P

100 AI Prompts for Data Scientists — Complete Guide

Data science combines statistics, programming, and domain expertise — and AI assistants can augment every stage of the workflow. Whether you are cleaning messy datasets, selecting the right model, or explaining results to stakeholders, these 100 prompts will help you work smarter and faster.

Filter by level:
50 prompts

Exploratory Data Analysis

Prompts to understand, profile, and visualize datasets quickly.

Generate an EDA script

Beginner

Quickly profile a new dataset

Write a Python script to perform exploratory data analysis on a CSV file loaded into a pandas DataFrame called df. Include: shape, dtypes, missing value counts, descriptive statistics, cardinality for categorical columns, and correlation heatmap using seaborn. The target column is [target_column].

Analyze missing data patterns

Beginner

Handle missing data strategically

Analyze the missing data in my pandas DataFrame df. Visualize missingness patterns using missingno, identify if data is MCAR/MAR/MNAR, and recommend the most appropriate imputation strategy for each column based on its type and missingness pattern. Columns: [list columns with dtypes].

Detect outliers

Beginner

Identify and handle outliers

Write Python code to detect outliers in the numerical columns of DataFrame df using three methods: IQR method, Z-score (threshold=3), and Isolation Forest. Compare results, visualize outliers with box plots, and recommend which method is most appropriate for [describe data distribution].

Create an automated EDA report

Beginner

Generate EDA reports automatically

Generate code to create an automated EDA HTML report for a DataFrame df using [ydata-profiling/sweetviz/dtale]. Configure it to include: correlation analysis, distribution plots, duplicate detection, and alert section for data quality issues. The dataset contains [describe the domain and columns].

Analyze time series data

Intermediate

Explore time series datasets

Write Python code to perform EDA on a time series DataFrame with a datetime index and [target column]. Include: trend decomposition (STL), seasonality detection, autocorrelation and partial autocorrelation plots, rolling statistics, and identification of anomalous periods.

Perform cohort analysis

Intermediate

Analyze user retention by cohort

Write Python code to perform cohort retention analysis on a user events DataFrame with columns: user_id, event_date, and event_type. Define cohorts by first purchase month, calculate monthly retention rates, and visualize as a heatmap. Export results to a pandas pivot table.

Create distribution comparison plots

Intermediate

Compare feature distributions

Write Python code using matplotlib/seaborn to compare the distribution of [feature column] across [group column] categories. Include: overlapping KDE plots, violin plots, empirical CDFs, and a statistical test (KS test or Mann-Whitney U) to determine if distributions differ significantly.

Compute feature correlations with target

Intermediate

Identify predictive features

Write Python code to compute the correlation between all features and the target variable [target] in DataFrame df. Use Pearson for numerical features, point-biserial for binary vs continuous, and Cramér's V for categorical vs categorical. Output a ranked table of correlation strengths.

Detect data drift

Advanced

Monitor model input data drift

Write Python code to detect data drift between a training dataset df_train and a production dataset df_prod. Use statistical tests (KS test, chi-squared) for each feature, visualize distributions side by side, flag features with significant drift (p < 0.05), and generate a drift report.

Build an interactive EDA dashboard

Advanced

Create interactive data exploration tools

Build an interactive EDA dashboard using Plotly Dash for a DataFrame df with [list column names and types]. Include: a dropdown to select any numerical column for distribution analysis, a scatter plot matrix with color encoding by [categorical column], and a correlation heatmap with clickable drill-down.

Feature Engineering & Preprocessing

Prompts to transform raw data into high-quality model inputs.

Encode categorical variables

Beginner

Encode categorical features

Write Python code using scikit-learn to encode the categorical columns in DataFrame df. Use one-hot encoding for [low cardinality columns], target encoding for [high cardinality columns], and ordinal encoding for [ordered columns]. Wrap everything in a ColumnTransformer that can be used in a Pipeline.

Engineer date features

Beginner

Extract features from datetime columns

Write Python code to extract useful features from a datetime column [date_column] in DataFrame df. Include: year, month, day, day of week, is_weekend, is_holiday (using holidays library for [country]), quarter, week of year, days since a reference date, and cyclical encoding for periodic features.

Handle imbalanced classes

Intermediate

Fix class imbalance

My classification dataset df has a severe class imbalance: [class distribution]. Write Python code to address this using: SMOTE oversampling, random undersampling, and class_weight='balanced' comparison. Evaluate each approach using stratified cross-validation with F1-macro score. Target: [target_column].

Build a feature engineering pipeline

Intermediate

Create reproducible preprocessing pipelines

Build a scikit-learn Pipeline for a [classification/regression] task with the following preprocessing steps: impute missing values in [numerical columns] with median, scale with RobustScaler, encode [categorical columns] with OneHotEncoder, and apply polynomial features of degree 2 to [key columns]. Make the pipeline serializable with joblib.

Select features using multiple methods

Intermediate

Select the most predictive features

Write Python code to perform feature selection on DataFrame df for predicting [target] using: filter methods (mutual information, chi-squared), wrapper method (recursive feature elimination with cross-validation), and embedded method (Lasso regularization). Compare the selected feature sets and recommend a final set.

Create lag features for time series

Intermediate

Engineer time series features

Write Python code to create lag features and rolling statistics for a time series DataFrame with columns [list columns] and a datetime index. Create lags of [1, 3, 7, 14, 28] periods, rolling mean/std/min/max for windows of [7, 14, 28] days, and ewm features. Handle NaN values from lagging correctly.

Apply dimensionality reduction

Advanced

Reduce high-dimensional data

Write Python code to reduce the dimensionality of a feature matrix X with [number] features using PCA, t-SNE, and UMAP. Determine the optimal number of PCA components using explained variance (95% threshold), visualize t-SNE and UMAP embeddings colored by [label column], and compare clustering quality.

Build text features with NLP

Advanced

Extract features from text data

Write Python code to engineer features from a text column [text_column] in DataFrame df. Include: TF-IDF vectors (top 500 features), sentence embeddings using sentence-transformers, readability scores (Flesch-Kincaid), sentiment scores (VADER), entity counts, and text length features. Combine with numerical features for downstream modeling.

Generate interaction features

Advanced

Discover feature interactions

Write Python code to systematically generate and evaluate interaction features between the top [N] most predictive features in DataFrame df for target [target]. Create pairwise products, ratios, and differences. Use mutual information to rank interaction features and keep the top 20 that add signal beyond the original features.

Normalize features for deep learning

Intermediate

Prepare data for deep learning

Write Python code to normalize features for a deep learning model using PyTorch/TensorFlow. Apply batch normalization layer configuration, feature-wise standardization using training set statistics, handle constant and near-constant columns, clip extreme values at the [1st, 99th] percentile, and create a reusable preprocessing class.

Model Building & Evaluation

Prompts for training, tuning, and evaluating machine learning models.

Compare multiple classifiers

Beginner

Find the best model for a classification task

Write Python code to compare multiple classifiers (Logistic Regression, Random Forest, XGBoost, LightGBM, SVM) on a classification dataset with features X and target y. Use stratified 5-fold cross-validation, report accuracy, F1-macro, ROC-AUC, and training time. Plot a comparison table and ROC curves.

Tune hyperparameters with Optuna

Intermediate

Optimize model hyperparameters

Write Python code to tune hyperparameters for [XGBoost/LightGBM/Random Forest] using Optuna on dataset X, y. Define a meaningful search space for [list key hyperparameters], use 5-fold cross-validation with [metric] as objective, run [N] trials, and plot the optimization history and parameter importances.

Build a stacking ensemble

Advanced

Build ensemble models

Write Python code to build a stacking ensemble with base models: [list models] and meta-learner: [model]. Use out-of-fold predictions to train the meta-learner (avoid data leakage), implement with scikit-learn StackingClassifier, and compare performance against the best individual base model.

Evaluate a regression model

Beginner

Evaluate regression model performance

Write Python code to fully evaluate a regression model on test set y_test vs y_pred. Report: MAE, RMSE, MAPE, R², adjusted R², and maximum error. Plot: actual vs predicted scatter, residual plot, residual distribution, and Q-Q plot. Identify systematic bias and heteroscedasticity patterns.

Implement cross-validation correctly

Intermediate

Avoid data leakage in evaluation

Write Python code to implement proper cross-validation for [classification/regression/time series] problem. For time series, use TimeSeriesSplit with purge gap. For classification, use StratifiedKFold. Include preprocessing inside the fold to avoid data leakage, and report mean ± std for each metric.

Analyze feature importance

Intermediate

Understand model feature importance

Write Python code to compute and visualize feature importance from a trained [Random Forest/XGBoost/LightGBM] model. Include: built-in importance, permutation importance (on test set), and SHAP values. Compare methods and identify discrepancies. Highlight the top 15 features with a plot.

Build a forecasting model

Advanced

Forecast time series data

Build a time series forecasting model for [target variable] with [frequency] data. Compare: Prophet, SARIMA, and LightGBM with lag features. Use walk-forward validation with [N] folds, optimize for [MAE/RMSE/SMAPE], and produce a forecast for the next [horizon] periods with confidence intervals.

Handle concept drift in models

Advanced

Maintain model performance over time

Design a strategy to detect and handle concept drift in a production [classification/regression] model trained on [describe data]. Implement drift detection using [ADWIN/Page-Hinkley/DDM], define retraining triggers, design a champion-challenger framework, and set up monitoring alerts for performance degradation.

Explain model predictions with SHAP

Intermediate

Make model predictions explainable

Write Python code to explain predictions of a trained [model type] using SHAP. Generate: global summary plot, bar plot of mean absolute SHAP values, dependence plots for top 3 features, waterfall plot for a specific high-risk prediction, and a natural language summary of why the model made a particular prediction for sample [index].

Build a recommendation system

Advanced

Build recommendation systems

Build a collaborative filtering recommendation system for [domain] using implicit feedback data (user_id, item_id, interaction_score). Implement matrix factorization with ALS using implicit library, evaluate with precision@K and NDCG@K, handle cold start with content-based fallback, and expose recommendations via a simple function.

Data Storytelling & Communication

Prompts to communicate findings clearly to technical and non-technical audiences.

Write an executive summary of findings

Beginner

Communicate results to executives

Write a 300-word executive summary of the following data science findings for a non-technical business audience: [describe findings, key metrics, model performance]. Focus on business impact, not technical details. Use plain language, highlight the key recommendation, and quantify the expected value.

Create a data visualization

Beginner

Build impactful visualizations

Write Python code using Plotly to create a publication-quality visualization for [describe what to show: e.g., 'the relationship between marketing spend and revenue segmented by channel']. Apply a clean theme, use a colorblind-friendly palette, add clear axis labels, title, subtitle, and annotation for the key insight.

Structure a data science presentation

Beginner

Structure data presentations

Create a presentation outline for sharing data science results on [project topic] to [audience: technical / business / mixed]. For each slide, provide the title, key message (one sentence), supporting evidence to include, and visualization type. The presentation should build a narrative from problem to recommendation in [N] slides.

Write a model card

Intermediate

Document ML models responsibly

Write a model card for a [model type] trained to [task description]. Include: model details, intended use cases and limitations, training data description, evaluation metrics across demographic groups, ethical considerations, and usage instructions. Follow the Google Model Card format.

Translate technical metrics for stakeholders

Intermediate

Explain model metrics to business

Translate the following model performance metrics into business impact language for stakeholders with no ML background: [list metrics: e.g., precision=0.87, recall=0.72, AUC=0.91]. Use concrete examples, analogies, and estimate the dollar impact of the model vs the current baseline of [describe baseline].

Build a Streamlit dashboard

Intermediate

Build interactive ML dashboards

Build a Streamlit dashboard that displays the results of a [classification/regression] model. Include: model performance metrics, feature importance bar chart, confusion matrix or residual plot, a prediction interface where users can input values and get a prediction with confidence score, and filters for date range and segment.

Write a data analysis memo

Beginner

Share analysis findings formally

Write a structured data analysis memo on [analysis topic] for [team/department]. Include: background and objective, methodology summary, 3-5 key findings with supporting data, limitations and caveats, and specific recommendations with expected outcomes. Keep it under 600 words and use bullet points for findings.

Critique a data visualization

Intermediate

Improve visualization quality

Critique the following data visualization description: [describe chart type, data shown, design choices]. Evaluate it against best practices for: chart type appropriateness, data-to-ink ratio, color use, accessibility, labeling clarity, and potential for misinterpretation. Suggest specific improvements.

Design an A/B test report

Intermediate

Report A/B test results

Write a structured A/B test report for an experiment testing [hypothesis]. Include: test design summary, sample sizes and duration, primary metric results with confidence intervals, secondary metrics, segment analysis, statistical significance and practical significance, recommendation, and next steps.

Create a data strategy proposal

Advanced

Propose data initiatives strategically

Write a data strategy proposal for [company/team] to improve [specific data capability: e.g., 'customer churn prediction']. Cover: current state assessment, proposed architecture, required data sources, expected business outcomes with KPIs, 3-phase implementation roadmap, estimated resource requirements, and risks.

MLOps & Production ML

Prompts for deploying, monitoring, and scaling machine learning systems.

Containerize an ML model

Beginner

Package ML models for deployment

Write a Dockerfile to containerize a Python ML model serving API built with FastAPI. The model is loaded from [model path], uses [list dependencies], and exposes a /predict endpoint. Optimize the image for size using multi-stage builds, pin all dependency versions, and add a health check endpoint.

Build a model serving API

Intermediate

Serve ML models via REST API

Build a FastAPI model serving API that loads a scikit-learn/XGBoost model from [path], accepts a JSON payload with features [list features with types], validates input with Pydantic, runs inference, and returns predictions with confidence scores. Include request logging, error handling, and a /health endpoint.

Set up model monitoring

Intermediate

Monitor production models

Design a model monitoring system for a production [model type] making [predictions per day] predictions. Define: input feature drift metrics (PSI, KS test), output distribution monitoring, performance metrics to track (requires labels), alerting thresholds, monitoring frequency, and recommended tooling (Evidently AI, Grafana, etc.).

Write an ML experiment tracking setup

Intermediate

Track ML experiments with MLflow

Set up MLflow experiment tracking for a [model type] training script. Log: hyperparameters, training/validation metrics per epoch, feature importance, model artifacts, input data hash, and environment info. Create a comparison view across runs and set up a model registry with staging/production stages.

Build an ML training pipeline

Advanced

Automate ML training pipelines

Design an ML training pipeline using [Kubeflow/Prefect/Airflow/ZenML] for a [task type] model. Define pipeline steps: data ingestion, validation, preprocessing, training, evaluation, and conditional registration. Include parameters for reuse, caching of intermediate steps, and pipeline versioning.

Implement shadow deployment

Advanced

Safely deploy new ML models

Implement a shadow deployment strategy for a new ML model to compare it against the production model without affecting users. Design the traffic mirroring setup, comparison metrics collection, statistical significance test for declaring the challenger better, and the rollout plan from 0% to 100% traffic.

Profile model inference latency

Advanced

Optimize model inference speed

Write Python code to profile the inference latency of a [model type] across different batch sizes (1, 8, 32, 128, 512). Measure: p50, p95, p99 latency and throughput (predictions/second). Identify the optimal batch size, memory usage per batch, and recommend optimizations (quantization, ONNX export, TorchScript).

Create data quality checks

Intermediate

Validate data quality in pipelines

Write Great Expectations data quality checks for a dataset used to train [model type]. Define expectations for: non-null constraints, value ranges, cardinality limits, referential integrity, schema consistency, and statistical distribution bounds based on training data. Set up checkpoint to run in CI before training.

Design a feature store

Advanced

Build a feature store

Design a feature store architecture for a [company/use case] with [N] ML models sharing features. Define: offline store (historical features for training), online store (low-latency serving), feature computation jobs, point-in-time correct joins for training, and how to handle feature backfills and versioning.

Write an ML system design

Advanced

Design production ML systems

Design an end-to-end ML system for [use case: e.g., 'real-time fraud detection']. Cover: data pipeline, feature engineering, model architecture, training infrastructure, serving architecture (latency/throughput requirements), monitoring, retraining triggers, fallback strategy, and estimated infrastructure cost at [target scale].

Pro Tips

Include dataset statistics in your prompts

Always share the number of rows, number of features, class distribution, and data types when asking for modeling advice. These details dramatically change which techniques are appropriate. A 500-row dataset needs very different treatment than a 50-million-row dataset.

Ask for leakage checks explicitly

Data leakage is the most common source of unrealistically high model performance. Always add 'check for potential data leakage sources' to prompts about feature engineering and model evaluation — it is easy to miss and devastating to miss in production.

Request reproducible code

Add 'set a random seed of 42 for all randomness sources and make the code fully reproducible' to any prompt involving model training or data splitting. Reproducibility is essential for debugging and production handoffs.

Separate exploration from production code

Use AI to rapidly explore in notebooks, then ask it to 'refactor this notebook code into a clean, tested Python module suitable for production'. This two-phase approach is much faster than trying to write production-quality code from the start.

Verify statistical assumptions

Always ask the AI to 'list the statistical assumptions of this method and provide tests to verify each one on my data'. Many practitioners apply models without checking assumptions, leading to unreliable results.

100 AI Prompts for Data Scientists | ChatGPT & Claude Guide | Prompt Guide