100 AI Prompts for Data Scientists — Complete Guide
Data science combines statistics, programming, and domain expertise — and AI assistants can augment every stage of the workflow. Whether you are cleaning messy datasets, selecting the right model, or explaining results to stakeholders, these 100 prompts will help you work smarter and faster.
Exploratory Data Analysis
Prompts to understand, profile, and visualize datasets quickly.
Generate an EDA script
BeginnerQuickly profile a new dataset
Write a Python script to perform exploratory data analysis on a CSV file loaded into a pandas DataFrame called df. Include: shape, dtypes, missing value counts, descriptive statistics, cardinality for categorical columns, and correlation heatmap using seaborn. The target column is [target_column].
Analyze missing data patterns
BeginnerHandle missing data strategically
Analyze the missing data in my pandas DataFrame df. Visualize missingness patterns using missingno, identify if data is MCAR/MAR/MNAR, and recommend the most appropriate imputation strategy for each column based on its type and missingness pattern. Columns: [list columns with dtypes].
Detect outliers
BeginnerIdentify and handle outliers
Write Python code to detect outliers in the numerical columns of DataFrame df using three methods: IQR method, Z-score (threshold=3), and Isolation Forest. Compare results, visualize outliers with box plots, and recommend which method is most appropriate for [describe data distribution].
Create an automated EDA report
BeginnerGenerate EDA reports automatically
Generate code to create an automated EDA HTML report for a DataFrame df using [ydata-profiling/sweetviz/dtale]. Configure it to include: correlation analysis, distribution plots, duplicate detection, and alert section for data quality issues. The dataset contains [describe the domain and columns].
Analyze time series data
IntermediateExplore time series datasets
Write Python code to perform EDA on a time series DataFrame with a datetime index and [target column]. Include: trend decomposition (STL), seasonality detection, autocorrelation and partial autocorrelation plots, rolling statistics, and identification of anomalous periods.
Perform cohort analysis
IntermediateAnalyze user retention by cohort
Write Python code to perform cohort retention analysis on a user events DataFrame with columns: user_id, event_date, and event_type. Define cohorts by first purchase month, calculate monthly retention rates, and visualize as a heatmap. Export results to a pandas pivot table.
Create distribution comparison plots
IntermediateCompare feature distributions
Write Python code using matplotlib/seaborn to compare the distribution of [feature column] across [group column] categories. Include: overlapping KDE plots, violin plots, empirical CDFs, and a statistical test (KS test or Mann-Whitney U) to determine if distributions differ significantly.
Compute feature correlations with target
IntermediateIdentify predictive features
Write Python code to compute the correlation between all features and the target variable [target] in DataFrame df. Use Pearson for numerical features, point-biserial for binary vs continuous, and Cramér's V for categorical vs categorical. Output a ranked table of correlation strengths.
Detect data drift
AdvancedMonitor model input data drift
Write Python code to detect data drift between a training dataset df_train and a production dataset df_prod. Use statistical tests (KS test, chi-squared) for each feature, visualize distributions side by side, flag features with significant drift (p < 0.05), and generate a drift report.
Build an interactive EDA dashboard
AdvancedCreate interactive data exploration tools
Build an interactive EDA dashboard using Plotly Dash for a DataFrame df with [list column names and types]. Include: a dropdown to select any numerical column for distribution analysis, a scatter plot matrix with color encoding by [categorical column], and a correlation heatmap with clickable drill-down.
Feature Engineering & Preprocessing
Prompts to transform raw data into high-quality model inputs.
Encode categorical variables
BeginnerEncode categorical features
Write Python code using scikit-learn to encode the categorical columns in DataFrame df. Use one-hot encoding for [low cardinality columns], target encoding for [high cardinality columns], and ordinal encoding for [ordered columns]. Wrap everything in a ColumnTransformer that can be used in a Pipeline.
Engineer date features
BeginnerExtract features from datetime columns
Write Python code to extract useful features from a datetime column [date_column] in DataFrame df. Include: year, month, day, day of week, is_weekend, is_holiday (using holidays library for [country]), quarter, week of year, days since a reference date, and cyclical encoding for periodic features.
Handle imbalanced classes
IntermediateFix class imbalance
My classification dataset df has a severe class imbalance: [class distribution]. Write Python code to address this using: SMOTE oversampling, random undersampling, and class_weight='balanced' comparison. Evaluate each approach using stratified cross-validation with F1-macro score. Target: [target_column].
Build a feature engineering pipeline
IntermediateCreate reproducible preprocessing pipelines
Build a scikit-learn Pipeline for a [classification/regression] task with the following preprocessing steps: impute missing values in [numerical columns] with median, scale with RobustScaler, encode [categorical columns] with OneHotEncoder, and apply polynomial features of degree 2 to [key columns]. Make the pipeline serializable with joblib.
Select features using multiple methods
IntermediateSelect the most predictive features
Write Python code to perform feature selection on DataFrame df for predicting [target] using: filter methods (mutual information, chi-squared), wrapper method (recursive feature elimination with cross-validation), and embedded method (Lasso regularization). Compare the selected feature sets and recommend a final set.
Create lag features for time series
IntermediateEngineer time series features
Write Python code to create lag features and rolling statistics for a time series DataFrame with columns [list columns] and a datetime index. Create lags of [1, 3, 7, 14, 28] periods, rolling mean/std/min/max for windows of [7, 14, 28] days, and ewm features. Handle NaN values from lagging correctly.
Apply dimensionality reduction
AdvancedReduce high-dimensional data
Write Python code to reduce the dimensionality of a feature matrix X with [number] features using PCA, t-SNE, and UMAP. Determine the optimal number of PCA components using explained variance (95% threshold), visualize t-SNE and UMAP embeddings colored by [label column], and compare clustering quality.
Build text features with NLP
AdvancedExtract features from text data
Write Python code to engineer features from a text column [text_column] in DataFrame df. Include: TF-IDF vectors (top 500 features), sentence embeddings using sentence-transformers, readability scores (Flesch-Kincaid), sentiment scores (VADER), entity counts, and text length features. Combine with numerical features for downstream modeling.
Generate interaction features
AdvancedDiscover feature interactions
Write Python code to systematically generate and evaluate interaction features between the top [N] most predictive features in DataFrame df for target [target]. Create pairwise products, ratios, and differences. Use mutual information to rank interaction features and keep the top 20 that add signal beyond the original features.
Normalize features for deep learning
IntermediatePrepare data for deep learning
Write Python code to normalize features for a deep learning model using PyTorch/TensorFlow. Apply batch normalization layer configuration, feature-wise standardization using training set statistics, handle constant and near-constant columns, clip extreme values at the [1st, 99th] percentile, and create a reusable preprocessing class.
Model Building & Evaluation
Prompts for training, tuning, and evaluating machine learning models.
Compare multiple classifiers
BeginnerFind the best model for a classification task
Write Python code to compare multiple classifiers (Logistic Regression, Random Forest, XGBoost, LightGBM, SVM) on a classification dataset with features X and target y. Use stratified 5-fold cross-validation, report accuracy, F1-macro, ROC-AUC, and training time. Plot a comparison table and ROC curves.
Tune hyperparameters with Optuna
IntermediateOptimize model hyperparameters
Write Python code to tune hyperparameters for [XGBoost/LightGBM/Random Forest] using Optuna on dataset X, y. Define a meaningful search space for [list key hyperparameters], use 5-fold cross-validation with [metric] as objective, run [N] trials, and plot the optimization history and parameter importances.
Build a stacking ensemble
AdvancedBuild ensemble models
Write Python code to build a stacking ensemble with base models: [list models] and meta-learner: [model]. Use out-of-fold predictions to train the meta-learner (avoid data leakage), implement with scikit-learn StackingClassifier, and compare performance against the best individual base model.
Evaluate a regression model
BeginnerEvaluate regression model performance
Write Python code to fully evaluate a regression model on test set y_test vs y_pred. Report: MAE, RMSE, MAPE, R², adjusted R², and maximum error. Plot: actual vs predicted scatter, residual plot, residual distribution, and Q-Q plot. Identify systematic bias and heteroscedasticity patterns.
Implement cross-validation correctly
IntermediateAvoid data leakage in evaluation
Write Python code to implement proper cross-validation for [classification/regression/time series] problem. For time series, use TimeSeriesSplit with purge gap. For classification, use StratifiedKFold. Include preprocessing inside the fold to avoid data leakage, and report mean ± std for each metric.
Analyze feature importance
IntermediateUnderstand model feature importance
Write Python code to compute and visualize feature importance from a trained [Random Forest/XGBoost/LightGBM] model. Include: built-in importance, permutation importance (on test set), and SHAP values. Compare methods and identify discrepancies. Highlight the top 15 features with a plot.
Build a forecasting model
AdvancedForecast time series data
Build a time series forecasting model for [target variable] with [frequency] data. Compare: Prophet, SARIMA, and LightGBM with lag features. Use walk-forward validation with [N] folds, optimize for [MAE/RMSE/SMAPE], and produce a forecast for the next [horizon] periods with confidence intervals.
Handle concept drift in models
AdvancedMaintain model performance over time
Design a strategy to detect and handle concept drift in a production [classification/regression] model trained on [describe data]. Implement drift detection using [ADWIN/Page-Hinkley/DDM], define retraining triggers, design a champion-challenger framework, and set up monitoring alerts for performance degradation.
Explain model predictions with SHAP
IntermediateMake model predictions explainable
Write Python code to explain predictions of a trained [model type] using SHAP. Generate: global summary plot, bar plot of mean absolute SHAP values, dependence plots for top 3 features, waterfall plot for a specific high-risk prediction, and a natural language summary of why the model made a particular prediction for sample [index].
Build a recommendation system
AdvancedBuild recommendation systems
Build a collaborative filtering recommendation system for [domain] using implicit feedback data (user_id, item_id, interaction_score). Implement matrix factorization with ALS using implicit library, evaluate with precision@K and NDCG@K, handle cold start with content-based fallback, and expose recommendations via a simple function.
Data Storytelling & Communication
Prompts to communicate findings clearly to technical and non-technical audiences.
Write an executive summary of findings
BeginnerCommunicate results to executives
Write a 300-word executive summary of the following data science findings for a non-technical business audience: [describe findings, key metrics, model performance]. Focus on business impact, not technical details. Use plain language, highlight the key recommendation, and quantify the expected value.
Create a data visualization
BeginnerBuild impactful visualizations
Write Python code using Plotly to create a publication-quality visualization for [describe what to show: e.g., 'the relationship between marketing spend and revenue segmented by channel']. Apply a clean theme, use a colorblind-friendly palette, add clear axis labels, title, subtitle, and annotation for the key insight.
Structure a data science presentation
BeginnerStructure data presentations
Create a presentation outline for sharing data science results on [project topic] to [audience: technical / business / mixed]. For each slide, provide the title, key message (one sentence), supporting evidence to include, and visualization type. The presentation should build a narrative from problem to recommendation in [N] slides.
Write a model card
IntermediateDocument ML models responsibly
Write a model card for a [model type] trained to [task description]. Include: model details, intended use cases and limitations, training data description, evaluation metrics across demographic groups, ethical considerations, and usage instructions. Follow the Google Model Card format.
Translate technical metrics for stakeholders
IntermediateExplain model metrics to business
Translate the following model performance metrics into business impact language for stakeholders with no ML background: [list metrics: e.g., precision=0.87, recall=0.72, AUC=0.91]. Use concrete examples, analogies, and estimate the dollar impact of the model vs the current baseline of [describe baseline].
Build a Streamlit dashboard
IntermediateBuild interactive ML dashboards
Build a Streamlit dashboard that displays the results of a [classification/regression] model. Include: model performance metrics, feature importance bar chart, confusion matrix or residual plot, a prediction interface where users can input values and get a prediction with confidence score, and filters for date range and segment.
Write a data analysis memo
BeginnerShare analysis findings formally
Write a structured data analysis memo on [analysis topic] for [team/department]. Include: background and objective, methodology summary, 3-5 key findings with supporting data, limitations and caveats, and specific recommendations with expected outcomes. Keep it under 600 words and use bullet points for findings.
Critique a data visualization
IntermediateImprove visualization quality
Critique the following data visualization description: [describe chart type, data shown, design choices]. Evaluate it against best practices for: chart type appropriateness, data-to-ink ratio, color use, accessibility, labeling clarity, and potential for misinterpretation. Suggest specific improvements.
Design an A/B test report
IntermediateReport A/B test results
Write a structured A/B test report for an experiment testing [hypothesis]. Include: test design summary, sample sizes and duration, primary metric results with confidence intervals, secondary metrics, segment analysis, statistical significance and practical significance, recommendation, and next steps.
Create a data strategy proposal
AdvancedPropose data initiatives strategically
Write a data strategy proposal for [company/team] to improve [specific data capability: e.g., 'customer churn prediction']. Cover: current state assessment, proposed architecture, required data sources, expected business outcomes with KPIs, 3-phase implementation roadmap, estimated resource requirements, and risks.
MLOps & Production ML
Prompts for deploying, monitoring, and scaling machine learning systems.
Containerize an ML model
BeginnerPackage ML models for deployment
Write a Dockerfile to containerize a Python ML model serving API built with FastAPI. The model is loaded from [model path], uses [list dependencies], and exposes a /predict endpoint. Optimize the image for size using multi-stage builds, pin all dependency versions, and add a health check endpoint.
Build a model serving API
IntermediateServe ML models via REST API
Build a FastAPI model serving API that loads a scikit-learn/XGBoost model from [path], accepts a JSON payload with features [list features with types], validates input with Pydantic, runs inference, and returns predictions with confidence scores. Include request logging, error handling, and a /health endpoint.
Set up model monitoring
IntermediateMonitor production models
Design a model monitoring system for a production [model type] making [predictions per day] predictions. Define: input feature drift metrics (PSI, KS test), output distribution monitoring, performance metrics to track (requires labels), alerting thresholds, monitoring frequency, and recommended tooling (Evidently AI, Grafana, etc.).
Write an ML experiment tracking setup
IntermediateTrack ML experiments with MLflow
Set up MLflow experiment tracking for a [model type] training script. Log: hyperparameters, training/validation metrics per epoch, feature importance, model artifacts, input data hash, and environment info. Create a comparison view across runs and set up a model registry with staging/production stages.
Build an ML training pipeline
AdvancedAutomate ML training pipelines
Design an ML training pipeline using [Kubeflow/Prefect/Airflow/ZenML] for a [task type] model. Define pipeline steps: data ingestion, validation, preprocessing, training, evaluation, and conditional registration. Include parameters for reuse, caching of intermediate steps, and pipeline versioning.
Implement shadow deployment
AdvancedSafely deploy new ML models
Implement a shadow deployment strategy for a new ML model to compare it against the production model without affecting users. Design the traffic mirroring setup, comparison metrics collection, statistical significance test for declaring the challenger better, and the rollout plan from 0% to 100% traffic.
Profile model inference latency
AdvancedOptimize model inference speed
Write Python code to profile the inference latency of a [model type] across different batch sizes (1, 8, 32, 128, 512). Measure: p50, p95, p99 latency and throughput (predictions/second). Identify the optimal batch size, memory usage per batch, and recommend optimizations (quantization, ONNX export, TorchScript).
Create data quality checks
IntermediateValidate data quality in pipelines
Write Great Expectations data quality checks for a dataset used to train [model type]. Define expectations for: non-null constraints, value ranges, cardinality limits, referential integrity, schema consistency, and statistical distribution bounds based on training data. Set up checkpoint to run in CI before training.
Design a feature store
AdvancedBuild a feature store
Design a feature store architecture for a [company/use case] with [N] ML models sharing features. Define: offline store (historical features for training), online store (low-latency serving), feature computation jobs, point-in-time correct joins for training, and how to handle feature backfills and versioning.
Write an ML system design
AdvancedDesign production ML systems
Design an end-to-end ML system for [use case: e.g., 'real-time fraud detection']. Cover: data pipeline, feature engineering, model architecture, training infrastructure, serving architecture (latency/throughput requirements), monitoring, retraining triggers, fallback strategy, and estimated infrastructure cost at [target scale].
Pro Tips
Include dataset statistics in your prompts
Always share the number of rows, number of features, class distribution, and data types when asking for modeling advice. These details dramatically change which techniques are appropriate. A 500-row dataset needs very different treatment than a 50-million-row dataset.
Ask for leakage checks explicitly
Data leakage is the most common source of unrealistically high model performance. Always add 'check for potential data leakage sources' to prompts about feature engineering and model evaluation — it is easy to miss and devastating to miss in production.
Request reproducible code
Add 'set a random seed of 42 for all randomness sources and make the code fully reproducible' to any prompt involving model training or data splitting. Reproducibility is essential for debugging and production handoffs.
Separate exploration from production code
Use AI to rapidly explore in notebooks, then ask it to 'refactor this notebook code into a clean, tested Python module suitable for production'. This two-phase approach is much faster than trying to write production-quality code from the start.
Verify statistical assumptions
Always ask the AI to 'list the statistical assumptions of this method and provide tests to verify each one on my data'. Many practitioners apply models without checking assumptions, leading to unreliable results.