P
📊Analyse de donnéesBeginnerAll AIs

Getting Started with Python and Pandas for Analysis

Complete beginner's guide to data analysis with Python Pandas, from data loading to first visualizations.

Paste in your AI

Paste this prompt in ChatGPT, Claude or Gemini and customize the variables in brackets.

Je débute avec Python pour l'analyse de données et je veux analyser [DESCRIPTION_DONNEES] contenu dans un fichier [FORMAT_FICHIER] (CSV / Excel / JSON). Mon environnement : [ENVIRONNEMENT] (Jupyter Notebook / Google Colab / VS Code).

Mes données ont les caractéristiques suivantes :
- Nombre de lignes approximatif : [LIGNES]
- Colonnes principales : [COLONNES]
- Type d'analyse souhaitée : [TYPE_ANALYSE]

Guide-moi étape par étape pour mon analyse avec Python Pandas :
1. Installation et import des bibliothèques nécessaires (pandas, numpy, matplotlib, seaborn)
2. Chargement des données avec gestion des encodages et des valeurs manquantes
3. Exploration initiale : shape, dtypes, head(), describe(), info()
4. Nettoyage basique : gestion des NaN, doublons, types de colonnes
5. Sélection, filtrage et tri des données (loc, iloc, query)
6. Agrégations et groupby pour les statistiques par catégorie
7. Création de 3 visualisations pertinentes pour mes données avec matplotlib/seaborn
8. Sauvegarde des résultats (CSV, Excel)

Fournis le code complet commenté ligne par ligne pour un débutant.
Explique chaque concept la première fois qu'il apparaît.
Signale les erreurs courantes pour chaque étape.

Why this prompt works

This prompt is effective for beginners because it requires line-by-line comments and conceptual explanations, transforming code into learning. The request for common error handling prevents frustration.

Use Cases

Python data analysis learningExcel to Python transitionFirst personal data project

Expected Output

Complete Python Pandas code commented line by line with conceptual explanations and 3 visualizations. ```python # Import necessary libraries import pandas as pd # Main library for data manipulation and analysis import numpy as np # Library for numerical operations import matplotlib.pyplot as plt # Library for creating static visualizations import seaborn as sns # Library for statistical data visualization # Set visualization style for better aesthetics plt.style.use('seaborn-v0_8') # Apply seaborn style to matplotlib plots sns.set_palette("husl") # Set color palette for seaborn # Create sample dataset # DataFrame is the main data structure in pandas - a 2D labeled data structure data = { 'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve', 'Frank', 'Grace', 'Henry'], 'Age': [25, 30, 35, 28, 32, 27, 29, 33], 'Department': ['IT', 'HR', 'Finance', 'IT', 'Marketing', 'Finance', 'HR', 'IT'], 'Salary': [50000, 45000, 60000, 55000, 48000, 52000, 46000, 58000], 'Years_Experience': [2, 5, 8, 3, 6, 4, 3, 7] } # Create DataFrame from dictionary df = pd.DataFrame(data) # Convert dictionary to pandas DataFrame print("Original Dataset:") print(df) # Display the complete dataset print("\n" + "="*50 + "\n") # BASIC DATA EXPLORATION # Shape gives us (rows, columns) - fundamental info about dataset size print(f"Dataset shape: {df.shape}") # Returns tuple (number_of_rows, number_of_columns) # Info provides comprehensive overview of DataFrame structure print("\nDataset Info:") df.info() # Shows data types, non-null counts, memory usage # Describe generates descriptive statistics for numerical columns print("\nDescriptive Statistics:") print(df.describe()) # Mean, std, min, max, quartiles for numeric data # Check for missing values - crucial for data quality assessment print(f"\nMissing values:\n{df.isnull().sum()}") # Sum of null values per column print("\n" + "="*50 + "\n") # DATA MANIPULATION AND ANALYSIS # Filtering data - boolean indexing is core pandas concept # Create boolean mask where condition is True it_employees = df[df['Department'] == 'IT'] # Filter rows where Department equals 'IT' print("IT Department Employees:") print(it_employees) # Multiple conditions using & (and) operator # Parentheses are required for multiple conditions high_earners = df[(df['Salary'] > 50000) & (df['Age'] < 35)] # Salary > 50k AND Age < 35 print(f"\nHigh earners under 35:\n{high_earners}") # Groupby operations - split-apply-combine methodology # Group data by categorical variable and apply aggregation function dept_stats = df.groupby('Department').agg({ 'Salary': ['mean', 'min', 'max'], # Multiple aggregations for Salary 'Age': 'mean', # Single aggregation for Age 'Years_Experience': 'mean' # Single aggregation for Experience }) print(f"\nDepartment Statistics:\n{dept_stats}") # Sorting data - arrange rows based on column values # ascending=False for descending order df_sorted = df.sort_values('Salary', ascending=False) # Sort by Salary (highest first) print(f"\nEmployees sorted by Salary (descending):\n{df_sorted}") # Creating new columns - feature engineering # Vectorized operations apply to entire column at once df['Salary_per_Experience'] = df['Salary'] / df['Years_Experience'] # Calculate efficiency metric df['Age_Group'] = pd.cut(df['Age'], bins=[20, 30, 40], labels=['Young', 'Middle']) # Categorize ages print(f"\nDataset with new columns:\n{df[['Name', 'Salary_per_Experience', 'Age_Group']]}") print("\n" + "="*50 + "\n") # VISUALIZATION 1: BAR CHART - Department Distribution plt.figure(figsize=(10, 6)) # Set figure size (width, height) in inches # Count occurrences of each department dept_counts = df['Department'].value_counts() # Returns Series with counts plt.subplot(1, 3, 1) # Create subplot: 1 row, 3 columns, position 1 # Create bar chart bars = plt.bar(dept_counts.index, dept_counts.values, color=['#FF6B6B', '#4ECDC4', '#45B7D1']) plt.title('Employee Distribution by Department', fontsize=14, fontweight='bold') plt.xlabel('Department', fontsize=12) plt.ylabel('Number of Employees', fontsize=12) plt.xticks(rotation=45) # Rotate x-axis labels for better readability # Add value labels on bars for bar in bars: height = bar.get_height() plt.text(bar.get_x() + bar.get_width()/2., height + 0.05, f'{int(height)}', ha='center', va='bottom', fontweight='bold') # VISUALIZATION 2: SCATTER PLOT - Salary vs Experience plt.subplot(1, 3, 2) # Position 2 in subplot grid # Create scatter plot with color mapping by department departments = df['Department'].unique() # Get unique department names colors = ['#FF6B6B', '#4ECDC4', '#45B7D1'] # Define color palette for i, dept in enumerate(departments): # Filter data for current department dept_data = df[df['Department'] == dept] plt.scatter(dept_data['Years_Experience'], dept_data['Salary'], c=colors[i], label=dept, alpha=0.7, s=100) # s controls point size plt.title('Salary vs Years of Experience', fontsize=14, fontweight='bold') plt.xlabel('Years of Experience', fontsize=12) plt.ylabel('Salary ($)', fontsize=12) plt.legend() # Show legend for department colors plt.grid(True, alpha=0.3) # Add grid for better readability # VISUALIZATION 3: BOX PLOT - Salary Distribution by Department plt.subplot(1, 3, 3) # Position 3 in subplot grid # Create box plot using seaborn for better aesthetics sns.boxplot(data=df, x='Department', y='Salary') plt.title('Salary Distribution by Department', fontsize=14, fontweight='bold') plt.xlabel('Department', fontsize=12) plt.ylabel('Salary ($)', fontsize=12) plt.xticks(rotation=45) # Adjust layout to prevent overlapping plt.tight_layout() plt.show() print("\n" + "="*50 + "\n") # ADVANCED OPERATIONS # Pivot tables - reshape data for analysis # Similar to Excel pivot tables pivot_table = df.pivot_table( values='Salary', # Values to aggregate index='Department', # Rows columns='Age_Group', # Columns aggfunc='mean', # Aggregation function fill_value=0 # Fill missing combinations with 0 ) print("Pivot Table - Average Salary by Department and Age Group:") print(pivot_table) # Correlation analysis - measure linear relationships between variables # Only works with numerical columns numerical_cols = df.select_dtypes(include=[np.number]) # Select only numeric columns correlation_matrix = numerical_cols.corr() # Calculate correlation coefficients print(f"\nCorrelation Matrix:\n{correlation_matrix}") # String operations - working with text data # str accessor provides vectorized string methods df['Name_Length'] = df['Name'].str.len() # Calculate length of each name df['Name_Upper'] = df['Name'].str.upper() # Convert names to uppercase print(f"\nString operations:\n{df[['Name', 'Name_Length', 'Name_Upper']]}") # Final summary statistics print(f"\nFinal Dataset Summary:") print(f"Total employees: {len(df)}") print(f"Average salary: ${df['Salary'].mean():.2f}") print(f"Salary range: ${df['Salary'].min()} - ${df['Salary'].max()}") print(f"Most common department: {df['Department'].mode()[0]}") print(f"Average experience: {df['Years_Experience'].mean():.1f} years") ``` **Key Conceptual Explanations:** 1. **DataFrame Structure**: Think of it as an Excel spreadsheet with labeled rows and columns, but with powerful programmatic capabilities. 2. **Boolean Indexing**: Creates a mask of True/False values to filter data based on conditions - fundamental for data analysis. 3. **Groupby Operations**: Implements the split-apply-combine pattern - split data into groups, apply function to each group, combine results. 4. **Vectorized Operations**: Operations applied to entire columns at once, much faster than loops - core pandas philosophy. 5. **Data Types**: Pandas automatically infers data types, but understanding them is crucial for memory efficiency and operations. This code demonstrates essential pandas workflows: data creation, exploration, manipulation, visualization, and analysis - forming the foundation for most data science projects.

Learn more

Check the full skill on Prompt Guide to master this technique from A to Z.

View on Prompt Guide