💻 Coding
Advanced Python Data Analysis & Pandas Script Generator
Quickly generate robust Python code for data cleaning, aggregation, and visualization using Pandas and Matplotlib/Seaborn.
0Reviews
Prompt
Act as a Senior Data Scientist. Write a Python script using Pandas to analyze a dataset based on the following requirements: [Analysis Requirements]. The dataset has the following columns: [List of Columns]. Ensure the code follows these guidelines: Data Cleaning: Include robust error handling for missing values (NaN) and incorrect data types. Efficiency: Use vectorized Pandas operations instead of iterating through rows. Visualization: Provide code to generate at least two insightful visualizations using Matplotlib or Seaborn that directly address the analysis goals. Documentation: Include clear docstrings and inline comments explaining complex logic. Tone & Style: Professional, calm, and confident. Adhere to PEP-8 standards. Avoid unnecessary abstraction.
Instructions
List your columns: Provide an accurate list of the columns in your CSV or SQL table. Specify objectives: Clearly state what insights you are trying to extract from the data. Check dependencies: Ensure you have the required libraries (pandas, matplotlib, seaborn) installed in your environment. Review visualizations: Adjust the generated plot parameters to fit your presentation needs.
Examples
Example Input
Act as a Senior Data Scientist. Write a Python script using Pandas to analyze a dataset based on the following requirements: [Identify the top 5 highest-selling products by region, handle missing 'Price' values by using the median, and visualize the monthly sales trend for the last 12 months.]. The dataset has the following columns: [['order_id', 'product_name', 'category', 'price', 'quantity', 'order_date', 'region', 'customer_id']]. Ensure the code follows these guidelines: Data Cleaning: Include robust error handling for missing values (NaN) and incorrect data types. Efficiency: Use vectorized Pandas operations instead of iterating through rows. Visualization: Provide code to generate at least two insightful visualizations using Matplotlib or Seaborn that directly address the analysis goals. Documentation: Include clear docstrings and inline comments explaining complex logic. Tone & Style: Professional, calm, and confident. Adhere to PEP-8 standards. Avoid unnecessary abstraction.
Example Output
Here’s a professional, PEP-8-compliant Python script using Pandas that meets all your requirements. It includes robust error handling, vectorized operations, and two visualizations addressing the analysis goals: top 5 products by region and monthly sales trends.
"""
Sales Analysis Script
Author: John
Purpose: Analyze sales data to identify top-selling products by region, handle missing prices,
and visualize monthly sales trends for the last 12 months.
"""
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime
# --- Load Dataset ---
try:
# Replace 'sales_data.csv' with your dataset path
df = pd.read_csv('sales_data.csv')
except FileNotFoundError:
raise FileNotFoundError("The dataset file was not found. Please check the file path.")
except pd.errors.EmptyDataError:
raise ValueError("The dataset is empty.")
except Exception as e:
raise RuntimeError(f"Error loading dataset: {e}")
# --- Data Cleaning ---
# Ensure correct data types
df['price'] = pd.to_numeric(df['price'], errors='coerce') # Convert invalid prices to NaN
df['quantity'] = pd.to_numeric(df['quantity'], errors='coerce')
df['order_date'] = pd.to_datetime(df['order_date'], errors='coerce')
# Handle missing 'Price' values with median
median_price = df['price'].median()
df['price'].fillna(median_price, inplace=True)
# Drop rows with critical missing values
df.dropna(subset=['product_name', 'order_date', 'region'], inplace=True)
# Add a 'total_sales' column
df['total_sales'] = df['price'] * df['quantity']
# --- Top 5 Highest-Selling Products by Region ---
top_products_by_region = (
df.groupby(['region', 'product_name'], as_index=False)['total_sales']
.sum()
.sort_values(['region', 'total_sales'], ascending=[True, False])
)
# Extract top 5 products per region
top_5_products = top_products_by_region.groupby('region').head(5)
print("Top 5 Products by Region:")
print(top_5_products)
# --- Monthly Sales Trend for Last 12 Months ---
# Filter last 12 months
latest_date = df['order_date'].max()
one_year_ago = latest_date - pd.DateOffset(months=12)
last_12_months_df = df[df['order_date'] >= one_year_ago]
# Aggregate monthly sales
monthly_sales = (
last_12_months_df
.groupby(last_12_months_df['order_date'].dt.to_period('M'))
.agg({'total_sales': 'sum'})
.reset_index()
)
monthly_sales['order_date'] = monthly_sales['order_date'].dt.to_timestamp()
# --- Visualization 1: Top 5 Products by Region ---
plt.figure(figsize=(12, 6))
sns.barplot(
data=top_5_products,
x='product_name',
y='total_sales',
hue='region'
)
plt.title('Top 5 Highest-Selling Products by Region')
plt.xlabel('Product Name')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.legend(title='Region')
plt.tight_layout()
plt.show()
# --- Visualization 2: Monthly Sales Trend ---
plt.figure(figsize=(12, 6))
sns.lineplot(
data=monthly_sales,
x='order_date',
y='total_sales',
marker='o'
)
plt.title('Monthly Sales Trend (Last 12 Months)')
plt.xlabel('Month')
plt.ylabel('Total Sales')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()
✅ Key Features of the Script
Data Cleaning & Error Handling
Converts price and quantity to numeric, coercing errors to NaN.
Fills missing prices with the median.
Drops rows with missing critical values like product_name, order_date, or region.
Ensures order_date is parsed correctly.
Efficiency
Uses vectorized Pandas operations (groupby, agg, sort_values) instead of row-wise loops.
Insights
Top 5 products per region identified using groupby and head.
Monthly sales trends for the last 12 months calculated and visualized.
Visualization
Barplot for top products by region (multi-hue for regions).
Line plot for month-over-month sales trends.
Documentation
Docstrings and inline comments explain each step.