Data Science Lesson 4 – Python for DS | Dataplexa
Data Science · Lesson 4

Python for DS

Write Python code that loads, explores, and manipulates real datasets using libraries built for data science

This lesson covers

Essential Libraries · Data Loading · First Analysis · Common Operations

The Data Science Python Stack

Python wasn't originally designed for data science. But the community built incredible libraries on top of it. Think of Python as the engine and these libraries as specialized tools bolted onto that engine.

1

Install libraries

2

Import what you need

3

Load your data

4

Start analyzing

Pandas

Excel-like data manipulation. Loading CSVs, filtering rows, grouping data.

NumPy

Fast math on arrays. Statistics, matrix operations, mathematical functions.

Matplotlib

Basic charts and plots. Bar charts, line graphs, scatter plots.

Seaborn

Beautiful statistical plots. Heatmaps, distribution plots, correlation matrices.

Honestly, Pandas alone handles 80% of data science tasks. The other libraries add specific capabilities you'll need later. But Pandas is where everyone starts.

$ pip install pandas numpy matplotlib seaborn
# Core data science stack

import pandas as pd
# Standard alias - everyone uses 'pd'

import numpy as np
# Standard alias - everyone uses 'np'

df = pd.read_csv('dataplexa_ecommerce.csv')
# Load data into DataFrame

df.head(5)
# Show first 5 rows

df.info()
# Data types and null counts

df.describe()
# Summary statistics

Loading Your First Dataset

The scenario: You're a data analyst at Flipkart. The marketing team needs urgent insights about customer purchase patterns. They've sent you a CSV file with 12,847 orders from last quarter.

# Import the essential library
import pandas as pd

# Load the dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

# First look at the data
print("Dataset shape:", df.shape)
print("\nFirst 3 rows:")
print(df.head(3))

What just happened?

pd.read_csv() loaded 12,847 rows and 12 columns into memory. df.shape shows (rows, columns). df.head(3) displays the first 3 records. Try this: Change the number in head() to see more rows.

That single line of code loaded 12,847 customer transactions. If you tried this in Excel, your laptop would probably freeze. Pandas handles datasets with millions of rows without breaking a sweat.

📊 Data Insight

This dataset contains ₹2.4 crores worth of transactions across 5 cities and 5 product categories. Electronics leads with 35% of total revenue.

Understanding Your Data Structure

Before you analyze anything, you need to understand what you're working with. Think of this as your data reconnaissance mission.

Dataset Overview
Column Data Type Sample Values Business Meaning
order_id Integer 1001, 1002, 1003 Unique transaction ID
date String 2023-01-05, 2023-01-06 Purchase date
revenue Float 25999.0, 1798.0, 1350.0 Total order value in INR
returned Boolean True, False Product returned or not

The scenario: Your manager at Swiggy wants to know if the data quality is good enough for analysis. You need to check for missing values, data types, and basic statistics.

# Get comprehensive info about the dataset
print("=== DATA INFO ===")
df.info()

print("\n=== MISSING VALUES ===")
print(df.isnull().sum())

print("\n=== BASIC STATISTICS ===")
print(df.describe())

What just happened?

df.info() shows data types and memory usage. df.isnull().sum() counts missing values per column (0 means clean data). df.describe() gives min/max/mean for numeric columns. Try this: Add include='all' to describe() to see text columns too.

Perfect! Zero missing values means this dataset is clean and ready for analysis. The average order value is ₹52,110, which seems reasonable for an e-commerce platform selling electronics and clothing.

Essential Data Operations

Now the real work begins. You'll spend 80% of your time filtering, grouping, and transforming data. These five operations handle most scenarios you'll encounter.

The scenario: Your manager at Zomato wants answers to specific questions. Which city generates the most revenue? What's the return rate for electronics? How do ratings vary by product category?

# Filter high-value orders (above 50K)
high_value = df[df['revenue'] > 50000]
print(f"High-value orders: {len(high_value)} out of {len(df)}")

# Group by city to find revenue leaders
city_revenue = df.groupby('city')['revenue'].sum().sort_values(ascending=False)
print(f"\nTop revenue city: {city_revenue.index[0]} with ₹{city_revenue.iloc[0]:,.0f}")

# Calculate return rate for electronics
electronics = df[df['product_category'] == 'Electronics']
return_rate = electronics['returned'].mean() * 100
print(f"Electronics return rate: {return_rate:.1f}%")

What just happened?

df[df['revenue'] > 50000] filtered 6,424 high-value orders. groupby('city') aggregated revenue by city. mean() on boolean column calculated return percentage. Try this: Change the revenue threshold to 75000 and see how many orders remain.

Mumbai leads with ₹52.3 crores, followed by Delhi at ₹48.1 crores

Mumbai generates 23% more revenue than Pune despite serving the same product categories. This suggests either higher customer spending power or better market penetration. The marketing team should investigate what's working in Mumbai and replicate it in other cities.

The 8.3% return rate for electronics is actually quite good for this category. Industry benchmarks hover around 12-15% for consumer electronics, so this indicates solid product quality and accurate product descriptions.

# Create age groups for analysis
df['age_group'] = pd.cut(df['customer_age'], 
                        bins=[17, 25, 35, 50, 70], 
                        labels=['18-25', '26-35', '36-50', '51+'])

# Average order value by age group
avg_order = df.groupby('age_group')['revenue'].mean()
print("Average order value by age:")
for group, value in avg_order.items():
    print(f"{group}: ₹{value:,.0f}")

# Top 3 products by quantity sold
top_products = df.groupby('product_name')['quantity'].sum().sort_values(ascending=False).head(3)
print(f"\nBest sellers:")
for product, qty in top_products.items():
    print(f"{product}: {qty} units")

What just happened?

pd.cut() created age buckets from continuous ages. groupby('age_group') calculated average spending per bucket. .head(3) showed only top 3 products. Try this: Change the age bins to [17, 30, 45, 70] and see how spending patterns shift.

📊 Data Insight

The 36-50 age group spends 61% more per order than younger customers. They represent the premium customer segment worth targeting with higher-value products.

Common Mistakes to Avoid

Three mistakes trip up 90% of beginners. I've seen senior analysts make these errors too. Each one can completely invalidate your analysis if you're not careful.

Mistake #1: Assuming clean data

Always run df.info() and df.isnull().sum() first. Missing values will break your calculations. Fix: Check for nulls, duplicates, and data types before any analysis.

Mistake #2: Wrong data types

Dates stored as strings won't sort chronologically. Numbers as text won't calculate properly. Fix: Use pd.to_datetime() for dates and pd.to_numeric() for numbers.

Mistake #3: Forgetting to save results

Operations like filtering create new DataFrames but don't modify the original. Fix: Assign results to variables: filtered_df = df[df['revenue'] > 50000]

The scenario: You're wrapping up your analysis for the OYO revenue team. They need a final summary with key metrics that they can present to executives tomorrow morning.

# Final business summary
print("=== EXECUTIVE SUMMARY ===")
total_revenue = df['revenue'].sum()
total_orders = len(df)
avg_order_value = df['revenue'].mean()
top_city = df.groupby('city')['revenue'].sum().idxmax()
top_category = df.groupby('product_category')['revenue'].sum().idxmax()

print(f"Total Revenue: ₹{total_revenue/10000000:.1f} crores")
print(f"Total Orders: {total_orders:,}")
print(f"Average Order Value: ₹{avg_order_value:,.0f}")
print(f"Top Revenue City: {top_city}")
print(f"Top Revenue Category: {top_category}")
print(f"Overall Customer Rating: {df['rating'].mean():.1f}/5.0")

What just happened?

.sum() totaled all revenue values. .idxmax() returned the city/category with maximum revenue. Dividing by 10,000,000 converted rupees to crores. Try this: Add df['returned'].sum() to count total returns.

There's your first complete data analysis using Python. Six lines of code generated insights that would take hours in Excel. But honestly, this is just scratching the surface. Next, you'll learn how to handle the messy reality of missing data.

Best workflow: Dataplexa on one side, Kaggle or Colab on the other. Read here, run code there immediately.

Quiz

1. Your manager at BigBasket wants to know total revenue by city from the ecommerce dataset. Which code gives the correct answer?


2. You're analyzing a new dataset for Myntra and need to check data types, memory usage, and null counts all at once. Which single command provides this information?


3. In a dataset with 12,847 orders, you filter for orders above ₹50,000 using df[df['revenue'] > 50000]. The result shows 6,424 rows. What percentage of orders are high-value?


Up Next

Missing Values

Learn to detect, understand, and handle missing data that could invalidate your entire analysis if ignored.