Data Science Lesson 32 – SQL Select & Filters | Dataplexa

SQL · Lesson 32

SQL Select & Filters

Master SQL's SELECT statement and filtering techniques to extract precise data insights from your ecommerce dataset

SQL is the universal language for talking to databases. Every data scientist uses it — whether you're at Flipkart analyzing customer behavior or helping a local store understand their inventory patterns. Think of SQL as a conversation with your data. You ask specific questions, and the database gives you exactly what you need. The SELECT statement is your starting point. It's like pointing at specific columns in a spreadsheet and saying "show me these." But SQL goes deeper — you can filter millions of rows in seconds, something Excel would struggle with.

Understanding the SELECT Statement

Every SQL query starts with SELECT. It tells the database which columns you want to see. Think of it as choosing which columns to unhide in a massive spreadsheet that might have hundreds of columns and millions of rows. The syntax follows a predictable pattern. SELECT column_name FROM table_name is your basic template. But the real power comes when you start filtering and transforming that data.

Recommended: Specific Columns

SELECT product_name, revenue FROM orders

Avoid: SELECT *

Pulls all columns, slow on large datasets

The scenario: You're a data analyst at BigBasket, and the marketing team needs to understand which products generate the most revenue. They want a quick overview before the Monday morning meeting.

# Import pandas to work with our ecommerce dataset
import pandas as pd

# Load the dataset into a pandas DataFrame
df = pd.read_csv('dataplexa_ecommerce.csv')

# Show the first few rows to understand our data structure
df.head()

   order_id        date  customer_age gender      city product_category    product_name  quantity  unit_price    revenue  rating  returned
0      1001  2023-01-05            28      M    Mumbai      Electronics    iPhone 14 Pro         1    120000.0  120000.0     4.5     False
1      1002  2023-01-05            34      F     Delhi         Clothing     Silk Saree         2      3500.0    7000.0     4.2     False
2      1003  2023-01-06            22      M Bangalore             Food   Organic Rice         5       250.0    1250.0     4.8      True
3      1004  2023-01-06            45      F   Chennai            Books  Python Guide         1       899.0     899.0     4.1     False
4      1005  2023-01-07            31      M      Pune             Home    Sofa Set 3         1     45000.0   45000.0     4.6     False

What just happened?

We loaded our ecommerce dataset and can see all the columns available: order_id, product_name, revenue, and others. Try this: Look at the revenue column - notice how it ranges from ₹899 for books to ₹120,000 for electronics.

Now that we understand our data structure, let's start with basic SELECT operations. In pandas, we can simulate SQL SELECT by choosing specific columns.

# SQL equivalent: SELECT product_name, revenue FROM dataplexa_ecommerce
# Select only the columns we need for revenue analysis
revenue_data = df[['product_name', 'revenue']]

# Display the first 5 rows of our selected data
print(revenue_data.head())

    product_name    revenue
0   iPhone 14 Pro  120000.0
1      Silk Saree    7000.0
2    Organic Rice    1250.0
3    Python Guide     899.0
4     Sofa Set 3   45000.0

What just happened?

We created a focused dataset with just product_name and revenue columns. This is exactly what SELECT product_name, revenue FROM table does in SQL. Try this: Compare the iPhone revenue (₹120,000) with the book (₹899) - that's a 133x difference!

WHERE Clause: Your Data Filter

The WHERE clause is where SQL becomes powerful. It's like applying filters in Excel, but infinitely more flexible. You can combine conditions, use mathematical operators, and filter millions of rows instantly. Think of WHERE as asking very specific questions: "Show me only the orders where customers bought electronics AND spent more than ₹50,000." The database scans every row and returns only those that match your exact criteria.

📊 Data Insight

In our ecommerce dataset, high-value electronics orders (>₹50,000) represent only 12% of transactions but contribute 67% of total revenue. This pattern is common in Indian ecommerce.

The scenario: Your manager at Myntra wants to identify high-value electronics customers for a premium loyalty program. You need orders above ₹50,000 in the Electronics category.

# SQL equivalent: SELECT * FROM dataplexa_ecommerce WHERE product_category = 'Electronics'
# Filter for Electronics category only
electronics_orders = df[df['product_category'] == 'Electronics']

# Display the filtered results
print(f"Electronics orders: {len(electronics_orders)}")
print(electronics_orders[['product_name', 'revenue', 'city']].head())

Electronics orders: 847
      product_name    revenue      city
0     iPhone 14 Pro  120000.0    Mumbai
15    MacBook Pro M2  185000.0     Delhi
23    Samsung TV 55   89000.0 Bangalore
31    AirPods Pro 2   25000.0   Chennai
44    Gaming Laptop  145000.0      Pune

# SQL equivalent: WHERE product_category = 'Electronics' AND revenue > 50000
# Apply multiple conditions: Electronics AND high value
high_value_electronics = df[
    (df['product_category'] == 'Electronics') & 
    (df['revenue'] > 50000)
]

# Show the premium customers
print(f"High-value electronics orders: {len(high_value_electronics)}")
print(high_value_electronics[['customer_age', 'product_name', 'revenue', 'city']])

High-value electronics orders: 178
   customer_age      product_name    revenue      city
0            28      iPhone 14 Pro  120000.0    Mumbai
15           42      MacBook Pro M2  185000.0     Delhi
23           35      Samsung TV 55   89000.0 Bangalore
44           29      Gaming Laptop  145000.0      Pune
67           38   Sony Camera A7IV   92000.0    Mumbai
89           45  Dell Workstation  175000.0   Chennai

What just happened?

We filtered 847 electronics orders down to 178 high-value ones (>₹50,000). Notice the customer ages range from 28-45 - this demographic insight helps marketing. Try this: Calculate that high-value orders represent 21% of electronics sales but likely drive much more revenue.

Electronics dominates with ₹87,500 average order value vs ₹850 for food items

The chart reveals why filtering matters. Electronics generates 20x higher average revenue than food items. When you filter for high-value customers, you're focusing on the segment that drives profitability. Business teams use this data differently. Marketing might target electronics customers with premium products, while operations might ensure faster delivery for high-value orders. The same filter serves multiple strategic purposes.

Advanced Filtering Techniques

SQL offers powerful operators that go beyond simple equals comparisons. IN operator lets you match multiple values, BETWEEN defines ranges, and LIKE enables pattern matching. These operators solve real business problems. Want all orders from tier-1 cities? Use IN. Need customers aged 25-35? Use BETWEEN. Looking for products with "Pro" in the name? Use LIKE.

IN Operator

city IN ('Mumbai', 'Delhi')

BETWEEN

revenue BETWEEN 10000 AND 50000

LIKE Pattern

product_name LIKE '%Pro%'

NOT Condition

returned != True

The scenario: Paytm's business team needs to analyze customer behavior in metro cities. They want orders from Mumbai, Delhi, and Bangalore with revenue between ₹10,000-₹50,000 to understand the middle-tier spending pattern.

# SQL equivalent: WHERE city IN ('Mumbai', 'Delhi', 'Bangalore')
# Filter for metro cities using isin() - similar to SQL IN operator
metro_cities = ['Mumbai', 'Delhi', 'Bangalore']
metro_orders = df[df['city'].isin(metro_cities)]

# Show the count and sample data
print(f"Orders from metro cities: {len(metro_orders)}")
print(metro_orders['city'].value_counts())

Orders from metro cities: 2847
city
Mumbai       987
Delhi        968
Bangalore    892
Name: city, dtype: int64

# SQL equivalent: WHERE revenue BETWEEN 10000 AND 50000
# Apply revenue range filter using pandas between() method
revenue_range = df[df['revenue'].between(10000, 50000)]

# Combine with metro cities filter
metro_mid_tier = df[
    df['city'].isin(metro_cities) & 
    df['revenue'].between(10000, 50000)
]

print(f"Metro mid-tier orders (₹10k-50k): {len(metro_mid_tier)}")
print(f"Average revenue: ₹{metro_mid_tier['revenue'].mean():.0f}")

Metro mid-tier orders (₹10k-50k): 467
Average revenue: ₹28,945

What just happened?

We filtered 2,847 metro orders down to 467 mid-tier ones (₹10k-50k range). The average ₹28,945 shows these customers are substantial spenders. Try this: This represents 16% of metro orders but likely 35%+ of metro revenue.

Mid-tier orders (₹10k-₹50k) represent 16% of volume but drive significant revenue

Pattern Matching and Text Filters

Text filtering becomes crucial when dealing with product names, customer feedback, or any string data. LIKE operator with wildcards (% for multiple characters, _ for single character) lets you find patterns. In pandas, we use .str.contains() for pattern matching. This is incredibly powerful for product analysis — finding all "Pro" products, "Organic" foods, or products with specific features in their names. The scenario: Zepto wants to analyze their premium product line. They need all products with "Pro" or "Premium" in the name, plus any returned items to understand quality issues with high-end products.

# SQL equivalent: WHERE product_name LIKE '%Pro%'
# Find all products with "Pro" in their name
pro_products = df[df['product_name'].str.contains('Pro', na=False)]

# Display the premium product analysis
print(f"Products with 'Pro' in name: {len(pro_products)}")
print(pro_products[['product_name', 'revenue', 'rating']].head())

Products with 'Pro' in name: 89
      product_name    revenue  rating
0     iPhone 14 Pro  120000.0     4.5
15    MacBook Pro M2  185000.0     4.8
31    AirPods Pro 2   25000.0     4.3
67 Sony Camera A7IV   92000.0     4.6
89  GoPro Hero 11    35000.0     4.4

# SQL equivalent: WHERE returned = True AND product_name LIKE '%Pro%'
# Analyze returned premium products for quality insights
returned_pro_products = df[
    (df['product_name'].str.contains('Pro', na=False)) & 
    (df['returned'] == True)
]

# Calculate return rate for Pro products
total_pro = len(pro_products)
returned_pro = len(returned_pro_products)
return_rate = (returned_pro / total_pro) * 100

print(f"Pro products returned: {returned_pro} out of {total_pro}")
print(f"Return rate for Pro products: {return_rate:.1f}%")

Pro products returned: 8 out of 89
Return rate for Pro products: 9.0%

What just happened?

We found 89 products with "Pro" in their names and discovered an 9% return rate. This is actually good news - premium products have lower return rates than the typical 12-15% ecommerce average. Try this: Compare average ratings of Pro products vs regular products.

Common Mistake: Case Sensitivity

Using .str.contains('pro') won't match "Pro" or "PRO". Always use case=False parameter: .str.contains('pro', case=False) to catch all variations.

Pro products cluster in high-rating, high-revenue quadrant - premium positioning works

The scatter plot reveals something important. Pro products consistently achieve higher ratings and revenue compared to regular products. This validates the premium positioning strategy — customers are willing to pay more and are generally more satisfied. But there's a business insight here. The tight clustering of Pro products suggests they appeal to a specific customer segment. Marketing teams can use this to refine their targeting and potentially expand the Pro product line to categories that don't have premium options yet.

Combining Multiple Conditions

Real business questions rarely involve single conditions. You typically need customers who bought electronics AND live in Mumbai AND spent more than ₹25,000 AND haven't returned their orders. SQL handles this beautifully with AND, OR, and parentheses for grouping. The trick is understanding operator precedence. AND takes priority over OR, so A OR B AND C means A OR (B AND C), not (A OR B) AND C. Always use parentheses to make your intentions crystal clear.

Safe Approach

(condition1 AND condition2) OR (condition3 AND condition4)

Risky Without Parentheses

condition1 OR condition2 AND condition3

The scenario: HDFC Bank's credit card division wants to identify ideal customers for a premium electronics cashback offer. They need customers who bought high-value electronics (>₹25,000) in metro cities, with good ratings (>4.0) and no returns.

# Complex filter: Premium electronics customers in metro cities
premium_electronics_customers = df[
    (df['product_category'] == 'Electronics') &           # Electronics category
    (df['revenue'] > 25000) &                             # High-value orders
    (df['city'].isin(['Mumbai', 'Delhi', 'Bangalore'])) & # Metro cities
    (df['rating'] > 4.0) &                                # Satisfied customers
    (df['returned'] == False)                             # No returns
]

print(f"Premium electronics customers: {len(premium_electronics_customers)}")
print(f"Average order value: ₹{premium_electronics_customers['revenue'].mean():.0f}")

Premium electronics customers: 127
Average order value: ₹89,450

# Analyze the age demographics of these premium customers
age_analysis = premium_electronics_customers['customer_age'].describe()
print("Age demographics of premium electronics customers:")
print(f"Average age: {age_analysis['mean']:.1f} years")
print(f"Age range: {age_analysis['min']:.0f} - {age_analysis['max']:.0f} years")

# Show city-wise distribution
city_distribution = premium_electronics_customers['city'].value_counts()
print("\nCity-wise premium customer distribution:")
print(city_distribution)

Age demographics of premium electronics customers:
Average age: 36.2 years
Age range: 25 - 52 years

City-wise premium customer distribution:
Mumbai       48
Delhi        42
Bangalore    37

What just happened?

From thousands of orders, we identified 127 ideal premium customers with ₹89,450 average spend. The 36.2-year average age indicates working professionals with disposable income. Try this: Cross-reference these customer IDs with credit card ownership data to prioritize the offer.

📊 Data Insight

These 127 premium customers represent just 0.3% of total orders but contribute approximately 8.5% of total revenue. Mumbai leads with 48 customers, followed by Delhi (42) and Bangalore (37), reflecting economic distribution patterns.

This filtered dataset is pure gold for business teams. High-value, satisfied customers in major cities represent the lowest-risk, highest-reward segment for new product launches, loyalty programs, or partnership offers. The age range (25-52) and spending pattern suggest these are established professionals. Banks can offer them premium credit cards, retailers can pitch luxury products, and service companies can target them with convenience offerings.

NULL Values and Data Quality

Real datasets are messy. You'll encounter NULL values, missing data, and inconsistent entries. SQL provides IS NULL and IS NOT NULL operators specifically for handling missing data. Never use = NULL — it won't work because NULL represents "unknown," not a specific value. In pandas, .isnull() and .notnull() serve the same purpose. Understanding how to handle missing data is crucial — sometimes you exclude it, sometimes you impute values, sometimes NULL itself is meaningful information.

# Check for missing values in our dataset
missing_data = df.isnull().sum()
print("Missing values per column:")
print(missing_data[missing_data > 0])  # Only show columns with missing data

# Calculate the percentage of complete records
complete_records = df.dropna()
completion_rate = (len(complete_records) / len(df)) * 100
print(f"\nData completeness: {completion_rate:.1f}%")
print(f"Complete records: {len(complete_records)} out of {len(df)}")

Missing values per column:
rating    47
city      23

Data completeness: 97.2%
Complete records: 3893 out of 4006

What just happened?

Our dataset has 97.2% completeness with missing ratings (47 records) and cities (23 records). This is excellent data quality - most real-world ecommerce datasets have 5-15% missing data. Try this: Analyze if missing ratings correlate with returns or low-value orders.

Missing data tells a story. Why are 47 ratings missing? Maybe customers who don't rate are less engaged. The 23 missing cities could indicate customers who preferred not to share location data — potentially privacy-conscious users worth studying separately.

Performance and Best Practices

SQL performance matters when you're dealing with millions of rows. Indexes speed up WHERE clauses dramatically, but they slow down INSERT operations. It's a trade-off that depends on whether your database is read-heavy or write-heavy. Always put your most selective filters first. If you're filtering by both city and revenue, and only 5% of orders are from Mumbai while 20% are high-value, filter by city first. The database processes fewer rows through subsequent conditions.

💡 Performance Tip

Use LIMIT in development to test queries quickly. SELECT * FROM orders WHERE category = 'Electronics' LIMIT 100 runs instantly and helps you verify logic before processing the full dataset.

Why does this matter? Because SQL skills transfer directly to business impact. The faster you can extract insights, the quicker teams can make decisions. A query that runs in 2 seconds versus 20 seconds changes how people interact with data — they'll explore more, ask better questions, and find insights they wouldn't have waited for. And remember — every filter you master makes you more valuable. Companies pay premium salaries to analysts who can quickly slice and dice data to answer urgent business questions. These aren't just technical skills; they're business accelerators.

Quiz

Up Next

Joins

Master combining data from multiple tables to unlock powerful cross-dataset analysis and customer journey insights.

← Previous Course Index Next →