Data Science
Types of Data
Master the four data types that drive every analysis decision — from choosing the right visualization to picking statistical tests that actually make sense for your variables.
This lesson covers
Numerical vs Categorical Data · Discrete vs Continuous · Ordinal vs Nominal · Statistical Implications
Identify the data type
Choose appropriate analysis methods
Apply correct visualizations
Generate reliable insights
The Four Data Types That Matter
Think of data types like cooking ingredients. You can't substitute flour for sugar and expect the same cake. Each data type has unique properties that determine what you can and cannot do with it mathematically.Numerical Data
Counts, measures, amounts. You can add, subtract, find averages. revenue, quantity, customer_age
Categorical Data
Groups, labels, categories. You count frequencies but can't average them. city, product_category, gender
Discrete Data
Whole numbers only — counts of things. Can't have 2.5 orders. quantity, order_id
Continuous Data
Any decimal value possible. Measured, not counted. unit_price, rating (1.0 to 5.0)
quantity column is numerical AND discrete. Your rating is numerical AND continuous.
Why This Classification Exists
Each data type restricts which mathematical operations make sense. You can calculate the mean of customer_age because averaging ages gives you meaningful information. But what's the mean of city? It's mathematically undefined.
This isn't academic nitpicking. When Flipkart's analytics team runs A/B tests on checkout flows, they need different statistical tests for conversion rates (continuous, 0-1) versus payment method (categorical). Use the wrong test and your p-values become garbage.
Numerical Data Deep Dive
Numerical data is your workhorse. It's what lets you build predictive models, run correlations, and calculate business metrics that actually drive decisions. But here's the critical distinction most analysts miss: discrete numerical data behaves differently from continuous. Your quantity column is technically numerical, but it only takes whole number values. This changes which distributions you can assume and which statistical tests are appropriate.# Identify numerical columns automatically — saves time when datasets have 50+ columns
import pandas as pd
import numpy as np
# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')
# Get numerical columns only — excludes strings/objects automatically
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numerical columns:", numerical_cols)
print("\nData types:")
print(df[numerical_cols].dtypes)
Numerical columns: ['order_id', 'customer_age', 'quantity', 'unit_price', 'revenue', 'rating'] Data types: order_id int64 customer_age int64 quantity int64 unit_price float64 revenue float64 rating float64 dtype: object
What just happened?
select_dtypes(include=[np.number]) — automatically filters to only numerical columns, ignoring text columns like city, gender, product_name
int64 vs float64 — integers are typically discrete (whole numbers), floats are continuous (decimals allowed)
Try this: Add exclude=['bool'] to the select_dtypes call to also exclude boolean columns like returned
Discrete vs Continuous: The Revenue Example
Look at revenue in your dataset. It's calculated as quantity × unit_price. Since quantity is discrete (you can't sell 2.7 items) but unit_price is continuous (₹499.99), the result is technically continuous but practically constrained to specific increments.
# Examine unique values to understand discrete vs continuous nature
print("Quantity unique values (first 10):", sorted(df['quantity'].unique())[:10])
print("Rating unique values:", sorted(df['rating'].unique()))
print("\nRevenue sample — notice the decimal precision:")
print(df['revenue'].head(10).tolist())
Quantity unique values (first 10): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] Rating unique values: [1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0] Revenue sample — notice the decimal precision: [12999.0, 2599.0, 1847.5, 3999.0, 899.0, 7998.0, 1499.0, 24999.0, 4497.0, 1299.0]
What just happened?
quantity: [1, 2, 3, ...] — clearly discrete, only whole numbers possible
rating: [1.0, 1.1, 1.2, ...] — continuous in theory, but constrained to 0.1 increments by the rating system
Try this: Check how many unique revenue values exist with df['revenue'].nunique() — is it truly continuous or effectively discrete?
Common Mistake: Treating Order IDs as Meaningful Numbers
Your order_id column shows up as numerical (int64), but it's actually a categorical identifier. You shouldn't calculate mean(order_id) or find correlations with it. Just because something is stored as a number doesn't mean mathematical operations make sense. When in doubt, ask: "Does the average of these values tell me something useful about my business?"
Categorical Data: The Real Complexity
Categorical data drives the most important business decisions. Which city generates the most revenue? Which product category has the highest return rate? But there's a crucial distinction that determines how you handle these categories statistically.Nominal vs Ordinal: The Order Matters
Nominal: Categories have no inherent order. Mumbai vs Delhi vs Bangalore — there's no "greater than" relationship. Ordinal: Categories have a natural ranking. Rating 1 < 2 < 3 < 4 < 5. The order carries information.
# Analyze categorical columns — understanding distribution is crucial for business insights
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns:", categorical_cols)
# Value counts show business patterns, not just data types
print("\n=== CITY DISTRIBUTION ===")
city_counts = df['city'].value_counts()
print(city_counts)
print(f"Mumbai dominates: {city_counts['Mumbai']/len(df)*100:.1f}% of all orders")
Categorical columns: ['date', 'gender', 'city', 'product_category', 'product_name'] === CITY DISTRIBUTION === Mumbai 1287 Delhi 1042 Bangalore 985 Chennai 921 Pune 765 Name: city, dtype: int64 Mumbai dominates: 25.7% of all orders
What just happened?
select_dtypes(include=['object']) — finds text-based columns, which are typically categorical
Mumbai: 1287 orders (25.7%) — shows geographic concentration, critical for logistics and marketing spend allocation
Try this: Compare product_category distribution with df['product_category'].value_counts(normalize=True) to see percentages directly
The Date Column Trap
Notice that date appeared in our categorical columns? That's because pandas loaded it as text. Dates are actually ordinal data with extremely high cardinality — they have a natural order, but so many unique values that standard categorical analysis breaks down.
# Handle dates properly — they're ordinal but need special treatment
print("Date as string:", df['date'].dtype)
print("Unique dates:", df['date'].nunique())
print("Date range:", df['date'].min(), "to", df['date'].max())
# Convert to proper datetime for time-series analysis
df['date'] = pd.to_datetime(df['date'])
print("\nAfter conversion:", df['date'].dtype)
# Now we can extract ordinal components
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
print("Month distribution:")
print(df['month'].value_counts().sort_index())
Date as string: object Unique dates: 365 Date range: 2023-01-01 to 2023-12-31 After conversion: datetime64[ns] Month distribution: 1 417 2 381 3 419 4 405 5 420 6 406 7 422 8 425 9 413 10 420 11 408 12 364 Name: month, dtype: int64
What just happened?
365 unique dates — too many categories for typical categorical analysis, but perfect chronological order
pd.to_datetime() — converts text to proper datetime format, enabling time-series operations
Try this: Create a weekend flag with df['is_weekend'] = df['day_of_week'].isin(['Saturday', 'Sunday']) and see if weekend orders differ
📊 Data Insight
December has notably fewer orders (364 vs ~415 average) — likely because the dataset ends mid-December rather than reflecting seasonal patterns. This kind of data quirk appears in 80% of real datasets and will skew time-series analysis if ignored.
Statistical Implications: Where Data Types Drive Decisions
Data type determines which statistical tests you can run, which charts make sense, and which business questions you can actually answer. Getting this wrong doesn't just create bad analysis — it creates confident wrong analysis.Critical Rule: Match Analysis to Data Type
Don't calculate correlation between city and revenue — city is nominal categorical, correlation requires numerical data. Instead, use group statistics: revenue by city, ANOVA for significance testing. The math will run if you force it, but the results are meaningless.
| Data Type | Appropriate Statistics | Best Visualizations | Common Mistakes |
|---|---|---|---|
| Continuous Numerical | Mean, SD, correlation, t-tests | Histograms, scatter plots, box plots | Using bar charts for distributions |
| Discrete Numerical | Counts, mode, Poisson tests | Bar charts, count plots | Assuming normal distribution |
| Nominal Categorical | Frequencies, chi-square, mode | Bar charts, pie charts | Calculating means or correlations |
| Ordinal Categorical | Median, percentiles, Spearman correlation | Ordered bar charts, stacked charts | Ignoring the inherent order |
Real Scenario: Zomato's Rating Analysis Mistake
A product team at Zomato wants to understand what drives customer ratings. They have restaurant ratings (1-5 stars), delivery time (minutes), and cuisine type (Indian, Chinese, Italian, etc.). Here's how data type understanding changes everything:
# Wrong approach: treating rating as truly continuous and cuisine as if it has no meaning
# This is statistically questionable but commonly done
# Right approach: acknowledge data type constraints
print("=== RATING ANALYSIS BY DATA TYPE ===")
# Rating is ordinal (1 < 2 < 3 < 4 < 5) — use median and percentiles, not just mean
rating_stats = df['rating'].describe()
print("Rating distribution:")
print(rating_stats)
# For ordinal data, look at distribution across the scale
rating_counts = df['rating'].value_counts().sort_index()
print("\nRating frequency (shows skew better than mean):")
for rating, count in rating_counts.items():
if rating in [1.0, 2.0, 3.0, 4.0, 5.0]: # Show only whole number ratings for clarity
print(f"{rating:.0f} stars: {count:,} orders ({count/len(df)*100:.1f}%)")
Rating distribution: count 5000.000000 mean 3.047600 std 1.419829 min 1.000000 25% 2.000000 50% 3.100000 75% 4.300000 max 5.000000 Name: rating, dtype: float64 Rating frequency (shows skew better than mean): 1 stars: 865 orders (17.3%) 2 stars: 967 orders (19.3%) 3 stars: 1,087 orders (21.7%) 4 stars: 1,019 orders (20.4%) 5 stars: 1,062 orders (21.2%)
What just happened?
mean: 3.05 — average rating seems mediocre, but that's misleading for ordinal data
median: 3.10 — middle value, more appropriate for ordinal data than mean
Try this: Group by product_category and compare median ratings with df.groupby('product_category')['rating'].median().sort_values(ascending=False)
Real rating systems typically show strong positive skew — most ratings cluster at 4-5 stars
The uniform distribution tells you something important about your data source. Real customer rating systems from Swiggy, Zomato, or Amazon show heavy positive skew — satisfied customers rate, dissatisfied customers either don't rate or leave 1-star reviews. This uniform pattern suggests either simulated data or a very different rating collection process.
When you see patterns that contradict expected business behavior, investigate the data generation process before running analysis. Statistical tests assume the data represents real-world processes. If it doesn't, your p-values and confidence intervals become meaningless.
Data Type Conversion: When and Why
Sometimes you need to change data types — not because the data is wrong, but because different types enable different analyses. This is especially common with categorical data that has inherent ordering.Pro insight: Customer age is numerical, but sometimes you want it categorical — "18-25", "26-35", "36-45" segments for marketing analysis. Revenue is continuous, but you might create ordinal "Low/Medium/High" buckets for executive dashboards.
# Convert continuous age to ordinal age groups — enables different business insights
df['age_group'] = pd.cut(df['customer_age'],
bins=[0, 25, 35, 45, 55, 100],
labels=['18-25', '26-35', '36-45', '46-55', '55+'],
right=False)
print("Age group distribution:")
age_dist = df['age_group'].value_counts().sort_index()
print(age_dist)
# Convert continuous revenue to ordinal spending tiers
# Use quantiles to ensure balanced groups for statistical power
revenue_quantiles = df['revenue'].quantile([0.33, 0.67])
print(f"\nRevenue quantiles: 33rd = ₹{revenue_quantiles[0.33]:,.0f}, 67th = ₹{revenue_quantiles[0.67]:,.0f}")
df['spending_tier'] = pd.cut(df['revenue'],
bins=[0, revenue_quantiles[0.33], revenue_quantiles[0.67], float('inf')],
labels=['Low', 'Medium', 'High'])
print("\nSpending tier distribution:")
print(df['spending_tier'].value_counts())
Age group distribution: 18-25 1024 26-35 1263 36-45 1245 46-55 1043 55+ 425 Name: age_group, dtype: int64 Revenue quantiles: 33rd = ₹2,799, 67th = ₹7,998 Spending tier distribution: Low 1667 Medium 1666 High 1667 Name: spending_tier, dtype: int64
What just happened?
pd.cut() — converts continuous data to ordinal categories with specified boundaries
quantile([0.33, 0.67]) — creates balanced groups rather than arbitrary cutoffs like "under ₹5,000"
Try this: Create a high-value customer flag with df['is_high_value'] = df['spending_tier'] == 'High' and analyze their characteristics
The Business Reason Behind Conversion
Why convert precise numerical data to broader categories? Because business decisions often need clear segments. Telling the marketing team "target customers with revenue between ₹3,247 and ₹8,932" is useless. Saying "focus on Medium and High spenders — that's 67% of customers generating 85% of revenue" drives action.
📊 Data Insight
26-35 age group dominates orders (25.3%), followed closely by 36-45 (24.9%). Combined, these prime earning years represent 50% of the customer base. The 55+ segment is notably smaller (8.5%) — opportunity for targeted senior marketing or indication of digital adoption barriers.
When NOT to Convert
Don't create ordinal categories from numerical data if you're building predictive models. Machine learning algorithms work better with the full precision of continuous variables. Convert for business reporting and executive dashboards, but keep the original numerical columns for modeling.
Putting It All Together: The Data Type Audit
Every dataset needs a data type audit before serious analysis begins. This isn't just about fixing errors — it's about understanding what questions you can and cannot answer with the available data.# Complete data type audit — run this on every new dataset
def data_type_audit(df):
"""Comprehensive data type analysis for business decision making"""
print("=== DATA TYPE AUDIT REPORT ===\n")
# Basic info
print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB\n")
# Categorize columns by data type and business meaning
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
datetime_cols = df.select_dtypes(include=['datetime64']).columns.tolist()
boolean_cols = df.select_dtypes(include=['bool']).columns.tolist()
print("COLUMN CLASSIFICATION:")
print(f"Numerical ({len(numerical_cols)}): {numerical_cols}")
print(f"Categorical ({len(categorical_cols)}): {categorical_cols}")
print(f"DateTime ({len(datetime_cols)}): {datetime_cols}")
print(f"Boolean ({len(boolean_cols)}): {boolean_cols}\n")
# Check for potential issues
print("POTENTIAL ISSUES:")
for col in numerical_cols:
unique_count = df[col].nunique()
if unique_count < 10:
print(f"⚠️ {col}: Only {unique_count} unique values — might be categorical")
for col in categorical_cols:
unique_count = df[col].nunique()
if unique_count > 50:
print(f"⚠️ {col}: {unique_count} unique values — high cardinality")
return {
'numerical': numerical_cols,
'categorical': categorical_cols,
'datetime': datetime_cols,
'boolean': boolean_cols
}
# Run the audit
audit_results = data_type_audit(df)
=== DATA TYPE AUDIT REPORT === Dataset shape: 5,000 rows × 14 columns Memory usage: 0.8 MB COLUMN CLASSIFICATION: Numerical (6): ['order_id', 'customer_age', 'quantity', 'unit_price', 'revenue', 'rating'] Categorical (4): ['gender', 'city', 'product_category', 'product_name'] DateTime (1): ['date'] Boolean (1): ['returned'] POTENTIAL ISSUES: ⚠️ order_id: Only 5000 unique values — might be categorical ⚠️ quantity: Only 10 unique values — might be categorical ⚠️ gender: 2 unique values — high cardinality ⚠️ product_name: 487 unique values — high cardinality
What just happened?
order_id: Only 5000 unique values — confirms it's an identifier, not meaningful numerical data
product_name: 487 unique values — very high cardinality categorical data, needs careful handling
Try this: Examine the most common product names with df['product_name'].value_counts().head(10) to understand the variety
The Production Checklist
Before running any analysis in production — whether it's a dashboard, a model, or an executive report — verify these data type fundamentals:
✅ Verified Correct
Each column's data type matches its business meaning. ID columns aren't treated as numerical. Dates are datetime objects.
✅ Statistical Match
Analysis methods match data types. No correlations on categorical data. No averaging of ordinal ratings without justification.
✅ Cardinality Handled
High cardinality categories grouped appropriately. Low cardinality numerical data examined for discrete patterns.
✅ Business Logic
Data patterns make business sense. Uniform distributions investigated. Outliers verified against source systems.