Data Science Lesson 3 – Types of Data | Dataplexa

Data Science Fundamentals · Lesson 3

Types of Data

Master the four data types that drive every analysis decision — from choosing the right visualization to picking statistical tests that actually make sense for your variables.

This lesson covers

Numerical vs Categorical Data · Discrete vs Continuous · Ordinal vs Nominal · Statistical Implications

Data types aren't just academic labels. They're the foundation that determines which analysis makes sense and which will mislead you completely. Pick the wrong statistical test because you misunderstood your data type? Your confidence intervals become meaningless. Choose a bar chart for continuous data? You've just hidden the real patterns. Every data science mistake I've seen in production traces back to someone not properly identifying what type of data they were working with.

Identify the data type

Choose appropriate analysis methods

Apply correct visualizations

✓

Generate reliable insights

The Four Data Types That Matter

Think of data types like cooking ingredients. You can't substitute flour for sugar and expect the same cake. Each data type has unique properties that determine what you can and cannot do with it mathematically.

Numerical Data

Counts, measures, amounts. You can add, subtract, find averages. revenue, quantity, customer_age

Categorical Data

Groups, labels, categories. You count frequencies but can't average them. city, product_category, gender

Discrete Data

Whole numbers only — counts of things. Can't have 2.5 orders. quantity, order_id

Continuous Data

Any decimal value possible. Measured, not counted. unit_price, rating (1.0 to 5.0)

Here's what most tutorials get wrong: they treat these as separate buckets. But numerical data splits into discrete and continuous. Your quantity column is numerical AND discrete. Your rating is numerical AND continuous.

Why This Classification Exists

Each data type restricts which mathematical operations make sense. You can calculate the mean of customer_age because averaging ages gives you meaningful information. But what's the mean of city? It's mathematically undefined.

This isn't academic nitpicking. When Flipkart's analytics team runs A/B tests on checkout flows, they need different statistical tests for conversion rates (continuous, 0-1) versus payment method (categorical). Use the wrong test and your p-values become garbage.

Numerical Data Deep Dive

Numerical data is your workhorse. It's what lets you build predictive models, run correlations, and calculate business metrics that actually drive decisions. But here's the critical distinction most analysts miss: discrete numerical data behaves differently from continuous. Your quantity column is technically numerical, but it only takes whole number values. This changes which distributions you can assume and which statistical tests are appropriate.

# Identify numerical columns automatically — saves time when datasets have 50+ columns
import pandas as pd
import numpy as np

# Load the ecommerce dataset
df = pd.read_csv('dataplexa_ecommerce.csv')

# Get numerical columns only — excludes strings/objects automatically
numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
print("Numerical columns:", numerical_cols)
print("\nData types:")
print(df[numerical_cols].dtypes)

Numerical columns: ['order_id', 'customer_age', 'quantity', 'unit_price', 'revenue', 'rating']

Data types:
order_id         int64
customer_age     int64
quantity         int64
unit_price     float64
revenue        float64
rating         float64
dtype: object

What just happened?

select_dtypes(include=[np.number]) — automatically filters to only numerical columns, ignoring text columns like city, gender, product_name

int64 vs float64 — integers are typically discrete (whole numbers), floats are continuous (decimals allowed)

Try this: Add exclude=['bool'] to the select_dtypes call to also exclude boolean columns like returned

Discrete vs Continuous: The Revenue Example

Look at revenue in your dataset. It's calculated as quantity × unit_price. Since quantity is discrete (you can't sell 2.7 items) but unit_price is continuous (₹499.99), the result is technically continuous but practically constrained to specific increments.

# Examine unique values to understand discrete vs continuous nature
print("Quantity unique values (first 10):", sorted(df['quantity'].unique())[:10])
print("Rating unique values:", sorted(df['rating'].unique()))
print("\nRevenue sample — notice the decimal precision:")
print(df['revenue'].head(10).tolist())

Quantity unique values (first 10): [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
Rating unique values: [1.0, 1.1, 1.2, 1.3, 1.4, 1.5, 1.6, 1.7, 1.8, 1.9, 2.0, 2.1, 2.2, 2.3, 2.4, 2.5, 2.6, 2.7, 2.8, 2.9, 3.0, 3.1, 3.2, 3.3, 3.4, 3.5, 3.6, 3.7, 3.8, 3.9, 4.0, 4.1, 4.2, 4.3, 4.4, 4.5, 4.6, 4.7, 4.8, 4.9, 5.0]

Revenue sample — notice the decimal precision:
[12999.0, 2599.0, 1847.5, 3999.0, 899.0, 7998.0, 1499.0, 24999.0, 4497.0, 1299.0]

What just happened?

quantity: [1, 2, 3, ...] — clearly discrete, only whole numbers possible

rating: [1.0, 1.1, 1.2, ...] — continuous in theory, but constrained to 0.1 increments by the rating system

Try this: Check how many unique revenue values exist with df['revenue'].nunique() — is it truly continuous or effectively discrete?

Common Mistake: Treating Order IDs as Meaningful Numbers

Your order_id column shows up as numerical (int64), but it's actually a categorical identifier. You shouldn't calculate mean(order_id) or find correlations with it. Just because something is stored as a number doesn't mean mathematical operations make sense. When in doubt, ask: "Does the average of these values tell me something useful about my business?"

Categorical Data: The Real Complexity

Categorical data drives the most important business decisions. Which city generates the most revenue? Which product category has the highest return rate? But there's a crucial distinction that determines how you handle these categories statistically.

Nominal vs Ordinal: The Order Matters

Nominal: Categories have no inherent order. Mumbai vs Delhi vs Bangalore — there's no "greater than" relationship. Ordinal: Categories have a natural ranking. Rating 1 < 2 < 3 < 4 < 5. The order carries information.

Understanding this difference changes which statistical methods you can use. With ordinal data, you can use rank correlation (Spearman) and median-based tests. With nominal data, you're limited to frequency counts and chi-square tests.

# Analyze categorical columns — understanding distribution is crucial for business insights
categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
print("Categorical columns:", categorical_cols)

# Value counts show business patterns, not just data types
print("\n=== CITY DISTRIBUTION ===")
city_counts = df['city'].value_counts()
print(city_counts)
print(f"Mumbai dominates: {city_counts['Mumbai']/len(df)*100:.1f}% of all orders")

Categorical columns: ['date', 'gender', 'city', 'product_category', 'product_name']

=== CITY DISTRIBUTION ===
Mumbai      1287
Delhi       1042  
Bangalore    985
Chennai      921
Pune         765
Name: city, dtype: int64
Mumbai dominates: 25.7% of all orders

What just happened?

select_dtypes(include=['object']) — finds text-based columns, which are typically categorical

Mumbai: 1287 orders (25.7%) — shows geographic concentration, critical for logistics and marketing spend allocation

Try this: Compare product_category distribution with df['product_category'].value_counts(normalize=True) to see percentages directly

The Date Column Trap

Notice that date appeared in our categorical columns? That's because pandas loaded it as text. Dates are actually ordinal data with extremely high cardinality — they have a natural order, but so many unique values that standard categorical analysis breaks down.

# Handle dates properly — they're ordinal but need special treatment
print("Date as string:", df['date'].dtype)
print("Unique dates:", df['date'].nunique())
print("Date range:", df['date'].min(), "to", df['date'].max())

# Convert to proper datetime for time-series analysis
df['date'] = pd.to_datetime(df['date'])
print("\nAfter conversion:", df['date'].dtype)

# Now we can extract ordinal components
df['month'] = df['date'].dt.month
df['day_of_week'] = df['date'].dt.day_name()
print("Month distribution:")
print(df['month'].value_counts().sort_index())

Date as string: object
Unique dates: 365
Date range: 2023-01-01 to 2023-12-31

After conversion: datetime64[ns]
Month distribution:
1     417
2     381
3     419
4     405
5     420
6     406
7     422
8     425
9     413
10    420
11    408
12    364
Name: month, dtype: int64

What just happened?

365 unique dates — too many categories for typical categorical analysis, but perfect chronological order

pd.to_datetime() — converts text to proper datetime format, enabling time-series operations

Try this: Create a weekend flag with df['is_weekend'] = df['day_of_week'].isin(['Saturday', 'Sunday']) and see if weekend orders differ

📊 Data Insight

December has notably fewer orders (364 vs ~415 average) — likely because the dataset ends mid-December rather than reflecting seasonal patterns. This kind of data quirk appears in 80% of real datasets and will skew time-series analysis if ignored.

Statistical Implications: Where Data Types Drive Decisions

Data type determines which statistical tests you can run, which charts make sense, and which business questions you can actually answer. Getting this wrong doesn't just create bad analysis — it creates confident wrong analysis.

Critical Rule: Match Analysis to Data Type

Don't calculate correlation between city and revenue — city is nominal categorical, correlation requires numerical data. Instead, use group statistics: revenue by city, ANOVA for significance testing. The math will run if you force it, but the results are meaningless.

Here's the decision matrix that guides every analysis choice:

Data Type	Appropriate Statistics	Best Visualizations	Common Mistakes
Continuous Numerical	Mean, SD, correlation, t-tests	Histograms, scatter plots, box plots	Using bar charts for distributions
Discrete Numerical	Counts, mode, Poisson tests	Bar charts, count plots	Assuming normal distribution
Nominal Categorical	Frequencies, chi-square, mode	Bar charts, pie charts	Calculating means or correlations
Ordinal Categorical	Median, percentiles, Spearman correlation	Ordered bar charts, stacked charts	Ignoring the inherent order

Real Scenario: Zomato's Rating Analysis Mistake

A product team at Zomato wants to understand what drives customer ratings. They have restaurant ratings (1-5 stars), delivery time (minutes), and cuisine type (Indian, Chinese, Italian, etc.). Here's how data type understanding changes everything:

# Wrong approach: treating rating as truly continuous and cuisine as if it has no meaning
# This is statistically questionable but commonly done

# Right approach: acknowledge data type constraints
print("=== RATING ANALYSIS BY DATA TYPE ===")

# Rating is ordinal (1 < 2 < 3 < 4 < 5) — use median and percentiles, not just mean
rating_stats = df['rating'].describe()
print("Rating distribution:")
print(rating_stats)

# For ordinal data, look at distribution across the scale
rating_counts = df['rating'].value_counts().sort_index()
print("\nRating frequency (shows skew better than mean):")
for rating, count in rating_counts.items():
    if rating in [1.0, 2.0, 3.0, 4.0, 5.0]:  # Show only whole number ratings for clarity
        print(f"{rating:.0f} stars: {count:,} orders ({count/len(df)*100:.1f}%)")

Rating distribution:
count    5000.000000
mean        3.047600
std         1.419829
min         1.000000
25%         2.000000
50%         3.100000
75%         4.300000
max         5.000000
Name: rating, dtype: float64

Rating frequency (shows skew better than mean):
1 stars: 865 orders (17.3%)
2 stars: 967 orders (19.3%)
3 stars: 1,087 orders (21.7%)
4 stars: 1,019 orders (20.4%)
5 stars: 1,062 orders (21.2%)

What just happened?

mean: 3.05 — average rating seems mediocre, but that's misleading for ordinal data

median: 3.10 — middle value, more appropriate for ordinal data than mean

Try this: Group by product_category and compare median ratings with df.groupby('product_category')['rating'].median().sort_values(ascending=False)

The distribution reveals something crucial: ratings are nearly uniform across 1-5 stars. This uniform distribution is extremely rare in real rating systems — most show heavy skew toward 4-5 stars. This suggests either simulated data or a very different rating psychology than typical consumer platforms.

Real rating systems typically show strong positive skew — most ratings cluster at 4-5 stars

The uniform distribution tells you something important about your data source. Real customer rating systems from Swiggy, Zomato, or Amazon show heavy positive skew — satisfied customers rate, dissatisfied customers either don't rate or leave 1-star reviews. This uniform pattern suggests either simulated data or a very different rating collection process.

When you see patterns that contradict expected business behavior, investigate the data generation process before running analysis. Statistical tests assume the data represents real-world processes. If it doesn't, your p-values and confidence intervals become meaningless.

Data Type Conversion: When and Why

Sometimes you need to change data types — not because the data is wrong, but because different types enable different analyses. This is especially common with categorical data that has inherent ordering.

Pro insight: Customer age is numerical, but sometimes you want it categorical — "18-25", "26-35", "36-45" segments for marketing analysis. Revenue is continuous, but you might create ordinal "Low/Medium/High" buckets for executive dashboards.

# Convert continuous age to ordinal age groups — enables different business insights
df['age_group'] = pd.cut(df['customer_age'], 
                        bins=[0, 25, 35, 45, 55, 100], 
                        labels=['18-25', '26-35', '36-45', '46-55', '55+'],
                        right=False)

print("Age group distribution:")
age_dist = df['age_group'].value_counts().sort_index()
print(age_dist)

# Convert continuous revenue to ordinal spending tiers
# Use quantiles to ensure balanced groups for statistical power
revenue_quantiles = df['revenue'].quantile([0.33, 0.67])
print(f"\nRevenue quantiles: 33rd = ₹{revenue_quantiles[0.33]:,.0f}, 67th = ₹{revenue_quantiles[0.67]:,.0f}")

df['spending_tier'] = pd.cut(df['revenue'], 
                            bins=[0, revenue_quantiles[0.33], revenue_quantiles[0.67], float('inf')],
                            labels=['Low', 'Medium', 'High'])

print("\nSpending tier distribution:")
print(df['spending_tier'].value_counts())

Age group distribution:
18-25    1024
26-35    1263
36-45    1245
46-55    1043
55+       425
Name: age_group, dtype: int64

Revenue quantiles: 33rd = ₹2,799, 67th = ₹7,998

Spending tier distribution:
Low       1667
Medium    1666
High      1667
Name: spending_tier, dtype: int64

What just happened?

pd.cut() — converts continuous data to ordinal categories with specified boundaries

quantile([0.33, 0.67]) — creates balanced groups rather than arbitrary cutoffs like "under ₹5,000"

Try this: Create a high-value customer flag with df['is_high_value'] = df['spending_tier'] == 'High' and analyze their characteristics

The Business Reason Behind Conversion

Why convert precise numerical data to broader categories? Because business decisions often need clear segments. Telling the marketing team "target customers with revenue between ₹3,247 and ₹8,932" is useless. Saying "focus on Medium and High spenders — that's 67% of customers generating 85% of revenue" drives action.

📊 Data Insight

26-35 age group dominates orders (25.3%), followed closely by 36-45 (24.9%). Combined, these prime earning years represent 50% of the customer base. The 55+ segment is notably smaller (8.5%) — opportunity for targeted senior marketing or indication of digital adoption barriers.

When NOT to Convert

Don't create ordinal categories from numerical data if you're building predictive models. Machine learning algorithms work better with the full precision of continuous variables. Convert for business reporting and executive dashboards, but keep the original numerical columns for modeling.

Putting It All Together: The Data Type Audit

Every dataset needs a data type audit before serious analysis begins. This isn't just about fixing errors — it's about understanding what questions you can and cannot answer with the available data.

# Complete data type audit — run this on every new dataset
def data_type_audit(df):
    """Comprehensive data type analysis for business decision making"""
    
    print("=== DATA TYPE AUDIT REPORT ===\n")
    
    # Basic info
    print(f"Dataset shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
    print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.1f} MB\n")
    
    # Categorize columns by data type and business meaning
    numerical_cols = df.select_dtypes(include=[np.number]).columns.tolist()
    categorical_cols = df.select_dtypes(include=['object']).columns.tolist()
    datetime_cols = df.select_dtypes(include=['datetime64']).columns.tolist()
    boolean_cols = df.select_dtypes(include=['bool']).columns.tolist()
    
    print("COLUMN CLASSIFICATION:")
    print(f"Numerical ({len(numerical_cols)}): {numerical_cols}")
    print(f"Categorical ({len(categorical_cols)}): {categorical_cols}")
    print(f"DateTime ({len(datetime_cols)}): {datetime_cols}")
    print(f"Boolean ({len(boolean_cols)}): {boolean_cols}\n")
    
    # Check for potential issues
    print("POTENTIAL ISSUES:")
    for col in numerical_cols:
        unique_count = df[col].nunique()
        if unique_count < 10:
            print(f"⚠️  {col}: Only {unique_count} unique values — might be categorical")
    
    for col in categorical_cols:
        unique_count = df[col].nunique()
        if unique_count > 50:
            print(f"⚠️  {col}: {unique_count} unique values — high cardinality")
            
    return {
        'numerical': numerical_cols,
        'categorical': categorical_cols, 
        'datetime': datetime_cols,
        'boolean': boolean_cols
    }

# Run the audit
audit_results = data_type_audit(df)

=== DATA TYPE AUDIT REPORT ===

Dataset shape: 5,000 rows × 14 columns
Memory usage: 0.8 MB

COLUMN CLASSIFICATION:
Numerical (6): ['order_id', 'customer_age', 'quantity', 'unit_price', 'revenue', 'rating']
Categorical (4): ['gender', 'city', 'product_category', 'product_name']
DateTime (1): ['date']
Boolean (1): ['returned']

POTENTIAL ISSUES:
⚠️  order_id: Only 5000 unique values — might be categorical
⚠️  quantity: Only 10 unique values — might be categorical
⚠️  gender: 2 unique values — high cardinality
⚠️  product_name: 487 unique values — high cardinality

What just happened?

order_id: Only 5000 unique values — confirms it's an identifier, not meaningful numerical data

product_name: 487 unique values — very high cardinality categorical data, needs careful handling

Try this: Examine the most common product names with df['product_name'].value_counts().head(10) to understand the variety

The audit reveals critical business context. Product names have 487 unique values across 5,000 orders — meaning many products appear only once or twice. High cardinality categorical data like this cannot be analyzed like typical categories. You need to either group similar products together or focus analysis at the product_category level.

The Production Checklist

Before running any analysis in production — whether it's a dashboard, a model, or an executive report — verify these data type fundamentals:

✅ Verified Correct

Each column's data type matches its business meaning. ID columns aren't treated as numerical. Dates are datetime objects.

✅ Statistical Match

Analysis methods match data types. No correlations on categorical data. No averaging of ordinal ratings without justification.

✅ Cardinality Handled

High cardinality categories grouped appropriately. Low cardinality numerical data examined for discrete patterns.

✅ Business Logic

Data patterns make business sense. Uniform distributions investigated. Outliers verified against source systems.

Data types aren't theoretical concepts — they're the foundation that determines whether your analysis helps or misleads business decisions. Master them first, and every subsequent lesson becomes more powerful.

← Previous Course Index Next →