Python Lesson 42 – CSV & XML | Dataplexa

CSV & XML

Not all data lives in databases. A huge proportion of real-world data exchange happens through files — spreadsheet exports, system reports, configuration files, API responses, government datasets. CSV and XML are two of the most widely used structured file formats, and Python has excellent built-in support for both.

This lesson covers reading and writing CSV files with the csv module and pandas, and parsing and building XML documents with xml.etree.ElementTree.

CSV — Comma-Separated Values

CSV is the simplest structured data format — each row is a line, each field is separated by a delimiter (usually a comma). Despite its simplicity, CSV has edge cases: fields with commas inside them, quoted strings, different line endings, and varying encodings. Python's csv module handles all of these correctly.

1. Reading CSV Files

Real-world use: reading a sales export from an accounting system, loading a dataset of customer records, or ingesting a bank statement downloaded as CSV.

# Reading CSV files — csv.reader and csv.DictReader

import csv
import io

# Sample CSV data (in practice, replace io.StringIO with open("file.csv"))
raw = """name,department,salary,start_date
Alice,Engineering,95000,2021-03-15
Bob,Marketing,72000,2020-07-01
Charlie,Engineering,88000,2019-11-20
Diana,HR,65000,2022-01-10
"""

# csv.reader — rows as lists
reader = csv.reader(io.StringIO(raw))
header = next(reader)   # consume the header row
print("Columns:", header)
for row in reader:
    print(row)

print()

# csv.DictReader — rows as dicts (keys from header row)
reader = csv.DictReader(io.StringIO(raw))
for row in reader:
    print(f"{row['name']:10} | {row['department']:12} | ${int(row['salary']):,}")

csv.DictReader is almost always preferred over csv.reader — column names as keys make code far more readable
All values from CSV are strings — convert explicitly: int(row["salary"]), float(row["price"])
For real files: open("file.csv", newline="", encoding="utf-8") — always specify newline="" and encoding
Change the delimiter for tab-separated or pipe-separated files: csv.DictReader(f, delimiter="\t")

2. Writing CSV Files

# Writing CSV files — csv.writer and csv.DictWriter

import csv
import io

# csv.writer — write rows as lists
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(["product", "quantity", "price"])   # header
writer.writerows([
    ["Notebook", 50,  4.99],
    ["Pen",      200, 1.50],
    ["Desk",     10,  89.99],
])
print("csv.writer output:")
print(output.getvalue())

# csv.DictWriter — write rows as dicts
output2 = io.StringIO()
fields  = ["name", "score", "grade"]
writer2 = csv.DictWriter(output2, fieldnames=fields)
writer2.writeheader()
writer2.writerows([
    {"name": "Alice",   "score": 92, "grade": "A"},
    {"name": "Bob",     "score": 78, "grade": "B"},
    {"name": "Charlie", "score": 85, "grade": "B+"},
])
print("csv.DictWriter output:")
print(output2.getvalue())

csv.writer output:
product,quantity,price
Notebook,50,4.99
Pen,200,1.5
Desk,10,89.99

csv.DictWriter output:
name,score,grade
Alice,92,A
Bob,78,B
Charlie,85,B+

For real files: open("output.csv", "w", newline="", encoding="utf-8")
writer.writerows(list_of_lists) writes multiple rows in one call
DictWriter.writeheader() writes the field names as the first row automatically
If a field value contains a comma, csv.writer automatically wraps it in quotes — you do not need to handle this yourself

3. CSV with pandas

For data analysis or any CSV with more than a few rows, pandas is far more powerful — it loads the entire file into a DataFrame with automatic type inference, filtering, aggregation, and export built in.

# CSV with pandas — read, filter, aggregate, export

import pandas as pd
import io

raw = """name,department,salary,start_date
Alice,Engineering,95000,2021-03-15
Bob,Marketing,72000,2020-07-01
Charlie,Engineering,88000,2019-11-20
Diana,HR,65000,2022-01-10
Eve,Engineering,102000,2018-05-30
"""

df = pd.read_csv(io.StringIO(raw), parse_dates=["start_date"])

print(df.dtypes)
print()
print(df[["name", "salary"]])
print()

# Filter and aggregate
eng = df[df["department"] == "Engineering"]
print("Engineering team:")
print(eng[["name", "salary"]])
print(f"Average salary: ${eng['salary'].mean():,.0f}")

# Export back to CSV
df.to_csv("employees_export.csv", index=False)
print("\nExported to employees_export.csv")

name object
department object
salary int64
start_date datetime64[ns]
dtype: object

name salary
0 Alice 95000
1 Bob 72000
2 Charlie 88000
3 Diana 65000
4 Eve 102000

Engineering team:
name salary
0 Alice 95000
2 Charlie 88000
4 Eve 102000
Average salary: $95,000

pd.read_csv() automatically infers column types — integers stay integers, dates parse with parse_dates=
df.to_csv("file.csv", index=False) exports without the row number index column
Use pandas when you need filtering, grouping, merging, or any analysis beyond simple reading and writing

XML — eXtensible Markup Language

XML uses nested tags to represent hierarchical data. It is used in configuration files, document formats (Word, SVG), SOAP web services, RSS feeds, and many enterprise data exchange formats. Python's xml.etree.ElementTree module is built in and covers most use cases.

4. Parsing XML

Real-world use: parsing an RSS feed to extract article titles, reading a configuration file, consuming a SOAP API response, or processing an SVG file.

# Parsing XML with ElementTree

import xml.etree.ElementTree as ET

xml_data = """

    
        The Great Gatsby
        F. Scott Fitzgerald
        1925
        12.99
    
    
        Sapiens
        Yuval Noah Harari
        2011
        16.99
    
    
        1984
        George Orwell
        1949
        9.99
    
"""

root = ET.fromstring(xml_data)   # parse from string
# ET.parse("file.xml").getroot() — parse from file

print("Root tag:", root.tag)
print(f"Total books: {len(root)}\n")

# Iterate over child elements
for book in root.findall("book"):
    book_id = book.get("id")          # get an attribute
    genre   = book.get("genre")
    title   = book.find("title").text  # get element text
    author  = book.find("author").text
    price   = float(book.find("price").text)
    print(f"[{book_id}] {title} by {author} — ${price:.2f} ({genre})")

# findall with a simple filter — fiction books only
print("\nFiction books:")
for book in root.findall("book[@genre='fiction']"):
    print(" ", book.find("title").text)

Root tag: library
Total books: 3

[1] The Great Gatsby by F. Scott Fitzgerald — $12.99 (fiction)
[2] Sapiens by Yuval Noah Harari — $16.99 (non-fiction)
[3] 1984 by George Orwell — $9.99 (fiction)

Fiction books:
The Great Gatsby
1984

root.findall("tag") — returns all direct children with that tag name
element.find("tag") — returns the first matching child element (or None)
element.get("attr") — returns the attribute value (or None if not present)
element.text — returns the text content between the opening and closing tags
XPath-style expressions like [@genre='fiction'] filter on attribute values

5. Building and Writing XML

# Building XML with ElementTree

import xml.etree.ElementTree as ET

# Build the tree programmatically
root = ET.Element("inventory")

items = [
    {"sku": "NB001", "name": "Notebook", "qty": 150, "price": 4.99},
    {"sku": "PN002", "name": "Pen",      "qty": 500, "price": 1.50},
    {"sku": "DS003", "name": "Desk",     "qty": 20,  "price": 89.99},
]

for item in items:
    el = ET.SubElement(root, "item", sku=item["sku"])
    ET.SubElement(el, "name").text  = item["name"]
    ET.SubElement(el, "qty").text   = str(item["qty"])
    ET.SubElement(el, "price").text = str(item["price"])

# Pretty-print (Python 3.9+)
ET.indent(root, space="    ")
tree = ET.ElementTree(root)

# Print to console
print(ET.tostring(root, encoding="unicode"))

# Write to file
tree.write("inventory.xml", encoding="unicode", xml_declaration=True)

<inventory>
<item sku="NB001">
<name>Notebook</name>
<qty>150</qty>
<price>4.99</price>
</item>
<item sku="PN002">
<name>Pen</name>
<qty>500</qty>
<price>1.5</price>
</item>
<item sku="DS003">
<name>Desk</name>
<qty>20</qty>
<price>89.99</price>
</item>
</inventory>

ET.Element("tag") — creates the root element
ET.SubElement(parent, "tag", attr=value) — creates a child element with optional attributes
ET.indent(root) — adds whitespace for human-readable output (Python 3.9+)
ET.tostring(root, encoding="unicode") — serialises the tree to a string
For complex XML with namespaces or XSLT, consider the third-party lxml library instead

Summary Table

Tool	Format	Best Used For
`csv.DictReader`	CSV	Reading CSV rows as named dicts
`csv.DictWriter`	CSV	Writing dicts as CSV rows
`pandas.read_csv`	CSV	Analysis, filtering, aggregation
`ET.fromstring()`	XML	Parse XML from a string
`ET.parse()`	XML	Parse XML from a file
`ET.SubElement()`	XML	Build XML trees programmatically

Practice Questions

Practice 1. What is the difference between csv.reader and csv.DictReader?

Practice 2. Why must you specify newline="" when opening a CSV file in Python?

Practice 3. In ElementTree, what method retrieves the text between an element's opening and closing tags?

Practice 4. What does df.to_csv("file.csv", index=False) do differently from the default?

Practice 5. Which ElementTree method adds a child element to an existing parent element?

Quiz

Quiz 1. What data type does csv.DictReader return for each row?

A dictionary with column names as keys and cell values as strings
A list of strings
A pandas Series
A named tuple

Quiz 2. What does pd.read_csv(..., parse_dates=["start_date"]) do?

Automatically converts the start_date column from strings to datetime objects
Filters rows by the start_date column
Sorts the DataFrame by start_date
Removes rows with missing start_date values

Quiz 3. In ElementTree, what is the difference between find() and findall()?

find() returns the first matching element, findall() returns a list of all matches
find() searches all descendants, findall() only searches direct children
find() is for attributes, findall() is for elements
They are identical

Quiz 4. How does csv.writer handle a field value that contains a comma?

It automatically wraps the field in double quotes
It raises a ValueError
It replaces the comma with a semicolon
It escapes the comma with a backslash

Quiz 5. Which Python version introduced ET.indent() for pretty-printing XML?

Python 3.9
Python 3.6
Python 3.11
Python 3.7

Next up — NumPy: fast numerical computing with arrays, vectorised operations, and linear algebra.

← Previous Course Index Next →

Python Course