Python Lesson 42 – CSV & XML | Dataplexa

CSV & XML

Not all data lives in databases. A huge proportion of real-world data exchange happens through files — spreadsheet exports, system reports, configuration files, API responses, government datasets. CSV and XML are two of the most widely used structured file formats, and Python has excellent built-in support for both.

This lesson covers reading and writing CSV files with the csv module and pandas, and parsing and building XML documents with xml.etree.ElementTree.

CSV — Comma-Separated Values

CSV is the simplest structured data format — each row is a line, each field is separated by a delimiter (usually a comma). Despite its simplicity, CSV has edge cases: fields with commas inside them, quoted strings, different line endings, and varying encodings. Python's csv module handles all of these correctly.

1. Reading CSV Files

Real-world use: reading a sales export from an accounting system, loading a dataset of customer records, or ingesting a bank statement downloaded as CSV.

# Reading CSV files — csv.reader and csv.DictReader

import csv
import io

# Sample CSV data (in practice, replace io.StringIO with open("file.csv"))
raw = """name,department,salary,start_date
Alice,Engineering,95000,2021-03-15
Bob,Marketing,72000,2020-07-01
Charlie,Engineering,88000,2019-11-20
Diana,HR,65000,2022-01-10
"""

# csv.reader — rows as lists
reader = csv.reader(io.StringIO(raw))
header = next(reader)   # consume the header row
print("Columns:", header)
for row in reader:
    print(row)

print()

# csv.DictReader — rows as dicts (keys from header row)
reader = csv.DictReader(io.StringIO(raw))
for row in reader:
    print(f"{row['name']:10} | {row['department']:12} | ${int(row['salary']):,}")
Columns: ['name', 'department', 'salary', 'start_date']
['Alice', 'Engineering', '95000', '2021-03-15']
['Bob', 'Marketing', '72000', '2020-07-01']
['Charlie', 'Engineering', '88000', '2019-11-20']
['Diana', 'HR', '65000', '2022-01-10']

Alice | Engineering | $95,000
Bob | Marketing | $72,000
Charlie | Engineering | $88,000
Diana | HR | $65,000
  • csv.DictReader is almost always preferred over csv.reader — column names as keys make code far more readable
  • All values from CSV are strings — convert explicitly: int(row["salary"]), float(row["price"])
  • For real files: open("file.csv", newline="", encoding="utf-8") — always specify newline="" and encoding
  • Change the delimiter for tab-separated or pipe-separated files: csv.DictReader(f, delimiter="\t")

2. Writing CSV Files

# Writing CSV files — csv.writer and csv.DictWriter

import csv
import io

# csv.writer — write rows as lists
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(["product", "quantity", "price"])   # header
writer.writerows([
    ["Notebook", 50,  4.99],
    ["Pen",      200, 1.50],
    ["Desk",     10,  89.99],
])
print("csv.writer output:")
print(output.getvalue())

# csv.DictWriter — write rows as dicts
output2 = io.StringIO()
fields  = ["name", "score", "grade"]
writer2 = csv.DictWriter(output2, fieldnames=fields)
writer2.writeheader()
writer2.writerows([
    {"name": "Alice",   "score": 92, "grade": "A"},
    {"name": "Bob",     "score": 78, "grade": "B"},
    {"name": "Charlie", "score": 85, "grade": "B+"},
])
print("csv.DictWriter output:")
print(output2.getvalue())
csv.writer output:
product,quantity,price
Notebook,50,4.99
Pen,200,1.5
Desk,10,89.99

csv.DictWriter output:
name,score,grade
Alice,92,A
Bob,78,B
Charlie,85,B+
  • For real files: open("output.csv", "w", newline="", encoding="utf-8")
  • writer.writerows(list_of_lists) writes multiple rows in one call
  • DictWriter.writeheader() writes the field names as the first row automatically
  • If a field value contains a comma, csv.writer automatically wraps it in quotes — you do not need to handle this yourself

3. CSV with pandas

For data analysis or any CSV with more than a few rows, pandas is far more powerful — it loads the entire file into a DataFrame with automatic type inference, filtering, aggregation, and export built in.

# CSV with pandas — read, filter, aggregate, export

import pandas as pd
import io

raw = """name,department,salary,start_date
Alice,Engineering,95000,2021-03-15
Bob,Marketing,72000,2020-07-01
Charlie,Engineering,88000,2019-11-20
Diana,HR,65000,2022-01-10
Eve,Engineering,102000,2018-05-30
"""

df = pd.read_csv(io.StringIO(raw), parse_dates=["start_date"])

print(df.dtypes)
print()
print(df[["name", "salary"]])
print()

# Filter and aggregate
eng = df[df["department"] == "Engineering"]
print("Engineering team:")
print(eng[["name", "salary"]])
print(f"Average salary: ${eng['salary'].mean():,.0f}")

# Export back to CSV
df.to_csv("employees_export.csv", index=False)
print("\nExported to employees_export.csv")
name object
department object
salary int64
start_date datetime64[ns]
dtype: object

name salary
0 Alice 95000
1 Bob 72000
2 Charlie 88000
3 Diana 65000
4 Eve 102000

Engineering team:
name salary
0 Alice 95000
2 Charlie 88000
4 Eve 102000
Average salary: $95,000
  • pd.read_csv() automatically infers column types — integers stay integers, dates parse with parse_dates=
  • df.to_csv("file.csv", index=False) exports without the row number index column
  • Use pandas when you need filtering, grouping, merging, or any analysis beyond simple reading and writing

XML — eXtensible Markup Language

XML uses nested tags to represent hierarchical data. It is used in configuration files, document formats (Word, SVG), SOAP web services, RSS feeds, and many enterprise data exchange formats. Python's xml.etree.ElementTree module is built in and covers most use cases.

4. Parsing XML

Real-world use: parsing an RSS feed to extract article titles, reading a configuration file, consuming a SOAP API response, or processing an SVG file.

# Parsing XML with ElementTree

import xml.etree.ElementTree as ET

xml_data = """

    
        The Great Gatsby
        F. Scott Fitzgerald
        1925
        12.99
    
    
        Sapiens
        Yuval Noah Harari
        2011
        16.99
    
    
        1984
        George Orwell
        1949
        9.99
    
"""

root = ET.fromstring(xml_data)   # parse from string
# ET.parse("file.xml").getroot() — parse from file

print("Root tag:", root.tag)
print(f"Total books: {len(root)}\n")

# Iterate over child elements
for book in root.findall("book"):
    book_id = book.get("id")          # get an attribute
    genre   = book.get("genre")
    title   = book.find("title").text  # get element text
    author  = book.find("author").text
    price   = float(book.find("price").text)
    print(f"[{book_id}] {title} by {author} — ${price:.2f} ({genre})")

# findall with a simple filter — fiction books only
print("\nFiction books:")
for book in root.findall("book[@genre='fiction']"):
    print(" ", book.find("title").text)
Root tag: library
Total books: 3

[1] The Great Gatsby by F. Scott Fitzgerald — $12.99 (fiction)
[2] Sapiens by Yuval Noah Harari — $16.99 (non-fiction)
[3] 1984 by George Orwell — $9.99 (fiction)

Fiction books:
The Great Gatsby
1984
  • root.findall("tag") — returns all direct children with that tag name
  • element.find("tag") — returns the first matching child element (or None)
  • element.get("attr") — returns the attribute value (or None if not present)
  • element.text — returns the text content between the opening and closing tags
  • XPath-style expressions like [@genre='fiction'] filter on attribute values

5. Building and Writing XML

# Building XML with ElementTree

import xml.etree.ElementTree as ET

# Build the tree programmatically
root = ET.Element("inventory")

items = [
    {"sku": "NB001", "name": "Notebook", "qty": 150, "price": 4.99},
    {"sku": "PN002", "name": "Pen",      "qty": 500, "price": 1.50},
    {"sku": "DS003", "name": "Desk",     "qty": 20,  "price": 89.99},
]

for item in items:
    el = ET.SubElement(root, "item", sku=item["sku"])
    ET.SubElement(el, "name").text  = item["name"]
    ET.SubElement(el, "qty").text   = str(item["qty"])
    ET.SubElement(el, "price").text = str(item["price"])

# Pretty-print (Python 3.9+)
ET.indent(root, space="    ")
tree = ET.ElementTree(root)

# Print to console
print(ET.tostring(root, encoding="unicode"))

# Write to file
tree.write("inventory.xml", encoding="unicode", xml_declaration=True)
<inventory>
<item sku="NB001">
<name>Notebook</name>
<qty>150</qty>
<price>4.99</price>
</item>
<item sku="PN002">
<name>Pen</name>
<qty>500</qty>
<price>1.5</price>
</item>
<item sku="DS003">
<name>Desk</name>
<qty>20</qty>
<price>89.99</price>
</item>
</inventory>
  • ET.Element("tag") — creates the root element
  • ET.SubElement(parent, "tag", attr=value) — creates a child element with optional attributes
  • ET.indent(root) — adds whitespace for human-readable output (Python 3.9+)
  • ET.tostring(root, encoding="unicode") — serialises the tree to a string
  • For complex XML with namespaces or XSLT, consider the third-party lxml library instead

Summary Table

Tool Format Best Used For
csv.DictReader CSV Reading CSV rows as named dicts
csv.DictWriter CSV Writing dicts as CSV rows
pandas.read_csv CSV Analysis, filtering, aggregation
ET.fromstring() XML Parse XML from a string
ET.parse() XML Parse XML from a file
ET.SubElement() XML Build XML trees programmatically

Practice Questions

Practice 1. What is the difference between csv.reader and csv.DictReader?



Practice 2. Why must you specify newline="" when opening a CSV file in Python?



Practice 3. In ElementTree, what method retrieves the text between an element's opening and closing tags?



Practice 4. What does df.to_csv("file.csv", index=False) do differently from the default?



Practice 5. Which ElementTree method adds a child element to an existing parent element?



Quiz

Quiz 1. What data type does csv.DictReader return for each row?






Quiz 2. What does pd.read_csv(..., parse_dates=["start_date"]) do?






Quiz 3. In ElementTree, what is the difference between find() and findall()?






Quiz 4. How does csv.writer handle a field value that contains a comma?






Quiz 5. Which Python version introduced ET.indent() for pretty-printing XML?






Next up — NumPy: fast numerical computing with arrays, vectorised operations, and linear algebra.