Python Course
CSV & XML
Not all data lives in databases. A huge proportion of real-world data exchange happens through files — spreadsheet exports, system reports, configuration files, API responses, government datasets. CSV and XML are two of the most widely used structured file formats, and Python has excellent built-in support for both.
This lesson covers reading and writing CSV files with the csv module and pandas, and parsing and building XML documents with xml.etree.ElementTree.
CSV — Comma-Separated Values
CSV is the simplest structured data format — each row is a line, each field is separated by a delimiter (usually a comma). Despite its simplicity, CSV has edge cases: fields with commas inside them, quoted strings, different line endings, and varying encodings. Python's csv module handles all of these correctly.
1. Reading CSV Files
Real-world use: reading a sales export from an accounting system, loading a dataset of customer records, or ingesting a bank statement downloaded as CSV.
# Reading CSV files — csv.reader and csv.DictReader
import csv
import io
# Sample CSV data (in practice, replace io.StringIO with open("file.csv"))
raw = """name,department,salary,start_date
Alice,Engineering,95000,2021-03-15
Bob,Marketing,72000,2020-07-01
Charlie,Engineering,88000,2019-11-20
Diana,HR,65000,2022-01-10
"""
# csv.reader — rows as lists
reader = csv.reader(io.StringIO(raw))
header = next(reader) # consume the header row
print("Columns:", header)
for row in reader:
print(row)
print()
# csv.DictReader — rows as dicts (keys from header row)
reader = csv.DictReader(io.StringIO(raw))
for row in reader:
print(f"{row['name']:10} | {row['department']:12} | ${int(row['salary']):,}")['Alice', 'Engineering', '95000', '2021-03-15']
['Bob', 'Marketing', '72000', '2020-07-01']
['Charlie', 'Engineering', '88000', '2019-11-20']
['Diana', 'HR', '65000', '2022-01-10']
Alice | Engineering | $95,000
Bob | Marketing | $72,000
Charlie | Engineering | $88,000
Diana | HR | $65,000
csv.DictReaderis almost always preferred overcsv.reader— column names as keys make code far more readable- All values from CSV are strings — convert explicitly:
int(row["salary"]),float(row["price"]) - For real files:
open("file.csv", newline="", encoding="utf-8")— always specifynewline=""and encoding - Change the delimiter for tab-separated or pipe-separated files:
csv.DictReader(f, delimiter="\t")
2. Writing CSV Files
# Writing CSV files — csv.writer and csv.DictWriter
import csv
import io
# csv.writer — write rows as lists
output = io.StringIO()
writer = csv.writer(output)
writer.writerow(["product", "quantity", "price"]) # header
writer.writerows([
["Notebook", 50, 4.99],
["Pen", 200, 1.50],
["Desk", 10, 89.99],
])
print("csv.writer output:")
print(output.getvalue())
# csv.DictWriter — write rows as dicts
output2 = io.StringIO()
fields = ["name", "score", "grade"]
writer2 = csv.DictWriter(output2, fieldnames=fields)
writer2.writeheader()
writer2.writerows([
{"name": "Alice", "score": 92, "grade": "A"},
{"name": "Bob", "score": 78, "grade": "B"},
{"name": "Charlie", "score": 85, "grade": "B+"},
])
print("csv.DictWriter output:")
print(output2.getvalue())product,quantity,price
Notebook,50,4.99
Pen,200,1.5
Desk,10,89.99
csv.DictWriter output:
name,score,grade
Alice,92,A
Bob,78,B
Charlie,85,B+
- For real files:
open("output.csv", "w", newline="", encoding="utf-8") writer.writerows(list_of_lists)writes multiple rows in one callDictWriter.writeheader()writes the field names as the first row automatically- If a field value contains a comma,
csv.writerautomatically wraps it in quotes — you do not need to handle this yourself
3. CSV with pandas
For data analysis or any CSV with more than a few rows, pandas is far more powerful — it loads the entire file into a DataFrame with automatic type inference, filtering, aggregation, and export built in.
# CSV with pandas — read, filter, aggregate, export
import pandas as pd
import io
raw = """name,department,salary,start_date
Alice,Engineering,95000,2021-03-15
Bob,Marketing,72000,2020-07-01
Charlie,Engineering,88000,2019-11-20
Diana,HR,65000,2022-01-10
Eve,Engineering,102000,2018-05-30
"""
df = pd.read_csv(io.StringIO(raw), parse_dates=["start_date"])
print(df.dtypes)
print()
print(df[["name", "salary"]])
print()
# Filter and aggregate
eng = df[df["department"] == "Engineering"]
print("Engineering team:")
print(eng[["name", "salary"]])
print(f"Average salary: ${eng['salary'].mean():,.0f}")
# Export back to CSV
df.to_csv("employees_export.csv", index=False)
print("\nExported to employees_export.csv")department object
salary int64
start_date datetime64[ns]
dtype: object
name salary
0 Alice 95000
1 Bob 72000
2 Charlie 88000
3 Diana 65000
4 Eve 102000
Engineering team:
name salary
0 Alice 95000
2 Charlie 88000
4 Eve 102000
Average salary: $95,000
pd.read_csv()automatically infers column types — integers stay integers, dates parse withparse_dates=df.to_csv("file.csv", index=False)exports without the row number index column- Use
pandaswhen you need filtering, grouping, merging, or any analysis beyond simple reading and writing
XML — eXtensible Markup Language
XML uses nested tags to represent hierarchical data. It is used in configuration files, document formats (Word, SVG), SOAP web services, RSS feeds, and many enterprise data exchange formats. Python's xml.etree.ElementTree module is built in and covers most use cases.
4. Parsing XML
Real-world use: parsing an RSS feed to extract article titles, reading a configuration file, consuming a SOAP API response, or processing an SVG file.
# Parsing XML with ElementTree
import xml.etree.ElementTree as ET
xml_data = """
The Great Gatsby
F. Scott Fitzgerald
1925
12.99
Sapiens
Yuval Noah Harari
2011
16.99
1984
George Orwell
1949
9.99
"""
root = ET.fromstring(xml_data) # parse from string
# ET.parse("file.xml").getroot() — parse from file
print("Root tag:", root.tag)
print(f"Total books: {len(root)}\n")
# Iterate over child elements
for book in root.findall("book"):
book_id = book.get("id") # get an attribute
genre = book.get("genre")
title = book.find("title").text # get element text
author = book.find("author").text
price = float(book.find("price").text)
print(f"[{book_id}] {title} by {author} — ${price:.2f} ({genre})")
# findall with a simple filter — fiction books only
print("\nFiction books:")
for book in root.findall("book[@genre='fiction']"):
print(" ", book.find("title").text)Total books: 3
[1] The Great Gatsby by F. Scott Fitzgerald — $12.99 (fiction)
[2] Sapiens by Yuval Noah Harari — $16.99 (non-fiction)
[3] 1984 by George Orwell — $9.99 (fiction)
Fiction books:
The Great Gatsby
1984
root.findall("tag")— returns all direct children with that tag nameelement.find("tag")— returns the first matching child element (orNone)element.get("attr")— returns the attribute value (orNoneif not present)element.text— returns the text content between the opening and closing tags- XPath-style expressions like
[@genre='fiction']filter on attribute values
5. Building and Writing XML
# Building XML with ElementTree
import xml.etree.ElementTree as ET
# Build the tree programmatically
root = ET.Element("inventory")
items = [
{"sku": "NB001", "name": "Notebook", "qty": 150, "price": 4.99},
{"sku": "PN002", "name": "Pen", "qty": 500, "price": 1.50},
{"sku": "DS003", "name": "Desk", "qty": 20, "price": 89.99},
]
for item in items:
el = ET.SubElement(root, "item", sku=item["sku"])
ET.SubElement(el, "name").text = item["name"]
ET.SubElement(el, "qty").text = str(item["qty"])
ET.SubElement(el, "price").text = str(item["price"])
# Pretty-print (Python 3.9+)
ET.indent(root, space=" ")
tree = ET.ElementTree(root)
# Print to console
print(ET.tostring(root, encoding="unicode"))
# Write to file
tree.write("inventory.xml", encoding="unicode", xml_declaration=True)<item sku="NB001">
<name>Notebook</name>
<qty>150</qty>
<price>4.99</price>
</item>
<item sku="PN002">
<name>Pen</name>
<qty>500</qty>
<price>1.5</price>
</item>
<item sku="DS003">
<name>Desk</name>
<qty>20</qty>
<price>89.99</price>
</item>
</inventory>
ET.Element("tag")— creates the root elementET.SubElement(parent, "tag", attr=value)— creates a child element with optional attributesET.indent(root)— adds whitespace for human-readable output (Python 3.9+)ET.tostring(root, encoding="unicode")— serialises the tree to a string- For complex XML with namespaces or XSLT, consider the third-party
lxmllibrary instead
Summary Table
| Tool | Format | Best Used For |
|---|---|---|
csv.DictReader |
CSV | Reading CSV rows as named dicts |
csv.DictWriter |
CSV | Writing dicts as CSV rows |
pandas.read_csv |
CSV | Analysis, filtering, aggregation |
ET.fromstring() |
XML | Parse XML from a string |
ET.parse() |
XML | Parse XML from a file |
ET.SubElement() |
XML | Build XML trees programmatically |
Practice Questions
Practice 1. What is the difference between csv.reader and csv.DictReader?
Practice 2. Why must you specify newline="" when opening a CSV file in Python?
Practice 3. In ElementTree, what method retrieves the text between an element's opening and closing tags?
Practice 4. What does df.to_csv("file.csv", index=False) do differently from the default?
Practice 5. Which ElementTree method adds a child element to an existing parent element?
Quiz
Quiz 1. What data type does csv.DictReader return for each row?
Quiz 2. What does pd.read_csv(..., parse_dates=["start_date"]) do?
Quiz 3. In ElementTree, what is the difference between find() and findall()?
Quiz 4. How does csv.writer handle a field value that contains a comma?
Quiz 5. Which Python version introduced ET.indent() for pretty-printing XML?
Next up — NumPy: fast numerical computing with arrays, vectorised operations, and linear algebra.