Practical examples of Polars in use for modern, cloud-native data engineering pipelines. The primary dataset is the HM Land Registry Price Paid Data — publicly available records of every residential property sale in England and Wales.
| Folder | Description |
|---|---|
notebooks/sqlbits_2026/ |
Demo notebooks for the SQLbits 2026 conference — download Land Registry data and run the full wrangling pipeline |
notebooks/polars_blog/ |
Polars blog post series covering eager/lazy patterns, Polars vs pandas, and Microsoft Fabric |
src/data_wrangler/ |
DataWrangler class — transformation pipeline and Pandera schema |
src/data_importers/ |
LandRegistryImporter — downloads yearly price paid CSV files |
tests/bdd/ |
Behavioural tests (Gherkin + Behave) covering every pipeline step |
skills/ |
How-to guides for working in this codebase |
data/land_registry_data/ |
Downloaded CSVs (not committed — generated by the download notebook) |
The notebooks/sqlbits_2026/ folder contains two notebooks intended to be run in order:
-
download_land_registry_data.ipynb— downloads up to 10 years of yearly price paid CSV files (~100 MB each) from the Land Registry S3 bucket intodata/land_registry_data/. Already-downloaded files are skipped. -
summarise_land_registry_data.ipynb— runs the fullDataWrangler.run_pipeline()against the downloaded data and plots median house price by year and property type using Plotly.
DataWrangler in src/data_wrangler/data_wrangler.py exposes a run_pipeline(data_folder) class method that chains the following steps using Polars lazy execution (scan_csv → collect):
load_data scan_csv glob, name columns, cast price/date
drop_records_without_postcode remove rows with null/empty postcode
drop_records_without_date remove rows with null date
filter_other_property_types remove property_type = "O" (Other)
extract_year_from_date add year column from date
rename_property_type D/S/T/F → Detached/Semi-Detached/Terraced/Flat
rename_duration F/L/U → Freehold/Leasehold/Unknown
rename_old_new Y/N → New/Old
extract_postcode_area add postcode_area (e.g. SW1A from SW1A 2AA)
summarise_by_year_and_property_type group_by year + property_type, agg sales/prices
sort_by_year_and_property_type sort ascending by year then property_type
collect execute the lazy plan, return eager DataFrame
All transformation methods are also callable individually as static methods, accepting either pl.DataFrame or pl.LazyFrame.
- uv (Python package manager)
- VS Code with the Python and Behave VSC extensions
az loginif reading/writing to Azure storage (local runs use the filesystem)
uv syncRun notebooks/sqlbits_2026/download_land_registry_data.ipynb, or use the importer directly:
from data_importers import LandRegistryImporter
importer = LandRegistryImporter(
raw_data_download_path="data/land_registry_data",
storage_options={},
number_of_years=5,
)
importer.download_land_registry_data()from data_wrangler import DataWrangler
summary = DataWrangler.run_pipeline("data/land_registry_data")
print(summary)uv run behave --tags @unit # fast in-memory unit tests
uv run behave --tags @e2e # end-to-end test against tests/bdd/test_data/
uv run behave # all testsHM Land Registry Price Paid Data tracks residential property sales in England and Wales submitted for registration. Data runs from 1995 to present. Each yearly file is approximately 100 MB. Published under the Open Government Licence v3.0.
Source: https://www.gov.uk/guidance/about-the-price-paid-data
Task-specific how-to guides live in skills/. Read the relevant skill before working in each area:
| Task | Skill |
|---|---|
| Writing or running BDD / Gherkin tests | skills/executable-specifications/SKILL.md |
| Adding or structuring Python packages | skills/python-package-management/SKILL.md |
| Writing Polars transformations | skills/polars-best-practices/SKILL.md |
| Land Registry field definitions, schema, loading | skills/land-registry-price-paid-data/SKILL.md |