endjin-polars-examples

Practical examples of Polars in use for modern, cloud-native data engineering pipelines. The primary dataset is the HM Land Registry Price Paid Data — publicly available records of every residential property sale in England and Wales.

Folder	Description
`notebooks/sqlbits_2026/`	Demo notebooks for the SQLbits 2026 conference — download Land Registry data and run the full wrangling pipeline
`notebooks/polars_blog/`	Polars blog post series covering eager/lazy patterns, Polars vs pandas, and Microsoft Fabric
`src/data_wrangler/`	`DataWrangler` class — transformation pipeline and Pandera schema
`src/data_importers/`	`LandRegistryImporter` — downloads yearly price paid CSV files
`tests/bdd/`	Behavioural tests (Gherkin + Behave) covering every pipeline step
`skills/`	How-to guides for working in this codebase
`data/land_registry_data/`	Downloaded CSVs (not committed — generated by the download notebook)

SQLbits 2026 notebooks

The notebooks/sqlbits_2026/ folder contains two notebooks intended to be run in order:

download_land_registry_data.ipynb — downloads up to 10 years of yearly price paid CSV files (~100 MB each) from the Land Registry S3 bucket into data/land_registry_data/. Already-downloaded files are skipped.
summarise_land_registry_data.ipynb — runs the full DataWrangler.run_pipeline() against the downloaded data and plots median house price by year and property type using Plotly.

DataWrangler pipeline

DataWrangler in src/data_wrangler/data_wrangler.py exposes a run_pipeline(data_folder) class method that chains the following steps using Polars lazy execution (scan_csv → collect):

load_data                          scan_csv glob, name columns, cast price/date
drop_records_without_postcode      remove rows with null/empty postcode
drop_records_without_date          remove rows with null date
filter_other_property_types        remove property_type = "O" (Other)
extract_year_from_date             add year column from date
rename_property_type               D/S/T/F → Detached/Semi-Detached/Terraced/Flat
rename_duration                    F/L/U → Freehold/Leasehold/Unknown
rename_old_new                     Y/N → New/Old
extract_postcode_area              add postcode_area (e.g. SW1A from SW1A 2AA)
summarise_by_year_and_property_type  group_by year + property_type, agg sales/prices
sort_by_year_and_property_type     sort ascending by year then property_type
collect                            execute the lazy plan, return eager DataFrame

All transformation methods are also callable individually as static methods, accepting either pl.DataFrame or pl.LazyFrame.

Getting started

Prerequisites

uv (Python package manager)
VS Code with the Python and Behave VSC extensions
az login if reading/writing to Azure storage (local runs use the filesystem)

Install dependencies

uv sync

Download Land Registry data

Run notebooks/sqlbits_2026/download_land_registry_data.ipynb, or use the importer directly:

from data_importers import LandRegistryImporter

importer = LandRegistryImporter(
    raw_data_download_path="data/land_registry_data",
    storage_options={},
    number_of_years=5,
)
importer.download_land_registry_data()

Run the pipeline

from data_wrangler import DataWrangler

summary = DataWrangler.run_pipeline("data/land_registry_data")
print(summary)

Run the tests

uv run behave --tags @unit    # fast in-memory unit tests
uv run behave --tags @e2e     # end-to-end test against tests/bdd/test_data/
uv run behave                 # all tests

Dataset

HM Land Registry Price Paid Data tracks residential property sales in England and Wales submitted for registration. Data runs from 1995 to present. Each yearly file is approximately 100 MB. Published under the Open Government Licence v3.0.

Source: https://www.gov.uk/guidance/about-the-price-paid-data

Skills

Task-specific how-to guides live in skills/. Read the relevant skill before working in each area:

Task	Skill
Writing or running BDD / Gherkin tests	`skills/executable-specifications/SKILL.md`
Adding or structuring Python packages	`skills/python-package-management/SKILL.md`
Writing Polars transformations	`skills/polars-best-practices/SKILL.md`
Land Registry field definitions, schema, loading	`skills/land-registry-price-paid-data/SKILL.md`

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
notebooks/sqlbits_2026		notebooks/sqlbits_2026
src		src
tests/bdd		tests/bdd
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
behave.ini		behave.ini
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

endjin-polars-examples

Contents

SQLbits 2026 notebooks

DataWrangler pipeline

Getting started

Prerequisites

Install dependencies

Download Land Registry data

Run the pipeline

Run the tests

Dataset

Skills

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

endjin-polars-examples

Contents

SQLbits 2026 notebooks

DataWrangler pipeline

Getting started

Prerequisites

Install dependencies

Download Land Registry data

Run the pipeline

Run the tests

Dataset

Skills

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 1

Languages

Packages