close
Skip to content

endjin/endjin-pyspark-examples

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

endjin-polars-examples

Practical examples of Polars in use for modern, cloud-native data engineering pipelines. The primary dataset is the HM Land Registry Price Paid Data — publicly available records of every residential property sale in England and Wales.

Contents

Folder Description
notebooks/sqlbits_2026/ Demo notebooks for the SQLbits 2026 conference — download Land Registry data and run the full wrangling pipeline
notebooks/polars_blog/ Polars blog post series covering eager/lazy patterns, Polars vs pandas, and Microsoft Fabric
src/data_wrangler/ DataWrangler class — transformation pipeline and Pandera schema
src/data_importers/ LandRegistryImporter — downloads yearly price paid CSV files
tests/bdd/ Behavioural tests (Gherkin + Behave) covering every pipeline step
skills/ How-to guides for working in this codebase
data/land_registry_data/ Downloaded CSVs (not committed — generated by the download notebook)

SQLbits 2026 notebooks

The notebooks/sqlbits_2026/ folder contains two notebooks intended to be run in order:

  1. download_land_registry_data.ipynb — downloads up to 10 years of yearly price paid CSV files (~100 MB each) from the Land Registry S3 bucket into data/land_registry_data/. Already-downloaded files are skipped.

  2. summarise_land_registry_data.ipynb — runs the full DataWrangler.run_pipeline() against the downloaded data and plots median house price by year and property type using Plotly.

DataWrangler pipeline

DataWrangler in src/data_wrangler/data_wrangler.py exposes a run_pipeline(data_folder) class method that chains the following steps using Polars lazy execution (scan_csvcollect):

load_data                          scan_csv glob, name columns, cast price/date
drop_records_without_postcode      remove rows with null/empty postcode
drop_records_without_date          remove rows with null date
filter_other_property_types        remove property_type = "O" (Other)
extract_year_from_date             add year column from date
rename_property_type               D/S/T/F → Detached/Semi-Detached/Terraced/Flat
rename_duration                    F/L/U → Freehold/Leasehold/Unknown
rename_old_new                     Y/N → New/Old
extract_postcode_area              add postcode_area (e.g. SW1A from SW1A 2AA)
summarise_by_year_and_property_type  group_by year + property_type, agg sales/prices
sort_by_year_and_property_type     sort ascending by year then property_type
collect                            execute the lazy plan, return eager DataFrame

All transformation methods are also callable individually as static methods, accepting either pl.DataFrame or pl.LazyFrame.

Getting started

Prerequisites

  • uv (Python package manager)
  • VS Code with the Python and Behave VSC extensions
  • az login if reading/writing to Azure storage (local runs use the filesystem)

Install dependencies

uv sync

Download Land Registry data

Run notebooks/sqlbits_2026/download_land_registry_data.ipynb, or use the importer directly:

from data_importers import LandRegistryImporter

importer = LandRegistryImporter(
    raw_data_download_path="data/land_registry_data",
    storage_options={},
    number_of_years=5,
)
importer.download_land_registry_data()

Run the pipeline

from data_wrangler import DataWrangler

summary = DataWrangler.run_pipeline("data/land_registry_data")
print(summary)

Run the tests

uv run behave --tags @unit    # fast in-memory unit tests
uv run behave --tags @e2e     # end-to-end test against tests/bdd/test_data/
uv run behave                 # all tests

Dataset

HM Land Registry Price Paid Data tracks residential property sales in England and Wales submitted for registration. Data runs from 1995 to present. Each yearly file is approximately 100 MB. Published under the Open Government Licence v3.0.

Source: https://www.gov.uk/guidance/about-the-price-paid-data

Skills

Task-specific how-to guides live in skills/. Read the relevant skill before working in each area:

Task Skill
Writing or running BDD / Gherkin tests skills/executable-specifications/SKILL.md
Adding or structuring Python packages skills/python-package-management/SKILL.md
Writing Polars transformations skills/polars-best-practices/SKILL.md
Land Registry field definitions, schema, loading skills/land-registry-price-paid-data/SKILL.md

About

Public repository providing examples of Pyspark code.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages