close
Skip to content

Incremental Sync refactor #52

@swalkinshaw

Description

@swalkinshaw

Incremental Sync Architecture

Motivation

WP Packages currently runs a full pipeline on every sync cycle: discover packages → fetch updates → build ~140k files to disk → deploy via symlink swap → upload to R2. This worked when Composer v1 required a complete provider tree, but since dropping v1 support, the build directory is vestigial overhead. Every run rewrites all files regardless of whether anything changed, and the R2 sync walks the entire build directory doing byte comparisons — O(total packages) instead of O(changed packages).

Goal

Replace the build-directory pipeline with a DB-driven architecture where SQLite is the single source of truth. Packages get a content_hash (what the data looks like) and a deployed_hash (what's live on R2). Finding what needs uploading becomes a single query: WHERE content_hash != deployed_hash. No intermediate files, no filesystem walking, no manifest.

How It Works

Three-step pipeline: Discover → Update → Sync

  • Discover checks what packages exist and which ones changed (via SVN revision log). Cheap — no API calls.
  • Update fetches full metadata from wp.org only for changed packages, normalizes versions, and computes content_hash. If the hash changed, the package is marked dirty.
  • Sync queries for dirty packages, serializes their Composer JSON, uploads to R2 in parallel, then stamps deployed_hash. Crash-safe — if interrupted, the next run picks up where it left off.

DB-backed serving for local dev: the HTTP server serializes Composer metadata directly from SQLite on each request, eliminating the build step entirely for development.

Conditional packages.json upload: the root Composer config is effectively static, so it's uploaded with If-None-Match — a no-op on most runs.

Phases

  1. Schema + Content Hash — Add content_hash, deployed_hash, and content_changed_at columns. Extract serialization logic into a pure composer package. Compute hashes at update time.
  2. DB-Backed Serve Layer — Serve /p2/{type}/{name}.json and /packages.json directly from SQLite. Remove the dev command in favor of Makefile-composed CLI commands.
  3. R2 Sync — The main cut-over. Replace filesystem-based build + deploy with DB-driven sync. Combine builds and sync_runs tables into a single pipeline_runs table. Delete ~1,200 lines of build/deploy/filesystem code.
  4. Test Infrastructure — Update existing integration tests (mock wp.org server and gofakes3 already built) for the new architecture. Add a full round-trip test: seed DB → sync to fake S3 → resolve with Composer.
  5. Metadata Changes Feed — Packagist-compatible /metadata/changes.json endpoint powered by the content_changed_at column, enabling third-party mirrors to poll for updates efficiently.

Phases are sequential — each builds on the previous — but Phase 2 can coexist with the old pipeline (the serve layer reads from DB while the old pipeline still runs), making the transition incremental.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions