close
Skip to content

fix(ingest/snowflake): fix silently dropped views in batched SHOW VIEWS#17434

Draft
rob-1019 wants to merge 1 commit into
masterfrom
fix/snowflake-view-batch-drop
Draft

fix(ingest/snowflake): fix silently dropped views in batched SHOW VIEWS#17434
rob-1019 wants to merge 1 commit into
masterfrom
fix/snowflake-view-batch-drop

Conversation

@rob-1019
Copy link
Copy Markdown
Contributor

@rob-1019 rob-1019 commented May 13, 2026

Summary

Fixes a silent failure in the Snowflake source where view definitions
(and therefore view lineage) were not ingested for ~25% of views in
schemas with > 1000 views, even when those views matched the user's
table_pattern/view_pattern filter and include_view_definitions: true
was set.

Observed: a schema with 1,122 views produced the report
line "Could not fetch definitions for 287 views in
SOME_DB.FOO_V after processing all batches."
The dataset URN was
created, schema/columns were populated, but the viewProperties aspect
(containing the DDL) and view-lineage edges were silently missing.

Root cause

_maybe_populate_empty_view_definitions in snowflake_schema.py called
build_prefix_batches(..., max_groups_in_batch=1) and then took
batch[0] from each returned batch. A comment claimed this was safe
("max_groups_in_batch=1 makes it safe to access batch[0]") — but the
underlying _batch_prefix_groups packer used a strict > check on the
group-count axis, so max_groups_in_batch=N actually packed up to N+1
groups per batch. The second group of every paired batch was silently
dropped — no SHOW VIEWS LIKE '<prefix>%' query ran for those prefixes,
so the text/DDL column was never fetched.

Concrete trace (the production case): a top-level group split by
first letter yielded A(681), B(55), C(55). The greedy packer:

A(681) → batch=[A], size=681
B(55)  → 681+55=736 ≤ 1000 and len=1 not > 1 → batch=[A, B], size=736
C(55)  → len=2 > 1 → close [A, B]; start fresh with [C]

Result: batches = [[A, B], [C]]. Caller takes batch[0] of each → only
A and C get queries; everything under B is dropped. Repeating
across the alphabet produced the observed 10-batch SHOW VIEWS sequence
(A, C, E, G, K, M, O, R, T, V) with D/F/H/L/N/P/S/U missing, and
287 / 1122 unfetched view definitions.

The fix

Three concerns addressed in one commit:

  1. Caller fix — flatten the comprehension in
    _maybe_populate_empty_view_definitions to iterate every PrefixGroup
    instead of slicing batch[0]. This alone fixes the production bug.

  2. Library fix — change _batch_prefix_groups to mirror the
    size-axis check's (current state) + (incoming delta) > cap form. The
    parameter name now means what it says:
    N packs at most N groups per batch.

    The column-fetch call site bumps max_groups_in_batch from 5 to
    6 to preserve the historical 6-groups-per-batch density (one
    SELECT against information_schema.columns packs that many LIKE
    prefixes). Verified locally that (10000, 6) under the new
    semantics produces byte-for-byte identical batches to (10000, 5)
    under the old strict-> semantics on both uniform and skewed inputs.

  3. Regression tests:

    • Snowflake-level test pinning the caller's "every prefix gets a
      query" contract via mocked SHOW VIEWS responses. Constructs an
      A=990, B=55, C=55 workload that exercises the bug-triggering
      pairing under the production parameters; verified to fail
      without the caller-side fix.
    • Library-level parametrized invariant test asserting that the
      union of names across all groups equals the input, every name
      shares its group's declared prefix, and no batch exceeds
      max_groups_in_batch. Covers production parameters from both
      Snowflake call sites plus pathological cases. Would have caught
      the original bug on its own.

Test plan

  • New snowflake-level regression test verified to fail without
    the caller-side fix and pass after.
  • New library-level invariant test covers 7 parameter combinations.
  • Existing test_build_prefix_batches_exceeds_max_batch_size
    still passes under the new semantics.
  • ./gradlew :metadata-ingestion:lintFix clean.
  • 19 unit tests pass (4 pre-existing builder + 7 new invariant
    parametrizations + 7 pre-existing snowflake schema + 1 new
    snowflake regression).

Checklist

  • PR conforms to the Contributing Guideline (PR Title Format)
  • Links to related issues — none filed; this PR documents the
    observed symptom.
  • Tests for the changes have been added.
  • Docs related to the changes have been added/updated — N/A;
    bug fix to existing behavior, no surface change.
  • Breaking change considerations — none. Ingestion produces more
    metadata than before; no schema or config API changes.

🤖 Generated with Claude Code

`_maybe_populate_empty_view_definitions` called `build_prefix_batches` with
`max_groups_in_batch=1` and then took `batch[0]` per batch. But the packer
used a strict `>` check on the group-count axis, so each batch could hold
up to two `PrefixGroup`s. The second group of every paired batch — entire
alphabetical ranges of views — never had its `SHOW VIEWS LIKE '<prefix>%'`
query issued, and those views were left with empty `viewProperties` even
when `include_view_definitions: true` and the view was in scope of the
filter. The view dataset was created in DataHub, but with no DDL and no
view-lineage edges.

Concrete trace (the actual production case): a schema with 1122 views,
top-level group split by first letter. Letter A produces a group of 681,
letters B and C produce ~55 each. The packer in `_batch_prefix_groups`:

  Start: batch=[], size=0
  A(681): 0+681 <= 1000 and len(batch)=0 not > 1 → append.
          batch=[A], size=681
  B(55):  681+55 = 736 <= 1000 but len(batch)=1 not > 1 → append.
          batch=[A, B], size=736  ← second group lands in the same batch
  C(55):  len(batch)=2 > 1 → close batch.
          batches=[[A, B]]. New batch=[C], size=55.

`batch[0]` of `[A, B]` is A; B is silently dropped, so no
`SHOW VIEWS LIKE 'B%'` ever runs. Same pattern repeats for D/F/H/L/N/P/S/U,
which produced the observed 10-batch sequence (A, C, E, G, K, M, O, R, T, V)
and "Could not fetch definitions for 287 views" in a real ingestion run.

This change does three things:

1. Caller fix: flatten the comprehension to iterate every group instead
   of slicing `batch[0]`. This alone fixes the production bug.

2. Library fix: change `_batch_prefix_groups` to mirror the size-axis
   check's "(current state) + (incoming delta) > cap" form on the count
   axis (`len(batch) + 1 > max_groups_in_batch`) so the parameter name
   means what it says. The size check has used this form since the file
   was first written; the count check is now consistent. Bump the
   column-fetch call site from `max_groups_in_batch=5` to `6` to
   preserve the historical 6-groups-per-batch density. Verified locally
   that `(10000, 6)` under the new semantics produces byte-for-byte
   identical batches to `(10000, 5)` under the old strict-`>` semantics
   on both uniform and skewed inputs.

3. Regression tests: a snowflake-level test that pins the caller's
   "every prefix gets a query" contract via mocked SHOW VIEWS responses,
   and a parametrized library-level invariant test asserting that the
   union of names across all groups equals the input, every name shares
   its group's prefix, and no batch exceeds `max_groups_in_batch`.
   The library invariant would have caught the original bug on its own.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label May 13, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 13, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

@datahub-connector-tests
Copy link
Copy Markdown

Connector Tests Results

All connector tests passed for commit 53ec1d5

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant