fix(ingest/snowflake): fix silently dropped views in batched SHOW VIEWS by rob-1019 · Pull Request #17434 · datahub-project/datahub

rob-1019 · 2026-05-13T23:54:03Z

Summary

Fixes a silent failure in the Snowflake source where view definitions
(and therefore view lineage) were not ingested for ~25% of views in
schemas with > 1000 views, even when those views matched the user's
table_pattern/view_pattern filter and include_view_definitions: true
was set.

Observed: a schema with 1,122 views produced the report
line "Could not fetch definitions for 287 views in
SOME_DB.FOO_V after processing all batches." The dataset URN was
created, schema/columns were populated, but the viewProperties aspect
(containing the DDL) and view-lineage edges were silently missing.

Root cause

_maybe_populate_empty_view_definitions in snowflake_schema.py called
build_prefix_batches(..., max_groups_in_batch=1) and then took
batch[0] from each returned batch. A comment claimed this was safe
("max_groups_in_batch=1 makes it safe to access batch[0]") — but the
underlying _batch_prefix_groups packer used a strict > check on the
group-count axis, so max_groups_in_batch=N actually packed up to N+1
groups per batch. The second group of every paired batch was silently
dropped — no SHOW VIEWS LIKE '<prefix>%' query ran for those prefixes,
so the text/DDL column was never fetched.

Concrete trace (the production case): a top-level group split by
first letter yielded A(681), B(55), C(55). The greedy packer:

A(681) → batch=[A], size=681
B(55)  → 681+55=736 ≤ 1000 and len=1 not > 1 → batch=[A, B], size=736
C(55)  → len=2 > 1 → close [A, B]; start fresh with [C]

Result: batches = [[A, B], [C]]. Caller takes batch[0] of each → only
A and C get queries; everything under B is dropped. Repeating
across the alphabet produced the observed 10-batch SHOW VIEWS sequence
(A, C, E, G, K, M, O, R, T, V) with D/F/H/L/N/P/S/U missing, and
287 / 1122 unfetched view definitions.

The fix

Three concerns addressed in one commit:

Caller fix — flatten the comprehension in
_maybe_populate_empty_view_definitions to iterate every PrefixGroup
instead of slicing batch[0]. This alone fixes the production bug.
Library fix — change _batch_prefix_groups to mirror the
size-axis check's (current state) + (incoming delta) > cap form. The
parameter name now means what it says:
N packs at most N groups per batch.

The column-fetch call site bumps max_groups_in_batch from 5 to
6 to preserve the historical 6-groups-per-batch density (one
SELECT against information_schema.columns packs that many LIKE
prefixes). Verified locally that (10000, 6) under the new
semantics produces byte-for-byte identical batches to (10000, 5)
under the old strict-> semantics on both uniform and skewed inputs.
Regression tests:
- Snowflake-level test pinning the caller's "every prefix gets a
  query" contract via mocked SHOW VIEWS responses. Constructs an
  A=990, B=55, C=55 workload that exercises the bug-triggering
  pairing under the production parameters; verified to fail
  without the caller-side fix.
- Library-level parametrized invariant test asserting that the
  union of names across all groups equals the input, every name
  shares its group's declared prefix, and no batch exceeds
  max_groups_in_batch. Covers production parameters from both
  Snowflake call sites plus pathological cases. Would have caught
  the original bug on its own.

Test plan

New snowflake-level regression test verified to fail without
the caller-side fix and pass after.
New library-level invariant test covers 7 parameter combinations.
Existing test_build_prefix_batches_exceeds_max_batch_size
still passes under the new semantics.
./gradlew :metadata-ingestion:lintFix clean.
19 unit tests pass (4 pre-existing builder + 7 new invariant
parametrizations + 7 pre-existing snowflake schema + 1 new
snowflake regression).

Checklist

PR conforms to the Contributing Guideline (PR Title Format)
Links to related issues — none filed; this PR documents the
observed symptom.
Tests for the changes have been added.
Docs related to the changes have been added/updated — N/A;
bug fix to existing behavior, no surface change.
Breaking change considerations — none. Ingestion produces more
metadata than before; no schema or config API changes.

🤖 Generated with Claude Code

`_maybe_populate_empty_view_definitions` called `build_prefix_batches` with `max_groups_in_batch=1` and then took `batch[0]` per batch. But the packer used a strict `>` check on the group-count axis, so each batch could hold up to two `PrefixGroup`s. The second group of every paired batch — entire alphabetical ranges of views — never had its `SHOW VIEWS LIKE '<prefix>%'` query issued, and those views were left with empty `viewProperties` even when `include_view_definitions: true` and the view was in scope of the filter. The view dataset was created in DataHub, but with no DDL and no view-lineage edges. Concrete trace (the actual production case): a schema with 1122 views, top-level group split by first letter. Letter A produces a group of 681, letters B and C produce ~55 each. The packer in `_batch_prefix_groups`: Start: batch=[], size=0 A(681): 0+681 <= 1000 and len(batch)=0 not > 1 → append. batch=[A], size=681 B(55): 681+55 = 736 <= 1000 but len(batch)=1 not > 1 → append. batch=[A, B], size=736 ← second group lands in the same batch C(55): len(batch)=2 > 1 → close batch. batches=[[A, B]]. New batch=[C], size=55. `batch[0]` of `[A, B]` is A; B is silently dropped, so no `SHOW VIEWS LIKE 'B%'` ever runs. Same pattern repeats for D/F/H/L/N/P/S/U, which produced the observed 10-batch sequence (A, C, E, G, K, M, O, R, T, V) and "Could not fetch definitions for 287 views" in a real ingestion run. This change does three things: 1. Caller fix: flatten the comprehension to iterate every group instead of slicing `batch[0]`. This alone fixes the production bug. 2. Library fix: change `_batch_prefix_groups` to mirror the size-axis check's "(current state) + (incoming delta) > cap" form on the count axis (`len(batch) + 1 > max_groups_in_batch`) so the parameter name means what it says. The size check has used this form since the file was first written; the count check is now consistent. Bump the column-fetch call site from `max_groups_in_batch=5` to `6` to preserve the historical 6-groups-per-batch density. Verified locally that `(10000, 6)` under the new semantics produces byte-for-byte identical batches to `(10000, 5)` under the old strict-`>` semantics on both uniform and skewed inputs. 3. Regression tests: a snowflake-level test that pins the caller's "every prefix gets a query" contract via mocked SHOW VIEWS responses, and a parametrized library-level invariant test asserting that the union of names across all groups equals the input, every name shares its group's prefix, and no batch exceeds `max_groups_in_batch`. The library invariant would have caught the original bug on its own. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov · 2026-05-13T23:57:52Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

datahub-connector-tests · 2026-05-14T00:30:47Z

Connector Tests Results

All connector tests passed for commit 53ec1d5

View full test logs →

To skip connector tests, add the skip-connector-tests label (org members only).

Autogenerated by the connector-tests CI pipeline.

github-actions Bot added the ingestion PR or Issue related to the ingestion of metadata label May 13, 2026

github-actions Bot deployed to datahub-wheels (Preview) May 13, 2026 23:56 View deployment

vercel Bot deployed to Preview May 14, 2026 00:07 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/snowflake): fix silently dropped views in batched SHOW VIEWS#17434

fix(ingest/snowflake): fix silently dropped views in batched SHOW VIEWS#17434
rob-1019 wants to merge 1 commit into
masterfrom
fix/snowflake-view-batch-drop

rob-1019 commented May 13, 2026 •

edited

Loading

Uh oh!

codecov Bot commented May 13, 2026 •

edited

Loading

Uh oh!

datahub-connector-tests Bot commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

rob-1019 commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

The fix

Test plan

Checklist

Uh oh!

codecov Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

datahub-connector-tests Bot commented May 14, 2026

Connector Tests Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

rob-1019 commented May 13, 2026 •

edited

Loading

codecov Bot commented May 13, 2026 •

edited

Loading