fix(ingest/snowflake): fix silently dropped views in batched SHOW VIEWS#17434
Draft
rob-1019 wants to merge 1 commit into
Draft
fix(ingest/snowflake): fix silently dropped views in batched SHOW VIEWS#17434rob-1019 wants to merge 1 commit into
rob-1019 wants to merge 1 commit into
Conversation
`_maybe_populate_empty_view_definitions` called `build_prefix_batches` with
`max_groups_in_batch=1` and then took `batch[0]` per batch. But the packer
used a strict `>` check on the group-count axis, so each batch could hold
up to two `PrefixGroup`s. The second group of every paired batch — entire
alphabetical ranges of views — never had its `SHOW VIEWS LIKE '<prefix>%'`
query issued, and those views were left with empty `viewProperties` even
when `include_view_definitions: true` and the view was in scope of the
filter. The view dataset was created in DataHub, but with no DDL and no
view-lineage edges.
Concrete trace (the actual production case): a schema with 1122 views,
top-level group split by first letter. Letter A produces a group of 681,
letters B and C produce ~55 each. The packer in `_batch_prefix_groups`:
Start: batch=[], size=0
A(681): 0+681 <= 1000 and len(batch)=0 not > 1 → append.
batch=[A], size=681
B(55): 681+55 = 736 <= 1000 but len(batch)=1 not > 1 → append.
batch=[A, B], size=736 ← second group lands in the same batch
C(55): len(batch)=2 > 1 → close batch.
batches=[[A, B]]. New batch=[C], size=55.
`batch[0]` of `[A, B]` is A; B is silently dropped, so no
`SHOW VIEWS LIKE 'B%'` ever runs. Same pattern repeats for D/F/H/L/N/P/S/U,
which produced the observed 10-batch sequence (A, C, E, G, K, M, O, R, T, V)
and "Could not fetch definitions for 287 views" in a real ingestion run.
This change does three things:
1. Caller fix: flatten the comprehension to iterate every group instead
of slicing `batch[0]`. This alone fixes the production bug.
2. Library fix: change `_batch_prefix_groups` to mirror the size-axis
check's "(current state) + (incoming delta) > cap" form on the count
axis (`len(batch) + 1 > max_groups_in_batch`) so the parameter name
means what it says. The size check has used this form since the file
was first written; the count check is now consistent. Bump the
column-fetch call site from `max_groups_in_batch=5` to `6` to
preserve the historical 6-groups-per-batch density. Verified locally
that `(10000, 6)` under the new semantics produces byte-for-byte
identical batches to `(10000, 5)` under the old strict-`>` semantics
on both uniform and skewed inputs.
3. Regression tests: a snowflake-level test that pins the caller's
"every prefix gets a query" contract via mocked SHOW VIEWS responses,
and a parametrized library-level invariant test asserting that the
union of names across all groups equals the input, every name shares
its group's prefix, and no batch exceeds `max_groups_in_batch`.
The library invariant would have caught the original bug on its own.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Connector Tests ResultsAll connector tests passed for commit To skip connector tests, add the Autogenerated by the connector-tests CI pipeline. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes a silent failure in the Snowflake source where view definitions
(and therefore view lineage) were not ingested for ~25% of views in
schemas with > 1000 views, even when those views matched the user's
table_pattern/view_patternfilter andinclude_view_definitions: truewas set.
Observed: a schema with 1,122 views produced the report
line "Could not fetch definitions for 287 views in
SOME_DB.FOO_V after processing all batches." The dataset URN was
created, schema/columns were populated, but the
viewPropertiesaspect(containing the DDL) and view-lineage edges were silently missing.
Root cause
_maybe_populate_empty_view_definitionsinsnowflake_schema.pycalledbuild_prefix_batches(..., max_groups_in_batch=1)and then tookbatch[0]from each returned batch. A comment claimed this was safe("max_groups_in_batch=1 makes it safe to access batch[0]") — but the
underlying
_batch_prefix_groupspacker used a strict>check on thegroup-count axis, so
max_groups_in_batch=Nactually packed up to N+1groups per batch. The second group of every paired batch was silently
dropped — no
SHOW VIEWS LIKE '<prefix>%'query ran for those prefixes,so the
text/DDL column was never fetched.Concrete trace (the production case): a top-level group split by
first letter yielded
A(681),B(55),C(55). The greedy packer:Result:
batches = [[A, B], [C]]. Caller takesbatch[0]of each → onlyAandCget queries; everything underBis dropped. Repeatingacross the alphabet produced the observed 10-batch SHOW VIEWS sequence
(A, C, E, G, K, M, O, R, T, V) with D/F/H/L/N/P/S/U missing, and
287 / 1122 unfetched view definitions.
The fix
Three concerns addressed in one commit:
Caller fix — flatten the comprehension in
_maybe_populate_empty_view_definitionsto iterate everyPrefixGroupinstead of slicing
batch[0]. This alone fixes the production bug.Library fix — change
_batch_prefix_groupsto mirror thesize-axis check's
(current state) + (incoming delta) > capform. Theparameter name now means what it says:
Npacks at most N groups per batch.The column-fetch call site bumps
max_groups_in_batchfrom5to6to preserve the historical 6-groups-per-batch density (oneSELECT against
information_schema.columnspacks that manyLIKEprefixes). Verified locally that
(10000, 6)under the newsemantics produces byte-for-byte identical batches to
(10000, 5)under the old strict-
>semantics on both uniform and skewed inputs.Regression tests:
query" contract via mocked SHOW VIEWS responses. Constructs an
A=990, B=55, C=55workload that exercises the bug-triggeringpairing under the production parameters; verified to fail
without the caller-side fix.
union of names across all groups equals the input, every name
shares its group's declared prefix, and no batch exceeds
max_groups_in_batch. Covers production parameters from bothSnowflake call sites plus pathological cases. Would have caught
the original bug on its own.
Test plan
the caller-side fix and pass after.
test_build_prefix_batches_exceeds_max_batch_sizestill passes under the new semantics.
./gradlew :metadata-ingestion:lintFixclean.parametrizations + 7 pre-existing snowflake schema + 1 new
snowflake regression).
Checklist
observed symptom.
bug fix to existing behavior, no surface change.
metadata than before; no schema or config API changes.
🤖 Generated with Claude Code