close
Skip to content

Schema parsing and resolved schema rework#536

Draft
KitFieldhouse wants to merge 3 commits intoapache:mainfrom
KitFieldhouse:currentFeature
Draft

Schema parsing and resolved schema rework#536
KitFieldhouse wants to merge 3 commits intoapache:mainfrom
KitFieldhouse:currentFeature

Conversation

@KitFieldhouse
Copy link
Copy Markdown

Hey you all, this PR reworks schema parsing and ResolvedSchema.

I apologize for the size of this PR. I tried making it smaller, but I kept finding the changes made to parsing/ResovledSchema would trickle into many other sections of the crate.

This PR addresses issues #531, #353, and #508.

Here is the gist of what this PR does:

The new stuff

Introduces SchemaWithSymbols.

Instead of representing a schema as a single recursive Schema variant, SchemaWithSymbols represents a schema as a hash map of all defined named schemata and a hash set of all referenced named schemata from that schema. This, coupled with the root schema enum, is enough to represent the entire schema parse tree.

The upshot of this representation is that when we go to check if a set of schemata form a complete context we don't need to recurse into each schema and can just compare the "symbol tables" directly.

In addition, SchemaWithSymbols is essentially just made up of Arc's and HashMaps of Arc's. This should make cloning fairly quick compared to Schema.

Introduces ResolvedContext.

ResovledContext wraps a list of schema definitions that have been checked for compatibility (no double naming of named schemata, all named references have a definition). This is closest to what the original ResolvedSchema looked like.

Completely reworks ResolvedSchema.

As can be seen by the diff, ResolvedSchema has been essentially replaced in its entirety. ResolvedSchema now wraps a schema and all of the definitions needed to uniquely resolve all named schemata. In addition, we can convert ResolvedSchema into a ResolvedNode.

Introduces ResolvedNode.

The struct ResolvedNode allows other code to walk down a schema tree without having to resolve Schema::Ref variants themselves. Further, since a ResolvedNode can only be created from a ResolvedSchema, we can make the type guarantee that this resolution will always find a unique named schema definition. ResolvedNode also makes the promise that any default for a record field has been checked against the field's schema.

Introduces a Resolver trait.

This trait is designed to be used to hook up this crate's schema resolution logic to an external schema registry. Struct's implementing this trait can be provided to ResolvedContext either directly or through ResolvedSchema. If a local definition for a schema can not be found, the Resolver is called asking for it to either provide that schema or to signal an error.

API changes

This PR introduces several breaking API changes, both in the public API and to the private API. I will try to summarize the changes for both layers.

Semantic changes -- parsing

  • Parser now always generates a SchemaWithSymbols. To keep compatibility with existing Schema based methods, the original schema parsing methods like Schema::parse_str will use the parser to generate a SchemaWithSymbols which is then unraveled into a normal Schema enum. This adds some overhead, but this is already a "slow" path so I don't think this introduces performance concerns.

  • Schema::parse_list no longer checks if every named schema has a unique definition somewhere in the list. This is due to the conceptual shift mentioned in issue schema parsing/resolution improvements? #353. In this PR, parser methods do nothing more than parse each schema string and check that each one is internally consistent. Checking that a list of schemata is consistent (no duplicate named definitions, all references resolved) is pushed to ResolvedSchema.

API changes -- public facing

  • ResolvedSchema completely reworked. Any code that relied on the old ResolvedSchema would have to be refactored (though I think the needed changes are rather minimal). The new ResolvedSchema wraps a single schema and its resolved context, and is created via ResolvedSchema::builder(). See the new builder API: .additional() to provide
    extra schemata, then .build_one(), .build_array(), or .build_list() as the terminal method depending on how many schemas you need resolved. Variants with _with_resolver accept a custom Resolver implementation for schema registry lookups.
  • Elimination of Schema::parse_str_with_list. This method parsed a schema string alongside a list of other schema strings and returned the parsed schema plus the resolved list. The same result can now be achieved by parsing each schema independently with SchemaWithSymbols::parse_str and then resolving them together via ResolvedSchema::builder().additional(others)?.build_one(target)?.
  • Elimination of Schema::denormalize. Denormalization (inlining named type definitions into references) is now handled via ResolvedSchema::unravel() or SchemaWithSymbols::unravel().
  • Elimination of Schema::independent_canonical_form. This is now accessed through CompleteSchema::independent_canonical_form() which requires a ResolvedSchema, making the resolved-context dependency explicit.
  • resolve_names and resolve_names_with_schemata (were pub(crate)) are removed. Their role is subsumed by ResolvedContext.
  • ResolvedOwnedSchema is removed. ResolvedSchema now owns its data (via Arc), making the separate owned variant unnecessary.
  • Schema::name() now returns Option<&Arc<Name>> instead of Option<&Name>.
  • Value::validate_schemata(schemata) replaced by Value::validate_against_resolved(resolved_schema).
  • find_schema_with_known_schemata is removed. Finding a matching element in a union requires that you have a ResolvedSchema. The method that performs union element matching is now structural_match_on_schema defined on ResolvedUnion.

API changes -- private facing

  • decode_internal signature changed from (schema: &Schema, names: &HashMap<Name, S>, enclosing_namespace: NamespaceRef, reader: &mut R) to
    (node: ResolvedNode, reader: &mut R). The ResolvedNode carries the schema, its resolved definitions, and handles namespace tracking internally
  • encode_internal similarly changed from (value: &Value, schema: &Schema, names: &HashMap<Name, S>, enclosing_namespace: NamespaceRef, writer: &mut W)
    to (value: &Value, node: ResolvedNode, writer: &mut W).
  • New functions decode_complete and encode_complete are added as the preferred entry points for decoding/encoding with a ResolvedSchema.
  • Block now takes a ResolvedSchema instead of Names.
  • validate_internal and resolve_internal on Value changed from generic <S: Borrow<Schema> + Debug> with (schema, names, namespace) parameters to taking a ResolvedNode directly.
  • The generic S: Borrow<Schema> pattern used throughout decode, encode, and types for referencing named schemas is entirely replaced by ResolvedNode's transparent reference resolution.
  • Elimination of Schema::parse_with_names (was pub(crate)). Internal callers now use SchemaWithSymbols and ResolvedSchema instead of passing Names maps through parsing.

@KitFieldhouse KitFieldhouse marked this pull request as draft April 20, 2026 22:23
@martin-g
Copy link
Copy Markdown
Member

There are many conflicts!
You need to rebase!

@martin-g
Copy link
Copy Markdown
Member

Also it would be good to address #368 if such big changes are going to be made.

@KitFieldhouse
Copy link
Copy Markdown
Author

There are many conflicts! You need to rebase!

Yeah sorry, my bad, I will get that done.

Also it would be good to address #368 if such big changes are going to be made.

I'll take a look!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants