close
Skip to content

feat: add unenforced_clustering_key to format spec#6552

Open
beinan wants to merge 2 commits intolance-format:mainfrom
beinan:feat/unenforced-clustering-key
Open

feat: add unenforced_clustering_key to format spec#6552
beinan wants to merge 2 commits intolance-format:mainfrom
beinan:feat/unenforced-clustering-key

Conversation

@beinan
Copy link
Copy Markdown
Contributor

@beinan beinan commented Apr 17, 2026

Summary

  • Add unenforced_clustering_key metadata to the Lance schema format, mirroring the existing unenforced_primary_key pattern
  • Clustering keys hint at the physical ordering of data within a table, enabling query engine optimizations such as storage-partitioned joins (SPJ)
  • Unlike primary keys, clustering key fields may be nullable

Changes across all layers:

  • Protobuf: unenforced_clustering_key (bool) + unenforced_clustering_key_position (uint32) fields 14-15
  • Rust core: field struct, constants, Arrow metadata parsing, schema method
  • Protobuf serialization: round-trip support with backward compat
  • Java JNI + LanceField: constructor args and getters
  • Python bindings + type stubs: is_unenforced_clustering_key() / unenforced_clustering_key_position()
  • Format docs: clustering key metadata section

Motivation

This was discussed in the lance-spark SPJ PR (lance-format/lance-spark#445). Rather than using custom table properties, embedding clustering key info in the schema metadata follows the established pattern and avoids migration issues.

Test plan

  • cargo check -p lance-core -p lance-file passes
  • cargo test -p lance-core -p lance-file passes (all tests including existing primary key tests)
  • cargo clippy -p lance-core -p lance-file --tests -- -D warnings clean

🤖 Generated with Claude Code

@github-actions github-actions bot added enhancement New feature or request python java labels Apr 17, 2026
Comment thread docs/src/format/table/schema.md Outdated
### Clustering Key Metadata

Clustering key configuration is handled by two protobuf fields in the Field message:
- **unenforced_clustering_key** (bool): Whether this field is part of the clustering key
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for unenforced primary key, we initially introduced the boolean and later moved to position because position makes the key fields ordered. I think for clustering, we can just go with position directly without the boolean

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks Jack! Good call - updated to drop the boolean and use only unenforced_clustering_key_position as the single source of truth.

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 45.83333% with 13 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
rust/lance-core/src/datatypes/schema.rs 0.00% 8 Missing ⚠️
rust/lance-core/src/datatypes/field.rs 60.00% 3 Missing and 1 partial ⚠️
rust/lance-file/src/datatypes.rs 83.33% 1 Missing ⚠️

📢 Thoughts on this report? Let us know!

@beinan beinan force-pushed the feat/unenforced-clustering-key branch from 415fb4f to 1118bfc Compare April 17, 2026 01:28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably also add getUnenforcedPrimaryKey() and getUnenforcedClusteringKey() in LanceSchema

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and similar comment for python

@beinan beinan force-pushed the feat/unenforced-clustering-key branch from 1118bfc to ad6157f Compare April 17, 2026 07:40
@beinan
Copy link
Copy Markdown
Contributor Author

beinan commented Apr 17, 2026

Thanks Jack! Added getUnenforcedPrimaryKey() and getUnenforcedClusteringKey() to LanceSchema in both Java and Python, returning fields sorted by position.

@beinan beinan force-pushed the feat/unenforced-clustering-key branch from ad6157f to 3304e87 Compare April 18, 2026 06:21
Add clustering key metadata to the Lance schema, following the same
pattern as unenforced_primary_key. Clustering keys hint at the physical
ordering of data within a table, enabling query engine optimizations
such as storage-partitioned joins (SPJ).

Changes across all layers:
- Protobuf: two new fields (bool + uint32 position) in Field message
- Rust core: field struct, constants, Arrow metadata parsing
- Rust schema: ordered field collection method
- Protobuf serialization: round-trip support
- Java JNI + LanceField: constructor and getters
- Python bindings + type stubs: is/position methods
- Format docs: clustering key metadata section

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@beinan beinan force-pushed the feat/unenforced-clustering-key branch from 3304e87 to 417ddb4 Compare April 18, 2026 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request java python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants