close
Skip to content

GH-463: Add more types - time, nano timestamps, UUID to Variant spec#464

Merged
emkornfield merged 5 commits into
apache:masterfrom
aihuaxu:aixu-add-more-variant-types
Dec 10, 2024
Merged

GH-463: Add more types - time, nano timestamps, UUID to Variant spec#464
emkornfield merged 5 commits into
apache:masterfrom
aihuaxu:aixu-add-more-variant-types

Conversation

@aihuaxu
Copy link
Copy Markdown
Contributor

@aihuaxu aihuaxu commented Oct 29, 2024

Rationale for this change

Iceberg tables support time, nano timestamps, UUID types and currently Variant spec doesn't include those. Propose to add those missing types.

What changes are included in this PR?

The spec change.

Fix #463

@aihuaxu
Copy link
Copy Markdown
Contributor Author

aihuaxu commented Oct 29, 2024

cc @gene-db, @rdblue.

Comment thread VariantEncoding.md Outdated
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| TimeNTZ | time without time zone | `21` | TIME(false, MICROS) | 8-byte little-endian |
| Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know that iceberg has added timestamp_ns in the V3 spec. From the Parquet side, we'd better be more generic. Should we consider parameterized timestamp type like what Trino does for Variant type: https://trino.io/docs/current/language/types.html#timestamp-p?

Copy link
Copy Markdown
Contributor

@Fokko Fokko Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I very much agree here. This is also in line with the LogicalTypeAnnotation in parquet-java.

Suggested change
| Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian |
| Timestamp | timestamp | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |

Since this hasn't been published yet, I would also propose:

  • Group these with the other timestamp/time types.
  • Remove the NTZ, as it is similar to the timeunit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing. I will try to make the name change to align with Parquet annotation The only thing is that we need to have separate type ID for ltz/ntz and micro/nano seconds. so it would be cleaner to have them into separate entries rather than grouping them with parameter.

@aihuaxu aihuaxu requested review from Fokko and wgtmac November 1, 2024 16:41
Comment thread VariantEncoding.md Outdated
| Date | date | `11` | DATE | 4 byte little-endian |
| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian |
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume some of these changes were weren't' intended, these lines shouldn't be changed

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH nvm, I see what's going on here

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually maybe I don't, "timestamp" isn't a physical type in parquet correct?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no Date or timestamp for physical type. I'm wondering if this column should mean type category. @gene-db ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should just eliminate the first column (and annotate logical type with additional information like exact numeric).

Comment thread VariantEncoding.md Outdated
| Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian |
| TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian |
| Timestamp | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we are trying to correct this it seems like timetamp without time zone should be the logical type and the physical type is int64?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this duplicates the the question @RussellSpitzer asked abve.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is answered in the paragraph below. Although "behaves" the same is ambiguous. I'm not sure if the intent here Is if parquet has flexibility to change the physical type. We should make sure we clarify this

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Just saw the explanation below for logical type.
I think it means: for a string, the engines can choose to encode differently but when it's read, they should be the same.

I updated the logical types for newly added types. Since engines may choose different precisions to encode, they have the same logical type from the paragraph below.

| Timestamp            | timestamp with time zone    | `12`    | TIMESTAMP(isAdjustedToUTC=true, MICROS)     | 8-byte little-endian                                                                        |
| Timestamp            | timestamp with time zone   | `22`    | TIMESTAMP(isAdjustedToUTC=true, NANOS)       | 8-byte little-endian                                                                        |

Comment thread VariantEncoding.md Outdated
| Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes |
| String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes |
| Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logical type should indicate precision here (and aboe for the other timestamps)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually reading the paragraph below I guess we don't need precision but do need nts

Comment thread VariantEncoding.md Outdated
| Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
| Timestamp | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian |
| UUID | uuid | `24` | UUID | 16 bytes |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably note there the encoding order (even though it is duplicative)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added 16-byte big-endian.

Copy link
Copy Markdown
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think once we iron out wording on timestamp we can merge.

@aihuaxu aihuaxu requested a review from emkornfield November 26, 2024 05:10
@aihuaxu aihuaxu force-pushed the aixu-add-more-variant-types branch from 45420e7 to de25a28 Compare November 26, 2024 05:14
@emkornfield
Copy link
Copy Markdown
Contributor

This LGTM, @RussellSpitzer any more comments.

Also, CC @gene-db @rdblue in case there are any concerns.

@emkornfield
Copy link
Copy Markdown
Contributor

Will merge end of week if there aren't more comments.

Copy link
Copy Markdown
Member

@RussellSpitzer RussellSpitzer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me @emkornfield.

Comment thread VariantEncoding.md Outdated
Comment thread VariantEncoding.md Outdated
@emkornfield
Copy link
Copy Markdown
Contributor

@aihuaxu sorry two small suggestions to avoid overloading "Logical Type" which is already a separate concept in Parquet.

@aihuaxu
Copy link
Copy Markdown
Contributor Author

aihuaxu commented Dec 9, 2024

@aihuaxu sorry two small suggestions to avoid overloading "Logical Type" which is already a separate concept in Parquet.

Yeah. Totally agree that we should avoid using logical type. I was thinking of using "type category" or "type group" before to avoid overload. But "Type Equivalence Class" also works for me.

aihuaxu and others added 2 commits December 8, 2024 16:20
Co-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
@emkornfield
Copy link
Copy Markdown
Contributor

Going to merge. Thanks @aihuaxu

@emkornfield emkornfield merged commit a3dda6a into apache:master Dec 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add more types - time, nano timestamps, UUID to Variant spec

5 participants