GH-463: Add more types - time, nano timestamps, UUID to Variant spec by aihuaxu · Pull Request #464 · apache/parquet-format

aihuaxu · 2024-10-29T21:35:18Z

Rationale for this change

Iceberg tables support time, nano timestamps, UUID types and currently Variant spec doesn't include those. Propose to add those missing types.

What changes are included in this PR?

The spec change.

Fix #463

aihuaxu · 2024-10-29T21:36:13Z

cc @gene-db, @rdblue.

wgtmac · 2024-10-30T01:41:08Z

 | Binary               | binary                      | `15`    | BINARY                      | 4 byte little-endian size, followed by bytes                                                                        |
 | String               | string                      | `16`    | STRING                      | 4 byte little-endian size, followed by UTF-8 encoded bytes                                                          |
+| TimeNTZ              | time without time zone      | `21`    | TIME(false, MICROS)          | 8-byte little-endian                                                                                                 |
+| Timestamp_ns         | timestamp                   | `22`    | TIMESTAMP(true, NANOS)       | 8-byte little-endian                                                                                                 |


I know that iceberg has added timestamp_ns in the V3 spec. From the Parquet side, we'd better be more generic. Should we consider parameterized timestamp type like what Trino does for Variant type: https://trino.io/docs/current/language/types.html#timestamp-p?

Thanks for reviewing. I will try to make the name change to align with Parquet annotation The only thing is that we need to have separate type ID for ltz/ntz and micro/nano seconds. so it would be cleaner to have them into separate entries rather than grouping them with parameter.

RussellSpitzer · 2024-11-01T16:51:24Z

 | Date                 | date                        | `11`    | DATE                        | 4 byte little-endian                                                                                                |
-| Timestamp            | timestamp                   | `12`    | TIMESTAMP(true, MICROS)     | 8-byte little-endian                                                                                                |
-| TimestampNTZ         | timestamp without time zone | `13`    | TIMESTAMP(false, MICROS)    | 8-byte little-endian                                                                                                |
+| Timestamp            | timestamp with time zone    | `12`    | TIMESTAMP(isAdjustedToUTC=true, MICROS)     | 8-byte little-endian                                                                                                |


I assume some of these changes were weren't' intended, these lines shouldn't be changed

AH nvm, I see what's going on here

Actually maybe I don't, "timestamp" isn't a physical type in parquet correct?

There is no Date or timestamp for physical type. I'm wondering if this column should mean type category. @gene-db ?

I wonder if we should just eliminate the first column (and annotate logical type with additional information like exact numeric).

emkornfield · 2024-11-23T00:08:57Z

-| Timestamp            | timestamp                   | `12`    | TIMESTAMP(true, MICROS)     | 8-byte little-endian                                                                                                |
-| TimestampNTZ         | timestamp without time zone | `13`    | TIMESTAMP(false, MICROS)    | 8-byte little-endian                                                                                                |
+| Timestamp            | timestamp with time zone    | `12`    | TIMESTAMP(isAdjustedToUTC=true, MICROS)     | 8-byte little-endian                                                                                                |
+| Timestamp            | timestamp without time zone | `13`    | TIMESTAMP(isAdjustedToUTC=false, MICROS)    | 8-byte little-endian                                                                                                |


While we are trying to correct this it seems like timetamp without time zone should be the logical type and the physical type is int64?

I see this duplicates the the question @RussellSpitzer asked abve.

The question is answered in the paragraph below. Although "behaves" the same is ambiguous. I'm not sure if the intent here Is if parquet has flexibility to change the physical type. We should make sure we clarify this

Thanks. Just saw the explanation below for logical type.
I think it means: for a string, the engines can choose to encode differently but when it's read, they should be the same.

I updated the logical types for newly added types. Since engines may choose different precisions to encode, they have the same logical type from the paragraph below.

| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | | Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |

emkornfield · 2024-11-23T00:10:05Z

 | Binary               | binary                      | `15`    | BINARY                      | 4 byte little-endian size, followed by bytes                                                                        |
 | String               | string                      | `16`    | STRING                      | 4 byte little-endian size, followed by UTF-8 encoded bytes                                                          |
+| Time                 | time without time zone      | `21`    | TIME(isAdjustedToUTC=false, MICROS)          | 8-byte little-endian                                                                                                 |
+| Timestamp            | timestamp with time zone   | `22`    | TIMESTAMP(isAdjustedToUTC=true, NANOS)       | 8-byte little-endian                                                                                                 |


I think logical type should indicate precision here (and aboe for the other timestamps)

Actually reading the paragraph below I guess we don't need precision but do need nts

emkornfield · 2024-11-23T00:11:02Z

+| Time                 | time without time zone      | `21`    | TIME(isAdjustedToUTC=false, MICROS)          | 8-byte little-endian                                                                                                 |
+| Timestamp            | timestamp with time zone   | `22`    | TIMESTAMP(isAdjustedToUTC=true, NANOS)       | 8-byte little-endian                                                                                                 |
+| Timestamp            | timestamp without time zone | `23`    | TIMESTAMP(isAdjustedToUTC=false, NANOS)      | 8-byte little-endian                                                                                                 |
+| UUID                 | uuid                        | `24`    | UUID                         | 16 bytes                                                                                                             |


we should probably note there the encoding order (even though it is duplicative)

Added 16-byte big-endian.

emkornfield

I think once we iron out wording on timestamp we can merge.

emkornfield · 2024-12-04T06:27:36Z

This LGTM, @RussellSpitzer any more comments.

Also, CC @gene-db @rdblue in case there are any concerns.

emkornfield · 2024-12-04T06:29:19Z

Will merge end of week if there aren't more comments.

RussellSpitzer

Looks good to me @emkornfield.

emkornfield · 2024-12-08T22:43:34Z

@aihuaxu sorry two small suggestions to avoid overloading "Logical Type" which is already a separate concept in Parquet.

aihuaxu · 2024-12-09T00:20:42Z

@aihuaxu sorry two small suggestions to avoid overloading "Logical Type" which is already a separate concept in Parquet.

Yeah. Totally agree that we should avoid using logical type. I was thinking of using "type category" or "type group" before to avoid overload. But "Type Equivalence Class" also works for me.

Co-authored-by: emkornfield <emkornfield@gmail.com>

emkornfield · 2024-12-10T22:48:21Z

Going to merge. Thanks @aihuaxu

wgtmac reviewed Oct 30, 2024

View reviewed changes

aihuaxu requested review from Fokko and wgtmac November 1, 2024 16:41

RussellSpitzer reviewed Nov 1, 2024

View reviewed changes

aihuaxu requested a review from RussellSpitzer November 6, 2024 18:08

emkornfield reviewed Nov 23, 2024

View reviewed changes

emkornfield requested changes Nov 23, 2024

View reviewed changes

aihuaxu requested a review from emkornfield November 26, 2024 05:10

aihuaxu added 3 commits November 25, 2024 21:11

Add more types - time, nano timestamps, UUID to Variant.

b2ab479

Update type names to align with Parquet logical type

c6fc1eb

Update logical type

de25a28

aihuaxu force-pushed the aixu-add-more-variant-types branch from 45420e7 to de25a28 Compare November 26, 2024 05:14

emkornfield approved these changes Dec 4, 2024

View reviewed changes

RussellSpitzer approved these changes Dec 5, 2024

View reviewed changes

aihuaxu mentioned this pull request Dec 8, 2024

Spec: add variant type apache/iceberg#10831

Merged

emkornfield reviewed Dec 8, 2024

View reviewed changes

Comment thread VariantEncoding.md Outdated

emkornfield reviewed Dec 8, 2024

View reviewed changes

Comment thread VariantEncoding.md Outdated

aihuaxu and others added 2 commits December 8, 2024 16:20

Update VariantEncoding.md

e40e3f4

Co-authored-by: emkornfield <emkornfield@gmail.com>

Update VariantEncoding.md

c0c78fa

Co-authored-by: emkornfield <emkornfield@gmail.com>

emkornfield merged commit a3dda6a into apache:master Dec 10, 2024

	\| Timestamp_ns \| timestamp \| `22` \| TIMESTAMP(true, NANOS) \| 8-byte little-endian \|
	\| Timestamp \| timestamp \| `22` \| TIMESTAMP(isAdjustedToUTC=true, NANOS) \| 8-byte little-endian \|

Conversation

aihuaxu commented Oct 29, 2024

Rationale for this change

What changes are included in this PR?

Uh oh!

aihuaxu commented Oct 29, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Fokko Oct 31, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

emkornfield commented Dec 4, 2024

Uh oh!

emkornfield commented Dec 4, 2024

Uh oh!

RussellSpitzer left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

emkornfield commented Dec 8, 2024

Uh oh!

aihuaxu commented Dec 9, 2024

Uh oh!

emkornfield commented Dec 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Fokko Oct 31, 2024 •

edited

Loading