GH-463: Add more types - time, nano timestamps, UUID to Variant spec#464
Conversation
| | Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | | ||
| | String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | | ||
| | TimeNTZ | time without time zone | `21` | TIME(false, MICROS) | 8-byte little-endian | | ||
| | Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian | |
There was a problem hiding this comment.
I know that iceberg has added timestamp_ns in the V3 spec. From the Parquet side, we'd better be more generic. Should we consider parameterized timestamp type like what Trino does for Variant type: https://trino.io/docs/current/language/types.html#timestamp-p?
There was a problem hiding this comment.
I very much agree here. This is also in line with the LogicalTypeAnnotation in parquet-java.
| | Timestamp_ns | timestamp | `22` | TIMESTAMP(true, NANOS) | 8-byte little-endian | | |
| | Timestamp | timestamp | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | |
Since this hasn't been published yet, I would also propose:
- Group these with the other timestamp/time types.
- Remove the NTZ, as it is similar to the timeunit
There was a problem hiding this comment.
Thanks for reviewing. I will try to make the name change to align with Parquet annotation The only thing is that we need to have separate type ID for ltz/ntz and micro/nano seconds. so it would be cleaner to have them into separate entries rather than grouping them with parameter.
| | Date | date | `11` | DATE | 4 byte little-endian | | ||
| | Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian | | ||
| | TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian | | ||
| | Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | |
There was a problem hiding this comment.
I assume some of these changes were weren't' intended, these lines shouldn't be changed
There was a problem hiding this comment.
AH nvm, I see what's going on here
There was a problem hiding this comment.
Actually maybe I don't, "timestamp" isn't a physical type in parquet correct?
There was a problem hiding this comment.
There is no Date or timestamp for physical type. I'm wondering if this column should mean type category. @gene-db ?
There was a problem hiding this comment.
I wonder if we should just eliminate the first column (and annotate logical type with additional information like exact numeric).
| | Timestamp | timestamp | `12` | TIMESTAMP(true, MICROS) | 8-byte little-endian | | ||
| | TimestampNTZ | timestamp without time zone | `13` | TIMESTAMP(false, MICROS) | 8-byte little-endian | | ||
| | Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian | | ||
| | Timestamp | timestamp without time zone | `13` | TIMESTAMP(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | |
There was a problem hiding this comment.
While we are trying to correct this it seems like timetamp without time zone should be the logical type and the physical type is int64?
There was a problem hiding this comment.
I see this duplicates the the question @RussellSpitzer asked abve.
There was a problem hiding this comment.
The question is answered in the paragraph below. Although "behaves" the same is ambiguous. I'm not sure if the intent here Is if parquet has flexibility to change the physical type. We should make sure we clarify this
There was a problem hiding this comment.
Thanks. Just saw the explanation below for logical type.
I think it means: for a string, the engines can choose to encode differently but when it's read, they should be the same.
I updated the logical types for newly added types. Since engines may choose different precisions to encode, they have the same logical type from the paragraph below.
| Timestamp | timestamp with time zone | `12` | TIMESTAMP(isAdjustedToUTC=true, MICROS) | 8-byte little-endian |
| Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian |
| | Binary | binary | `15` | BINARY | 4 byte little-endian size, followed by bytes | | ||
| | String | string | `16` | STRING | 4 byte little-endian size, followed by UTF-8 encoded bytes | | ||
| | Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | ||
| | Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | |
There was a problem hiding this comment.
I think logical type should indicate precision here (and aboe for the other timestamps)
There was a problem hiding this comment.
Actually reading the paragraph below I guess we don't need precision but do need nts
| | Time | time without time zone | `21` | TIME(isAdjustedToUTC=false, MICROS) | 8-byte little-endian | | ||
| | Timestamp | timestamp with time zone | `22` | TIMESTAMP(isAdjustedToUTC=true, NANOS) | 8-byte little-endian | | ||
| | Timestamp | timestamp without time zone | `23` | TIMESTAMP(isAdjustedToUTC=false, NANOS) | 8-byte little-endian | | ||
| | UUID | uuid | `24` | UUID | 16 bytes | |
There was a problem hiding this comment.
we should probably note there the encoding order (even though it is duplicative)
There was a problem hiding this comment.
Added 16-byte big-endian.
emkornfield
left a comment
There was a problem hiding this comment.
I think once we iron out wording on timestamp we can merge.
45420e7 to
de25a28
Compare
|
This LGTM, @RussellSpitzer any more comments. |
|
Will merge end of week if there aren't more comments. |
RussellSpitzer
left a comment
There was a problem hiding this comment.
Looks good to me @emkornfield.
|
@aihuaxu sorry two small suggestions to avoid overloading "Logical Type" which is already a separate concept in Parquet. |
Yeah. Totally agree that we should avoid using logical type. I was thinking of using "type category" or "type group" before to avoid overload. But "Type Equivalence Class" also works for me. |
Co-authored-by: emkornfield <emkornfield@gmail.com>
Co-authored-by: emkornfield <emkornfield@gmail.com>
|
Going to merge. Thanks @aihuaxu |
Rationale for this change
Iceberg tables support time, nano timestamps, UUID types and currently Variant spec doesn't include those. Propose to add those missing types.
What changes are included in this PR?
The spec change.
Fix #463