PARQUET-2473: Clarify records can not be split across v2 pages or PageIndex#244
Conversation
| /** Number of NULL values, in this data page. | ||
| Number of non-null = num_values - num_nulls which is also the number of values in the data section **/ | ||
| 2: required i32 num_nulls | ||
| /** Number of rows in this data page. which means pages change on record boundaries (r = 0) **/ |
There was a problem hiding this comment.
I also propose expanding the r = 0 terminology to be explicitly repetition_level = 0 to match the conventions in the rest of the documentation
| * | ||
| * If a ColumnIndex is present, a page must begin at a record | ||
| * boundary (repetition_level = 0). Otherwise, pages may begin | ||
| * within a record (repetition_level > 0). |
There was a problem hiding this comment.
I wonder if we could instead state that records should not be split across pages, but that readers should tolerate this in the abscence of a page index or v2 data pages, as certain writers historically wrote such data?
There was a problem hiding this comment.
I agree with you that splitting pages will make the file less compatible, even though some writers historically did that.
However, I would like to ensure there is consensus on updating the spec to say that writers should do something before changing the spec
Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
|
Updated to clarify that records can't repeat when an |
| /** | ||
| * Number of values, including NULLs, in this data page. | ||
| * | ||
| * If a OffsetIndex is present, a page must begin at a record |
There was a problem hiding this comment.
Can we unify the row/record terminology? They seem to mean the same thing.
There was a problem hiding this comment.
I agree row/record seem to mean the same thing
I double checked and it appears the rest of the file is inconsistent in the terminology as well
For example the repetition level documentation refers to records
parquet-format/src/main/thrift/parquet.thrift
Lines 180 to 192 in 8d59c7d
but there are several fields that are named num_rows that clearly refer to rows.
parquet-format/src/main/thrift/parquet.thrift
Lines 623 to 631 in 8d59c7d
In this PR I followed the term used in the repetition level documentation as I think it is the most relevant.
I will start a conversation on the mailing list about using consistent terminology
There was a problem hiding this comment.
Mailing list conversation: https://lists.apache.org/thread/0gxjk4tphxls0gcrc7lt775pf8s7gtz3
Currently the consensus seems to be to use "row"
I will make a follow on PR to change that terminology after this PR is merged.
I would like to retain the "record" terminology here to remain consistent with the current wording in the repetition levels section
| /** | ||
| * Number of rows in this data page. Every page must begin at a | ||
| * record boundary (repetition_level = 0): records must **not** be | ||
| * split across page boundaries when using V2 data pages. |
|
I think this PR is now ready to go @wgtmac . Is there anything else we are waiting on? |
wgtmac
left a comment
There was a problem hiding this comment.
Sorry for missing this. I'll merge it now.
No worries -- thank you! I'll file a JIRA shortly to improve the spec to use row terminology |
|
Update: here is a PR to consistently use the "row" terminology: #256 |
…eIndex (apache#244) Co-authored-by: Ed Seidl <etseidl@users.noreply.github.com>
This was sparked by a mailing list discussion: https://lists.apache.org/thread/rd8twnvg4bg3558r507rzpxckcxt5wdn
Several implementors of the Parquet spec have been confused by this point
Notes
There seems to be clear consensus that record boundaries can't span pages in V2 pages or if there is a page index, so let's make that clear in the spec to avoid future confusion
There also seemed to be consensus that Row Groups must start on record boundaries, and that the existing spec language was clear on this point, so I did not propose any changes to that language
https://issues.apache.org/jira/browse/PARQUET-2473
Jira
Commits
Documentation
This PR is only a clarification