Skip to content

Arrow: Fix int96 timestamp offset in arrow dictionary decode#16435

Open
sungwy wants to merge 1 commit into
apache:mainfrom
sungwy:int96-timestamp-offset-fix
Open

Arrow: Fix int96 timestamp offset in arrow dictionary decode#16435
sungwy wants to merge 1 commit into
apache:mainfrom
sungwy:int96-timestamp-offset-fix

Conversation

@sungwy
Copy link
Copy Markdown
Contributor

@sungwy sungwy commented May 20, 2026

Issue: #13485

This PR fixes the packed dictionary INT96 timestamp decode path in the VectorizedParquetDefinitionLevelReader to write using byte offsets rather than row indexes.

The unit test constructs a dictionary with expected values and verifies that the decoded Arrow buffer contains the expected timestamp values, and verifies that writing multiple rows of values did not corrupt the data as a result of the values being written into the wrong offsets.

The fix is the same as the earlier proposed change discussed in #13486 , and mirrors the offset handling used by other readers in the same class:

.setLong((long) idx * typeWidth, dict.decodeToLong(reader.readInteger()));

Disclosure:

Copilot AI review requested due to automatic review settings May 20, 2026 01:32
@sungwy sungwy requested a review from nastra May 20, 2026 01:32
@github-actions github-actions Bot added the arrow label May 20, 2026
@sungwy sungwy changed the title Fix int96 timestamp offset in arrow dictionary decode Arrow: Fix int96 timestamp offset in arrow dictionary decode May 20, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect INT96 timestamp decoding for Parquet dictionary-encoded packed data in Iceberg’s Arrow vectorized reader by writing to Arrow buffers using byte offsets (instead of value indices), preventing buffer corruption and wrong timestamps.

Changes:

  • Correct TimestampInt96Reader.nextDictEncodedVal (PACKED mode) to use (long) idx * typeWidth when calling setLong.
  • Add a unit test that reproduces the corruption scenario and verifies correct decoded timestamps across multiple rows.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java Fixes packed dictionary INT96 decode to write to Arrow buffers using byte offsets, matching other readers and preventing data corruption.
arrow/src/test/java/org/apache/iceberg/arrow/vectorized/parquet/TestVectorizedParquetDefinitionLevelReader.java Adds a regression test ensuring packed dictionary INT96 decoding writes correct per-row values without overwriting adjacent entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@sungwy sungwy requested review from Fokko, kevinjqliu and rdblue May 20, 2026 15:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants