Arrow: Fix int96 timestamp offset in arrow dictionary decode by sungwy · Pull Request #16435 · apache/iceberg

sungwy · 2026-05-20T01:32:26Z

This PR fixes the packed dictionary INT96 timestamp decode path in the VectorizedParquetDefinitionLevelReader to write using byte offsets rather than row indexes.

The unit test constructs a dictionary with expected values and verifies that the decoded Arrow buffer contains the expected timestamp values, and verifies that writing multiple rows of values did not corrupt the data as a result of the values being written into the wrong offsets.

The fix is the same as the earlier proposed change discussed in #13486 , and mirrors the offset handling used by other readers in the same class:

iceberg/arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java

Line 302 in 8e7ab3c

.setLong((long) idx * typeWidth, dict.decodeToLong(reader.readInteger()));

Disclosure:

AI-assisted analysis and test drafting were used while investigating this issue. The final code and test were reviewed, edited, and validated manually. Refer: https://iceberg.apache.org/contribute/#guidelines-for-ai-assisted-contributions

Copilot

Pull request overview

Fixes incorrect INT96 timestamp decoding for Parquet dictionary-encoded packed data in Iceberg’s Arrow vectorized reader by writing to Arrow buffers using byte offsets (instead of value indices), preventing buffer corruption and wrong timestamps.

Changes:

Correct TimestampInt96Reader.nextDictEncodedVal (PACKED mode) to use (long) idx * typeWidth when calling setLong.
Add a unit test that reproduces the corruption scenario and verifies correct decoded timestamps across multiple rows.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File	Description
`arrow/src/main/java/org/apache/iceberg/arrow/vectorized/parquet/VectorizedParquetDefinitionLevelReader.java`	Fixes packed dictionary INT96 decode to write to Arrow buffers using byte offsets, matching other readers and preventing data corruption.
`arrow/src/test/java/org/apache/iceberg/arrow/vectorized/parquet/TestVectorizedParquetDefinitionLevelReader.java`	Adds a regression test ensuring packed dictionary INT96 decoding writes correct per-row values without overwriting adjacent entries.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fix int96 timestamp offset in arrow dictionary decode

eba9847

Copilot AI review requested due to automatic review settings May 20, 2026 01:32

Copilot started reviewing on behalf of sungwy May 20, 2026 01:32 View session

sungwy requested a review from nastra May 20, 2026 01:32

github-actions Bot added the arrow label May 20, 2026

sungwy changed the title ~~Fix int96 timestamp offset in arrow dictionary decode~~ Arrow: Fix int96 timestamp offset in arrow dictionary decode May 20, 2026

Copilot AI reviewed May 20, 2026

View reviewed changes

sungwy requested review from Fokko, kevinjqliu and rdblue May 20, 2026 15:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arrow: Fix int96 timestamp offset in arrow dictionary decode#16435

Arrow: Fix int96 timestamp offset in arrow dictionary decode#16435
sungwy wants to merge 1 commit into
apache:mainfrom
sungwy:int96-timestamp-offset-fix

sungwy commented May 20, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sungwy commented May 20, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants