Skip to content

Spark: Fix type mismatch in SPJ with bucket partition key on string column#16424

Open
ammarchalifah wants to merge 1 commit into
apache:mainfrom
ammarchalifah:fix/spj-bucket-type-mismatch
Open

Spark: Fix type mismatch in SPJ with bucket partition key on string column#16424
ammarchalifah wants to merge 1 commit into
apache:mainfrom
ammarchalifah:fix/spj-bucket-type-mismatch

Conversation

@ammarchalifah
Copy link
Copy Markdown

Problem

When a table is partitioned by bucket(N, string_column), the bucket transform produces an Integer partition value. During Storage Partitioned Joins (SPJ), Spark reads partition values through StructInternalRow, which calls struct.get(ordinal, CharSequence.class) in getUTF8StringInternal(). This assumes the value is always a CharSequence, causing a ClassCastException:

IllegalArgumentException: Wrong class, expected java.lang.CharSequence, but was java.lang.Integer, for object: 1

This affects any SPJ query (e.g. MERGE INTO or JOIN) on tables partitioned
with bucket(N, string_column).

Fix

Changed getUTF8StringInternal() to use struct.get(ordinal, Object.class) instead of struct.get(ordinal, CharSequence.class), then call value.toString(). This follows the same pattern already used by getBinaryInternal() in the same class, which uses Object.class to handle multiple possible runtime types.

The fix is applied to all Spark versions: 3.4, 3.5, 4.0, and 4.1.

Testing

  • Added testJoinsWithBucketingOnStringColumn using the existing checkJoin helper to cover bucket-only partitioning on string columns.
  • Added testJoinsWithIdentityAndBucketOnStringColumn as a targeted regression test for the exact scenario from the issue: identity + bucket partitioning on a string column with an SPJ join.

Both tests are added consistently across all 4 Spark versions.

Notes

AI tools were used to assist with drafting this change. I have reviewed and
validated the logic, tests, and code style end-to-end.

Closes #15349

@github-actions github-actions Bot added the spark label May 19, 2026
@ammarchalifah
Copy link
Copy Markdown
Author

This PR is a re-implementation of this closed PR: #15555

I was the reporter that filed the bug report, and really needed this bug to be fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

SPJ with Bucket Partition Key: Error: Wrong class, expected java.lang.CharSequence, but was java.lang.Integer

1 participant