Skip to content

[Improve][OrcRead] parse config on file level rather than field level#10759

Open
liunaijie wants to merge 1 commit intoapache:devfrom
liunaijie:improve/orc_read
Open

[Improve][OrcRead] parse config on file level rather than field level#10759
liunaijie wants to merge 1 commit intoapache:devfrom
liunaijie:improve/orc_read

Conversation

@liunaijie
Copy link
Copy Markdown
Member

Purpose of this pull request

Now the charset is extract from Config on field level.
If I have a 20 rows data, each row has 30 string type fields, the extract need run 20*30 times.
Update to extract on file level, only extract once.
This update can significantly improve performance.

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Copilot AI review requested due to automatic review settings April 13, 2026 14:23
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR optimizes ORC file reading by resolving the configured character encoding once per file read (instead of per field), reducing repeated config parsing and charset lookup overhead during row/column decoding.

Changes:

  • Move FileBaseSourceOptions.ENCODING resolution to the read(...) method (file scope) and reuse the Charset during row decoding.
  • Thread the resolved Charset through readColumn into BYTES/STRUCT/UNION decoding paths.
Comments suppressed due to low confidence (1)

seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java:459

  • In readBytesVal, ORC BINARY values are converted to String (via bytesVectorToString) and then back to byte[] using getBytes(charset). This can corrupt arbitrary binary payloads and adds avoidable allocations. For TypeDescription.Category.BINARY, it should return the raw bytes slice from BytesColumnVector (copy-of-range) without any charset-based decoding/encoding.
            BytesColumnVector bytesVector = (BytesColumnVector) colVec;
            bytesObj = this.bytesVectorToString(bytesVector, rowNum, charset);
            if (typeDescription.getCategory() == TypeDescription.Category.BINARY
                    && bytesObj != null) {
                bytesObj = ((String) bytesObj).getBytes(charset);
            }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 399 to +407
@@ -392,7 +404,7 @@ private Object readColumn(
columnObj = readMapVal(colVec, colType, rowNum);
break;
case UNION:
columnObj = readUnionVal(colVec, colType, rowNum);
columnObj = readUnionVal(colVec, colType, rowNum, charset);
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

readColumn now receives the file-level charset, but it isn’t propagated into the LIST/MAP paths. As a result, nested string values (e.g., LIST<STRING> / MAP<STRING,...>) still get decoded using platform default charset (see readBytesListVector using new String(vecCopy)), ignoring the configured FileBaseSourceOptions.ENCODING. Consider threading charset into readListVal / readMapVal (and the bytes-to-string helpers they call) so all string decoding consistently uses the same file-level charset.

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown

@DanielLeens DanielLeens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pulled the current HEAD locally and checked the ORC read path end to end.

Runtime chain:

BaseFileSourceReader.pollNext()
  -> OrcReadStrategy.read(...)
    -> resolve encoding once per file
    -> readColumn(...)
      -> readBytesVal / readStructVal / readUnionVal

What this PR really changes:

  • it moves encoding resolution from the per-field BYTES path into the file-level read(...) scope
  • the same resolved Charset is then threaded through nested STRUCT/UNION/BYTES decoding
  • row conversion semantics stay the same; this is a hot-path de-duplication, not a behavior change

Conclusion: I do not see a code blocker in the current diff.
Remaining merge gate: the GitHub Build check is still pending.

Copy link
Copy Markdown
Contributor

@davidzollo davidzollo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 if CI passes.
Good job

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants