[Improve][OrcRead] parse config on file level rather than field level by liunaijie · Pull Request #10759 · apache/seatunnel

liunaijie · 2026-04-13T14:23:10Z

Purpose of this pull request

Now the charset is extract from Config on field level.
If I have a 20 rows data, each row has 30 string type fields, the extract need run 20*30 times.
Update to extract on file level, only extract once.
This update can significantly improve performance.

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

If any new Jar binary package adding in your PR, please add License Notice according
New License Guide
If necessary, please update the documentation to describe the new feature. https://github.com/apache/seatunnel/tree/dev/docs
If necessary, please update incompatible-changes.md to describe the incompatibility caused by this PR.
If you are contributing the connector code, please check that the following files are updated:
1. Update plugin-mapping.properties and add new connector information in it
2. Update the pom file of seatunnel-dist
3. Add ci label in label-scope-conf
4. Add e2e testcase in seatunnel-e2e
5. Update connector plugin_config

Copilot

Pull request overview

This PR optimizes ORC file reading by resolving the configured character encoding once per file read (instead of per field), reducing repeated config parsing and charset lookup overhead during row/column decoding.

Changes:

Move FileBaseSourceOptions.ENCODING resolution to the read(...) method (file scope) and reuse the Charset during row decoding.
Thread the resolved Charset through readColumn into BYTES/STRUCT/UNION decoding paths.

Comments suppressed due to low confidence (1)

seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java:459

In readBytesVal, ORC BINARY values are converted to String (via bytesVectorToString) and then back to byte[] using getBytes(charset). This can corrupt arbitrary binary payloads and adds avoidable allocations. For TypeDescription.Category.BINARY, it should return the raw bytes slice from BytesColumnVector (copy-of-range) without any charset-based decoding/encoding.

            BytesColumnVector bytesVector = (BytesColumnVector) colVec;
            bytesObj = this.bytesVectorToString(bytesVector, rowNum, charset);
            if (typeDescription.getCategory() == TypeDescription.Category.BINARY
                    && bytesObj != null) {
                bytesObj = ((String) bytesObj).getBytes(charset);
            }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-13T14:32:41Z

@@ -392,7 +404,7 @@ private Object readColumn(
                    columnObj = readMapVal(colVec, colType, rowNum);
                    break;
                case UNION:
-                    columnObj = readUnionVal(colVec, colType, rowNum);
+                    columnObj = readUnionVal(colVec, colType, rowNum, charset);


readColumn now receives the file-level charset, but it isn’t propagated into the LIST/MAP paths. As a result, nested string values (e.g., LIST<STRING> / MAP<STRING,...>) still get decoded using platform default charset (see readBytesListVector using new String(vecCopy)), ignoring the configured FileBaseSourceOptions.ENCODING. Consider threading charset into readListVal / readMapVal (and the bytes-to-string helpers they call) so all string decoding consistently uses the same file-level charset.

DanielLeens

I pulled the current HEAD locally and checked the ORC read path end to end.

Runtime chain:

BaseFileSourceReader.pollNext()
  -> OrcReadStrategy.read(...)
    -> resolve encoding once per file
    -> readColumn(...)
      -> readBytesVal / readStructVal / readUnionVal

What this PR really changes:

it moves encoding resolution from the per-field BYTES path into the file-level read(...) scope
the same resolved Charset is then threaded through nested STRUCT/UNION/BYTES decoding
row conversion semantics stay the same; this is a hot-path de-duplication, not a behavior change

Conclusion: I do not see a code blocker in the current diff.
Remaining merge gate: the GitHub Build check is still pending.

davidzollo

+1 if CI passes.
Good job

[Improve][OrcRead] parse config on file level rather than field level

3ace235

Copilot AI review requested due to automatic review settings April 13, 2026 14:23

github-actions bot added connectors-v2 labels Apr 13, 2026

Copilot started reviewing on behalf of liunaijie April 13, 2026 14:25 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

DanielLeens reviewed Apr 13, 2026

View reviewed changes

davidzollo removed the connectors-v2 label Apr 14, 2026

davidzollo approved these changes Apr 14, 2026

View reviewed changes

github-actions bot added approved reviewed labels Apr 14, 2026

davidzollo removed reviewed labels Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Improve][OrcRead] parse config on file level rather than field level#10759

[Improve][OrcRead] parse config on file level rather than field level#10759
liunaijie wants to merge 1 commit intoapache:devfrom
liunaijie:improve/orc_read

liunaijie commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

DanielLeens left a comment

Uh oh!

davidzollo left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

liunaijie commented Apr 13, 2026

Purpose of this pull request

Does this PR introduce any user-facing change?

How was this patch tested?

Check list

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

DanielLeens left a comment

Choose a reason for hiding this comment

Uh oh!

davidzollo left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

davidzollo left a comment •

edited

Loading