[Improve][OrcRead] parse config on file level rather than field level#10759
[Improve][OrcRead] parse config on file level rather than field level#10759liunaijie wants to merge 1 commit intoapache:devfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR optimizes ORC file reading by resolving the configured character encoding once per file read (instead of per field), reducing repeated config parsing and charset lookup overhead during row/column decoding.
Changes:
- Move
FileBaseSourceOptions.ENCODINGresolution to theread(...)method (file scope) and reuse theCharsetduring row decoding. - Thread the resolved
CharsetthroughreadColumninto BYTES/STRUCT/UNION decoding paths.
Comments suppressed due to low confidence (1)
seatunnel-connectors-v2/connector-file/connector-file-base/src/main/java/org/apache/seatunnel/connectors/seatunnel/file/source/reader/OrcReadStrategy.java:459
- In
readBytesVal, ORCBINARYvalues are converted toString(viabytesVectorToString) and then back tobyte[]usinggetBytes(charset). This can corrupt arbitrary binary payloads and adds avoidable allocations. ForTypeDescription.Category.BINARY, it should return the raw bytes slice fromBytesColumnVector(copy-of-range) without any charset-based decoding/encoding.
BytesColumnVector bytesVector = (BytesColumnVector) colVec;
bytesObj = this.bytesVectorToString(bytesVector, rowNum, charset);
if (typeDescription.getCategory() == TypeDescription.Category.BINARY
&& bytesObj != null) {
bytesObj = ((String) bytesObj).getBytes(charset);
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -392,7 +404,7 @@ private Object readColumn( | |||
| columnObj = readMapVal(colVec, colType, rowNum); | |||
| break; | |||
| case UNION: | |||
| columnObj = readUnionVal(colVec, colType, rowNum); | |||
| columnObj = readUnionVal(colVec, colType, rowNum, charset); | |||
There was a problem hiding this comment.
readColumn now receives the file-level charset, but it isn’t propagated into the LIST/MAP paths. As a result, nested string values (e.g., LIST<STRING> / MAP<STRING,...>) still get decoded using platform default charset (see readBytesListVector using new String(vecCopy)), ignoring the configured FileBaseSourceOptions.ENCODING. Consider threading charset into readListVal / readMapVal (and the bytes-to-string helpers they call) so all string decoding consistently uses the same file-level charset.
DanielLeens
left a comment
There was a problem hiding this comment.
I pulled the current HEAD locally and checked the ORC read path end to end.
Runtime chain:
BaseFileSourceReader.pollNext()
-> OrcReadStrategy.read(...)
-> resolve encoding once per file
-> readColumn(...)
-> readBytesVal / readStructVal / readUnionVal
What this PR really changes:
- it moves
encodingresolution from the per-field BYTES path into the file-levelread(...)scope - the same resolved
Charsetis then threaded through nested STRUCT/UNION/BYTES decoding - row conversion semantics stay the same; this is a hot-path de-duplication, not a behavior change
Conclusion: I do not see a code blocker in the current diff.
Remaining merge gate: the GitHub Build check is still pending.
Purpose of this pull request
Now the charset is extract from Config on field level.
If I have a 20 rows data, each row has 30 string type fields, the extract need run 20*30 times.
Update to extract on file level, only extract once.
This update can significantly improve performance.
Does this PR introduce any user-facing change?
How was this patch tested?
Check list
New License Guide
incompatible-changes.mdto describe the incompatibility caused by this PR.