Package io.delta.kernel.engine
Interface ParquetHandler
- All Known Implementing Classes:
DefaultParquetHandler
Provides Parquet file related functionalities to Delta Kernel. Connectors can leverage this
interface to provide their own custom implementation of Parquet data file functionalities to
Delta Kernel.
- Since:
- 3.0.0
-
Method Summary
Modifier and TypeMethodDescriptionreadParquetFiles
(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, Optional<Predicate> predicate) Read the Parquet format files at the given locations and return the data as aColumnarBatch
with the columns requested byphysicalSchema
.void
writeParquetFileAtomically
(String filePath, CloseableIterator<FilteredColumnarBatch> data) Write the given data as a Parquet file.writeParquetFiles
(String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, List<Column> statsColumns) Write the given data batches to a Parquet files.
-
Method Details
-
readParquetFiles
CloseableIterator<ColumnarBatch> readParquetFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, Optional<Predicate> predicate) throws IOException Read the Parquet format files at the given locations and return the data as aColumnarBatch
with the columns requested byphysicalSchema
.If
physicalSchema
has aStructField
with column nameStructField.METADATA_ROW_INDEX_COLUMN_NAME
and the field is a metadata columnStructField.isMetadataColumn()
the column must be populated with the file row index.How does a column in
physicalSchema
match to the column in the Parquet file? If theStructField
has a field id in themetadata
with key `parquet.field.id` the column is attempted to match by id. If the column is not found by id, the column is matched by name. When trying to find the column in Parquet by name, first case-sensitive match is used. If not found then a case-insensitive match is attempted.- Parameters:
fileIter
- Iterator of files to read data from.physicalSchema
- Select list of columns to read from the Parquet file.predicate
- Optional predicate which the Parquet reader can optionally use to prune rows that don't satisfy the predicate. Because pruning is optional and may be incomplete, caller is still responsible apply the predicate on the data returned by this method.- Returns:
- an iterator of
ColumnarBatch
s containing the data in columnar format. It is the responsibility of the caller to close the iterator. The data returned is in the same as the order of files given inscanFileIter
. - Throws:
IOException
- if an I/O error occurs during the read.
-
writeParquetFiles
CloseableIterator<DataFileStatus> writeParquetFiles(String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, List<Column> statsColumns) throws IOException Write the given data batches to a Parquet files. Try to keep the Parquet file size to given size. If the current file exceeds this size close the current file and start writing to a new file.- Parameters:
directoryPath
- Location where the data files should be written.dataIter
- Iterator of data batches to write. It is the responsibility of the calle to close the iterator.statsColumns
- List of columns to collect statistics for. The statistics collection is optional. If the implementation does not support statistics collection, it is ok to return no statistics.- Returns:
- an iterator of
DataFileStatus
containing the status of the written files. Each status contains the file path and the optionally collected statistics for the file It is the responsibility of the caller to close the iterator. - Throws:
IOException
- if an I/O error occurs during the file writing. This may leave some files already written in the directory. It is the responsibility of the caller to clean up.- Since:
- 3.2.0
-
writeParquetFileAtomically
void writeParquetFileAtomically(String filePath, CloseableIterator<FilteredColumnarBatch> data) throws IOException Write the given data as a Parquet file. This call either succeeds in creating the file with given contents or no file is created at all. It won't leave behind a partially written file.- Parameters:
filePath
- Fully qualified destination file pathdata
- Iterator ofFilteredColumnarBatch
- Throws:
FileAlreadyExistsException
- if the file already exists andoverwrite
is false.IOException
- if any other I/O error occurs.
-