@Evolving public interface ParquetHandler
Modifier and Type | Method and Description |
---|---|
CloseableIterator<ColumnarBatch> |
readParquetFiles(CloseableIterator<FileStatus> fileIter,
StructType physicalSchema,
java.util.Optional<Predicate> predicate)
Read the Parquet format files at the given locations and return the data as a
ColumnarBatch with the columns requested by physicalSchema . |
void |
writeParquetFileAtomically(String filePath,
CloseableIterator<FilteredColumnarBatch> data)
Write the given data as a Parquet file.
|
CloseableIterator<DataFileStatus> |
writeParquetFiles(String directoryPath,
CloseableIterator<FilteredColumnarBatch> dataIter,
java.util.List<Column> statsColumns)
Write the given data batches to a Parquet files.
|
CloseableIterator<ColumnarBatch> readParquetFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, java.util.Optional<Predicate> predicate) throws java.io.IOException
ColumnarBatch
with the columns requested by physicalSchema
.
If physicalSchema
has a StructField
with column name StructField.METADATA_ROW_INDEX_COLUMN_NAME
and the field is a metadata column StructField.isMetadataColumn()
the column must be populated with the file row index.
How does a column in physicalSchema
match to the column in the Parquet file? If the
StructField
has a field id in the metadata
with key `parquet.field.id` the
column is attempted to match by id. If the column is not found by id, the column is matched by
name. When trying to find the column in Parquet by name, first case-sensitive match is used. If
not found then a case-insensitive match is attempted.
fileIter
- Iterator of files to read data from.physicalSchema
- Select list of columns to read from the Parquet file.predicate
- Optional predicate which the Parquet reader can optionally use to prune rows
that don't satisfy the predicate. Because pruning is optional and may be incomplete, caller
is still responsible apply the predicate on the data returned by this method.ColumnarBatch
s containing the data in columnar format. It is the
responsibility of the caller to close the iterator. The data returned is in the same as the
order of files given in scanFileIter
.java.io.IOException
- if an I/O error occurs during the read.CloseableIterator<DataFileStatus> writeParquetFiles(String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, java.util.List<Column> statsColumns) throws java.io.IOException
directoryPath
- Location where the data files should be written.dataIter
- Iterator of data batches to write. It is the responsibility of the calle to
close the iterator.statsColumns
- List of columns to collect statistics for. The statistics collection is
optional. If the implementation does not support statistics collection, it is ok to return
no statistics.DataFileStatus
containing the status of the written files. Each
status contains the file path and the optionally collected statistics for the file It is
the responsibility of the caller to close the iterator.java.io.IOException
- if an I/O error occurs during the file writing. This may leave some files
already written in the directory. It is the responsibility of the caller to clean up.void writeParquetFileAtomically(String filePath, CloseableIterator<FilteredColumnarBatch> data) throws java.io.IOException
filePath
- Fully qualified destination file pathdata
- Iterator of FilteredColumnarBatch
java.nio.file.FileAlreadyExistsException
- if the file already exists and overwrite
is false.java.io.IOException
- if any other I/O error occurs.