@Evolving public interface ParquetHandler
| Modifier and Type | Method and Description |
|---|---|
CloseableIterator<ColumnarBatch> |
readParquetFiles(CloseableIterator<FileStatus> fileIter,
StructType physicalSchema,
java.util.Optional<Predicate> predicate)
Read the Parquet format files at the given locations and return the data as a
ColumnarBatch with the columns requested by physicalSchema. |
void |
writeParquetFileAtomically(String filePath,
CloseableIterator<FilteredColumnarBatch> data)
Write the given data as a Parquet file.
|
CloseableIterator<DataFileStatus> |
writeParquetFiles(String directoryPath,
CloseableIterator<FilteredColumnarBatch> dataIter,
java.util.List<Column> statsColumns)
Write the given data batches to a Parquet files.
|
CloseableIterator<ColumnarBatch> readParquetFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, java.util.Optional<Predicate> predicate) throws java.io.IOException
ColumnarBatch with the columns requested by physicalSchema.
If physicalSchema has a StructField with column name StructField.METADATA_ROW_INDEX_COLUMN_NAME and the field is a metadata column StructField.isMetadataColumn() the column must be populated with the file row index.
How does a column in physicalSchema match to the column in the Parquet file? If the
StructField has a field id in the metadata with key `parquet.field.id` the
column is attempted to match by id. If the column is not found by id, the column is matched by
name. When trying to find the column in Parquet by name, first case-sensitive match is used. If
not found then a case-insensitive match is attempted.
fileIter - Iterator of files to read data from.physicalSchema - Select list of columns to read from the Parquet file.predicate - Optional predicate which the Parquet reader can optionally use to prune rows
that don't satisfy the predicate. Because pruning is optional and may be incomplete, caller
is still responsible apply the predicate on the data returned by this method.ColumnarBatchs containing the data in columnar format. It is the
responsibility of the caller to close the iterator. The data returned is in the same as the
order of files given in scanFileIter.java.io.IOException - if an I/O error occurs during the read.CloseableIterator<DataFileStatus> writeParquetFiles(String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, java.util.List<Column> statsColumns) throws java.io.IOException
directoryPath - Location where the data files should be written.dataIter - Iterator of data batches to write. It is the responsibility of the calle to
close the iterator.statsColumns - List of columns to collect statistics for. The statistics collection is
optional. If the implementation does not support statistics collection, it is ok to return
no statistics.DataFileStatus containing the status of the written files. Each
status contains the file path and the optionally collected statistics for the file It is
the responsibility of the caller to close the iterator.java.io.IOException - if an I/O error occurs during the file writing. This may leave some files
already written in the directory. It is the responsibility of the caller to clean up.void writeParquetFileAtomically(String filePath,
CloseableIterator<FilteredColumnarBatch> data)
throws java.io.IOException
filePath - Fully qualified destination file pathdata - Iterator of FilteredColumnarBatchjava.nio.file.FileAlreadyExistsException - if the file already exists and overwrite is false.java.io.IOException - if any other I/O error occurs.