All Known Implementing Classes:: DefaultParquetHandler

@Evolving public interface ParquetHandler

Provides Parquet file related functionalities to Delta Kernel. Connectors can leverage this interface to provide their own custom implementation of Parquet data file functionalities to Delta Kernel.

Since:: 3.0.0

Method Summary

Modifier and Type

Method

Description

CloseableIterator<ColumnarBatch>

readParquetFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, Optional<Predicate> predicate)

Read the Parquet format files at the given locations and return the data as a ColumnarBatch with the columns requested by physicalSchema.

void

writeParquetFileAtomically(String filePath, CloseableIterator<FilteredColumnarBatch> data)

Write the given data as a Parquet file.

CloseableIterator<DataFileStatus>

writeParquetFiles(String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, List<Column> statsColumns)

Write the given data batches to a Parquet files.

Method Details
- readParquetFiles
  
  CloseableIterator<ColumnarBatch> readParquetFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, Optional<Predicate> predicate) throws IOException
  
  Read the Parquet format files at the given locations and return the data as a ColumnarBatch with the columns requested by physicalSchema.
  If physicalSchema has a StructField with column name StructField.METADATA_ROW_INDEX_COLUMN_NAME and the field is a metadata column StructField.isMetadataColumn() the column must be populated with the file row index.
  How does a column in physicalSchema match to the column in the Parquet file? If the StructField has a field id in the metadata with key `parquet.field.id` the column is attempted to match by id. If the column is not found by id, the column is matched by name. When trying to find the column in Parquet by name, first case-sensitive match is used. If not found then a case-insensitive match is attempted.
  
  Parameters:
  
  fileIter - Iterator of files to read data from.
  
  physicalSchema - Select list of columns to read from the Parquet file.
  
  predicate - Optional predicate which the Parquet reader can optionally use to prune rows that don't satisfy the predicate. Because pruning is optional and may be incomplete, caller is still responsible apply the predicate on the data returned by this method.
  
  Returns:
  
  an iterator of ColumnarBatchs containing the data in columnar format. It is the responsibility of the caller to close the iterator. The data returned is in the same as the order of files given in scanFileIter.
  
  Throws:
  
  IOException - if an I/O error occurs during the read.
- writeParquetFiles
  
  CloseableIterator<DataFileStatus> writeParquetFiles(String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, List<Column> statsColumns) throws IOException
  
  Write the given data batches to a Parquet files. Try to keep the Parquet file size to given size. If the current file exceeds this size close the current file and start writing to a new file.
  
  Parameters:
  
  directoryPath - Location where the data files should be written.
  
  dataIter - Iterator of data batches to write. It is the responsibility of the calle to close the iterator.
  
  statsColumns - List of columns to collect statistics for. The statistics collection is optional. If the implementation does not support statistics collection, it is ok to return no statistics.
  
  Returns:
  
  an iterator of DataFileStatus containing the status of the written files. Each status contains the file path and the optionally collected statistics for the file It is the responsibility of the caller to close the iterator.
  
  Throws:
  
  IOException - if an I/O error occurs during the file writing. This may leave some files already written in the directory. It is the responsibility of the caller to clean up.
  
  Since:
  
  3.2.0
- writeParquetFileAtomically
  
  void writeParquetFileAtomically(String filePath, CloseableIterator<FilteredColumnarBatch> data) throws IOException
  
  Write the given data as a Parquet file. This call either succeeds in creating the file with given contents or no file is created at all. It won't leave behind a partially written file.
  
  Parameters:
  
  filePath - Fully qualified destination file path
  
  data - Iterator of FilteredColumnarBatch
  
  Throws:
  
  FileAlreadyExistsException - if the file already exists and overwrite is false.
  
  IOException - if any other I/O error occurs.

Interface ParquetHandler

Method Summary

Method Details

readParquetFiles

writeParquetFiles

writeParquetFileAtomically