Interface ParquetHandler

All Known Implementing Classes:
DefaultParquetHandler

@Evolving public interface ParquetHandler
Provides Parquet file related functionalities to Delta Kernel. Connectors can leverage this interface to provide their own custom implementation of Parquet data file functionalities to Delta Kernel.
Since:
3.0.0
  • Method Details

    • readParquetFiles

      CloseableIterator<ColumnarBatch> readParquetFiles(CloseableIterator<FileStatus> fileIter, StructType physicalSchema, Optional<Predicate> predicate) throws IOException
      Read the Parquet format files at the given locations and return the data as a ColumnarBatch with the columns requested by physicalSchema.

      If physicalSchema has a StructField with column name StructField.METADATA_ROW_INDEX_COLUMN_NAME and the field is a metadata column StructField.isMetadataColumn() the column must be populated with the file row index.

      How does a column in physicalSchema match to the column in the Parquet file? If the StructField has a field id in the metadata with key `parquet.field.id` the column is attempted to match by id. If the column is not found by id, the column is matched by name. When trying to find the column in Parquet by name, first case-sensitive match is used. If not found then a case-insensitive match is attempted.

      Parameters:
      fileIter - Iterator of files to read data from.
      physicalSchema - Select list of columns to read from the Parquet file.
      predicate - Optional predicate which the Parquet reader can optionally use to prune rows that don't satisfy the predicate. Because pruning is optional and may be incomplete, caller is still responsible apply the predicate on the data returned by this method.
      Returns:
      an iterator of ColumnarBatchs containing the data in columnar format. It is the responsibility of the caller to close the iterator. The data returned is in the same as the order of files given in scanFileIter.
      Throws:
      IOException - if an I/O error occurs during the read.
    • writeParquetFiles

      CloseableIterator<DataFileStatus> writeParquetFiles(String directoryPath, CloseableIterator<FilteredColumnarBatch> dataIter, List<Column> statsColumns) throws IOException
      Write the given data batches to a Parquet files. Try to keep the Parquet file size to given size. If the current file exceeds this size close the current file and start writing to a new file.

      Parameters:
      directoryPath - Location where the data files should be written.
      dataIter - Iterator of data batches to write. It is the responsibility of the calle to close the iterator.
      statsColumns - List of columns to collect statistics for. The statistics collection is optional. If the implementation does not support statistics collection, it is ok to return no statistics.
      Returns:
      an iterator of DataFileStatus containing the status of the written files. Each status contains the file path and the optionally collected statistics for the file It is the responsibility of the caller to close the iterator.
      Throws:
      IOException - if an I/O error occurs during the file writing. This may leave some files already written in the directory. It is the responsibility of the caller to clean up.
      Since:
      3.2.0
    • writeParquetFileAtomically

      void writeParquetFileAtomically(String filePath, CloseableIterator<FilteredColumnarBatch> data) throws IOException
      Write the given data as a Parquet file. This call either succeeds in creating the file with given contents or no file is created at all. It won't leave behind a partially written file.

      Parameters:
      filePath - Fully qualified destination file path
      data - Iterator of FilteredColumnarBatch
      Throws:
      FileAlreadyExistsException - if the file already exists and overwrite is false.
      IOException - if any other I/O error occurs.