Optimizations

Delta Lake provides optimizations that accelerate data lake operations.

Optimize performance with file management

To improve query speed, Delta Lake supports the ability to optimize the layout of data in storage. There are various ways to optimize the layout.

Compaction (bin-packing)

Note

This feature is available in Delta Lake 1.2.0 and above.

Delta Lake can improve the speed of read queries from a table by coalescing small files into larger ones.

OPTIMIZE '/path/to/delta/table' -- Optimizes the path-based Delta Lake table

OPTIMIZE delta_table_name;

OPTIMIZE delta.`/path/to/delta/table`;

-- If you have a large amount of data and only want to optimize a subset of it, you can specify an optional partition predicate using `WHERE`:
OPTIMIZE delta_table_name WHERE date >= '2017-01-01'

Note

  • Bin-packing optimization is idempotent, meaning that if it is run twice on the same dataset, the second run has no effect.
  • Bin-packing aims to produce evenly-balanced data files with respect to their size on disk, but not necessarily number of tuples per file. However, the two measures are most often correlated.

Readers of Delta tables use snapshot isolation, which means that they are not interrupted when OPTIMIZE removes unnecessary files from the transaction log. OPTIMIZE makes no data related changes to the table, so a read before and after an OPTIMIZE has the same results. Performing OPTIMIZE on a table that is a streaming source does not affect any current or future streams that treat this table as a source. OPTIMIZE returns the file statistics (min, max, total, and so on) for the files removed and the files added by the operation. Optimize stats also contains the number of batches, and partitions optimized.

File compaction is controlled by following Apache Spark configuration options:

  • optimize.minFileSize (default is 1,000,000,000 bytes (1GB)): Files which are smaller than this threshold (in bytes) will be grouped together and rewritten as larger files by the OPTIMIZE command.
  • optimize.maxFileSize (default is 1,000,000,000 bytes (1GB)): Target file size (in bytes) produced by the OPTIMIZE command.
  • optimize.maxThreads (default is 15): Maximum number of parallel jobs allowed in OPTIMIZE command. Increasing the maximum parallel jobs allows the OPTIMIZE command to run faster, but increases the job management on the Spark driver side.

Data skipping

Note

This feature is available in Delta Lake 1.2.0 and above.

Data skipping information is collected automatically when you write data into a Delta Lake table. Delta Lake takes advantage of this information (minimum and maximum values for each column) at query time to provide faster queries. You do not need to configure data skipping; the feature is activated whenever applicable.

Collecting statistics on a column containing long values such as string or binary is an expensive operation. To avoid collecting statistics on such columns you can configure the table property delta.dataSkippingNumIndexedCols. This property indicates the position index of a column in the table’s schema. All columns with a position index less than the delta.dataSkippingNumIndexedCols property will have statistics collected. For the purposes of collecting statistics, each field within a nested column is considered as an individual column. To avoid collecting statistics on columns containing long values, either set the delta.dataSkippingNumIndexedCols property so that the long value columns are after this index in the table’s schema, or move columns containing long strings to an index position greater than the delta.dataSkippingNumIndexedCols property by using ALTER TABLE ALTER COLUMN.