Table Utility Commands

Delta Lake tables support vacuum and history utility commands.

In this topic:

Vacuum

You can remove files that are no longer referenced by a Delta Lake table and are older than the retention threshold by running vacuum on the table. The default retention threshold for the files is 7 days. The ability to time travel back to a version older than the retention period is lost after running vacuum. vacuum is not triggered automatically. Running the vacuum command on the table recursively vacuums the directories associated with the Delta Lake table.

Scala
import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)

deltaTable.vacuum()        // vacuum files not required by versions older than the default retention period

deltaTable.vacuum(100)     // vacuum files not required by versions more than 100 hours old
Java
import io.delta.tables.*;
import org.apache.spark.sql.functions;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);

deltaTable.vacuum();        // vacuum files not required by versions older than the default retention period

deltaTable.vacuum(100);    // vacuum files not required by versions more than 100 hours old

See Programmatic API Docs for more details.

History

You can retrieve information on the operations, user, timestamp, and so on for each write to a Delta Lake table by running the history command. The operations are returned in reverse chronological order. By default table history is retained for 30 days.

Scala
import io.delta.tables._

val deltaTable = DeltaTable.forPath(spark, pathToTable)

val fullHistoryDF = deltaTable.history()    // get the full history of the table.

val lastOperationDF = deltaTable.history(1) // get the last operation.
Java
import io.delta.tables.*;

DeltaTable deltaTable = DeltaTable.forPath(spark, pathToTable);

DataFrame fullHistoryDF = deltaTable.history();       // get the full history of the table.

DataFrame lastOperationDF = deltaTable.history(1);    // fetch the last operation on the DeltaTable.

The returned DataFrame will have the following structure.

+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+
|version|          timestamp|userId|userName|operation| operationParameters| job|notebook|clusterId|readVersion|isolationLevel|isBlindAppend|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+
|      5|2019-07-29 14:07:47|  null|    null|   DELETE|[predicate -> ["(...|null|    null|     null|          4|          null|        false|
|      4|2019-07-29 14:07:41|  null|    null|   UPDATE|[predicate -> (id...|null|    null|     null|          3|          null|        false|
|      3|2019-07-29 14:07:29|  null|    null|   DELETE|[predicate -> ["(...|null|    null|     null|          2|          null|        false|
|      2|2019-07-29 14:06:56|  null|    null|   UPDATE|[predicate -> (id...|null|    null|     null|          1|          null|        false|
|      1|2019-07-29 14:04:31|  null|    null|   DELETE|[predicate -> ["(...|null|    null|     null|          0|          null|        false|
|      0|2019-07-29 14:01:40|  null|    null|    WRITE|[mode -> ErrorIfE...|null|    null|     null|       null|          null|         true|
+-------+-------------------+------+--------+---------+--------------------+----+--------+---------+-----------+--------------+-------------+

See Programmatic API Docs for more details.