Concurrency Control
Delta Lake provides ACID transaction guarantees between reads and writes. This means that:
- Readers continue to see the consistent snapshot view of the table that the Spark job started with, even when the table is modified during the job.
- Multiple writers can simultaneously modify a table and see a consistent snapshot view of the table and there will be a serial order for these writes.
Optimistic concurrency control
Delta Lake uses optimistic concurrency control to provide transactional guarantees between writes. Under this mechanism, writes operate in three stages:
- Read: Reads (if needed) the latest available version of the table to identify which files need to be modified (that is, rewritten).
- Write: Stages all the changes by writing new data files.
- Validate and commit: Before committing the changes, checks whether the proposed changes conflict with any other changes that may have been concurrently committed since the snapshot that was read. If there are no conflicts, all the staged changes are committed as a new versioned snapshot, and the write operation succeeds. However, if there are conflicts, the write operation fails with a concurrent modification exception rather than corrupting the table as would happen with open source Spark.
Concurrency Level
Delta Lake supports concurrent reads and append-only writes. To be considered as append-only, a writer must be only adding new data without reading or modifying existing data in any way. Concurrent reads and appends are allowed and get snapshot isolation even when they operate on the same Delta Lake table partition.
See Append using DataFrames for batch-append and Append mode for streaming-append to a Delta Lake table.