Frequently Asked Questions (FAQ)
- What is Delta Lake?
- How is Delta Lake related to Apache Spark?
- What format does Delta Lake use to store data?
- How can I read and write data with Delta Lake?
- Where does Delta Lake store the data?
- Can I stream data directly into Delta Lake tables?
- Can I stream data from Delta Lake tables?
- Does Delta Lake support writes or reads using the Spark Streaming DStream API?
- When I use Delta Lake, will I be able to port my code to other Spark platforms easily?
- Does Delta Lake support multi-table transactions?
- When should I use partitioning with Delta Lake tables?
What is Delta Lake?
Delta Lake is an open source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs.
What format does Delta Lake use to store data?
Delta Lake uses versioned Parquet files to store your data in your cloud storage. Apart from the versions, Delta Lake also stores a transaction log to keep track of all the commits made to the table or blob store directory to provide ACID transactions.
How can I read and write data with Delta Lake?
You can use your favorite Apache Spark APIs to read and write data with Delta Lake. See Read a table and Write to a table.
Where does Delta Lake store the data?
When writing data, you can specify the location in your cloud storage. Delta Lake stores the data in that location in Parquet format.
Can I stream data directly into Delta Lake tables?
Yes, you can use Structured Streaming to directly write data into Delta Lake tables. See Stream data into Delta Lake tables.
Does Delta Lake support writes or reads using the Spark Streaming DStream API?
Delta does not support the DStream API. We recommend Table Streaming Reads and Writes.
When I use Delta Lake, will I be able to port my code to other Spark platforms easily?
Yes. When you use Delta Lake, you are using open Apache Spark APIs so you can easily port your code to other Spark platforms. To port your code, replace delta
format with parquet
format.
Does Delta Lake support multi-table transactions?
Delta Lake does not support multi-table transactions and foreign keys. Delta Lake supports transactions at the table level.
When should I use partitioning with Delta Lake tables?
You can partition a Delta Lake table by a column. The most commonly used partition column is date
. Follow these two rules of thumb for deciding on what column to partition by:
- Cardinality of the column: Number of distinct values a column will have. If the cardinality of a column will be very high, do not use that column for partitioning. For example, if you partition by a column
userId
and if there can be 1M distinct user ids, then that is a bad partitioning strategy. - Amount of data in each partition: You can partition by a column if you expect data in that partition to be at least 1 GB.