Storage Configuration
Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, Delta Lake relies on the following when interacting with storage systems:
Atomic visibility
There must a way for a file to visible in its entirety or not visible at all.
Mutual exclusion
Only one writer must be able to create (or rename) a file at the final destination.
Consistent listing
Once a file has been written in a directory, all future listings for that directory must return that file.
Given that storage systems do not necessarily provide all of these guarantees out-of-the-box, Delta Lake transactional operations typically go through the LogStore API instead of accessing the storage system directly. We can plug in custom LogStore
implementations in order to provide the above guarantees for different storage systems. Delta Lake has the built-in LogStore
implementation for HDFS since 0.1.0 and for Amazon S3 and Azure storage services since 0.2.0. If you are interested in adding a custom LogStore
implementation for your storage system, you can start discussions in the community mailing group.
This topics covers how to configure Delta Lake on HDFS, Amazon S3, and Azure storage services.
HDFS
You can use Delta Lake to read and write data on HDFS. Delta Lake supports concurrent reads and writes from multiple clusters.
Configuration for HDFS
You can use Delta Lake on HDFS out-of-the-box, as the default implementation of LogStore
is HDFSLogStore
, which accesses HDFS through Hadoop’s FileContext
APIs. If you have configured a different LogStore
implementation before, you can unset the spark.delta.logStore.class
Spark configuration property or set it as follows:
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.HDFSLogStore
This configures Delta Lake to choose HDFSLogStore
as the LogStore
implementation. This is an application property, must be set before starting SparkContext
, and cannot change during the lifetime of the context.
Amazon S3
You can create, read, and write Delta tables on S3. Delta Lake supports concurrent reads from multiple clusters, but concurrent writes to S3 must originate from a single Spark driver in order for Delta Lake to provide transactional guarantees.
Requirements
- S3 credentials: IAM roles (recommended) or access keys
- Apache Spark version 2.4.2 and above
- Delta Lake 0.2.0 and above
Quickstart
This is a quick guide on how to start reading and writing Delta tables on S3. For a detailed explanation of the configuration, refer to the next section.
Use the following command to launch a Spark shell with Delta Lake and S3 support (assuming you use Spark 2.4.3 pre-built for Hadoop 2.7):
bin/spark-shell \ --packages io.delta:delta-core_2.11:0.2.0,org.apache.hadoop:hadoop-aws:2.7.7 \ --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \ --conf spark.hadoop.fs.s3a.access.key=<your-s3-access-key> \ --conf spark.hadoop.fs.s3a.secret.key=<your-s3-secret-key>
Try out some basic Delta table operations on S3 (in Scala):
// Create a Delta table on S3: spark.range(5).write.format("delta").save("s3a://<your-s3-bucket>/<path>/<to>/<delta-table>") // Read a Delta table on S3: spark.read.format("delta").save("s3a://<your-s3-bucket>/<path>/<to>/<delta-table>")
For other languages and more examples of Delta table operations, see Delta Lake Quickstart.
Configuration for S3
Here are the steps to configure Delta Lake for S3.
Configure
LogStore
implementation.Set the
spark.delta.logStore.class
Spark configuration property:spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
As the name suggests, the
S3SingleDriverLogStore
implementation only works properly when all concurrent writes originate from a single Spark driver. This is an application property, must be set before startingSparkContext
, and cannot change during the lifetime of the context.Include
hadoop-aws
JAR in the classpath.Delta Lake needs the
org.apache.hadoop.fs.s3a.S3AFileSystem
class from thehadoop-aws
package, which implements Hadoop’sFileSystem
API for S3. Make sure the version of this package matches the Hadoop version with which the Spark was built.Set up S3 credentials.
We recommend using IAM roles for authentication and authorization. But if you want to use keys, here is one way is to set up the Hadoop configurations (in Scala):
sc.hadoopConfiguration.set("fs.s3a.access.key", "<your-s3-access-key>") sc.hadoopConfiguration.set("fs.s3a.secret.key", "<your-s3-secret-key>")
Microsoft Azure storage
You can create, read, and write Delta tables on Azure Blob storage and Azure Data Lake Storage Gen1. Delta Lake supports concurrent writes from multiple clusters.
Requirements
Delta Lake relies on Hadoop FileSystem
APIs to access Azure storage services. Specifically, Delta Lake requires the implementation of FileSystem.rename()
to be atomic, which is only supported in newer Hadoop versions (Hadoop-15156 and Hadoop-15086)). For this reason, you may need to build Spark with newer Hadoop versions. Here is a list of requirements:
Azure Blob storage
- A Shared Key or Shared Access Signature (SAS)
- Spark 2.4.2 and above built with Hadoop version 2.9.1 and above (not 3.x)
- Delta Lake 0.2.0 and above
Azure Data Lake Storage Gen1
- A Service Principal for OAuth 2.0 access
- Spark 2.4.2 and above built with Hadoop version 2.9.1 and above (not 3.x)
- Delta Lake 0.2.0 and above
Refer to Specifying the Hadoop Version and Enabling YARN for building Spark with a specific Hadoop version and Delta Lake Quickstart for setting up Spark with Delta Lake.
Quickstart
This is a quick guide on setting up Delta Lake on Azure Data Lake Storage Gen1. For detailed configurations on both Azure Data Lake Storage Gen1 and Azure Blob storage, see the subsequent sections.
Launch a Spark shell from your Spark home directory with your ADLS credentials (assuming your Spark is built with Scala 2.11 and Hadoop 2.9.2):
bin/spark-shell \ --packages io.delta:delta-core_2.11:0.2.0,org.apache.hadoop:hadoop-azure-datalake:2.9.2 \ --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore \ --conf spark.hadoop.dfs.adls.oauth2.access.token.provider.type=ClientCredential \ --conf spark.hadoop.dfs.adls.oauth2.client.id=<your-oauth2-client-id> \ --conf spark.hadoop.dfs.adls.oauth2.credential=<your-oauth2-credential> \ --conf spark.hadoop.dfs.adls.oauth2.refresh.url=https://login.microsoftonline.com/<your-directory-id>/oauth2/token
Try out some basic Delta table operations on ADLS Gen 1:
// Create a Delta table on ADLS Gen 1: spark.range(5).write.format("delta").save("adl://<your-adls-account>.azuredatalakestore.net/<path>/<to>/<delta-table>") // Read a Delta table on ADLS Gen 1: spark.read.format("delta").load("adl://<your-adls-account>.azuredatalakestore.net/<path>/<to>/<delta-table>")
For other languages and more examples of Delta table operations, see Delta Lake Quickstart.
Configure for Azure Data Lake Storage Gen1
Here are the steps to configure Delta Lake on Azure Data Lake Storage Gen1.
Configure
LogStore
implementation.Set the
spark.delta.logStore.class
Spark configuration property:spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
The
AzureLogStore
implementation works for all storage services in Azure and supports multi-cluster concurrent writes. This is an application property, must be set before startingSparkContext
, and cannot change during the lifetime of the context.Include
hadoop-azure-datalake
JAR in the classpath. Delta Lake requires 2.9.1+ for Hadoop and 3.0.1+ for Hadoop 3. Make sure the version you use for this package matches the Hadoop version with which the Spark was built.Set up Azure Data Lake Storage Gen1 credentials.
You can set the following Hadoop configurations with your credentials (in Scala):
sc.hadoopConfiguration.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential") sc.hadoopConfiguration.set("dfs.adls.oauth2.client.id", "<your-oauth2-client-id>") sc.hadoopConfiguration.set("dfs.adls.oauth2.credential", "<your-oauth2-credential>") sc.hadoopConfiguration.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
Configure for Azure Blob storage
Here are the steps to configure Delta Lake on Azure Blob storage.
Configure
LogStore
implementation.Set the
spark.delta.logStore.class
Spark configuration property:spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
The
AzureLogStore
implementation works for all Azure storage services and supports multi-cluster concurrent writes. This is an application property, must be set before startingSparkContext
, and cannot change during the lifetime of the context.Include
hadoop-azure
JAR in the classpath. Delta Lake requires 2.9.1+ for Hadoop and 3.0.1+ for Hadoop 3. Make sure the version you use for this package matches the Hadoop version with which the Spark was built.Set up credentials.
You can set up your credentials in the Spark configuration property.
We recommend that you use a SAS token. In Scala, you can use the following:
spark.conf.set( "fs.azure.sas.<your-container-name>.<your-storage-account-name>.blob.core.windows.net", "<complete-query-string-of-your-sas-for-the-container>")
Or you can specify an account access key:
spark.conf.set( "fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net", "<your-storage-account-access-key>")
Access data on your ABS account. Use Delta Lake to read and write data on your ABS account. For example, in Scala:
spark.write.format("delta").save("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<path>/<to>/<delta-table>") spark.read.format("delta").load("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<path>/<to>/<delta-table>")
For other languages and more examples of Delta table operations, see Delta Lake Quickstart.