Storage configuration
Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, Delta Lake relies on the following when interacting with storage systems:
- Atomic visibility: There must a way for a file to visible in its entirety or not visible at all.
- Mutual exclusion: Only one writer must be able to create (or rename) a file at the final destination.
- Consistent listing: Once a file has been written in a directory, all future listings for that directory must return that file.
Because storage systems do not necessarily provide all of these guarantees out-of-the-box, Delta Lake transactional operations typically go through the LogStore API instead of accessing the storage system directly. To provide the ACID guarantees for different storage systems, you can plug in custom LogStore
implementations. Delta Lake has a built-in LogStore
implementation for HDFS since 0.1.0 and for Amazon S3 and Azure storage services since 0.2.0. If you are interested in adding a custom LogStore
implementation for your storage system, you can start a discussion in the community mailing group.
This topics covers how to configure Delta Lake on HDFS, Amazon S3, and Azure storage services.
In this article:
HDFS
You can use Delta Lake to read and write data on HDFS. Delta Lake supports concurrent reads and writes from multiple clusters.
Configuration
You can use Delta Lake on HDFS out-of-the-box, as the default implementation of LogStore
is HDFSLogStore
, which accesses HDFS through Hadoop’s FileContext
APIs. If you have configured a different LogStore
implementation before, you can unset the spark.delta.logStore.class
Spark configuration property or set it as follows:
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.HDFSLogStore
This configures Delta Lake to choose HDFSLogStore
as the LogStore
implementation. This is an application property, must be set before starting SparkContext
, and cannot change during the lifetime of the context.
Amazon S3
You can create, read, and write Delta tables on S3. Delta Lake supports concurrent reads from multiple clusters, but concurrent writes to S3 must originate from a single Spark driver in order for Delta Lake to provide transactional guarantees.
Warning
Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss.
In this section:
Requirements
- S3 credentials: IAM roles (recommended) or access keys
- Delta Lake 0.2.0 or above
- Apache Spark associated with the corresponding Delta Lake version, see the Quick Start page of the relevant Delta version’s documentation for details.
Quickstart
This section explains how to quickly start reading and writing Delta tables on S3. For a detailed explanation of the configuration, see Configuration.
Use the following command to launch a Spark shell with Delta Lake and S3 support (assuming you use Spark pre-built for Hadoop 2.7:
bin/spark-shell \ --packages io.delta:delta-core_2.12:0.7.0,org.apache.hadoop:hadoop-aws:2.7.7 \ --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \ --conf spark.hadoop.fs.s3a.access.key=<your-s3-access-key> \ --conf spark.hadoop.fs.s3a.secret.key=<your-s3-secret-key>
Try out some basic Delta table operations on S3 (in Scala):
// Create a Delta table on S3: spark.range(5).write.format("delta").save("s3a://<your-s3-bucket>/<path-to-delta-table>") // Read a Delta table on S3: spark.read.format("delta").load("s3a://<your-s3-bucket>/<path-to-delta-table>")
For other languages and more examples of Delta table operations, see Quickstart.
Configuration
Here are the steps to configure Delta Lake for S3.
Configure
LogStore
implementation.Set the
spark.delta.logStore.class
Spark configuration property:spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
As the name suggests, the
S3SingleDriverLogStore
implementation only works properly when all concurrent writes originate from a single Spark driver. This is an application property, must be set before startingSparkContext
, and cannot change during the lifetime of the context.Include
hadoop-aws
JAR in the classpath.Delta Lake needs the
org.apache.hadoop.fs.s3a.S3AFileSystem
class from thehadoop-aws
package, which implements Hadoop’sFileSystem
API for S3. Make sure the version of this package matches the Hadoop version with which the Spark was built.Set up S3 credentials.
We recommend using IAM roles for authentication and authorization. But if you want to use keys, here is one way is to set up the Hadoop configurations (in Scala):
sc.hadoopConfiguration.set("fs.s3a.access.key", "<your-s3-access-key>") sc.hadoopConfiguration.set("fs.s3a.secret.key", "<your-s3-secret-key>")
Microsoft Azure storage
You can create, read, and write Delta tables on Azure Blob storage and Azure Data Lake Storage Gen1. Delta Lake supports concurrent writes from multiple clusters.
Delta Lake relies on Hadoop FileSystem
APIs to access Azure storage services. Specifically, Delta Lake requires the implementation of FileSystem.rename()
to be atomic, which is only supported in newer Hadoop versions (Hadoop-15156 and Hadoop-15086)). For this reason, you may need to build Spark with newer Hadoop versions and use them for deploying your application. Refer to Specifying the Hadoop Version and Enabling YARN for building Spark with a specific Hadoop version and Quickstart for setting up Spark with Delta Lake.
Here is a list of requirements specific to each type of Azure storage system:
Azure Blob storage
In this section:
Requirements
- A shared key or shared access signature (SAS)
- Delta Lake 0.2.0 or above
- Hadoop’s Azure Blob Storage libraries for deployment with the following versions:
- 2.9.1+ for Hadoop 2
- 3.0.1+ for Hadoop 3
- Apache Spark associated with the corresponding Delta Lake version (see the Quick Start page of the relevant Delta version’s documentation) and compiled with Hadoop version that is compatible with the chosen Hadoop libraries.
For example, a possible combination that will work is Delta 0.7.0 or above, along with Apache Spark 3.0 compiled and deployed with Hadoop 3.2.
Configuration
Here are the steps to configure Delta Lake on Azure Blob storage.
Configure
LogStore
implementation.Set the
spark.delta.logStore.class
Spark configuration property:spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
The
AzureLogStore
implementation works for all Azure storage services and supports multi-cluster concurrent writes. This is an application property, must be set before startingSparkContext
, and cannot change during the lifetime of the context.Include
hadoop-azure
JAR in the classpath. See the requirements above for version details.Set up credentials.
You can set up your credentials in the Spark configuration property.
We recommend that you use a SAS token. In Scala, you can use the following:
spark.conf.set( "fs.azure.sas.<your-container-name>.<your-storage-account-name>.blob.core.windows.net", "<complete-query-string-of-your-sas-for-the-container>")
Or you can specify an account access key:
spark.conf.set( "fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net", "<your-storage-account-access-key>")
Usage
spark.write.format("delta").save("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<path-to-delta-table>")
spark.read.format("delta").load("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<path-to-delta-table>")
Azure Data Lake Storage Gen1
In this section:
Requirements
- A service principal for OAuth 2.0 access
- Delta Lake 0.2.0 or above
- Hadoop’s Azure Data Lake Storage Gen1 libraries for deployment with the following versions:
- 2.9.1+ for Hadoop 2
- 3.0.1+ for Hadoop 3
- Apache Spark associated with the corresponding Delta Lake version (see the Quick Start page of the relevant Delta version’s documentation) and compiled with Hadoop version that is compatible with the chosen Hadoop libraries.
For example, a possible combination that will work is Delta 0.7.0 or above, along with Apache Spark 3.0 compiled and deployed with Hadoop 3.2.
Configuration
Here are the steps to configure Delta Lake on Azure Data Lake Storage Gen1.
Configure
LogStore
implementation.Set the
spark.delta.logStore.class
Spark configuration property:spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
The
AzureLogStore
implementation works for all storage services in Azure and supports multi-cluster concurrent writes. This is an application property, must be set before startingSparkContext
, and cannot change during the lifetime of the context.Include
hadoop-azure-datalake
JAR in the classpath. See the requirements above for version details.Set up Azure Data Lake Storage Gen1 credentials.
You can set the following Hadoop configurations with your credentials (in Scala):
spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential") spark.conf.set("dfs.adls.oauth2.client.id", "<your-oauth2-client-id>") spark.conf.set("dfs.adls.oauth2.credential", "<your-oauth2-credential>") spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
Usage
spark.range(5).write.format("delta").save("adl://<your-adls-account>.azuredatalakestore.net/<path-to-delta-table>")
spark.read.format("delta").load("adl://<your-adls-account>.azuredatalakestore.net/<path-to-delta-table>")
Azure Data Lake Storage Gen2
In this section:
Requirements
- Account created in Azure Data Lake Storage Gen2)
- Service principal created and assigned the Storage Blob Data Contributor role for the storage account.
- Note the storage-account-name, directory-id (also known as tenant-id), application-id, and password of the principal. These will be used for configuring Spark.
- Delta Lake 0.7.0 or above
- Apache Spark 3.0 or above
- Apache Spark used must be built with Hadoop 3.2 or above.
For example, a possible combination that will work is Delta 0.7.0 or above, along with Apache Spark 3.0 compiled and deployed with Hadoop 3.2.
Configuration
Here are the steps to configure Delta Lake on Azure Data Lake Storage Gen1.
Configure
LogStore
implementation.Set the
spark.delta.logStore.class
Spark configuration property:spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
Include
hadoop-azure-datalake
JAR in the classpath. See the requirements for version details. In addition, you may also have to include JARs for Maven artifactshadoop-azure
andwildfly-openssl
.Set up Azure Data Lake Storage Gen2 credentials.
spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth") spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net", "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider") spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>") spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net","<password>") spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
where
<storage-account-name>
,<application-id>
,<directory-id>
and<password>
are details of the service principal we set as requirements earlier.Initialize the file system if needed
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")
Usage
spark.range(5).write.format("delta").save("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-delta-table>")
spark.read.format("delta").load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-delta-table>")
where <container-name>
is the file system name under the container.