Storage Configuration

Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, Delta Lake relies on the following when interacting with storage systems:

Atomic visibility

There must a way for a file to visible in its entirety or not visible at all.

Mutual exclusion

Only one writer must be able to create (or rename) a file at the final destination.

Consistent listing

Once a file has been written in a directory, all future listings for that directory must return that file.

Given that storage systems do not necessarily provide all of these guarantees out-of-the-box, Delta Lake transactional operations typically go through the LogStore API instead of accessing the storage system directly. We can plug in custom LogStore implementations in order to provide the above guarantees for different storage systems. Delta Lake has the built-in LogStore implementation for HDFS since 0.1.0 and for Amazon S3 and Azure storage services since 0.2.0. If you are interested in adding a custom LogStore implementation for your storage system, you can start discussions in the community mailing group.

This topics covers how to configure Delta Lake on HDFS, Amazon S3, and Azure storage services.

HDFS

You can use Delta Lake to read and write data on HDFS. Delta Lake supports concurrent reads and writes from multiple clusters.

Requirements

Apache Spark version 2.4.2 and above
Delta Lake 0.1.0 and above

Configuration for HDFS

You can use Delta Lake on HDFS out-of-the-box, as the default implementation of LogStore is HDFSLogStore, which accesses HDFS through Hadoop’s FileContext APIs. If you have configured a different LogStore implementation before, you can unset the spark.delta.logStore.class Spark configuration property or set it as follows:

spark.delta.logStore.class=org.apache.spark.sql.delta.storage.HDFSLogStore

This configures Delta Lake to choose HDFSLogStore as the LogStore implementation. This is an application property, must be set before starting SparkContext, and cannot change during the lifetime of the context.

Amazon S3

You can create, read, and write Delta tables on S3. Delta Lake supports concurrent reads from multiple clusters, but concurrent writes to S3 must originate from a single Spark driver in order for Delta Lake to provide transactional guarantees.

Requirements

S3 credentials: IAM roles (recommended) or access keys
Apache Spark version 2.4.2 and above
Delta Lake 0.2.0 and above

Quickstart

This is a quick guide on how to start reading and writing Delta tables on S3. For a detailed explanation of the configuration, refer to the next section.

Use the following command to launch a Spark shell with Delta Lake and S3 support (assuming you use Spark 2.4.3 pre-built for Hadoop 2.7):

bin/spark-shell \
 --packages io.delta:delta-core_2.11:0.2.0,org.apache.hadoop:hadoop-aws:2.7.7 \
 --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
 --conf spark.hadoop.fs.s3a.access.key=<your-s3-access-key> \
 --conf spark.hadoop.fs.s3a.secret.key=<your-s3-secret-key>

Try out some basic Delta table operations on S3 (in Scala):

// Create a Delta table on S3:
spark.range(5).write.format("delta").save("s3a://<your-s3-bucket>/<path>/<to>/<delta-table>")

// Read a Delta table on S3:
spark.read.format("delta").load("s3a://<your-s3-bucket>/<path>/<to>/<delta-table>")

For other languages and more examples of Delta table operations, see Delta Lake quickstart.

Configuration for S3

Here are the steps to configure Delta Lake for S3.

Configure LogStore implementation.

Set the spark.delta.logStore.class Spark configuration property:
```
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
```
As the name suggests, the S3SingleDriverLogStore implementation only works properly when all concurrent writes originate from a single Spark driver. This is an application property, must be set before starting SparkContext, and cannot change during the lifetime of the context.
Include hadoop-aws JAR in the classpath.

Delta Lake needs the org.apache.hadoop.fs.s3a.S3AFileSystem class from the hadoop-aws package, which implements Hadoop’s FileSystem API for S3. Make sure the version of this package matches the Hadoop version with which the Spark was built.
Set up S3 credentials.

We recommend using IAM roles for authentication and authorization. But if you want to use keys, here is one way is to set up the Hadoop configurations (in Scala):
```
sc.hadoopConfiguration.set("fs.s3a.access.key", "<your-s3-access-key>")
sc.hadoopConfiguration.set("fs.s3a.secret.key", "<your-s3-secret-key>")
```

Microsoft Azure storage

You can create, read, and write Delta tables on Azure Blob storage and Azure Data Lake Storage Gen1. Delta Lake supports concurrent writes from multiple clusters.

Requirements

Delta Lake relies on Hadoop FileSystem APIs to access Azure storage services. Specifically, Delta Lake requires the implementation of FileSystem.rename() to be atomic, which is only supported in newer Hadoop versions (Hadoop-15156 and Hadoop-15086)). For this reason, you may need to build Spark with newer Hadoop versions. Here is a list of requirements:

Azure Blob storage

A Shared Key or Shared Access Signature (SAS)
Spark 2.4.2 and above built with Hadoop version 2.9.1 and above (not 3.x)
Delta Lake 0.2.0 and above

Azure Data Lake Storage Gen1

A Service Principal for OAuth 2.0 access
Spark 2.4.2 and above built with Hadoop version 2.9.1 and above (not 3.x)
Delta Lake 0.2.0 and above

Refer to Specifying the Hadoop Version and Enabling YARN for building Spark with a specific Hadoop version and Delta Lake quickstart for setting up Spark with Delta Lake.

Quickstart

This is a quick guide on setting up Delta Lake on Azure Data Lake Storage Gen1. For detailed configurations on both Azure Data Lake Storage Gen1 and Azure Blob storage, see the subsequent sections.

Launch a Spark shell from your Spark home directory with your ADLS credentials (assuming your Spark is built with Scala 2.11 and Hadoop 2.9.2):

bin/spark-shell \
  --packages io.delta:delta-core_2.11:0.2.0,org.apache.hadoop:hadoop-azure-datalake:2.9.2 \
  --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore \
  --conf spark.hadoop.dfs.adls.oauth2.access.token.provider.type=ClientCredential \
  --conf spark.hadoop.dfs.adls.oauth2.client.id=<your-oauth2-client-id> \
  --conf spark.hadoop.dfs.adls.oauth2.credential=<your-oauth2-credential> \
  --conf spark.hadoop.dfs.adls.oauth2.refresh.url=https://login.microsoftonline.com/<your-directory-id>/oauth2/token

Try out some basic Delta table operations on ADLS Gen 1:

// Create a Delta table on ADLS Gen 1:
spark.range(5).write.format("delta").save("adl://<your-adls-account>.azuredatalakestore.net/<path>/<to>/<delta-table>")

// Read a Delta table on ADLS Gen 1:
spark.read.format("delta").load("adl://<your-adls-account>.azuredatalakestore.net/<path>/<to>/<delta-table>")

For other languages and more examples of Delta table operations, see Delta Lake quickstart.

Configure for Azure Data Lake Storage Gen1

Here are the steps to configure Delta Lake on Azure Data Lake Storage Gen1.

Configure LogStore implementation.

Set the spark.delta.logStore.class Spark configuration property:
```
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
```
The AzureLogStore implementation works for all storage services in Azure and supports multi-cluster concurrent writes. This is an application property, must be set before starting SparkContext, and cannot change during the lifetime of the context.
Include hadoop-azure-datalake JAR in the classpath. Delta Lake requires 2.9.1+ for Hadoop and 3.0.1+ for Hadoop 3. Make sure the version you use for this package matches the Hadoop version with which the Spark was built.

Set up Azure Data Lake Storage Gen1 credentials.

You can set the following Hadoop configurations with your credentials (in Scala):

sc.hadoopConfiguration.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
sc.hadoopConfiguration.set("dfs.adls.oauth2.client.id", "<your-oauth2-client-id>")
sc.hadoopConfiguration.set("dfs.adls.oauth2.credential", "<your-oauth2-credential>")
sc.hadoopConfiguration.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")

Configure for Azure Blob storage

Here are the steps to configure Delta Lake on Azure Blob storage.

Configure LogStore implementation.

Set the spark.delta.logStore.class Spark configuration property:
```
spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
```
The AzureLogStore implementation works for all Azure storage services and supports multi-cluster concurrent writes. This is an application property, must be set before starting SparkContext, and cannot change during the lifetime of the context.
Include hadoop-azure JAR in the classpath. Delta Lake requires 2.9.1+ for Hadoop and 3.0.1+ for Hadoop 3. Make sure the version you use for this package matches the Hadoop version with which the Spark was built.

Set up credentials.

You can set up your credentials in the Spark configuration property.

We recommend that you use a SAS token. In Scala, you can use the following:

spark.conf.set(
  "fs.azure.sas.<your-container-name>.<your-storage-account-name>.blob.core.windows.net",
   "<complete-query-string-of-your-sas-for-the-container>")

Or you can specify an account access key:

spark.conf.set(
  "fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
   "<your-storage-account-access-key>")

Access data on your ABS account. Use Delta Lake to read and write data on your ABS account. For example, in Scala:

spark.write.format("delta").save("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<path>/<to>/<delta-table>")
spark.read.format("delta").load("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<path>/<to>/<delta-table>")

For other languages and more examples of Delta table operations, see Delta Lake quickstart.