Storage configuration

Delta Lake ACID guarantees are predicated on the atomicity and durability guarantees of the storage system. Specifically, Delta Lake relies on the following when interacting with storage systems:

  • Atomic visibility: There must a way for a file to visible in its entirety or not visible at all.
  • Mutual exclusion: Only one writer must be able to create (or rename) a file at the final destination.
  • Consistent listing: Once a file has been written in a directory, all future listings for that directory must return that file.

Because storage systems do not necessarily provide all of these guarantees out-of-the-box, Delta Lake transactional operations typically go through the LogStore API instead of accessing the storage system directly. To provide the ACID guarantees for different storage systems, you can plug in custom LogStore implementations. Delta Lake has a built-in LogStore implementation for HDFS since 0.1.0 and for Amazon S3 and Azure storage services since 0.2.0. If you are interested in adding a custom LogStore implementation for your storage system, you can start a discussion in the community mailing group.

This topics covers how to configure Delta Lake on HDFS, Amazon S3, and Azure storage services.

In this article:

HDFS

You can use Delta Lake to read and write data on HDFS. Delta Lake supports concurrent reads and writes from multiple clusters.

Configuration for HDFS

You can use Delta Lake on HDFS out-of-the-box, as the default implementation of LogStore is HDFSLogStore, which accesses HDFS through Hadoop’s FileContext APIs. If you have configured a different LogStore implementation before, you can unset the spark.delta.logStore.class Spark configuration property or set it as follows:

spark.delta.logStore.class=org.apache.spark.sql.delta.storage.HDFSLogStore

This configures Delta Lake to choose HDFSLogStore as the LogStore implementation. This is an application property, must be set before starting SparkContext, and cannot change during the lifetime of the context.

Amazon S3

You can create, read, and write Delta tables on S3. Delta Lake supports concurrent reads from multiple clusters, but concurrent writes to S3 must originate from a single Spark driver in order for Delta Lake to provide transactional guarantees.

Warning

Concurrent writes to the same Delta table from multiple Spark drivers can lead to data loss.

Requirements

  • S3 credentials: IAM roles (recommended) or access keys
  • Delta Lake 0.2.0 and above
  • Apache Spark associated with the corresponding Delta Lake version, see the Quick Start page of the relevant Delta version’s documentation for details.

Quickstart

This section explains how to quickly start reading and writing Delta tables on S3. For a detailed explanation of the configuration, see Configure for S3.

  1. Use the following command to launch a Spark shell with Delta Lake and S3 support (assuming you use Spark pre-built for Hadoop 2.7:

    bin/spark-shell \
     --packages io.delta:delta-core_2.12:0.7.0,org.apache.hadoop:hadoop-aws:2.7.7 \
     --conf spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore \
     --conf spark.hadoop.fs.s3a.access.key=<your-s3-access-key> \
     --conf spark.hadoop.fs.s3a.secret.key=<your-s3-secret-key>
    
  2. Try out some basic Delta table operations on S3 (in Scala):

    // Create a Delta table on S3:
    spark.range(5).write.format("delta").save("s3a://<your-s3-bucket>/<path>/<to>/<delta-table>")
    
    // Read a Delta table on S3:
    spark.read.format("delta").load("s3a://<your-s3-bucket>/<path>/<to>/<delta-table>")
    

For other languages and more examples of Delta table operations, see Quickstart.

Configure for S3

Here are the steps to configure Delta Lake for S3.

  1. Configure LogStore implementation.

    Set the spark.delta.logStore.class Spark configuration property:

    spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
    

    As the name suggests, the S3SingleDriverLogStore implementation only works properly when all concurrent writes originate from a single Spark driver. This is an application property, must be set before starting SparkContext, and cannot change during the lifetime of the context.

  2. Include hadoop-aws JAR in the classpath.

    Delta Lake needs the org.apache.hadoop.fs.s3a.S3AFileSystem class from the hadoop-aws package, which implements Hadoop’s FileSystem API for S3. Make sure the version of this package matches the Hadoop version with which the Spark was built.

  3. Set up S3 credentials.

    We recommend using IAM roles for authentication and authorization. But if you want to use keys, here is one way is to set up the Hadoop configurations (in Scala):

    sc.hadoopConfiguration.set("fs.s3a.access.key", "<your-s3-access-key>")
    sc.hadoopConfiguration.set("fs.s3a.secret.key", "<your-s3-secret-key>")
    

Microsoft Azure storage

You can create, read, and write Delta tables on Azure Blob storage and Azure Data Lake Storage Gen1. Delta Lake supports concurrent writes from multiple clusters.

Delta Lake relies on Hadoop FileSystem APIs to access Azure storage services. Specifically, Delta Lake requires the implementation of FileSystem.rename() to be atomic, which is only supported in newer Hadoop versions (Hadoop-15156 and Hadoop-15086)). For this reason, you may need to build Spark with newer Hadoop versions and use them for deploying your application. Refer to Specifying the Hadoop Version and Enabling YARN for building Spark with a specific Hadoop version and Quickstart for setting up Spark with Delta Lake.

Here is a list of requirements specific to each type of Azure storage system:

Azure Blob storage

Requirements

  • A shared key or shared access signature (SAS)
  • Delta Lake 0.2.0 and above
  • Hadoop’s Azure Blob Storage libraries for deployment with the following versions:
    • 2.9.1+ for Hadoop 2
    • 3.0.1+ for Hadoop 3
  • Apache Spark associated with the corresponding Delta Lake version (see the Quick Start page of the relevant Delta version’s documentation) and compiled with Hadoop version that is compatible with the chosen Hadoop libraries.

For example, a possible combination that will work is Delta 0.7.0 or above, along with Apache Spark 3.0 compiled and deployed with Hadoop 3.2.

Configuration

Here are the steps to configure Delta Lake on Azure Blob storage.

  1. Configure LogStore implementation.

    Set the spark.delta.logStore.class Spark configuration property:

    spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
    

    The AzureLogStore implementation works for all Azure storage services and supports multi-cluster concurrent writes. This is an application property, must be set before starting SparkContext, and cannot change during the lifetime of the context.

  2. Include hadoop-azure JAR in the classpath. See the requirements above for version details.

  3. Set up credentials.

    You can set up your credentials in the Spark configuration property.

    We recommend that you use a SAS token. In Scala, you can use the following:

    spark.conf.set(
      "fs.azure.sas.<your-container-name>.<your-storage-account-name>.blob.core.windows.net",
       "<complete-query-string-of-your-sas-for-the-container>")
    

    Or you can specify an account access key:

    spark.conf.set(
      "fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net",
       "<your-storage-account-access-key>")
    

Usage

spark.write.format("delta").save("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<path-to-delta-table>")
spark.read.format("delta").load("wasbs://<your-container-name>@<your-storage-account-name>.blob.core.windows.net/<path-to-delta-table>")

Requirements

  • A service principal for OAuth 2.0 access
  • Delta Lake 0.2.0 and above
  • Hadoop’s Azure Data Lake Storage Gen1 libraries for deployment with the following versions:
    • 2.9.1+ for Hadoop 2
    • 3.0.1+ for Hadoop 3
  • Apache Spark associated with the corresponding Delta Lake version (see the Quick Start page of the relevant Delta version’s documentation) and compiled with Hadoop version that is compatible with the chosen Hadoop libraries.

For example, a possible combination that will work is Delta 0.7.0 or above, along with Apache Spark 3.0 compiled and deployed with Hadoop 3.2.

Configuration

Here are the steps to configure Delta Lake on Azure Data Lake Storage Gen1.

  1. Configure LogStore implementation.

    Set the spark.delta.logStore.class Spark configuration property:

    spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
    

    The AzureLogStore implementation works for all storage services in Azure and supports multi-cluster concurrent writes. This is an application property, must be set before starting SparkContext, and cannot change during the lifetime of the context.

  2. Include hadoop-azure-datalake JAR in the classpath. See the requirements above for version details.

  3. Set up Azure Data Lake Storage Gen1 credentials.

    You can set the following Hadoop configurations with your credentials (in Scala):

    spark.conf.set("dfs.adls.oauth2.access.token.provider.type", "ClientCredential")
    spark.conf.set("dfs.adls.oauth2.client.id", "<your-oauth2-client-id>")
    spark.conf.set("dfs.adls.oauth2.credential", "<your-oauth2-credential>")
    spark.conf.set("dfs.adls.oauth2.refresh.url", "https://login.microsoftonline.com/<your-directory-id>/oauth2/token")
    

Usage

spark.range(5).write.format("delta").save("adl://<your-adls-account>.azuredatalakestore.net/<path-to-delta-table>")

spark.read.format("delta").load("adl://<your-adls-account>.azuredatalakestore.net/<path-to-delta-table>")

Azure Data Lake Storage Gen2

Requirements

  • Account created in Azure Data Lake Storage Gen2)
  • Service principal created and assigned the Storage Blob Data Contributor role for the storage account.
    • Note the storage-account-name, directory-id (also known as tenant-id), application-id, and password of the principal. These will be used for configuring Spark.
  • Delta Lake 0.7.0 and above
  • Apache Spark 3.0 or above
  • Apache Spark used must be built with Hadoop 3.2 or above.

For example, a possible combination that will work is Delta 0.7.0 or above, along with Apache Spark 3.0 compiled and deployed with Hadoop 3.2.

Configuration

Here are the steps to configure Delta Lake on Azure Data Lake Storage Gen1.

  1. Configure LogStore implementation.

    Set the spark.delta.logStore.class Spark configuration property:

    spark.delta.logStore.class=org.apache.spark.sql.delta.storage.AzureLogStore
    
  2. Include hadoop-azure-datalake JAR in the classpath. See the requirements above for version details. In addition, you may also have to include JARs for maven artifacts hadoop-azure and wildfly-openssl.

  3. Set up Azure Data Lake Storage Gen2 credentials.

     spark.conf.set("fs.azure.account.auth.type.<storage-account-name>.dfs.core.windows.net", "OAuth")
     spark.conf.set("fs.azure.account.oauth.provider.type.<storage-account-name>.dfs.core.windows.net",  "org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider")
     spark.conf.set("fs.azure.account.oauth2.client.id.<storage-account-name>.dfs.core.windows.net", "<application-id>")
     spark.conf.set("fs.azure.account.oauth2.client.secret.<storage-account-name>.dfs.core.windows.net","<password>")
     spark.conf.set("fs.azure.account.oauth2.client.endpoint.<storage-account-name>.dfs.core.windows.net", "https://login.microsoftonline.com/<directory-id>/oauth2/token")
    

    where <storage-account-name>, <application-id>, <directory-id> and <password> are details of the service principal we set as requirements earlier.

  4. Initialize the file system if needed

spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "true")
dbutils.fs.ls("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/")
spark.conf.set("fs.azure.createRemoteFileSystemDuringInitialization", "false")

Usage

spark.range(5).write.format("delta").save("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-delta-table>")

spark.read.format("delta").load("abfss://<container-name>@<storage-account-name>.dfs.core.windows.net/<path-to-delta-table>")

where <container-name> is the file system name under the container.