Welcome to Delta Lake’s Python documentation page¶
DeltaTable¶
- class delta.tables.DeltaTable(spark: SparkSession, jdt: JavaObject)¶
Main class for programmatically interacting with Delta tables. You can create DeltaTable instances using the path of the Delta table.:
deltaTable = DeltaTable.forPath(spark, "/path/to/table")
In addition, you can convert an existing Parquet table in place into a Delta table.:
deltaTable = DeltaTable.convertToDelta(spark, "parquet.`/path/to/table`")
Added in version 0.4.
- toDF() DataFrame¶
Get a DataFrame representation of this Delta table.
Added in version 0.4.
- alias(aliasName: str) DeltaTable¶
Apply an alias to the Delta table.
Added in version 0.4.
- generate(mode: str) None¶
Generate manifest files for the given delta table.
- Parameters:
mode –
mode for the type of manifest file to be generated The valid modes are as follows (not case sensitive):
- ”symlink_format_manifest”: This will generate manifests in symlink format
for Presto and Athena read support.
See the online documentation for more information.
Added in version 0.5.
- delete(condition: Column | str | None = None) None¶
Delete data from the table that match the given
condition.Example:
deltaTable.delete("date < '2017-01-01'") # predicate using SQL formatted string deltaTable.delete(col("date") < "2017-01-01") # predicate using Spark SQL functions
- Parameters:
condition (str or pyspark.sql.Column) – condition of the update
Added in version 0.4.
- update(condition: str | Column, set: Dict[str, str | Column]) None¶
- update(*, set: Dict[str, str | Column]) None
Update data from the table on the rows that match the given
condition, which performs the rules defined byset.Example:
# condition using SQL formatted string deltaTable.update( condition = "eventType = 'clck'", set = { "eventType": "'click'" } ) # condition using Spark SQL functions deltaTable.update( condition = col("eventType") == "clck", set = { "eventType": lit("click") } )
- Parameters:
condition (str or pyspark.sql.Column) – Optional condition of the update
set (dict with str as keys and str or pyspark.sql.Column as values) – Defines the rules of setting the values of columns that need to be updated. Note: This param is required. Default value None is present to allow positional args in same order across languages.
Added in version 0.4.
- merge(source: DataFrame, condition: str | Column) DeltaMergeBuilder¶
Merge data from the source DataFrame based on the given merge condition. This returns a
DeltaMergeBuilderobject that can be used to specify the update, delete, or insert actions to be performed on rows based on whether the rows matched the condition or not. SeeDeltaMergeBuilderfor a full description of this operation and what combinations of update, delete and insert operations are allowed.Example 1 with conditions and update expressions as SQL formatted string:
deltaTable.alias("events").merge( source = updatesDF.alias("updates"), condition = "events.eventId = updates.eventId" ).whenMatchedUpdate(set = { "data": "updates.data", "count": "events.count + 1" } ).whenNotMatchedInsert(values = { "date": "updates.date", "eventId": "updates.eventId", "data": "updates.data", "count": "1" } ).execute()
Example 2 with conditions and update expressions as Spark SQL functions:
from pyspark.sql.functions import * deltaTable.alias("events").merge( source = updatesDF.alias("updates"), condition = expr("events.eventId = updates.eventId") ).whenMatchedUpdate(set = { "data" : col("updates.data"), "count": col("events.count") + 1 } ).whenNotMatchedInsert(values = { "date": col("updates.date"), "eventId": col("updates.eventId"), "data": col("updates.data"), "count": lit("1") } ).execute()
- Parameters:
source (pyspark.sql.DataFrame) – Source DataFrame
condition (str or pyspark.sql.Column) – Condition to match sources rows with the Delta table rows.
- Returns:
builder object to specify whether to update, delete or insert rows based on whether the condition matched or not
- Return type:
Added in version 0.4.
- vacuum(retentionHours: float | None = None) DataFrame¶
Recursively delete files and directories in the table that are not needed by the table for maintaining older versions up to the given retention threshold. This method will return an empty DataFrame on successful completion.
Example:
deltaTable.vacuum() # vacuum files not required by versions more than 7 days old deltaTable.vacuum(100) # vacuum files not required by versions more than 100 hours old
- Parameters:
retentionHours – Optional number of hours retain history. If not specified, then the default retention period of 168 hours (7 days) will be used.
Added in version 0.4.
- history(limit: int | None = None) DataFrame¶
Get the information of the latest limit commits on this table as a Spark DataFrame. The information is in reverse chronological order.
Example:
fullHistoryDF = deltaTable.history() # get the full history of the table lastOperationDF = deltaTable.history(1) # get the last operation
- Parameters:
limit – Optional, number of latest commits to returns in the history.
- Returns:
Table’s commit history. See the online Delta Lake documentation for more details.
- Return type:
pyspark.sql.DataFrame
Added in version 0.4.
- detail() DataFrame¶
Get the details of a Delta table such as the format, name, and size.
Example:
detailDF = deltaTable.detail() # get the full details of the table
:return Information of the table (format, name, size, etc.) :rtype: pyspark.sql.DataFrame
Note
Evolving
Added in version 2.1.
- classmethod convertToDelta(sparkSession: SparkSession, identifier: str, partitionSchema: StructType | str | None = None) DeltaTable¶
Create a DeltaTable from the given parquet table. Takes an existing parquet table and constructs a delta transaction log in the base path of the table. Note: Any changes to the table during the conversion process may not result in a consistent state at the end of the conversion. Users should stop any changes to the table before the conversion is started.
Example:
# Convert unpartitioned parquet table at path 'path/to/table' deltaTable = DeltaTable.convertToDelta( spark, "parquet.`path/to/table`") # Convert partitioned parquet table at path 'path/to/table' and partitioned by # integer column named 'part' partitionedDeltaTable = DeltaTable.convertToDelta( spark, "parquet.`path/to/table`", "part int")
- Parameters:
sparkSession (pyspark.sql.SparkSession) – SparkSession to use for the conversion
identifier (str) – Parquet table identifier formatted as “parquet.`path`”
partitionSchema – Hive DDL formatted string, or pyspark.sql.types.StructType
- Returns:
DeltaTable representing the converted Delta table
- Return type:
Added in version 0.4.
- classmethod forPath(sparkSession: SparkSession, path: str, hadoopConf: Dict[str, str] = {}) DeltaTable¶
Instantiate a
DeltaTableobject representing the data at the given path, If the given path is invalid (i.e. either no table exists or an existing table is not a Delta table), it throws a not a Delta table error.- Parameters:
sparkSession (pyspark.sql.SparkSession) – SparkSession to use for loading the table
hadoopConf (optional dict with str as key and str as value.) – Hadoop configuration starting with “fs.” or “dfs.” will be picked up by DeltaTable to access the file system when executing queries. Other configurations will not be allowed.
- Returns:
loaded Delta table
- Return type:
Example:
hadoopConf = {"fs.s3a.access.key" : "<access-key>", "fs.s3a.secret.key": "secret-key"} deltaTable = DeltaTable.forPath( spark, "/path/to/table", hadoopConf)
Added in version 0.4.
- classmethod forName(sparkSession: SparkSession, tableOrViewName: str) DeltaTable¶
Instantiate a
DeltaTableobject using the given table name. If the given tableOrViewName is invalid (i.e. either no table exists or an existing table is not a Delta table), it throws a not a Delta table error. Note: Passing a view name will also result in this error as views are not supported.The given tableOrViewName can also be the absolute path of a delta datasource (i.e. delta.`path`), If so, instantiate a
DeltaTableobject representing the data at the given path (consistent with the forPath).- Parameters:
sparkSession – SparkSession to use for loading the table
tableOrViewName – name of the table or view
- Returns:
loaded Delta table
- Return type:
Example:
deltaTable = DeltaTable.forName(spark, "tblName")
Added in version 0.7.
- classmethod create(sparkSession: SparkSession | None = None) DeltaTableBuilder¶
Return
DeltaTableBuilderobject that can be used to specify the table name, location, columns, partitioning columns, table comment, and table properties to create a Delta table, error if the table exists (the same as SQL CREATE TABLE).See
DeltaTableBuilderfor a full description and examples of this operation.- Parameters:
sparkSession – SparkSession to use for creating the table
- Returns:
an instance of DeltaTableBuilder
- Return type:
Note
Evolving
Added in version 1.0.
- classmethod createIfNotExists(sparkSession: SparkSession | None = None) DeltaTableBuilder¶
Return
DeltaTableBuilderobject that can be used to specify the table name, location, columns, partitioning columns, table comment, and table properties to create a Delta table, if it does not exists (the same as SQL CREATE TABLE IF NOT EXISTS).See
DeltaTableBuilderfor a full description and examples of this operation.- Parameters:
sparkSession – SparkSession to use for creating the table
- Returns:
an instance of DeltaTableBuilder
- Return type:
Note
Evolving
Added in version 1.0.
- classmethod replace(sparkSession: SparkSession | None = None) DeltaTableBuilder¶
Return
DeltaTableBuilderobject that can be used to specify the table name, location, columns, partitioning columns, table comment, and table properties to replace a Delta table, error if the table doesn’t exist (the same as SQL REPLACE TABLE).See
DeltaTableBuilderfor a full description and examples of this operation.- Parameters:
sparkSession – SparkSession to use for creating the table
- Returns:
an instance of DeltaTableBuilder
- Return type:
Note
Evolving
Added in version 1.0.
- classmethod createOrReplace(sparkSession: SparkSession | None = None) DeltaTableBuilder¶
Return
DeltaTableBuilderobject that can be used to specify the table name, location, columns, partitioning columns, table comment, and table properties replace a Delta table, error if the table doesn’t exist (the same as SQL REPLACE TABLE).See
DeltaTableBuilderfor a full description and examples of this operation.- Parameters:
sparkSession – SparkSession to use for creating the table
- Returns:
an instance of DeltaTableBuilder
- Return type:
Note
Evolving
Added in version 1.0.
- classmethod isDeltaTable(sparkSession: SparkSession, identifier: str) bool¶
Check if the provided identifier string, in this case a file path, is the root of a Delta table using the given SparkSession.
- Parameters:
sparkSession – SparkSession to use to perform the check
path – location of the table
- Returns:
If the table is a delta table or not
- Return type:
bool
Example:
DeltaTable.isDeltaTable(spark, "/path/to/table")
Added in version 0.4.
- upgradeTableProtocol(readerVersion: int, writerVersion: int) None¶
Updates the protocol version of the table to leverage new features. Upgrading the reader version will prevent all clients that have an older version of Delta Lake from accessing this table. Upgrading the writer version will prevent older versions of Delta Lake to write to this table. The reader or writer version cannot be downgraded.
See online documentation and Delta’s protocol specification at PROTOCOL.md for more details.
Added in version 0.8.
- addFeatureSupport(featureName: str) None¶
Modify the protocol to add a supported feature, and if the table does not support table features, upgrade the protocol automatically. In such a case when the provided feature is writer-only, the table’s writer version will be upgraded to 7, and when the provided feature is reader-writer, both reader and writer versions will be upgraded, to (3, 7).
See online documentation and Delta’s protocol specification at PROTOCOL.md for more details.
Added in version 3.3.
- dropFeatureSupport(featureName: str, truncateHistory: bool | None = None) None¶
Modify the protocol to drop a supported feature. The operation always normalizes the resulting protocol. Protocol normalization is the process of converting a table features protocol to the weakest possible form. This primarily refers to converting a table features protocol to a legacy protocol. A table features protocol can be represented with the legacy representation only when the feature set of the former exactly matches a legacy protocol. Normalization can also decrease the reader version of a table features protocol when it is higher than necessary. For example:
(1, 7, None, {AppendOnly, Invariants, CheckConstraints}) -> (1, 3) (3, 7, None, {RowTracking}) -> (1, 7, RowTracking)
The dropFeatureSupport method can be used as follows: delta.tables.DeltaTable.dropFeatureSupport(“rowTracking”)
- Parameters:
featureName – The name of the feature to drop.
truncateHistory – Optional value whether to truncate history. If not specified, the history is not truncated.
- Returns:
None.
Added in version 3.4.
- restoreToVersion(version: int) DataFrame¶
Restore the DeltaTable to an older version of the table specified by version number.
Example:
delta.tables.DeltaTable.restoreToVersion(1)
- Parameters:
version – target version of restored table
- Returns:
Dataframe with metrics of restore operation.
- Return type:
pyspark.sql.DataFrame
Added in version 1.2.
- restoreToTimestamp(timestamp: str) DataFrame¶
Restore the DeltaTable to an older version of the table specified by a timestamp. Timestamp can be of the format yyyy-MM-dd or yyyy-MM-dd HH:mm:ss
Example:
delta.tables.DeltaTable.restoreToTimestamp('2021-01-01') delta.tables.DeltaTable.restoreToTimestamp('2021-01-01 01:01:01')
- Parameters:
timestamp – target timestamp of restored table
- Returns:
Dataframe with metrics of restore operation.
- Return type:
pyspark.sql.DataFrame
Added in version 1.2.
- optimize() DeltaOptimizeBuilder¶
Optimize the data layout of the table. This returns a
DeltaOptimizeBuilderobject that can be used to specify the partition filter to limit the scope of optimize and also execute different optimization techniques such as file compaction or order data using Z-Order curves.See the
DeltaOptimizeBuilderfor a full description of this operation.Example:
deltaTable.optimize().where("date='2021-11-18'").executeCompaction()
- Returns:
an instance of DeltaOptimizeBuilder.
- Return type:
Added in version 2.0.
- clone(target, isShallow=False, replace=False, properties=None) DeltaTable¶
Clone the latest state of a DeltaTable to a destination which mirrors the existing table’s data and metadata at that version. Example:
# Shallow clone a table to path '/path/to/table' deltaTable = DeltaTable.clone("/path/to/table", False, True)
- Parameters:
self (
DeltaTable) – The current instancetarget (str) – Path where we should clone the Delta table
isShallow (bool) – True for shallow clones, false for deep clones
replace (bool) – True if the desired behavior is to overwrite the target table if one exists otherwise throw an error if table exists at the target
properties (dict) – user-defined table properties that should override any properties with the same key from the source table
- Return type:
- cloneAtVersion(version, target, isShallow=False, replace=False, properties=None) DeltaTable¶
Clone a DeltaTable at the given version to a destination which mirrors the existing table’s data and metadata at that version. Example:
# Shallow clone a table to path '/path/to/table' at version 1 deltaTable = DeltaTable.cloneAtVersion(1, "/path/to/table", False)
- Parameters:
self (
DeltaTable) – The current instanceversion (number) – Version at which to clone the source directory. Take the metadata at this version of the table as well.
target (str) – Path where we should clone the Delta table
isShallow (bool) – True for shallow clones, false for deep clones
replace (bool) – True if the desired behavior is to overwrite the target table if one exists otherwise throw an error if table exists at the target
properties (dict) – user-defined table properties that should override any properties with the same key from the source table
- Return type:
- cloneAtTimestamp(timestamp, target, isShallow=False, replace=False, properties=None) DeltaTable¶
Clone a DeltaTable at the given timestamp to a destination which mirrors the existing table’s data and metadata at that timestamp. Example:
# Shallow clone a table to path '/path/to/table' at time of format yyyy-MM-dd'T'HH:mm:ss # or yyyy-MM-dd deltaTable = DeltaTable.cloneAtTimestamp( "2019-01-01", "/path/to/table", False)
- Parameters:
self (
DeltaTable) – The current instancetimestamp (str) – Timestamp at which to clone the source directory. Take the metadata at this timestamp as well.
target (str) – Path where we should clone the Delta table
isShallow (bool) – True for shallow clones, false for deep clones
replace (bool) – True if the desired behavior is to overwrite the target table if one exists otherwise throw an error if table exists at the target
properties (dict) – user-defined table properties that should override any properties with the same key from the source table
- Return type:
- class delta.tables.DeltaMergeBuilder(spark: SparkSession, jbuilder: JavaObject)¶
Builder to specify how to merge data from source DataFrame into the target Delta table. Use
delta.tables.DeltaTable.merge()to create an object of this class. Using this builder, you can specify any number ofwhenMatched,whenNotMatchedandwhenNotMatchedBySourceclauses. Here are the constraints on these clauses.Constraints in the
whenMatchedclauses:The condition in a
whenMatchedclause is optional. However, if there are multiplewhenMatchedclauses, then only the last one may omit the condition.When there are more than one
whenMatchedclauses and there are conditions (or the lack of) such that a row satisfies multiple clauses, then the action for the first clause satisfied is executed. In other words, the order of thewhenMatchedclauses matters.If none of the
whenMatchedclauses match a source-target row pair that satisfy the merge condition, then the target rows will not be updated or deleted.If you want to update all the columns of the target Delta table with the corresponding column of the source DataFrame, then you can use the
whenMatchedUpdateAll(). This is equivalent to:whenMatchedUpdate(set = { "col1": "source.col1", "col2": "source.col2", ... # for all columns in the delta table })
Constraints in the
whenNotMatchedclauses:The condition in a
whenNotMatchedclause is optional. However, if there are multiplewhenNotMatchedclauses, then only the last one may omit the condition.When there are more than one
whenNotMatchedclauses and there are conditions (or the lack of) such that a row satisfies multiple clauses, then the action for the first clause satisfied is executed. In other words, the order of thewhenNotMatchedclauses matters.If no
whenNotMatchedclause is present or if it is present but the non-matching source row does not satisfy the condition, then the source row is not inserted.If you want to insert all the columns of the target Delta table with the corresponding column of the source DataFrame, then you can use
whenNotMatchedInsertAll(). This is equivalent to:whenNotMatchedInsert(values = { "col1": "source.col1", "col2": "source.col2", ... # for all columns in the delta table })
Constraints in the
whenNotMatchedBySourceclauses:The condition in a
whenNotMatchedBySourceclause is optional. However, if there are multiplewhenNotMatchedBySourceclauses, then only the lastwhenNotMatchedBySourceclause may omit the condition.Conditions and update expressions in
whenNotMatchedBySourceclauses may only refer to columns from the target Delta table.When there are more than one
whenNotMatchedBySourceclauses and there are conditions (or the lack of) such that a row satisfies multiple clauses, then the action for the first clause satisfied is executed. In other words, the order of thewhenNotMatchedBySourceclauses matters.If no
whenNotMatchedBySourceclause is present or if it is present but the non-matching target row does not satisfy any of thewhenNotMatchedBySourceclause condition, then the target row will not be updated or deleted.
Example 1 with conditions and update expressions as SQL formatted string:
deltaTable.alias("events").merge( source = updatesDF.alias("updates"), condition = "events.eventId = updates.eventId" ).whenMatchedUpdate(set = { "data": "updates.data", "count": "events.count + 1" } ).whenNotMatchedInsert(values = { "date": "updates.date", "eventId": "updates.eventId", "data": "updates.data", "count": "1", "missed_count": "0" } ).whenNotMatchedBySourceUpdate(set = { "missed_count": "events.missed_count + 1" } ).execute()
Example 2 with conditions and update expressions as Spark SQL functions:
from pyspark.sql.functions import * deltaTable.alias("events").merge( source = updatesDF.alias("updates"), condition = expr("events.eventId = updates.eventId") ).whenMatchedUpdate(set = { "data" : col("updates.data"), "count": col("events.count") + 1 } ).whenNotMatchedInsert(values = { "date": col("updates.date"), "eventId": col("updates.eventId"), "data": col("updates.data"), "count": lit("1"), "missed_count": lit("0") } ).whenNotMatchedBySourceUpdate(set = { "missed_count": col("events.missed_count") + 1 } ).execute()
Added in version 0.4.
- whenMatchedUpdate(condition: Column | str | None, set: Dict[str, str | Column]) DeltaMergeBuilder¶
- whenMatchedUpdate(*, set: Dict[str, str | Column]) DeltaMergeBuilder
Update a matched table row based on the rules defined by
set. If aconditionis specified, then it must evaluate to true for the row to be updated.See
DeltaMergeBuilderfor complete usage details.- Parameters:
condition (str or pyspark.sql.Column) – Optional condition of the update
set (dict with str as keys and str or pyspark.sql.Column as values) – Defines the rules of setting the values of columns that need to be updated. Note: This param is required. Default value None is present to allow positional args in same order across languages.
- Returns:
this builder
Added in version 0.4.
- whenMatchedUpdateAll(condition: Column | str | None = None) DeltaMergeBuilder¶
Update all the columns of the matched table row with the values of the corresponding columns in the source row. If a
conditionis specified, then it must be true for the new row to be updated.See
DeltaMergeBuilderfor complete usage details.- Parameters:
condition (str or pyspark.sql.Column) – Optional condition of the insert
- Returns:
this builder
Added in version 0.4.
- whenMatchedDelete(condition: Column | str | None = None) DeltaMergeBuilder¶
Delete a matched row from the table only if the given
condition(if specified) is true for the matched row.See
DeltaMergeBuilderfor complete usage details.- Parameters:
condition (str or pyspark.sql.Column) – Optional condition of the delete
- Returns:
this builder
Added in version 0.4.
- whenNotMatchedInsert(condition: str | Column, values: Dict[str, str | Column]) DeltaMergeBuilder¶
- whenNotMatchedInsert(*, values: Dict[str, str | Column] = None) DeltaMergeBuilder
Insert a new row to the target table based on the rules defined by
values. If aconditionis specified, then it must evaluate to true for the new row to be inserted.See
DeltaMergeBuilderfor complete usage details.- Parameters:
condition (str or pyspark.sql.Column) – Optional condition of the insert
values (dict with str as keys and str or pyspark.sql.Column as values) – Defines the rules of setting the values of columns that need to be updated. Note: This param is required. Default value None is present to allow positional args in same order across languages.
- Returns:
this builder
Added in version 0.4.
- whenNotMatchedInsertAll(condition: Column | str | None = None) DeltaMergeBuilder¶
Insert a new target Delta table row by assigning the target columns to the values of the corresponding columns in the source row. If a
conditionis specified, then it must evaluate to true for the new row to be inserted.See
DeltaMergeBuilderfor complete usage details.- Parameters:
condition (str or pyspark.sql.Column) – Optional condition of the insert
- Returns:
this builder
Added in version 0.4.
- whenNotMatchedBySourceUpdate(condition: Column | str | None, set: Dict[str, str | Column]) DeltaMergeBuilder¶
- whenNotMatchedBySourceUpdate(*, set: Dict[str, str | Column]) DeltaMergeBuilder
Update a target row that has no matches in the source based on the rules defined by
set. If aconditionis specified, then it must evaluate to true for the row to be updated.See
DeltaMergeBuilderfor complete usage details.- Parameters:
condition (str or pyspark.sql.Column) – Optional condition of the update
set (dict with str as keys and str or pyspark.sql.Column as values) – Defines the rules of setting the values of columns that need to be updated. Note: This param is required. Default value None is present to allow positional args in same order across languages.
- Returns:
this builder
Added in version 2.3.
- whenNotMatchedBySourceDelete(condition: Column | str | None = None) DeltaMergeBuilder¶
Delete a target row that has no matches in the source from the table only if the given
condition(if specified) is true for the target row.See
DeltaMergeBuilderfor complete usage details.- Parameters:
condition (str or pyspark.sql.Column) – Optional condition of the delete
- Returns:
this builder
Added in version 2.3.
- withSchemaEvolution() DeltaMergeBuilder¶
Enable schema evolution for the merge operation. This allows the target table schema to be automatically updated based on the schema of the source DataFrame.
See
DeltaMergeBuilderfor complete usage details.- Returns:
this builder
Added in version 3.2.
- execute() DataFrame¶
Execute the merge operation based on the built matched and not matched actions.
See
DeltaMergeBuilderfor complete usage details.Added in version 0.4.
- class delta.tables.IdentityGenerator(start: int = 1, step: int = 1)¶
Identity generator specifications for the identity column in the Delta table. :param start: the start for the identity column. Default is 1. :type start: int :param step: the step for the identity column. Default is 1. :type step: int
- start: int = 1¶
- step: int = 1¶
- class delta.tables.DeltaTableBuilder(spark: SparkSession, jbuilder: JavaObject)¶
Builder to specify how to create / replace a Delta table. You must specify the table name or the path before executing the builder. You can specify the table columns, the partitioning columns, the location of the data, the table comment and the property, and how you want to create / replace the Delta table.
After executing the builder, a
DeltaTableobject is returned.Use
delta.tables.DeltaTable.create(),delta.tables.DeltaTable.createIfNotExists(),delta.tables.DeltaTable.replace(),delta.tables.DeltaTable.createOrReplace()to create an object of this class.Example 1 to create a Delta table with separate columns, using the table name:
deltaTable = DeltaTable.create(sparkSession) .tableName("testTable") .addColumn("c1", dataType = "INT", nullable = False) .addColumn("c2", dataType = IntegerType(), generatedAlwaysAs = "c1 + 1") .partitionedBy("c1") .execute()
Example 2 to replace a Delta table with existing columns, using the location:
df = spark.createDataFrame([('a', 1), ('b', 2), ('c', 3)], ["key", "value"]) deltaTable = DeltaTable.replace(sparkSession) .tableName("testTable") .addColumns(df.schema) .execute()
Added in version 1.0.
Note
Evolving
- tableName(identifier: str) DeltaTableBuilder¶
Specify the table name. Optionally qualified with a database name [database_name.] table_name.
- Parameters:
identifier (str) – the table name
- Returns:
this builder
Note
Evolving
Added in version 1.0.
- location(location: str) DeltaTableBuilder¶
Specify the path to the directory where table data is stored, which could be a path on distributed storage.
- Parameters:
location (str) – the data stored location
- Returns:
this builder
Note
Evolving
Added in version 1.0.
- comment(comment: str) DeltaTableBuilder¶
Comment to describe the table.
- Parameters:
comment (str) – the table comment
- Returns:
this builder
Note
Evolving
Added in version 1.0.
- addColumn(colName: str, dataType: str | DataType, nullable: bool = True, generatedAlwaysAs: str | IdentityGenerator | None = None, generatedByDefaultAs: IdentityGenerator | None = None, comment: str | None = None) DeltaTableBuilder¶
Specify a column in the table
- Parameters:
colName (str) – the column name
dataType (str or pyspark.sql.types.DataType) – the column data type
nullable (bool) – whether column is nullable
generatedAlwaysAs (str or delta.tables.IdentityGenerator) – a SQL expression if the column is always generated as a function of other columns; an IdentityGenerator object if the column is always generated using identity generator See online documentation for details on Generated Columns.
generatedByDefaultAs (delta.tables.IdentityGenerator) –
- an IdentityGenerator object to generate identity values
if the user does not provide values for the column
See online documentation for details on Generated Columns.
comment (str) – the column comment
- Returns:
this builder
Note
Evolving
Added in version 1.0.
- addColumns(cols: StructType | List[StructField]) DeltaTableBuilder¶
Specify columns in the table using an existing schema
- Parameters:
cols (pyspark.sql.types.StructType or a list of pyspark.sql.types.StructType.) – the columns in the existing schema
- Returns:
this builder
Note
Evolving
Added in version 1.0.
- partitionedBy(*cols: str) DeltaTableBuilder¶
- partitionedBy(__cols: List[str] | Tuple[str, ...]) DeltaTableBuilder
Specify columns for partitioning
- Parameters:
cols (str or list name of columns) – the partitioning cols
- Returns:
this builder
Note
Evolving
Added in version 1.0.
- clusterBy(*cols: str) DeltaTableBuilder¶
- clusterBy(__cols: List[str] | Tuple[str, ...]) DeltaTableBuilder
Specify columns for clustering
- Parameters:
cols (str or list name of columns) – the clustering cols
- Returns:
this builder
Note
Evolving
Added in version 3.2.
- property(key: str, value: str) DeltaTableBuilder¶
Specify a table property
- Parameters:
key – the table property key
- Returns:
this builder
Note
Evolving
Added in version 1.0.
- execute() DeltaTable¶
Execute Table Creation.
- Return type:
Note
Evolving
Added in version 1.0.
- class delta.tables.DeltaOptimizeBuilder(spark: SparkSession, jbuilder: JavaObject)¶
Builder class for constructing OPTIMIZE command and executing.
Use
delta.tables.DeltaTable.optimize()to create an instance of this class.Added in version 2.0.0.
- where(partitionFilter: str) DeltaOptimizeBuilder¶
Apply partition filter on this optimize command builder to limit the operation on selected partitions.
- Parameters:
partitionFilter (str) – The partition filter to apply
- Returns:
DeltaOptimizeBuilder with partition filter applied
- Return type:
Added in version 2.0.
- executeCompaction() DataFrame¶
Compact the small files in selected partitions.
- Returns:
DataFrame containing the OPTIMIZE execution metrics
- Return type:
pyspark.sql.DataFrame
Added in version 2.0.
- executeZOrderBy(*cols: str | List[str] | Tuple[str, ...]) DataFrame¶
Z-Order the data in selected partitions using the given columns.
- Parameters:
cols (str or list name of columns) – the Z-Order cols
- Returns:
DataFrame containing the OPTIMIZE execution metrics
- Return type:
pyspark.sql.DataFrame
Added in version 2.0.
Exceptions¶
- exception delta.exceptions.DeltaConcurrentModificationException(message: str | None = None, errorClass: str | None = None, messageParameters: Dict[str, str] | None = None, contexts: List[QueryContext] | None = None)¶
The basic class for all Delta commit conflict exceptions.
Added in version 1.0.
Note
Evolving
- exception delta.exceptions.ConcurrentWriteException(message: str | None = None, errorClass: str | None = None, messageParameters: Dict[str, str] | None = None, contexts: List[QueryContext] | None = None)¶
Thrown when a concurrent transaction has written data after the current transaction read the table.
Added in version 1.0.
Note
Evolving
- exception delta.exceptions.MetadataChangedException(message: str | None = None, errorClass: str | None = None, messageParameters: Dict[str, str] | None = None, contexts: List[QueryContext] | None = None)¶
Thrown when the metadata of the Delta table has changed between the time of read and the time of commit.
Added in version 1.0.
Note
Evolving
- exception delta.exceptions.ProtocolChangedException(message: str | None = None, errorClass: str | None = None, messageParameters: Dict[str, str] | None = None, contexts: List[QueryContext] | None = None)¶
Thrown when the protocol version has changed between the time of read and the time of commit.
Added in version 1.0.
Note
Evolving
- exception delta.exceptions.ConcurrentAppendException(message: str | None = None, errorClass: str | None = None, messageParameters: Dict[str, str] | None = None, contexts: List[QueryContext] | None = None)¶
Thrown when files are added that would have been read by the current transaction.
Added in version 1.0.
Note
Evolving
- exception delta.exceptions.ConcurrentDeleteReadException(message: str | None = None, errorClass: str | None = None, messageParameters: Dict[str, str] | None = None, contexts: List[QueryContext] | None = None)¶
Thrown when the current transaction reads data that was deleted by a concurrent transaction.
Added in version 1.0.
Note
Evolving
- exception delta.exceptions.ConcurrentDeleteDeleteException(message: str | None = None, errorClass: str | None = None, messageParameters: Dict[str, str] | None = None, contexts: List[QueryContext] | None = None)¶
Thrown when the current transaction deletes data that was deleted by a concurrent transaction.
Added in version 1.0.
Note
Evolving
- exception delta.exceptions.ConcurrentTransactionException(message: str | None = None, errorClass: str | None = None, messageParameters: Dict[str, str] | None = None, contexts: List[QueryContext] | None = None)¶
Thrown when concurrent transaction both attempt to update the same idempotent transaction.
Added in version 1.0.
Note
Evolving
Others¶
- delta.pip_utils.configure_spark_with_delta_pip(spark_session_builder: Builder, extra_packages: List[str] | None = None) Builder¶
Utility function to configure a SparkSession builder such that the generated SparkSession will automatically download the required Delta Lake JARs from Maven. This function is required when you want to
Install Delta Lake locally using pip, and
Execute your Python code using Delta Lake + Pyspark directly, that is, not using spark-submit –packages io.delta:… or pyspark –packages io.delta:….
builder = SparkSession.builder .master(“local[*]”) .appName(“test”)
spark = configure_spark_with_delta_pip(builder).getOrCreate()
If you would like to add more packages, use the extra_packages parameter.
builder = SparkSession.builder .master(“local[*]”) .appName(“test”) my_packages = [“org.apache.spark:spark-sql-kafka-0-10_2.12:x.y.z”] spark = configure_spark_with_delta_pip(builder, extra_packages=my_packages).getOrCreate()
- Parameters:
spark_session_builder – SparkSession.Builder object being used to configure and create a SparkSession.
extra_packages – Set other packages to add to Spark session besides Delta Lake.
- Returns:
Updated SparkSession.Builder object
Added in version 1.0.
Note
Evolving