Database and Tables in Azure Databricks
An Azure Databricks Database is a collection of tables.
An Azure Databricks Table is a collection of structured data. We can cache, filter, and perform any operations supported by Apache Spark DataFrames on Azure Databricks tables and query tables with Spark APIs and Spark SQL.
Types of Tables
There are two types of tables: Global and Local
Global Table
A global table is available across all clusters. Azure Databricks registers global tables to either the Azure Databricks Hive metastore or an external Hive metastore.
Local Table
A local table is not accessible from other clusters and is not registered in the Hive metastore. This is also known as a temporary view.
When we create a table using the UI, we cannot:
Upload a file if we are using a High Concurrency cluster. Instead, use the Databricks File System (DBFS) to load the data into Azure Databricks.
Update the table. Instead, create a table programmatically.
Create a table using the UI
- Click Data in the sidebar. The Databases and Tables folders will display.
- In the Databases folder, select a database.
- Above the Tables folder, click Create Table.
- Choose a data source and follow the steps in the corresponding section to configure the table
- Drag files to the Files dropzone or click the dropzone to browse and choose files. After upload, a path displays for each file. The path will be something like
/FileStore/tables/<filename>-<integer>.<file-type>
. You can use this path in a notebook to read data.
- Click Create Table with UI.
Create a Table Programmatically
Create a Global Table
To create a Global table in SQL
CREATE TABLE <table-name>
To create a Global table from a DataFrame in Python or Scala
dataFrame.write.saveAsTable(“<table-name>”)
To create a Local table from a DataFrame in Python or Scala
dataFrame.createOrReplaceTempView(“<table-name>”)
Access a Table
We can view table details and read, update, and delete a table.
We can view the table in UI as well as in Notebooks.
To Access/View the Table from the UI
- Click Data Icon Data in the sidebar.
- In the Databases folder, click a database.
- In the Tables folder, click the table name.
- In the Cluster drop-down, optionally select another cluster to render the table preview.
To Access/View the Table from Notebooks
Query the table using SQL
SELECT * FROM <Table>
Query the table using Spark SQL
df = spark.sql(“select * from table_name”)
display(table_name.select(“*”))
Query the table using Python
diamonds = spark.table(“table_name”)
display(table_name.select(“*”))
Update a Table
The table schema is immutable. However, we can update table data by changing the underlying files.
For example, for tables created from a storage directory, adding or removing files in that directory changes the table's contents.
After updating the files underlying a table, refresh the table using the SQL command: REFRESH TABLE <table-name>
Delete a table
We can delete the table from UI (By clicking the delete option) and through coding (drop table <table_name>)