Understanding the AI_EXTRACT SQL Function in Databricks

What is AI_EXTRACT?

The AI_EXTRACT function in Databricks enables users to extract structured entities or insights from unstructured text data using AI models. It simplifies the process of retrieving critical information such as dates, names, sentiments, or other context-specific entities from text columns in tables.

Key Benefits of AI_EXTRACT

  1. AI-Powered Insights: Extract entities using pre-trained AI models, reducing the need for custom NLP pipelines.
  2. Ease of Use: The function integrates seamlessly into SQL workflows.
  3. Flexibility: Extract various data types like dates, numeric values, locations, and more.

Syntax

AI_EXTRACT(expression, 'entityType', options)

Parameters

  1. expression: The text input from which data needs to be extracted.
  2. entityType: Specifies the type of information to extract (e.g., DATE, NUMBER, LOCATION, PERSON).
  3. options: (Optional) Additional configurations such as model parameters or language settings.

Examples

In the screenshot below, I have text data in unity catalog: my_catalog.my_schema.data

Table

To extract names and year from the data, I will author this SQL query:

SELECT ai_extract(content,array('person','year') )
FROM my_catalog.my_schema.data

Databricks

What about if I want to further extract location and organization, I can further write the query as thus

SELECT ai_extract(content,array('person','year','location','organization') )
FROM my_catalog.my_schema.data

AI_Extract Function Databricks

When to Use AI_EXTRACT?

Ideal Use Cases

  1. Text Analysis for BI: Extract entities from textual customer feedback for business intelligence.
  2. Data Cleaning: Structure unstructured fields in raw datasets.
  3. Event and Log Parsing: Extract timestamps, error codes, or other relevant details from logs.
  4. Contextual Insights: Identify entities like names, locations, and amounts to enrich your data warehouse.

Limitations

While AI_EXTRACT is powerful, it’s essential to be aware of its limitations:

  1. Model Dependency: The function relies on pre-trained models, which may not always capture domain-specific nuances.
  2. Performance: Extracting entities from large text datasets can be resource-intensive.
  3. Custom Needs: For highly specific extractions, additional NLP frameworks or custom-trained models might be required.

Bye for now.


Similar Articles