What is AI_EXTRACT?
The AI_EXTRACT function in Databricks enables users to extract structured entities or insights from unstructured text data using AI models. It simplifies the process of retrieving critical information such as dates, names, sentiments, or other context-specific entities from text columns in tables.
Key Benefits of AI_EXTRACT
- AI-Powered Insights: Extract entities using pre-trained AI models, reducing the need for custom NLP pipelines.
- Ease of Use: The function integrates seamlessly into SQL workflows.
- Flexibility: Extract various data types like dates, numeric values, locations, and more.
Syntax
AI_EXTRACT(expression, 'entityType', options)
Parameters
- expression: The text input from which data needs to be extracted.
- entityType: Specifies the type of information to extract (e.g., DATE, NUMBER, LOCATION, PERSON).
- options: (Optional) Additional configurations such as model parameters or language settings.
Examples
In the screenshot below, I have text data in unity catalog: my_catalog.my_schema.data
To extract names and year from the data, I will author this SQL query:
SELECT ai_extract(content,array('person','year') )
FROM my_catalog.my_schema.data
What about if I want to further extract location and organization, I can further write the query as thus
SELECT ai_extract(content,array('person','year','location','organization') )
FROM my_catalog.my_schema.data
When to Use AI_EXTRACT?
Ideal Use Cases
- Text Analysis for BI: Extract entities from textual customer feedback for business intelligence.
- Data Cleaning: Structure unstructured fields in raw datasets.
- Event and Log Parsing: Extract timestamps, error codes, or other relevant details from logs.
- Contextual Insights: Identify entities like names, locations, and amounts to enrich your data warehouse.
Limitations
While AI_EXTRACT is powerful, it’s essential to be aware of its limitations:
- Model Dependency: The function relies on pre-trained models, which may not always capture domain-specific nuances.
- Performance: Extracting entities from large text datasets can be resource-intensive.
- Custom Needs: For highly specific extractions, additional NLP frameworks or custom-trained models might be required.
Bye for now.