Problem Statement
Often, our data doesn’t come in a neat Excel sheet or CSV file but is buried as a table in a PDF. The challenge with PDF files is that it can have a variety of content ranging from text, tables, images, hyperlinks, etc.
There are programming frameworks that have libraries that can be used for extracting data from these files. But this again poses a challenge for a non-technical user.
Is it possible to Read and Import data from PDF file in an easier and an automated way .
Prerequisites
- MSFT Fabric Dataflow Gen 2
- Storage Account
A sample PDF file is attached.
Solution
1. Login to https://app.fabric.microsoft.com/home?experience=data-factory and Select Dataflow Gen 2.
And rename the Dataflow
2. Click ‘Get Data’ >> More and Select Azure Blobs
3. Create a new connection (In case if there isn’t an existing one) via your organization account and Click ‘Next’.
Note: In this scenario, we would use Account Key authentication.
4. The list of Containers within the Azure blob would be visible as seen below :
Filter for your Container and Select the PDF file whose data we need to extract.
5. The PDF connector detected all the 6 tables in our sample PDF file, so it provides an option to import specific tables. This is a great option for those who are interested in a specific piece of data from the PDF file. If you analyze carefully, it also tells you which table is on which page. The rest of the two options are at the bottom of the list, provides an option to read the entire page. There is 1 table (item) created per page.
Note: There are 5 Pages within the PDF file and 6 Tables distributed across the 5 Pages.
We need to get data from one table, therefore, select Table 001, and it shows the table on the right-hand side page.
6. Do the necessary transformation that you need in the Power Query mode.
In this scenario, we would Convert the first row as headers.
Final State
The Advanced editor code
let
Source = AzureStorage.Blobs("https://datasharkxstg.blob.core.windows.net/"),
#"Navigation 1" = Source{[Name = "fabric"]}[Data],
Navigation = #"Navigation 1"{[#"Folder Path" = "https://datasharkxstg.blob.core.windows.net/fabric/", Name = "DataSharkXSamplePDF.pdf"]}[Content],
#"Imported PDF" = Pdf.Tables(Navigation, [Implementation = "1.3"]),
#"Navigation 2" = #"Imported PDF"{[Id = "Table001"]}[Data],
#"Promoted headers" = Table.PromoteHeaders(#"Navigation 2", [PromoteAllScalars = true])
in
#"Promoted headers"
7. Now Select the Sink / Destination settings
As of today only 4 Sinks are supported :
So for our use case, we would Select Lakehouse.
8. Filter the Workspace and Select the Lakehouse.
You can either create a new table on run or map it to an existing one.
In our use case, we would create a new table in destination with the name PDFExtractionSink.
9. One can either append the data or replace the data based on the settings
We would proceed with Replace data scenario.
And Click on “Save settings”.
10. Finally, Click on “Publish”
11. The Dataflow begins the 1st execution once it’s published.
Lakehouse output :
12. To Execute again, Click on the Refresh now component of the dataflow.
As we have enabled Replace data at sink, the data would be overwritten.
13. In order to schedule the data extraction, one can either schedule the dataflow refresh frequency via Dataflow Settings:
Or schedule via Fabric Data pipelines.