Read and Import Data from PDF File using MSFT Fabric

Nandan Hegde
1y
4.2k
0
2

Article

DataSharkXSamplePDF.zip

Problem Statement

Often, our data doesn’t come in a neat Excel sheet or CSV file but is buried as a table in a PDF. The challenge with PDF files is that it can have a variety of content ranging from text, tables, images, hyperlinks, etc.

There are programming frameworks that have libraries that can be used for extracting data from these files. But this again poses a challenge for a non-technical user.

Is it possible to Read and Import data from PDF file in an easier and an automated way .

Prerequisites

MSFT Fabric Dataflow Gen 2
Storage Account

A sample PDF file is attached.

Solution

1. Login to https://app.fabric.microsoft.com/home?experience=data-factory and Select Dataflow Gen 2.

Read and Import Data from PDF File using MSFT Fabric

And rename the Dataflow

Read and Import Data from PDF File using MSFT Fabric

2. Click ‘Get Data’ >> More and Select Azure Blobs

Read and Import Data from PDF File using MSFT Fabric

3. Create a new connection (In case if there isn’t an existing one) via your organization account and Click ‘Next’.

Read and Import Data from PDF File using MSFT Fabric

Note: In this scenario, we would use Account Key authentication.

4. The list of Containers within the Azure blob would be visible as seen below :

Read and Import Data from PDF File using MSFT Fabric

Filter for your Container and Select the PDF file whose data we need to extract.

Read and Import Data from PDF File using MSFT Fabric

5. The PDF connector detected all the 6 tables in our sample PDF file, so it provides an option to import specific tables. This is a great option for those who are interested in a specific piece of data from the PDF file. If you analyze carefully, it also tells you which table is on which page. The rest of the two options are at the bottom of the list, provides an option to read the entire page. There is 1 table (item) created per page.

Read and Import Data from PDF File using MSFT Fabric

Note: There are 5 Pages within the PDF file and 6 Tables distributed across the 5 Pages.

We need to get data from one table, therefore, select Table 001, and it shows the table on the right-hand side page.

Read and Import Data from PDF File using MSFT Fabric

6. Do the necessary transformation that you need in the Power Query mode.

In this scenario, we would Convert the first row as headers.

Read and Import Data from PDF File using MSFT Fabric

Final State

Read and Import Data from PDF File using MSFT Fabric

The Advanced editor code

Read and Import Data from PDF File using MSFT Fabric

let
  Source = AzureStorage.Blobs("https://datasharkxstg.blob.core.windows.net/"),
  #"Navigation 1" = Source{[Name = "fabric"]}[Data],
  Navigation = #"Navigation 1"{[#"Folder Path" = "https://datasharkxstg.blob.core.windows.net/fabric/", Name = "DataSharkXSamplePDF.pdf"]}[Content],
  #"Imported PDF" = Pdf.Tables(Navigation, [Implementation = "1.3"]),
  #"Navigation 2" = #"Imported PDF"{[Id = "Table001"]}[Data],
  #"Promoted headers" = Table.PromoteHeaders(#"Navigation 2", [PromoteAllScalars = true])
in
  #"Promoted headers"

7. Now Select the Sink / Destination settings

Read and Import Data from PDF File using MSFT Fabric

As of today only 4 Sinks are supported :

Read and Import Data from PDF File using MSFT Fabric

So for our use case, we would Select Lakehouse.

Read and Import Data from PDF File using MSFT Fabric

8. Filter the Workspace and Select the Lakehouse.

You can either create a new table on run or map it to an existing one.

Read and Import Data from PDF File using MSFT Fabric

In our use case, we would create a new table in destination with the name PDFExtractionSink.

9. One can either append the data or replace the data based on the settings

Read and Import Data from PDF File using MSFT Fabric

We would proceed with Replace data scenario.

And Click on “Save settings”.

10. Finally, Click on “Publish”

Read and Import Data from PDF File using MSFT Fabric

11. The Dataflow begins the 1st execution once it’s published.

Lakehouse output :

Read and Import Data from PDF File using MSFT Fabric

12. To Execute again, Click on the Refresh now component of the dataflow.

Read and Import Data from PDF File using MSFT Fabric

As we have enabled Replace data at sink, the data would be overwritten.

13. In order to schedule the data extraction, one can either schedule the dataflow refresh frequency via Dataflow Settings:

Read and Import Data from PDF File using MSFT Fabric

Or schedule via Fabric Data pipelines.