How To Extract Tables From PDF In Java

C# Curator
3y
17.7k
0
3

Article

A PDF document, such as an electronic invoice or financial report, is very likely to have tables. You may sometimes need to extract table data from a PDF file and save it in an Excel worksheet, so you can do further analysis using the tools provided by MS Excel. This article will demonstrate how to extract PDF tables using Spire.PDF for Java through the following three topics.

Extract a Specific Table from a Specified PDF Page
Extract All Tables from the Entire PDF Document
Export Table Data from PDF to Excel

Install Spire.Pdf.Jar File

To begin with, you need to download Spire.PDF for Java and add the Spire.Pdf.jar file in your Java application as a dependency. If you use Maven, you can easily import the jar file in your application using the following configurations.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId> e-iceblue</groupId>
        <artifactId>spire.pdf</artifactId>
        <verson>5.1.0</version>
    </dependency>
</dependencies>

Extract a Specific Table from a Specified PDF Page

Spire.PDF for Java offers the PdfTableExtrator class to extract tables for PDF. Specifically, create an instance of PdfTableExtrator class and invoke the method extractTable(int pageIndex) under it to get all tables from a certain page. Return the results in a PdfTable[] array, then a specific table can be accessed by its index.

Now that you’ve got a certain table, you can use the PdfTable.getText(int rowIndex, int columnIndex) method to retrieve data from each table cell.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractSpecificTableFromSpecifiedPage {

    public static void main(String[] args) throws IOException {

        //Load a sample PDF document
        PdfDocument pdf = new PdfDocument("C:\\Users\\Administrator\\Desktop\\TwoTables.pdf");

        //Create a PdfTableExtractor instance
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);
        
        //Extract tables from the first page
        PdfTable[] pdfTables = extractor.extractTable(0);

        //Get the first table
        PdfTable table = pdfTables[0];

        //Create a StringBuilder instance
        StringBuilder builder = new StringBuilder();

        //Loop through the rows in the current table
        for (int i = 0; i < table.getRowCount(); i++) {

            //Loop through the columns in the current table
            for (int j = 0; j < table.getColumnCount(); j++) {

                //Extract data from the current table cell
                String text = table.getText(i, j);

                //Append the text to the string builder
                builder.append(text + " ");
            }
            builder.append("\r\n");
        }

        //Write data into a .txt document
        FileWriter fw = new FileWriter("output/ExtractSpecificTableFromSpecifiedPage.txt");
        fw.write(builder.toString());
        fw.flush();
        fw.close();
    }
}

Extract All Tables from the Entire PDF Document

The code example above shows how we can fetch a specific table from a page. By traversing all pages in the document, and all tables on every single page, you can obtain all tables from the entire PDF document.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;

import java.io.FileWriter;
import java.io.IOException;

public class ExtractTablesFromPdf {

    public static void main(String[] args) throws IOException {

        //Load a sample PDF document
        PdfDocument pdf = new PdfDocument("C:\\Users\\Administrator\\Desktop\\TwoTables.pdf");

        //Create a PdfTableExtractor instance
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);

        //Declare a PdfTable array variable
        PdfTable[] pdfTables = null;

        //Create a StringBuilder instance
        StringBuilder builder = new StringBuilder();

        //Loop through the pages
        for (int pageIndex = 0; pageIndex < pdf.getPages().getCount(); pageIndex++) {

            //Extract tables from the current page
            pdfTables = extractor.extractTable(pageIndex);

            //If any tables are found
            if (pdfTables != null && pdfTables.length > 0) {

                //Loop through the tables
                for (PdfTable table : pdfTables) {

                    //Loop through the rows in the current table
                    for (int i = 0; i < table.getRowCount(); i++) {

                        //Loop through the columns in the current table
                        for (int j = 0; j < table.getColumnCount(); j++) {

                            //Extract data from the current table cell
                            String text = table.getText(i, j);

                            //Append the text to the string builder
                            builder.append(text + " ");
                        }
                        builder.append("\r\n");
                    }
                    builder.append("\r\n");
                }
            }
        }

        //Write data into a .txt document
        FileWriter fw = new FileWriter("output/ExtractAllTables.txt");
        fw.write(builder.toString());
        fw.flush();
        fw.close();
    }
}

Export Table Data from PDF to Excel

The scenario actually uses Spire.PDF for Java for extracting tables from PDF, and Spire.XLS for Java to generate Excel files. In order to use them in the same project, you'll need another library called Spire.Office for Java. You can either download it or install it through Maven repository.

<repositories>
    <repository>
        <id>com.e-iceblue</id>
        <name>e-iceblue</name>
        <url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
    </repository>
</repositories>
<dependencies>
    <dependency>
        <groupId> e-iceblue</groupId>
        <artifactId>spire.office</artifactId>
        <verson>4.12.2</version>
    </dependency>
</dependencies>

We already know how to obtain the text value of a specific PDF table cell. After that, you can write the data directly into an Excel cell using the Worksheet.get(int row, int column).setText(String string) method under the com.spire.xls.Worksheet namespace. The following code snippet demonstrates how to export each table of a certain page into an individual worksheet by using Spire.Office for Java.

import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import com.spire.xls.ExcelVersion;
import com.spire.xls.Workbook;
import com.spire.xls.Worksheet;

public class ExtractTableDataAndSaveInExcel {

    public static void main(String[] args) {

        //Load a sample PDF document
        PdfDocument pdf = new PdfDocument("C:\\Users\\Administrator\\Desktop\\TwoTables.pdf");

        //Create a PdfTableExtractor instance
        PdfTableExtractor extractor = new PdfTableExtractor(pdf);

        //Create a Workbook object,
        Workbook wb = new Workbook();

        //Remove default worksheets
        wb.getWorksheets().clear();

        //Extract tables from the first page
        PdfTable[] pdfTables  = extractor.extractTable(0);

        //If any tables are found
        if (pdfTables != null && pdfTables.length > 0) {

            //Loop through the tables
            for (int tableNum = 0; tableNum < pdfTables.length; tableNum++) {

                //Add a worksheet to workbook
                String sheetName = String.format("Table - %d", tableNum + 1);
                Worksheet sheet = wb.getWorksheets().add(sheetName);

                //Loop through the rows in the current table
                for (int rowNum = 0; rowNum < pdfTables[tableNum].getRowCount(); rowNum++) {

                    //Loop through the columns in the current table
                    for (int colNum = 0; colNum < pdfTables[tableNum].getColumnCount(); colNum++) {

                        //Extract data from the current table cell
                        String text = pdfTables[tableNum].getText(rowNum, colNum);

                        //Insert data into a specific cell
                        sheet.get(rowNum + 1, colNum + 1).setText(text);

                    }
                }

                //Auto fit column width
                for (int sheetColNum = 0; sheetColNum < sheet.getColumns().length; sheetColNum++) {
                    sheet.autoFitColumn(sheetColNum + 1);
                }
            }
        }

        //Save the workbook to an Excel file
        wb.saveToFile("output/ExportTableToExcel.xlsx", ExcelVersion.Version2016);
    }
}

Spire.PDF for Java