A PDF document, such as an electronic invoice or financial report, is very likely to have tables. You may sometimes need to extract table data from a PDF file and save it in an Excel worksheet, so you can do further analysis using the tools provided by MS Excel. This article will demonstrate how to extract PDF tables using Spire.PDF for Java through the following three topics.
- Extract a Specific Table from a Specified PDF Page
- Extract All Tables from the Entire PDF Document
- Export Table Data from PDF to Excel
Install Spire.Pdf.Jar File
To begin with, you need to download Spire.PDF for Java and add the Spire.Pdf.jar file in your Java application as a dependency. If you use Maven, you can easily import the jar file in your application using the following configurations.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId> e-iceblue</groupId>
<artifactId>spire.pdf</artifactId>
<verson>5.1.0</version>
</dependency>
</dependencies>
Extract a Specific Table from a Specified PDF Page
Spire.PDF for Java offers the PdfTableExtrator class to extract tables for PDF. Specifically, create an instance of PdfTableExtrator class and invoke the method extractTable(int pageIndex) under it to get all tables from a certain page. Return the results in a PdfTable[] array, then a specific table can be accessed by its index.
Now that you’ve got a certain table, you can use the PdfTable.getText(int rowIndex, int columnIndex) method to retrieve data from each table cell.
import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractSpecificTableFromSpecifiedPage {
public static void main(String[] args) throws IOException {
//Load a sample PDF document
PdfDocument pdf = new PdfDocument("C:\\Users\\Administrator\\Desktop\\TwoTables.pdf");
//Create a PdfTableExtractor instance
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
//Extract tables from the first page
PdfTable[] pdfTables = extractor.extractTable(0);
//Get the first table
PdfTable table = pdfTables[0];
//Create a StringBuilder instance
StringBuilder builder = new StringBuilder();
//Loop through the rows in the current table
for (int i = 0; i < table.getRowCount(); i++) {
//Loop through the columns in the current table
for (int j = 0; j < table.getColumnCount(); j++) {
//Extract data from the current table cell
String text = table.getText(i, j);
//Append the text to the string builder
builder.append(text + " ");
}
builder.append("\r\n");
}
//Write data into a .txt document
FileWriter fw = new FileWriter("output/ExtractSpecificTableFromSpecifiedPage.txt");
fw.write(builder.toString());
fw.flush();
fw.close();
}
}
Extract All Tables from the Entire PDF Document
The code example above shows how we can fetch a specific table from a page. By traversing all pages in the document, and all tables on every single page, you can obtain all tables from the entire PDF document.
import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import java.io.FileWriter;
import java.io.IOException;
public class ExtractTablesFromPdf {
public static void main(String[] args) throws IOException {
//Load a sample PDF document
PdfDocument pdf = new PdfDocument("C:\\Users\\Administrator\\Desktop\\TwoTables.pdf");
//Create a PdfTableExtractor instance
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
//Declare a PdfTable array variable
PdfTable[] pdfTables = null;
//Create a StringBuilder instance
StringBuilder builder = new StringBuilder();
//Loop through the pages
for (int pageIndex = 0; pageIndex < pdf.getPages().getCount(); pageIndex++) {
//Extract tables from the current page
pdfTables = extractor.extractTable(pageIndex);
//If any tables are found
if (pdfTables != null && pdfTables.length > 0) {
//Loop through the tables
for (PdfTable table : pdfTables) {
//Loop through the rows in the current table
for (int i = 0; i < table.getRowCount(); i++) {
//Loop through the columns in the current table
for (int j = 0; j < table.getColumnCount(); j++) {
//Extract data from the current table cell
String text = table.getText(i, j);
//Append the text to the string builder
builder.append(text + " ");
}
builder.append("\r\n");
}
builder.append("\r\n");
}
}
}
//Write data into a .txt document
FileWriter fw = new FileWriter("output/ExtractAllTables.txt");
fw.write(builder.toString());
fw.flush();
fw.close();
}
}
Export Table Data from PDF to Excel
The scenario actually uses Spire.PDF for Java for extracting tables from PDF, and Spire.XLS for Java to generate Excel files. In order to use them in the same project, you'll need another library called Spire.Office for Java. You can either download it or install it through Maven repository.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId> e-iceblue</groupId>
<artifactId>spire.office</artifactId>
<verson>4.12.2</version>
</dependency>
</dependencies>
We already know how to obtain the text value of a specific PDF table cell. After that, you can write the data directly into an Excel cell using the Worksheet.get(int row, int column).setText(String string) method under the com.spire.xls.Worksheet namespace. The following code snippet demonstrates how to export each table of a certain page into an individual worksheet by using Spire.Office for Java.
import com.spire.pdf.PdfDocument;
import com.spire.pdf.utilities.PdfTable;
import com.spire.pdf.utilities.PdfTableExtractor;
import com.spire.xls.ExcelVersion;
import com.spire.xls.Workbook;
import com.spire.xls.Worksheet;
public class ExtractTableDataAndSaveInExcel {
public static void main(String[] args) {
//Load a sample PDF document
PdfDocument pdf = new PdfDocument("C:\\Users\\Administrator\\Desktop\\TwoTables.pdf");
//Create a PdfTableExtractor instance
PdfTableExtractor extractor = new PdfTableExtractor(pdf);
//Create a Workbook object,
Workbook wb = new Workbook();
//Remove default worksheets
wb.getWorksheets().clear();
//Extract tables from the first page
PdfTable[] pdfTables = extractor.extractTable(0);
//If any tables are found
if (pdfTables != null && pdfTables.length > 0) {
//Loop through the tables
for (int tableNum = 0; tableNum < pdfTables.length; tableNum++) {
//Add a worksheet to workbook
String sheetName = String.format("Table - %d", tableNum + 1);
Worksheet sheet = wb.getWorksheets().add(sheetName);
//Loop through the rows in the current table
for (int rowNum = 0; rowNum < pdfTables[tableNum].getRowCount(); rowNum++) {
//Loop through the columns in the current table
for (int colNum = 0; colNum < pdfTables[tableNum].getColumnCount(); colNum++) {
//Extract data from the current table cell
String text = pdfTables[tableNum].getText(rowNum, colNum);
//Insert data into a specific cell
sheet.get(rowNum + 1, colNum + 1).setText(text);
}
}
//Auto fit column width
for (int sheetColNum = 0; sheetColNum < sheet.getColumns().length; sheetColNum++) {
sheet.autoFitColumn(sheetColNum + 1);
}
}
}
//Save the workbook to an Excel file
wb.saveToFile("output/ExportTableToExcel.xlsx", ExcelVersion.Version2016);
}
}