Introduction
Extracting text from documents is a common practice in our work or daily lives. It can be performed for various purposes, such as to analyze textual content in documents or to retrieve information from documents. We all know that Word documents are popularly used for storing and processing text, therefore, this article will primarily focus on extracting text from Word documents in Java using Free Spire.Doc for Java.
A Word document can contain a wide range of elements, such as sections, paragraphs, tables, and bookmarks. This article will introduce how to extract text from Word documents as well as extract text from different elements in Word Documents.
- Extract Text from a Whole Word Document
- Extract Text from a Section or Paragraph in a Word Document
- Extract Text from Paragraphs that Use Specific Styles in a Word Document
- Extract Text from a Table in a Word Document
- Extract Text from a Bookmark in a Word Document
Add Dependencies
If you are using maven, you can import the jar file of Free Spire.Doc for Java into your application by adding the following code to your project's pom.xml file.
<repositories>
<repository>
<id>com.e-iceblue</id>
<name>e-iceblue</name>
<url>https://repo.e-iceblue.com/nexus/content/groups/public/</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>e-iceblue</groupId>
<artifactId>spire.doc.free</artifactId>
<version>5.2.0</version>
</dependency>
</dependencies>
If you are not using maven, you can download Free Spire.Doc for Java from the official website, extract the zip file and then import the Spire.Doc.jar file under the lib folder into your project as a dependency.
Extract Text from a Whole Word Document in Java
Extracting text from a whole Word document is extremely simple. You just need to follow four steps below:
- Initialize an instance of the Document class.
- Load a Word document using Document.loadFromFile() method.
- Get text from the document using Document.getText() method.
- Write the text into a .txt file.
import com.spire.doc.Document;
import java.io.File;
import java.io.FileWriter;
public class ExtractTextFromDocument {
public static void main(String []args) throws Exception {
//Initialize an instance of the Document class
Document document = new Document();
//Load a Word document
document.loadFromFile("Input.docx");
//Get text from the whole document
String content = document.getText();
//Initialize an instance of the File class
File output = new File("Document.txt");
//Initialize an instance of the FileWriter class
FileWriter writer = new FileWriter(output);
//Write the text into a .txt file
writer.write(content);
writer.flush();
writer.close();
}
}
Extract Text from a Section or Paragraph in a Word Document in Java
A Word document can contain one or more sections, and a section can contain one or more paragraphs.
You can extract text from a specific paragraph in a section, or extract text from a section by iterating through all paragraphs in it and then extracting text from them.
The following steps show you how to extract text from a specific paragraph in a section:
- Initialize an instance of the Document class.
- Load a Word document using Document.loadFromFile() method.
- Get the desired section by its index using Document.getSections().get(int) method.
- Get the desired paragraph in the section by its index using Section.getParagraphs().get(int) method.
- Get the text of the paragraph using Paragraph.getText() method.
- Write the text into a .txt file.
import com.spire.doc.Document;
import com.spire.doc.Section;
import com.spire.doc.documents.Paragraph;
import java.io.File;
import java.io.FileWriter;
public class ExtractTextFromParagraph {
public static void main(String []args) throws Exception {
//Initialize an instance of the Document class
Document document = new Document();
//Load a Word document
document.loadFromFile("Input.docx");
//Get the first section
Section section = document.getSections().get(0);
//Get the second paragraph in the section
Paragraph paragraph = section.getParagraphs().get(1);
//Get the text of the paragraph
String text = paragraph.getText();
//Initialize an instance of the File class
File output = new File("Paragraphs.txt");
//Initialize an instance of the FileWriter class
FileWriter writer = new FileWriter(output);
//Write the text nto a .txt file
writer.write(text);
writer.flush();
writer.close();
}
}
Extract Text from Paragraphs that Use Specific Styles in a Word Document in Java
The paragraphs in a Word document can be applied with different styles, such as Heading 1, Heading 2, Heading 3, or even with a custom style.
Free Spire.Doc for Java provides the ability to extract text from paragraphs that use specific styles in a Word document. The following are the main steps to do so:
- Initialize an instance of the Document class.
- Load a Word document using Document.loadFromFile() method.
- Initialize an instance of the StringBuilder class.
- Iterate through all sections in the document.
- Iterate through all paragraphs in each section.
- Check if the current paragraph uses a specific style using Paragraph.getStyleName().equals(String) method.
- Get the text from the paragraph using Paragraph.getText() method.
- Save the text into the StringBuilder.
- Write the text in the StringBuilder into a .txt file.
import com.spire.doc.Document;
import com.spire.doc.documents.Paragraph;
import java.io.File;
import java.io.FileWriter;
public class ExtractTextFromParagraphsWithSpecificStyles {
public static void main(String []args) throws Exception {
//Initialize an instance of the Document class
Document document = new Document();
//Load a Word document
document.loadFromFile("Input.docx");
//Initialize an instance of the StringBuilder class
StringBuilder sb = new StringBuilder();
//Loop through all sections in the document
for (int i = 0; i < document.getSections().getCount(); i++) {
//Loop through the paragraphs in each section
for (int j = 0; j < document.getSections().get(i).getParagraphs().getCount(); j++) {
//Get the current paragraph
Paragraph paragraph = document.getSections().get(i).getParagraphs().get(j);
//Check if the paragraph style name is "Heading 1"
if (paragraph.getStyleName().equals("Heading1")) {
//Get the text of the paragraph
String text = paragraph.getText();
//Save the text into the StringBuilder
sb.append(text + "\n");
}
}
}
//Initialize an instance of the File class
File output = new File("ParagraphsWithStyles.txt");
//Initialize an instance of the FileWriter class
FileWriter writer = new FileWriter(output);
//Write the text in the StringBuilder into a .txt file
writer.write(sb.toString());
writer.flush();
writer.close();
}
}
Extract Text from a Table in a Word Document in Java
A table is made up of cells. To extract text from a table, you need to access the cells in the table and then get the text from them. The following are the detailed steps:
- Initialize an instance of the Document class.
- Load a Word document using Document.loadFromFile() method.
- Get the desired section by its index using Document.getSections().get(int) method.
- Get the desired table in the section by its index using Section.getTables().get(int) method.
- Initialize an instance of the StringBuilder class.
- Iterate through the rows in the table.
- Iterate through the cells in each row.
- Iterate through the paragraphs in each cell.
- Get the text of each paragraph using Paragraph.getText() method and save the result into the StringBuilder.
- Write the text in the StringBuilder into a .txt file.
import com.spire.doc.Document;
import com.spire.doc.Section;
import com.spire.doc.TableCell;
import com.spire.doc.TableRow;
import com.spire.doc.documents.Paragraph;
import com.spire.doc.interfaces.ITable;
import java.io.File;
import java.io.FileWriter;
public class ExtractTextFromTable {
public static void main(String []args) throws Exception {
//Initialize an instance of the Document class
Document document = new Document();
//Load a Word document
document.loadFromFile("Table.docx");
//Get the first section
Section section = document.getSections().get(0);
//Get the first table in the first section
ITable table = section.getTables().get(0);
//Initialize an instance of the StringBuilder class
StringBuilder sb = new StringBuilder();
//Iterate through the rows in the table
for (int i = 0; i < table.getRows().getCount(); i++) {
TableRow row = table.getRows().get(i);
//Iterate through the cells in each row
for (int j = 0; j < row.getCells().getCount(); j++) {
TableCell cell = row.getCells().get(j);
//Iterate through the paragraphs in each cell
for (int k = 0; k < cell.getParagraphs().getCount(); k++) {
//Extract text from each paragraph
Paragraph paragraph = cell.getParagraphs().get(k);
String text = paragraph.getText();
//Append the text to the StringBuilder
sb.append(text+ "\t");
}
}
sb.append("\r\n");
}
//Initialize an instance of the File class
File output = new File("Table.txt");
//Initialize an instance of the FileWriter class
FileWriter writer = new FileWriter(output);
//Write the text in the StringBuilder into a .txt file
writer.write(sb.toString());
writer.flush();
writer.close();
}
}
Extract Text from a Bookmark in a Word Document in Java
In Word, text can be bookmarked to enable readers to quickly navigate to its location.
You can retrieve the text of a specific bookmark in a Word document by following the steps below:
- Initialize an instance of the Document class.
- Load a Word document using Document.loadFromFile() method.
- Initialize an instance of the BookmarksNavigator class.
- Find the specific bookmark by its name using BookmarksNavigator.moveToBookmark(String) method.
- Get the content of the bookmark using BookmarksNavigator.getBookmarkContent() method.
- Initialize an instance of the StringBuilder class.
- Iterate through the items in the bookmark content.
- Check if the current item is of Paragraph type.
- Iterate through the child objects in the paragraph.
- Check if the current child object is of TextRange type.
- Get the text of the text range using TextRange.getText() method and save the result into the StringBuilder.
- Write the text in the StringBuilder into a .txt file.
import com.spire.doc.Document;
import com.spire.doc.documents.BookmarksNavigator;
import com.spire.doc.documents.Paragraph;
import com.spire.doc.documents.TextBodyPart;
import com.spire.doc.fields.TextRange;
import java.io.File;
import java.io.FileWriter;
public class ExtractTextFromBookmark {
public static void main(String []args) throws Exception {
//Initialize an instance of the Document class
Document document = new Document();
//Load a Word document
document.loadFromFile("Bookmark.docx");
//Initialize an instance of the BookmarksNavigator class
BookmarksNavigator navigator = new BookmarksNavigator(document);
//Find the specific bookmark by its name
navigator.moveToBookmark("MyFirstBookmark");
//Get the content of the bookmark
TextBodyPart textBodyPart = navigator.getBookmarkContent();
//Initialize an instance of the StringBuilder class
StringBuilder sb = new StringBuilder();
//Iterate through the items in the bookmark content
for (Object item : textBodyPart.getBodyItems()) {
//Check if the current item is of Paragraph type
if ((item instanceof Paragraph)) {
//Iterate through the child objects in the paragraph
for (Object childObject : ((Paragraph)(item)).getChildObjects()) {
//Check if the current child object is of TextRange type
if ((childObject instanceof TextRange)) {
//Get the text of the text range and save the results into the StringBuilder
TextRange range = ((TextRange)(childObject));
sb.append(range.getText() + "\n");
}
}
}
}
//Initialize an instance of the File class
File output = new File("Bookmark.txt");
//Initialize an instance of the FileWriter class
FileWriter writer = new FileWriter(output);
//Write the text in the StringBuilder into a .txt file
writer.write(sb.toString());
writer.flush();
writer.close();
}
}