Text search and extraction in pdf file

Question

I am working for text search and extraction from pdf using third party dll itextsharp.

I am getting the text on searching but not only that text, the whole text of that page.

I thought to use phrases or chunks so that I can get pre-and post of that text only along with it instead of whole page text. Can anyone suggest me code for phrases or anything else which I can use for it. Thanks!

My code is:

string searchText = null;

string filename = System.AppDomain.CurrentDomain.BaseDirectory;

filename = @"C:\test.pdf";

searchText = textBox.Text.ToString();

List<int> pages = new List<int>();

if (File.Exists(filename))

{

PdfReader pdfReader = new PdfReader(filename);

List<Phrase> PhraseList = new List<Phrase>();

for (int page = 1; page <= pdfReader.NumberOfPages; page++)

{

ITextExtractionStrategy strategy = SimpleTextExtractionStrategy();

string currentPageText = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)

if (currentPageText.Contains(searchText))

{

pages.Add(page);

textBox1.AppendText(PdfTextExtractor.GetTextFromPage(pdfReader, page));

textBox1.Text += pages.ToString();

}

pdfReader.Close();

}

Leon D · Answer

Spire.PDF library allows you to easily search and extract text in PDF by using the FindText method of PdfPageBase class. https://www.e-iceblue.com/Tutorials/Spire.PDF/Program-Guide/Text/Find-and-replace-text-on-PDF-document-in-C.html

Kip Hackman · Answer

I understand that this is an older post, but for those still seeking an answer, the LEADTOOLS Document Viewer offers the ability to search a PDF and retrieve where the text is using the Find() method. See the link below for sample code on how to perform this text search. https://www.leadtools.com/help/sdk/v22/dh/doxui/documentviewertext-find.html Also, you have the option to OCR the text and extract all text recognized and handle the text extraction in a string. Here is a step by step tutorial showing how to extract the text of a document. https://www.leadtools.com/help/sdk/v22/tutorials/dotnet-console-parse-the-text-of-a-document.html

Salman Beg · Answer

Hi. if (currentPageText.Contains(searchText)) { // Create a new pdf and insert the same searchText } you can do like this. thanks.

Ritika · Answer

Hi Salman, I want to extract the same searched text but I m getting the whole page text. Thats what I am asking to help me to get only the searched text in result instead of whole page text. Thanks!

Salman Beg · Answer

@Ritika When you get the SearchText, at that time you want to extract the same search text or the whole page. I got confused here. If you want to extract the same search text then in your if condition you can create a new pdf file and write it. But if you want the whole page then also you can read that whole page and paste the same page in a new pdf file. thanks.

Text search and extraction in pdf file

Insert Link

Embed YouTube Video

Table Options

Insert Image

Answers (5)