TECHNOLOGIES
FORUMS
JOBS
BOOKS
EVENTS
INTERVIEWS
Live
MORE
LEARN
Training
CAREER
MEMBERS
VIDEOS
NEWS
BLOGS
Sign Up
Login
No unread comment.
View All Comments
No unread message.
View All Messages
No unread notification.
View All Notifications
Answers
Post
An Article
A Blog
A News
A Video
An EBook
An Interview Question
Ask Question
Forums
Monthly Leaders
Forum guidelines
Udai Mathur
NA
49
10.9k
Paragraph Reading in PDF
Jul 1 2019 3:46 AM
In my code I need to read the PDF file content and based on some specific requirnment I need to insert the content of PDF into sql server DB.
I used iText sharp for PDF reading. It reads well the when it found the entire line in PDF.
Problems comes when it found table inside the PDF.
It first get into column1 and reads the line and jumps into column2 and reads that line and so on.
Problem is column1 has paragraph string and column2 has paragraph string. It breaks those paragraph into single different lines which has no meaning.
I want it to work like go to column1 read paragraph and if it find new paragraph after newline then read the paragraph from second line.
After processing column1 then jumps into colum2.
I am attaching PDF_File and PDF_Content screen shots. Here you can check it is merging two different paragraphs of different cells.
Currently I am using below code:
PdfReader reader = new PdfReader(@"D:\pdf1.pdf");
int PageNum = reader.NumberOfPages;
string[] sentence;
for (int i = 1; i <= PageNum; i++)
{
ITextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
string currentText = PdfTextExtractor.GetTextFromPage(reader, i, strategy);
currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)));
text.Append(currentText);
sentence = text.ToString().Split('\n');
}
Attachment:
PDF.rar
Reply
Answers (
0
)
Real time scenario
How to Generate JSON Web Tokens (JWT) in asp.net core