Introduction
Reading files programmatically is very common, often a situation arises for downloading a file from S3 or GCS and extracting text from the files for further processing or modifying the file based on some condition. The most usual scenario is to process .csv or .xlsx files. Reading PDF files in Python is fun, there is an existing library called PyPDF2 which has a collection of a lot of useful functions and classes which makes PDF file reading, text extraction extremely useful. The article explains how to read a PDF file using PyPDF2, article also covers some useful scenarios like identifying the no. of pages in PDF, details about the PDF, extracting text from a specific page, accessing the page contents, and moving the text of a set of pages to a separate file.
Installation
The installation of PyPDF2 is very simple and usual, using pip install, the below image showcases the command and logs after successful installation.
Classes
In the example a pdf file of 424 pages will be read and parsed, the PyPDF2 library has 4 main Objects
- PDFFileReader – Main object for PDF file reading
- PDFFileWriter – Main object for creating a PDF file
- PDFFileMerger – Merging Multiple PDFs to one
- PageObject – This class represents a single page in the PDF.
The theme of the article is to read and process PDF files, we have to focus on 2 classes for that, PDFFileReader and PageObject.
Reading PDF
For reading a PDF file, first, we need to import PyPDF2 and instantiate a PDFFileReader object.
import PyPDF2
doc = PyPDF2.PdfFileReader(‘Data Visualization with Python Pragmatic Eyes.pdf ')
Through getDocumentInfo() / documentInfo attribute we can access the PDF’s information dictionary like Title, Licensed to, Creator, PDF creation date etc.
import PyPDF2
doc = PyPDF2.PdfFileReader(‘Data Visualization with Python Pragmatic Eyes.pdf')
doc.getDocumentInfo()
For calculating the number of pages in the PDF, the PyPDF2 has a method getNumPages() or numPages attribute that can be used.
import PyPDF2
doc = PyPDF2.PdfFileReader(‘Data Visualization with Python Pragmatic Eyes.pdf')
print('No. of Pages:', doc.numPages)
Through the isEncrypted method, we can identify whether the PDF file is encrypted or not
import PyPDF2
doc = PyPDF2.PdfFileReader(‘Data Visualization with Python Pragmatic Eyes.pdf ')
doc.isEncrypted # False
PageObject
The PageObject is a very useful object in PyPDF2, it has a collection of several useful methods for accessing a single page in PDF, PageObject will be returned whenever the getPage() function of PDFFileReader is accessed, but several other methods return PageObject implicitly.
For extracting the content of specific page in a PDF, a page number needs to be given to getPage() function which returns PageObject and PageObject has a function extractText() which extracts the text.
import PyPDF2
doc = PyPDF2.PdfFileReader(‘Data Visualization with Python Pragmatic Eyes.pdf')
doc.getPage(3).extractText()
With the help of extractText() method, let’s move the content from the PDF to a separate text file.
def createTextFromPDF():
data = ""
with open('demo.txt', 'w', encoding='utf-8') as file:
for page in range(1,5):
data = doc.getPage(page).extractText()
file.write(data)
createTextFromPDF()
demo.txt file
Summary
In the article we have seen how easy and convenient is to read PDF files using PyPDF2, article explained some of the utility functions like getDocumentInfo(), numPages, and functions from PageObject class like getPage, extractText. For reading PDF files PyPDF2 library is highly recommended.