We will be using the following third-party DLLs to get our work done:
- PdfBox: This third-party Nuget package will be used to read a PDF file.
- DocX: This package will be used to write a Word document.
Step 1
The first step will be to get the PdfBox package using the Nuget Package Manager. The path for it is somwhere.
Right-click on the solution and select the “Manage NuGet Packages” option.
Now, select the “Online” option from the left side menu and search for “PdfBox” on the right side panel. Ater searching, click on the “Install” button alongside the “PdfBox” as in the following:
Once it is installed, you will see some DLLs have been added to the project as in the following:
Step 2
Now, import the following DLLs into your .cs file:
- using org.apache.pdfbox.pdmodel;
- using org.apache.pdfbox.util;
Please ensure that this step is followed else you would not be able to read the PDF doc.
Step 3
The third step will be to install the
DocX NuGet Package from the NuGet Package Manager:
Doing this will import some more DLLs into your solution.
Step 4
Let's read a PDF file and try to get the text from it.
We would use the package PDFBox to do so and the code for it will be as in the following:
Step 5
The next part of the code will be to read this string and write it to a Word document.
You would need to import the following two namespaces in the .cs file:
- using Novacode;
- using System.Diagnostics;
The Novacode namespace is to make use of the DOCX packages included in the solution.
The System.Diagnostic is to ensure that we are automatically able to open the new Word document. This is done by the code: “Process.Start("WINWORD.EXE", fn);”.
The code for it will be:
In this way the PDF document would not be available as a Word document file.
I hope this helped.
Thanks,
Vipul