In data management solutions and publishing scenarios there is a need to route documents based on metadata of a file. For example, there is a scanner that scans documents and puts the documents in a shared folder and from there on a utility looks at the document metadata and routes it to the appropriate folder. In SharePoint 2013, extracting metadata of Microsoft Office documents is fairly straightforward. But the metadata of a PDF file is not extracted automatically. In this article, we will see how, by using the iTextSharp library, we can easily extract the metadata of a PDF file.
Step 1: Using the GacUtil.exe, install the iTextSharp.dll in the GAC. The latest iTextSharp can be downloaded from
here.
Step 2: Create a new column called "Metadata" and make it a "Multiline" type as shown below.
Step 3: Create a new blank project for SharePoint 2013 as in the following:
Step 4: Create an Event Receiver as shown below. Choose "Document Library" --> "Add Item Event" and "Item Updating Event".
Step 5: Add a reference to the iTextSharp library and add the following lines of code:
using iTextSharp.text.pdf;
using System.Collections.Generic;
Step 6: In the ItemAdded event add the following lines of code. Please note that this code will only run for document libraries where you have added the additional "Metadata" columns since that is hardcoded as shown below.
base.ItemAdded(properties);
SPFile f = properties.ListItem.File;
SPListItem currentItem = properties.ListItem;
if (f.Name.EndsWith(".pdf"))
{
byte[] pdfIn = f.OpenBinary();
PdfReader pr = new PdfReader(pdfIn);
string meta = "";
foreach (KeyValuePair<string, string> pair in pr.Info)
{
meta = meta + pair.Key + " = " + pair.Value + ";";
}
currentItem["Metadata"] = meta;
currentItem.SystemUpdate();
}
Step 7: For the ItemUpdating event add the following lines of code:
base.ItemUpdating(properties);
properties.ListItem["Metadata"] = properties.AfterProperties["Metadata"];
Step 8: Now you can either debug the code or deploy and test it. The following are the test results.
Please note that you need to refresh the page to see the results.
This code can be tweaked as per requirements and new columns and values can be added to the list based on the metadata of a document, or only a fixed set of metadata can be read. You can even write a timer job that looks at the common folder and read metadata and route documents. I hope this helps.