Extracting Values from PDFs in .NET Core 8 without ASP.NET

Ashutosh Singh
1y
11.9k
0
2

Article

Introduction

In the realm of software development, extracting data from PDF files is a common necessity for various tasks such as data analysis, content indexing, and information retrieval. While ASP.NET Core 8 offers robust tools for PDF manipulation, there are instances where developers may prefer alternatives for flexibility or specific project requirements. In this article, we'll explore how to extract values from PDF files within the .NET Core 8 ecosystem without relying on ASP.NET, using the PdfSharpCore library. We'll provide a step-by-step guide along with examples in C# to demonstrate how to accomplish this task effectively.

Understanding PdfSharpCore: PdfSharpCore is a popular .NET library for PDF document manipulation. It provides functionalities to create, modify, and extract content from PDF files. In this guide, we'll focus on utilizing PdfSharpCore to extract text from PDF documents.
Installing PdfSharpCore: Before we can start using PdfSharpCore in our .NET Core application, we need to install the PdfSharpCore NuGet package. This can be done via the NuGet Package Manager Console or the .NET CLI.

Using the NuGet Package Manager Console

Install-Package PdfSharpCore

Using the .NET CLI

dotnet add package PdfSharpCore

Extracting Text from PDFs in C#: Now that we have PdfSharpCore installed, let's dive into how we can extract text from PDF files using C#.

using PdfSharpCore.Pdf;
using PdfSharpCore.Pdf.IO;
using System;

public class PdfTextExtractor
{
    public static string ExtractTextFromPdf(string filePath)
    {
        using (PdfDocument document = PdfReader.Open(filePath, PdfDocumentOpenMode.Import))
        {
            string text = "";
            foreach (PdfPage page in document.Pages)
            {
                text += page.GetText();
            }
            return text;
        }
    }

    // Example usage:
    public static void Main(string[] args)
    {
        string pdfText = ExtractTextFromPdf("sample.pdf");
        Console.WriteLine(pdfText);
    }
}

In this example, we've created a PdfTextExtractor class with a static method ExtractTextFromPdf that takes the file path of the PDF as input and returns the extracted text. Inside the method, we use PdfSharpCore to open the PDF file, iterate through its pages, and extract text from each page. Finally, the extracted text is concatenated and returned.

Conclusion

By leveraging PdfSharpCore in .NET Core 8, developers have access to a powerful and efficient tool for extracting text from PDF files. Without relying on ASP.NET, PdfSharpCore provides a straightforward solution for handling PDF documents within the .NET Core ecosystem. Whether you're building data processing pipelines, content management systems, or document parsing utilities, PdfSharpCore empowers developers to accomplish PDF manipulation tasks effectively and seamlessly.

😊Please consider liking and following me for more articles and if you find this content helpful.👍