Cinchoo ETL - Merge Different CSV Files Into One Large CSV File

1. Introduction

ChoETL is an open source ETL (extract, transform and load) framework for .NET. It is a code based library for extracting data from multiple sources, transforming, and loading into your very own data warehouse in .NET environment. You can have data in your data warehouse in no time.

This article talks about merging different CSV files into one large CSV file using Cinchoo ETL framework. It is very simple to use, with few lines of code, the conversion can be done. You can convert large files as the conversion process is stream based, quite fast and with low memory footprint.

2. Requirement

This framework library is written in C# using .NET 4.5 / .NET Core 3.x Framework.

3. How to Use
 

3.1 Sample Data

Let's begin by looking into below sample CSV files. Assuming these CSV files are large in sizes, comes with different fields, may have column counts vary on them.

Listing 3.1.1. CSV file 1 (sample1.csv)

col1, col2, col3
val1, val2, val3
val11, val21, val31

Listing 3.1.2. CSV file 2 (sample2.csv)

col1, col3
val4, val5
val41, val51

Listing 3.1.3. CSV file 3 (sample3.csv)

col1, col4
val6, val7
val61, val71

After successful merge, the expected CSV file should look like as below

Listing 3.1.4. CSV output (merge.csv)

col1, col2, col3, col4
val1, val2, val3,
val11, val21, val31,
val4, , val5,
val41, , val51,
val6, , , val7
val61, , , val71

The first thing to do is to install ChoETL.JSON /ChoETL.JSON.NETStandard nuget package. To do this, run the following command in the Package Manager Console.

.NET Framework

Install-Package ChoETL.JSON

.NET Core

Install-Package ChoETL.JSON.NETStandard

Now add ChoETL namespace to the program.

using ChoETL;

3.2 Merge Operation

As input files may be large in sizes, we need to consider way to merge them efficiently. Here is an approach to adapt to merge such CSV files.

  1. First open each CSV file, read out the first item. Put them into collection.
  2. Next assess all possible columns comes from all the input CSV files by writing the collection to dummy ChoCSVWriter. Use WithMaxScanRows() call to scan for the columns from all CSV files. Capture the Configuration object (containing all the scanned CSV columns) for later use.
  3. Finally open each CSV file and writer them to ChoCSVWriter by using the captured configuration object. 

Listing 3.2.1. Merge CSV files

private static void MergeCSVFiles() {
    ChoCSVRecordConfiguration config = null;
    List < object > items = new List < object > ();
    using(var r1 = new ChoCSVReader("sample1.csv").WithFirstLineHeader()) {
        using(var r2 = new ChoCSVReader("sample2.csv").WithFirstLineHeader()) {
            using(var r3 = new ChoCSVReader("sample3.csv").WithFirstLineHeader()) {
                items.Add(r1.First());
                items.Add(r2.First());
                items.Add(r3.First());
            }
        }
    }
    StringBuilder csv = new StringBuilder();
    using(var w = new ChoCSVWriter(csv).WithFirstLineHeader().WithMaxScanRows(5).ThrowAndStopOnMissingField(false)) {
        w.Write(items);
        //Capture configuration for later use to merge CSV files
        config = w.Configuration;
    }
    using(var r1 = new ChoCSVReader("sample1.csv").WithFirstLineHeader()) {
        using(var r2 = new ChoCSVReader("sample2.csv").WithFirstLineHeader()) {
            using(var r3 = new ChoCSVReader("sample3.csv").WithFirstLineHeader()) {
                //use the captured configuration object for merging CSV file
                using(var w = new ChoCSVWriter(Console.Out, config).WithFirstLineHeader()) {
                    w.Write(Enumerable.Concat(r1, r2).Concat(r3));
                }
            }
        }
    }
}

Sample fiddle: https://dotnetfiddle.net/4L8f0k