How to Load 5 Million Records from CSV and Process Them In Under Three Seconds

We have a scenario where we have to load 5 million records under 2 seconds from a CSV file using C#, then process it and return some processed records based on certain criteria too. This sounds like loading and processing may take more time but only if we do it in the wrong way.
 
This is what we will solve in the below code.
 
Let's dive in and do some processing ourselves. First download a file from the URL below, it is a sample Sales records CSV file with 5 million records.
 
http://eforexcel.com/wp/wp-content/uploads/2020/09/5m-Sales-Records.7z
 
Now we will do is load this CSV in our program and get the top ten sales records with maximum revenue in order.
  1. Stopwatch stopwatch = new Stopwatch();  
  2. stopwatch.Start();  
  3. //LOAD    
  4. //Created a temporary dataset to hold the records    
  5. List < Tuple < stringstringstring >> listA = new List < Tuple < stringstringstring >> ();  
  6. using(var reader = new StreamReader(@ "C:\Users\Lenovo\Desktop\5m Sales Records.csv")) {  
  7.     while (!reader.EndOfStream) {  
  8.         var line = reader.ReadLine();  
  9.         var values = line.Split(',');  
  10.         listA.Add(new Tuple < stringstringstring > (values[0], values[1], values[11]));  
  11.     }  
  12. }  
  13. //PROCESS    
  14. var top10HigestRevenueSalesRecords = from salesrec in listA.Skip(0).Take(10)  
  15. orderby salesrec.Item3  
  16. select salesrec;  
  17. //PRINT    
  18. foreach(var item in top10HigestRevenueSalesRecords) {  
  19.     Console.WriteLine($ "{item.Item1} - {item.Item2} - {item.Item3}");  
  20. }  
  21. stopwatch.Stop();  
  22. Console.WriteLine($ "Time ellapsed {stopwatch.ElapsedMilliseconds/1000}");  
  23. Console.ReadLine();   
Now all three main steps in the process Load, Process, and Print were done in under 2 seconds.
 
Adding Parallel. For or Foreach does not either work much for this scenario, in fact, it will slow it down a bit with again a difference in nanoseconds which is not to be considered much.
 
We can improve it futher down to one second by using some custom Nuget packages  that decrease the downtime of loading large csv files.
  1. using LumenWorks.Framework.IO.Csv;  
  2. using(CsvReader csv = new CsvReader(new StreamReader(@ "C:\Users\Lenovo\Desktop\5m Sales Records.csv"), true)) {  
  3.     while (csv.ReadNextRecord()) {  
  4.         listA.Add(new Tuple < stringstringstring > (csv[0], csv[1], csv[11]));  
  5.     }  
  6. }   
Happy coding fellows.


Similar Articles