Convert HTML To DataTable

Sometimes, we have situations where we get data as HTML data and we have to convert it to DataTable. From DataTable, we can export it to xlsx or use it according to our need.

The problem I faced with HTML data, was that I had to export a report in xlsx format and I had only HTML data. The HTML data was exporting into xlsx but with a format message, while opening the Excel. That means, it was writing the HTML to the Excel sheet as it appeared. But for the Excel sheet, we need columns to be written properly into each cell. xlsx doesn't support direct HTML conversion to xlsx cells.
 
To export HTML data to xlsx, I had to convert the HTML data to DataTable from where I can export the data easily and there will be no format message while opening the Excel. Below is the code to convert the HTML to DataTable.
  1. public static DataTable ConvertHTMLTablesToDataTable(string HTML)  
  2.       {  
  3.           
  4.           
  5.           DataTable dt = null;  
  6.           DataRow dr = null;  
  7.           DataColumn dc = null;  
  8.           string TableExpression = "<table[^>]*>(.*?)</table>";  
  9.           string HeaderExpression = "<th[^>]*>(.*?)</th>";  
  10.           string RowExpression = "<tr[^>]*>(.*?)</tr>";  
  11.           string ColumnExpression = "<td[^>]*>(.*?)</td>";  
  12.           bool HeadersExist = false;  
  13.           int iCurrentColumn = 0;  
  14.           int iCurrentRow = 0;  
  15.   
  16.           // Get a match for all the tables in the HTML    
  17.           MatchCollection Tables = Regex.Matches(HTML, TableExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);  
  18.   
  19.           // Loop through each table element    
  20.           foreach (Match Table in Tables)  
  21.           {  
  22.   
  23.               // Reset the current row counter and the header flag    
  24.               iCurrentRow = 0;  
  25.               HeadersExist = false;  
  26.   
  27.               // Add a new table to the DataSet    
  28.               dt = new DataTable();  
  29.   
  30.               // Create the relevant amount of columns for this table (use the headers if they exist, otherwise use default names)    
  31.               if (Table.Value.Contains("<th"))  
  32.               {  
  33.                   // Set the HeadersExist flag    
  34.                   HeadersExist = true;  
  35.   
  36.                   // Get a match for all the rows in the table    
  37.                   MatchCollection Headers = Regex.Matches(Table.Value, HeaderExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);  
  38.   
  39.                   // Loop through each header element    
  40.                   foreach (Match Header in Headers)  
  41.                   {  
  42.                       //dt.Columns.Add(Header.Groups(1).ToString);  
  43.                       dt.Columns.Add(Header.Groups[1].ToString());  
  44.   
  45.                   }  
  46.               }  
  47.               else  
  48.               {  
  49.                   for (int iColumns = 1; iColumns <= Regex.Matches(Regex.Matches(Regex.Matches(Table.Value, TableExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase)[0].ToString(), RowExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase)[0].ToString(), ColumnExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase).Count; iColumns++)  
  50.                   {  
  51.                       dt.Columns.Add("Column " + iColumns);  
  52.                   }  
  53.               }  
  54.   
  55.               // Get a match for all the rows in the table    
  56.               MatchCollection Rows = Regex.Matches(Table.Value, RowExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);  
  57.   
  58.               // Loop through each row element    
  59.               foreach (Match Row in Rows)  
  60.               {  
  61.   
  62.                   // Only loop through the row if it isn't a header row    
  63.                   if (!(iCurrentRow == 0 & HeadersExist == true))  
  64.                   {  
  65.   
  66.                       // Create a new row and reset the current column counter    
  67.                       dr = dt.NewRow();  
  68.                       iCurrentColumn = 0;  
  69.   
  70.                       // Get a match for all the columns in the row    
  71.                       MatchCollection Columns = Regex.Matches(Row.Value, ColumnExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);  
  72.   
  73.                       // Loop through each column element    
  74.                       foreach (Match Column in Columns)  
  75.                       {  
  76.   
  77.                           DataColumnCollection columns = dt.Columns;  
  78.   
  79.                           if (!columns.Contains("Column " + iCurrentColumn))  
  80.                           {  
  81.                               //Add Columns  
  82.                               dt.Columns.Add("Column " + iCurrentColumn);  
  83.                           }  
  84.                           // Add the value to the DataRow    
  85.                           dr[iCurrentColumn] = Column.Groups[1].ToString();  
  86.                           // Increase the current column    
  87.                           iCurrentColumn += 1;  
  88.   
  89.                       }  
  90.   
  91.                       // Add the DataRow to the DataTable    
  92.                       dt.Rows.Add(dr);  
  93.   
  94.                   }  
  95.   
  96.                   // Increase the current row counter    
  97.                   iCurrentRow += 1;  
  98.               }  
  99.               
  100.   
  101.           }  
  102.   
  103.           return (dt);  
  104.   
  105.       }  
This line gets all the Table matches as defined in TableExpression variable.
  1. MatchCollection Tables = Regex.Matches(HTML, TableExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);  
This line gets the total number of columns using ColumnExpressions as <td>. It created each <td> as a column.
  1. MatchCollection Columns = Regex.Matches(Row.Value, ColumnExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);   
This line gets the header using HeaderExpression as <th>. It created each <th> as a Header.
  1. MatchCollection Headers = Regex.Matches(Table.Value, HeaderExpression, RegexOptions.Multiline | RegexOptions.Singleline | RegexOptions.IgnoreCase);   
Use this code to convert HTML to DataTable. If you have found any problem, please comment below.
Next Recommended Reading Converting Generic List into DataTable