In this article, I shall explain how to convert an html file to a XML. We shall proceed step by step.
Let us look at some of the facts about HTML file.
Like XML, HTML is also a tag based language but it doesn't conform to XML standard. The non-conformance pertains to the tags which do not require closing such as img. There are several characters or character sequence which are also illegal in XML e.g. etc.
Therefore, our first task is to clean the html file so that it can be parsed as xml. There are variety of tools available on Internet on how to clean HTML file. I have referred the blog and code provided. Nevertheless, I have modified the code to fit to my own needs.
Next step is to extract the required information. Assuming a given html page has a fixed layout to show a report or order information, we can deduce the information using any standard XML parser. In my case, I have used XML parser from .NET framework.
- string xmlContents;
- try
- {
- XmlDocument doc = new XmlDocument();
- doc.Load(outputFileTextBox.Text);
-
- XmlNode node = doc.GetElementsByTagName("table")[3];
- for (int i = 1; i < node.ChildNodes.Count - 1; i++)
- {
- Order order = new Order()
- {
- Part_Number = node.ChildNodes[i].ChildNodes[0].InnerText ? ? string.Empty,
- Customer_Part_Number = node.ChildNodes[i].ChildNodes[1].InnerText ? ? string.Empty,
- Supplier_Part_Number = node.ChildNodes[i].ChildNodes[2].InnerText ? ? string.Empty,
- Supplier_Name = node.ChildNodes[i].ChildNodes[4].InnerText ? ? string.Empty,
- Type = node.ChildNodes[i].ChildNodes[5].InnerText ? ? string.Empty,
- Material = node.ChildNodes[i].ChildNodes[6].InnerText ? ? string.Empty,
- Unit_of_Measure = node.ChildNodes[i].ChildNodes[7].InnerText ? ? string.Empty,
- Quantity = node.ChildNodes[i].ChildNodes[8].InnerText ? ? string.Empty
- };
- bom.BomList.Add(order);
- }
- }
- catch (XmlException exception)
- {
- Console.WriteLine("xml parsing failed {0}", exception.Message);
- }
In order to transform the data into XML, first of all the information should be saved. This is a crucial step we should think over and try to use existing features provided by .NET framework. I have created a class that can be serialized using
xmlserializer.
-
-
-
- [Serializable]
- public class Order
- {
- public string Part_Number
- {
- get;
- set;
- }
- public string Customer_Part_Number
- {
- get;
- set;
- }
- public string Supplier_Part_Number
- {
- get;
- set;
- }
- public string Supplier_Name
- {
- get;
- set;
- }
- public string Type
- {
- get;
- set;
- }
- public string Color
- {
- get;
- set;
- }
- public string Material
- {
- get;
- set;
- }
- public string Unit_of_Measure
- {
- get;
- set;
- }
- public string Quantity
- {
- get;
- set;
- }
- }
- [XmlInclude(typeof(Order))]
- public class BOM
- {
- [XmlElement(ElementName = "Order")]
- public List < Order > BomList = new List < Order > ();
- }
The advantage of it is that you get the whole serialized xml in a string. Thereafter, xml contents can be written to an xml file easily. There exists many possibility to achieve the required functionality. But I personally thought it would be easier do it this way.
-
- XmlSerializer xmlSerializer = new XmlSerializer(typeof(BOM), new Type[]
- {
- typeof(Order)
- });;
-
- using(StringWriter writer = new StringWriter())
- {
- xmlSerializer.Serialize(writer, bom);
- xmlContents = writer.ToString();
- }
-
- using(StreamWriter fileWriter = new StreamWriter(outputFileTextBox.Text))
- {
- fileWriter.Write(xmlContents);
- }
Note: In this article, I have taken an HTML file provided by one of the users on C-SharpCorner. Just to be on safe side, I am not responsible for any data in HTML.