Introduction
XML (eXtensible Markup Language) is a widely used format for structuring and storing data in a hierarchical manner. It consists of elements enclosed in tags, which can have attributes and contain text or other elements. However, processing large XML files can be memory-intensive, especially if you load the entire document into memory at once.
Use XmlTextReader to parse large XML documents
using System.Xml;
public void FindParticularNodesUsingTextReader()
{
string xmlFilePath = @"C:\Document and Settings\Administrator\Desktop\sampleXmlDoc.xml";
using (XmlTextReader txtReader = new XmlTextReader(xmlFilePath))
{
txtReader.WhitespaceHandling = WhitespaceHandling.None;
while (txtReader.Read())
{
if (txtReader.Name.Equals("TotalPrice") && txtReader.IsStartElement())
{
txtReader.Read();
richTextBox1.AppendText(txtReader.Value);
}
}
}
}
Output
12.36 11.99 7.97
Faster, read-only XPath query-based access to data, use XPathDocument and XPathNavigator along with xpath query.
using System.Xml.XPath;
public void FindTagsUsingXPathNavigatorAndXPathDocumentNew()
{
string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";
XPathDocument xpDoc = new XPathDocument(xmlFilePath);
XPathNavigator xpNav = xpDoc.CreateNavigator();
XPathExpression xpExpression = xpNav.Compile("/Orders/Order/TotalPrice");
XPathNodeIterator xpIter = xpNav.Select(xpExpression);
while (xpIter.MoveNext())
{
richTextBox1.AppendText(xpIter.Current.Value);
}
}
Output
12.36 11.99 7.97
Combining XmlReader and XmlDocument. On the XmlReader, use the MoveToContent and Skip methods to skip unwanted items.
using System.Xml;
public void UseXmlReaderAndXmlDocument()
{
string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";
using (XmlReader rdrObj = XmlReader.Create(xmlFilePath))
{
while (rdrObj.Read())
{
if (rdrObj.NodeType.Equals(XmlNodeType.Element) &&
rdrObj.Name.Equals("TotalPrice") &&
rdrObj.IsStartElement())
{
rdrObj.Read();
richTextBox1.AppendText(rdrObj.Value);
}
}
}
}
Output
12.36 11.99 7.97
using System.Xml;
public void UseXmlReaderAndXmlDocumentNew()
{
string xmlFilePath = @"C:\Documents and Settings\Administrator\Desktop\sampleXmlDoc.xml";
using (XmlReader rdrObj = XmlReader.Create(xmlFilePath))
{
XmlDocument xmlDocObj = new XmlDocument();
while (rdrObj.Read())
{
if (rdrObj.NodeType == XmlNodeType.Element &&
rdrObj.Name.Equals("TotalPrice") &&
rdrObj.IsStartElement())
{
rdrObj.Read();
richTextBox1.AppendText(rdrObj.Value);
}
}
rdrObj.Close(); // Close the XmlReader before loading into XmlDocument
xmlDocObj.Load(xmlFilePath);
richTextBox1.Text = xmlDocObj.InnerText;
}
}
Design Considerations
- Avoid XML as long as possible.
- Avoid processing large documents.
- Avoid validation. XmlValidatingReader is 2-3x slower than XmlTextReader.
- Avoid DTD, especially IDs and entity references.
- Use streaming interfaces such as XmlReader or SAXdotnet.
- Consider hard-coded processing, including validation.
- Shorten node name length.
- Consider sharing NameTable, but only when names are likely to be really common. With more and more irrelevant names, it becomes slower and slower.
Parsing XML
- Use XmlTextReader and avoid validating readers.
- When a node is required, consider using XmlDocument.ReadNode(), not the entire Load().
- Set null for XmlResolver property on some XmlReaders to avoid access to external resources.
- Make full use of MoveToContent() and Skip(). They avoid extraneous name creation. However, it becomes almost nothing when you use XmlValidatingReader.
- Avoid accessing Value for Text/CDATA nodes as long as possible.
Validating XML
- Avoid extraneous validation.
- Consider caching schemas.
- Avoid identity constraint usage. Not only because it stores keys/fields for the entire document but also because the keys are boxed.
- Avoid extraneous strong typing. It results in XmlSchemaDatatype.ParseValue(). It could also result in avoiding access to the Value string.
Writing XML
- Write output directly as long as possible.
- To save documents, XmlTextWriter without indentation is better than TextWriter/Stream/file output (all indented) except for human reading.
DOM Processing
- Avoid InnerXml. It internally creates XmlTextReader/XmlTextWriter. InnerText is fine.
- Avoid PreviousSibling. XmlDocument is very inefficient for backward traverse.
- Append nodes as soon as possible. Adding a big subtree results in a longer extraneous run to check ID attributes.
- Prefer FirstChild/NextSibling and avoid to access ChildNodes. It creates XmlNodeList, which is initially not instantiated.
XPath Processing
- Consider using XPathDocument, but only when you need the entire document. With XmlDocument, you can use ReadNode() but no equivalent for XPathDocument.
- Avoid preceding-sibling and preceding axes queries, especially over XmlDocument. They would result in sorting, and for XmlDocument, they need access to PreviousSibling.
- Avoid // (descendant). The returned nodes are mostly likely to be irrelevant.
- Avoid position(), last() and positional predicates (especially things like foo[last()-1]).
- Compile the XPath string to XPathExpression and reuse it for frequent queries.
- Don't run XPath query frequently. It is costly since it always has to be Clone() XPathNavigators.
XSLT Processing
- Reuse (cache) XslTransform objects.
- Avoid key() in XSLT. They can return all kind of nodes that prevents node-type-based optimization.
- Avoid document(), especially with nonstatic arguments.
- Pull style (e.g. xsl:for-each) is usually better than template match.
- Minimize output size. More importantly, minimize input.