Introduction
XML (eXtensible Markup Language) is a widely used format for structuring and storing data in a hierarchical manner. It consists of elements enclosed in tags, which can have attributes and contain text or other elements. However, processing large XML files can be memory-intensive, especially if you load the entire document into memory at once.
Use XmlTextReader to parse large XML documents
Output
12.36 11.99 7.97
Faster, read-only XPath query-based access to data, use XPathDocument and XPathNavigator along with xpath query.
Output
12.36 11.99 7.97
Combining XmlReader and XmlDocument. On the XmlReader, use the MoveToContent and Skip methods to skip unwanted items.
Output
12.36 11.99 7.97
Design Considerations
- Avoid XML as long as possible.
- Avoid processing large documents.
- Avoid validation. XmlValidatingReader is 2-3x slower than XmlTextReader.
- Avoid DTD, especially IDs and entity references.
- Use streaming interfaces such as XmlReader or SAXdotnet.
- Consider hard-coded processing, including validation.
- Shorten node name length.
- Consider sharing NameTable, but only when names are likely to be really common. With more and more irrelevant names, it becomes slower and slower.
Parsing XML
- Use XmlTextReader and avoid validating readers.
- When a node is required, consider using XmlDocument.ReadNode(), not the entire Load().
- Set null for XmlResolver property on some XmlReaders to avoid access to external resources.
- Make full use of MoveToContent() and Skip(). They avoid extraneous name creation. However, it becomes almost nothing when you use XmlValidatingReader.
- Avoid accessing Value for Text/CDATA nodes as long as possible.
Validating XML
- Avoid extraneous validation.
- Consider caching schemas.
- Avoid identity constraint usage. Not only because it stores keys/fields for the entire document but also because the keys are boxed.
- Avoid extraneous strong typing. It results in XmlSchemaDatatype.ParseValue(). It could also result in avoiding access to the Value string.
Writing XML
- Write output directly as long as possible.
- To save documents, XmlTextWriter without indentation is better than TextWriter/Stream/file output (all indented) except for human reading.
DOM Processing
- Avoid InnerXml. It internally creates XmlTextReader/XmlTextWriter. InnerText is fine.
- Avoid PreviousSibling. XmlDocument is very inefficient for backward traverse.
- Append nodes as soon as possible. Adding a big subtree results in a longer extraneous run to check ID attributes.
- Prefer FirstChild/NextSibling and avoid to access ChildNodes. It creates XmlNodeList, which is initially not instantiated.
XPath Processing
- Consider using XPathDocument, but only when you need the entire document. With XmlDocument, you can use ReadNode() but no equivalent for XPathDocument.
- Avoid preceding-sibling and preceding axes queries, especially over XmlDocument. They would result in sorting, and for XmlDocument, they need access to PreviousSibling.
- Avoid // (descendant). The returned nodes are mostly likely to be irrelevant.
- Avoid position(), last() and positional predicates (especially things like foo[last()-1]).
- Compile the XPath string to XPathExpression and reuse it for frequent queries.
- Don't run XPath query frequently. It is costly since it always has to be Clone() XPathNavigators.
XSLT Processing
- Reuse (cache) XslTransform objects.
- Avoid key() in XSLT. They can return all kind of nodes that prevents node-type-based optimization.
- Avoid document(), especially with nonstatic arguments.
- Pull style (e.g. xsl:for-each) is usually better than template match.
- Minimize output size. More importantly, minimize input.