|Number of votes:||3|
In EPiServer Framework we now (since the Community 4 release) ship a new HTML parser. I will do some more technically oriented posts later on, but for now I just wanted to explain why we decided to invest in a new parser.
Short aside - the "HTML parsing" that we do is lexical analysis and tokenization. In a Computer Science class we would not get away with calling this parsing since we don't really care about the syntax.
In EPiServer CMS we primarily need HTML parsing for Friendly URL (aka FURL) rewriting of outgoing HTML. It is also used to deal with the permanent link scheme used when storing CMS data to the database. As part of this process we also do the "soft link" indexing. There are also a few other situations, for example allowing a subset of HTML for untrusted user input (Relate), where a solid HTML parser can help. Finally there are also all sorts of interesting scenarios that you, our partners, come up with - pulling information from other sites and extracting links, custom markup language etc.
The SGML Reader is basically an XML reader (build around the same .NET infrastructure as the XML readers) that accepts malformed XML / HTML.
Unfortunately that codebase is very complex . There are a couple of long-standing bugs that we have been unable to fix. Another aspect is that the SGML Reader will force your HTML code into well-formed XHTML. This is usually the right thing to do, but in some cases you don't want your HTML code to be reformatted at all.
The XML reader model with returning the node and attributes separately also causes client code to be much more complicated than necessary, usually forcing you to use an event-based architecture. I will give some examples of this in future posts.
Very good question - it is so easy to fall into the "Not Invented Here" trap. Creating a good HTML parser is a major undertaking so lets first lets go thru the "must haves" for our parser and compare it to existing HTML parser implementations:
Now lets take a look at the existing parsers:
|XML DOM based-parsing, although possible to use without actually creating an XML DOM. As previously noted, fails on #3 and #4.|
|HTML Agility Pack
|Very nice API, but it does not support a streaming model. Everything gets read into memory before you can act on it, breaking #2.|
|Extremely fast but enforces a hard, compile-time limit on the size of the HTML. The API is also a bit clunky, breaking must-have #2 and #3.|
|LINQ to HTML
|Nice API, but still DOM based breaking #2.|
|A wrapper for native-code HTML Tidy library, breaking #5|
(Please let me know if there are any interesting libraries that I have missed.)
None of these features in itself makes the HtmlStreamReader unique, but the combination of features are perfect for our needs. I hope you will find it useful too!
Since I feel that a blog post is not complete unless it contains at least a few lines of code, here is a short sample showing off the LINQ capabilities. This code snippet will show all external references from the startpage of Swedish newspaper DN.
var html = new HtmlStreamReader(new StreamReader(WebRequest.Create("http://www.dn.se/").GetResponse().GetResponseStream()));
var result = html.
SelectMany(e => e.Attributes).
Where(a => (a.Token == AttributeToken.Src || a.Token == AttributeToken.Href) &&