Validating non validating parser xml
Regular expressions are a tool that is insufficiently sophisticated to understand the constructs employed by HTML.
HTML is not a regular language and hence cannot be parsed by regular expressions.
But many will try, some will claim success and others will find the fault and totally mess you up.
No matter how many times we say it, they won't stop coming every day... It is a lost cause, which someone else can fight for a bit. Also, scraping fairly regularly formatted data from large documents is going to be WAY faster with judicious use of scan & regex than any generic parser.Maybe if you give examples of the "(X)HTML syntax errors implemented in real world user agents" you're referring to, I'll understand what you're getting at [email protected] Mihalcin is exactly right.Most extant regex engines are more powerful than Chomsky Type 3 grammars (eg non-greedy matching, backrefs).Regex queries are not equipped to break down HTML into its meaningful parts. Even enhanced irregular regular expressions as used by Perl are not up to the task of parsing HTML. HTML is a language of sufficient complexity that it cannot be parsed by regular expressions.Even Jon Skeet cannot parse HTML using regular expressions.
As I have answered in HTML-and-regex questions here so many times before, the use of regex will not allow you to consume HTML.