HtmlPrag provides permissive HTML parsing capability to Scheme programs, which is useful for software agent extraction of information from Web pages, for programmatically transforming HTML files, and for implementing interactive Web browsers. HtmlPrag emits ``SHTML,'' which is an encoding of HTML in [SXML], so that conventional HTML may be processed with XML tools such as [SXPath] and [SXML-Tools]. Like [SSAX-HTML], HtmlPrag provides a permissive tokenizer, but also attempts to recover structure. HtmlPrag also includes procedures for encoding SHTML in HTML syntax.
The HtmlPrag parsing behavior is permissive in that it accepts erroneous HTML, handling several classes of HTML syntax errors gracefully, without yielding a parse error. This is crucial for parsing arbitrary real-world Web pages, since many pages actually contain syntax errors that would defeat a strict or validating parser. HtmlPrag's handling of errors is intended to generally emulate popular Web browsers' interpretation of the structure of erroneous HTML. We euphemistically term this kind of parse ``pragmatic.'' To disable the pragmatic behavior and parse HTML more rigidly, the %strict-tokenizer?
parameter can be set to #true
. In this mode of operation, one ended HTML tags will not be treated specially, for example, and their content will be coalesced. On the other side, valid HTML will parse more accurately. When working with HTML known to be valid, it makes sense to use this mode of operation.
HtmlPrag also has some support for [XHTML], although XML namespace qualifiers [XML-Names] are currently accepted but stripped from the resulting SHTML. Note that valid XHTML input is of course better handled by a validating XML parser like [SSAX].
To receive notification of new versions of HtmlPrag, and to be polled for input on changes to HtmlPrag being considered, ask the author to add you to the moderated, announce-only email list, htmlprag-announce
.
Thanks to Oleg Kiselyov and Kirill Lisovsky for their help with SXML.
%default-parent-constraints | [Variable] |
%parent-constraints | [Variable] |
%strict-tokenizer? | [Variable] |
shtml-comment-symbol | [Variable] |
shtml-decl-symbol | [Variable] |
shtml-empty-symbol | [Variable] |
shtml-end-symbol | [Variable] |
shtml-entity-symbol | [Variable] |
shtml-named-char-id | [Variable] |
shtml-numeric-char-id | [Variable] |
shtml-pi-symbol | [Variable] |
shtml-start-symbol | [Variable] |
shtml-text-symbol | [Variable] |
shtml-top-symbol | [Variable] |
html->shtml input [#:strict?] | [Function] |
html->sxml input [#:strict?] | [Function] |
html->sxml-0nf input [#:strict?] | [Function] |
html->sxml-1nf input [#:strict?] | [Function] |
html->sxml-2nf input [#:strict?] | [Function] |
make-html-tokenizer in normalized? | [Function] |
parse-html/tokenizer tokenizer normalized? [#:strict?] | [Function] |
shtml->html shtml | [Function] |
shtml-entity-value entity | [Function] |
shtml-token-kind token | [Function] |
sxml->html shtml | [Function] |
test-htmlprag
| [Function] |
tokenize-html in normalized? | [Function] |
write-shtml-as-html shtml out | [Function] |
write-sxml-html shtml out | [Function] |