WizardHTML¶
WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, and high-level helpers for cleaning, pretty-printing, and HTML→Markdown.
Installation¶
Requires Python 3.9+.
pip install wizardhtml
Quick start¶
import wizardhtml as wh
# Mode A: text-only extraction
print(wh.clean_html("<div><p>Hello</p><script>x()</script></div>"))
# -> "Hello"
# Pretty print
html = "<body><p>Hi <b>there</b></p><img src=x></body>"
print(wh.beautiful_html(html, indent=2))
# HTML → Markdown
print(wh.html_to_markdown("<h1>T</h1><p>Body</p>"))
# Parser and DOM
doc = wh.parse("<!doctype html><html><body><p>Hi</p></body></html>")
Public API¶
Function |
Purpose |
|---|---|
|
Parse into |
|
High-level HTML cleaning with A/B/C modes. |
|
Non-destructive pretty-printer for HTML. |
|
Convert HTML → Markdown. |
|
Serialize DOM → HTML. |
|
Extract readable text (internally uses Mode A, then normalizes whitespace). |
DOM types¶
Node, Document, DocumentFragment, Element, Text, Comment.
Parsing¶
Signature¶
import wizardhtml as wh
wh.parse(
html: str,
fragment_context: str | None = None,
return_errors: bool = False,
) -> Document | DocumentFragment | tuple[Document | DocumentFragment, list[str]]
Behavior¶
Full document when
fragment_context is None→ returnsDocument.Fragment parsing when
fragment_contextis an element name (e.g."div","template","tbody","svg","math") → returnsDocumentFragment. Tokenizer state and insertion mode follow WHATWG rules for the context element.With
return_errors=Truereturns(node, errors: list[str])where errors are informative.
Examples¶
Full document:
import wizardhtml as wh
doc = wh.parse("<!doctype html><html><body><p>Hi</p></body></html>")
Fragment:
import wizardhtml as wh
frag = wh.parse("<li>item</li>", fragment_context="ul")
Collecting parse errors:
import wizardhtml as wh
node, errors = wh.parse("<p><b>x</p>", return_errors=True)
print(errors)
HTML cleaning¶
HTML cleanup with granular switches for scripts, metadata, embedded media, interactive elements, headings, phrasing content, and more. Supports wildcard-based tag and attribute removal, selective content stripping, and empty-node pruning. Returns text or HTML depending on the mode.
Behavior¶
Three explicit modes with different outputs:
Mode |
How to trigger |
Returns |
Description |
|---|---|---|---|
A) text-only |
No parameters provided (all |
|
Extracts text, skips script-supporting tags, inserts spaces. |
B) structural clean |
At least one flag is |
|
Removes/unwraps per flags and serializes sanitized HTML. |
C) text+preserve |
Parameters present and all are |
|
Extracts text but preserves groups explicitly set False. |
Note
When deleting nodes between adjacent text nodes, the cleaner inserts one space to avoid word concatenation.
In Mode B the serializer uses quote_attr_values="always" for stable diffs.
Parameters¶
Parameter |
Description |
|---|---|
|
(str) Raw HTML input. |
|
(bool | None) Drop executable tags ( |
|
(bool | None) Drop metadata ( |
|
(bool | None) Drop flow content (layout + phrasing, e.g. |
|
(bool | None) Drop sectioning ( |
|
(bool | None) Drop headings |
|
(bool | None) Drop phrasing (inline) elements, e.g. |
|
(bool | None) Drop embedded content ( |
|
(bool | None) Drop interactive elements ( |
|
(bool | None) Drop palpable elements (broad set incl. |
|
(bool | None) Remove |
|
(bool | None) Remove |
|
(str | list | None) Remove attributes by name or wildcard
(e.g. |
|
(str | list | None) Unwrap tags by name or wildcard (children are lifted into parent). |
|
(bool | None) Prune empty nodes after edits. |
|
(str | list | None) Keep tag but drop inner content. |
|
(str | list | None) Remove tag and its entire content. |
Parameter semantics¶
None → flag unset. If all are None ⇒ Mode A.
True → request removal/operation ⇒ Mode B.
False → request preservation ⇒ Mode C (text output that preserves those groups;
remove_comments=Falseandremove_doctype=Falsealso preserve them).
Tag groups reference¶
Flag |
Tags affected |
|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Examples¶
Mode A — text only¶
import wizardhtml as wh
txt = wh.clean_html("<div><p>Hello</p><script>x()</script></div>")
print(txt)
Output
Hello
Mode B — structural clean (HTML out)¶
Drop scripts, metadata, embeds; strip attributes; prune empties.
import wizardhtml as wh
html = """
<html><head>
<title>x</title><meta charset="utf-8">
<link rel="preload" href="x.css"><script>evil()</script>
</head>
<body>
<article><h1>Title</h1><img src="a.png"><p id="k" onclick="x()">hello</p></article>
<!-- comment -->
</body></html>
"""
out = wh.clean_html(
html,
remove_script=True,
remove_metadata_tags=True,
remove_embedded_tags=True,
remove_specific_attributes=["id", "on*"],
remove_empty_tags=True,
remove_comments=True,
remove_doctype=True,
)
print(out)
Output
<html>
<body>
<article><h1>Title</h1><p>hello</p></article>
</body>
</html>
Wildcards and unwrap vs hard remove:
import wizardhtml as wh
html = """
<div id="hero" data-track="x">
<svg viewBox="0 0 10 10"><circle r="5"/></svg>
<p class="k" onclick="hack()">Hello</p>
<iframe src="a.html"></iframe>
</div>
"""
out = wh.clean_html(
html,
remove_tags_and_contents=["iframe", "template"],
remove_specific_attributes=["id", "data-*", "on*"],
remove_empty_tags=True,
)
print(out)
Output
<html><body><div>
<p class="k">Hello</p>
</div></body></html>
Content stripping vs tag deletion:
import wizardhtml as wh
html = """
<article>
<script>track()</script>
<style>p{}</style>
<pre>code stays</pre>
<noscript>fallback</noscript>
</article>
"""
keep_tags_drop_content = wh.clean_html(
html,
remove_content_tags=["script","style"], # keep <script>/<style> but empty them
)
print(keep_tags_drop_content)
Output
<html><head></head><body><article>
<script></script>
<style></style>
<pre>code stays</pre>
<noscript>fallback</noscript>
</article></body></html>
Sectioning, headings, flow:
import wizardhtml as wh
html = "<section><h1>T</h1><div><address>X</address><p>Body</p></div></section>"
out = wh.clean_html(
html,
remove_sectioning_tags=True, # drop <section>/<article>/<aside>/<nav>
remove_heading_tags=True, # drop <h1>-<h6>
)
print(out)
Output
<html><head></head><body></body></html>
Interactive and embedded:
import wizardhtml as wh
html = """
<button id="b" disabled>Click</button>
<img src="logo.png" alt="Logo">
<video src="v.mp4"></video>
"""
out = wh.clean_html(
html,
remove_interactive_tags=True, # button, input, select
remove_embedded_tags=True, # img, iframe, embed, video, audio
remove_specific_attributes=["id"],
remove_empty_tags=True
)
print(out) # empty string if everything got removed
Mode C — text with preservation¶
Preserve sectioning + headings + comments:
import wizardhtml as wh
html = "<article><h1>T</h1><p>Body</p><!-- c --></article>"
txt = wh.clean_html(
html,
remove_sectioning_tags=False,
remove_heading_tags=False,
remove_comments=False,
)
print(txt)
Output
<article><h1>T</h1>Body<!-- c --></article>
Preserve images but text-only elsewhere:
import wizardhtml as wh
html = '<p>A<img src="a.png" alt="A">B</p>'
txt = wh.clean_html(
html,
remove_embedded_tags=False, # keep <img>
)
print(txt)
Output
A<img src="a.png" alt="A">B
Operational notes¶
When deleting nodes between adjacent text nodes, the cleaner inserts one space to avoid word concatenation.
In Mode B the serializer prefers stable quoting for diff-friendly output.
If the DOM becomes empty after removals, returns
"".
Text helper¶
Extract readable text with whitespace normalization.
import wizardhtml as wh
txt = wh.to_text("<div> A <b> B </b>\n\n <i>C</i></div>", separator=" ")
print(txt) # "A B C"
Beautiful HTML¶
Pretty-print raw HTML without changing semantics. The formatter parses html,
serializes a normalized DOM, and indents nodes by a configurable amount. It never
reflows RCData content (<script>, <style>, <textarea>) and avoids introducing
visible whitespace unless explicitly requested.
Parameters¶
html(str): Raw HTML input.indent(int, default2): Spaces per indentation level.quote_attr_values({"always","spec","legacy"}, default"spec"): Quoting policy for attribute values. -"always"→ always quote. -"spec"→ quote only when required by the HTML5 spec (space, quotes,=,<,>, backtick). -"legacy"→ legacy behavior; quote only for whitespace or quotes.quote_char({'"',"'"}, default'"'): Preferred quote character when quoting.use_best_quote_char(bool, defaultTrue): Choose the quote character that minimizes escaping per attribute.minimize_boolean_attributes(bool, defaultFalse): Render compact boolean attributes (e.g.,disabledinstead ofdisabled="disabled").use_trailing_solidus(bool, defaultFalse): Emit a trailing solidus on void elements (<br />). Cosmetic in HTML5.space_before_trailing_solidus(bool, defaultTrue): Insert a space before the trailing solidus if it is used.escape_lt_in_attrs(bool, defaultFalse): Escape</>inside attribute values.escape_rcdata(bool, defaultFalse): Escape characters inside RCData elements (usually keepFalse).resolve_entities(bool, defaultTrue): Prefer named entities where available during serialization.alphabetical_attributes(bool, defaultTrue): Sort attributes alphabetically (useful for diff-friendly output).strip_whitespace(bool, defaultFalse): Trim leading/trailing whitespace in text nodes and collapse runs of spaces.include_doctype(bool, defaultTrue): Prepend<!DOCTYPE html>if missing.expand_mixed_content(bool, defaultTrue): For elements that contain both text and child elements, place each child on its own indented line (may introduce visible whitespace in inline contexts).expand_empty_elements(bool, defaultTrue): Render empty non-void elements on two lines (open/close on separate lines).
Examples¶
Basic pretty-print:
import wizardhtml as wh
html = """
<body>
<button id='btn1' class="primary" disabled="disabled">
Click <b>me</b>
</button>
<img alt="Logo" src="/static/logo.png">
</body>
"""
pretty = wh.beautiful_html(
html=html,
indent=4,
alphabetical_attributes=True,
minimize_boolean_attributes=True,
quote_attr_values="always",
strip_whitespace=True,
include_doctype=True,
expand_mixed_content=True,
expand_empty_elements=True,
)
print(pretty)
Output
<!DOCTYPE html>
<html>
<head>
</head>
<body>
<button class="primary" disabled id="btn1">
Click
<b>
me
</b>
</button>
<img alt="Logo" src="/static/logo.png">
</body>
</html>
Quote policies & best quote char¶
import wizardhtml as wh
html = '<a data-title=\'He said "hi"\'>x</a>'
out = wh.beautiful_html(
html,
quote_attr_values="always",
quote_char='"',
use_best_quote_char=True, # picks ' to minimize escaping
)
print(out)
Output
<!DOCTYPE html>
<html>
<head></head>
<body>
<a data-title='He said "hi"'>
x
</a>
</body>
</html>
Void elements and trailing solidus¶
import wizardhtml as wh
html = "<br><img src=x>"
out = wh.beautiful_html(
html,
use_trailing_solidus=True,
space_before_trailing_solidus=False,
)
print(out)
Output
<!DOCTYPE html>
<html>
<head></head>
<body>
<br/>
<img src=x/>
</body>
</html>
Whitespace & mixed content¶
import wizardhtml as wh
html = "<p>Hello <b>world</b>!</p>"
out = wh.beautiful_html(
html,
expand_mixed_content=True, # puts <b> on its own line
strip_whitespace=False,
)
print(out)
Output
<!DOCTYPE html>
<html>
<head></head>
<body>
<p>
Hello
<b>
world
</b>
!
</p>
</body>
</html>
Serialization¶
Signature¶
import wizardhtml as wh
wh.serialize(
node,
*,
quote_attr_values: str = "spec", # "spec" | "legacy" | "always"
quote_char: str = '"',
use_best_quote_char: bool = True,
minimize_boolean_attributes: bool = False,
resolve_entities: bool = True,
alphabetical_attributes: bool = False,
strip_whitespace: bool = False,
include_doctype: bool = True,
) -> str
Notes¶
include_doctypeis effective only whennodeis aDocument.alphabetical_attributes=Trueis useful for diff-friendly output.Does not alter DOM structure.
HTML → Markdown¶
Best-effort conversion of common HTML structures to Markdown (headings, paragraphs, inline emphasis/code, lists, links, images, blockquotes, code blocks, horizontal rules). Attributes and presentational markup are ignored. When the input cannot be converted safely, the original HTML is returned unchanged.
Parameters¶
html(str): Raw HTML input.
Return value¶
str— Markdown representation of the input HTML (or the original HTML if conversion is not applicable).
Examples¶
Basic¶
import wizardhtml as wh
md = wh.html_to_markdown("<h1>Hello</h1><p>World</p>")
print(md)
Output
# Hello
World
Links, lists, code¶
import wizardhtml as wh
html = """
<h2>Quick links</h2>
<p>Visit <a href="https://example.com">our site</a>.</p>
<ul>
<li><strong>One</strong></li>
<li>Two</li>
</ul>
<pre><code>print("hi")</code></pre>
<hr>
"""
print(wh.html_to_markdown(html))
Output
## Quick links
Visit [our site](https://example.com)\.
- **One**
- Two
```
print("hi")
```
—
Errors & validation¶
TypeErroron non-string input forparse/to_text/clean_html/beautiful_html.Malformed markup is normalized whenever possible following WHATWG rules.
When the DOM becomes empty after removals,
clean_htmlreturns"".