Body

item

", return_errors=True) print(errors) HTML cleaning ============= HTML cleanup with granular switches for scripts, metadata, embedded media, interactive elements, headings, phrasing content, and more. Supports wildcard-based *tag* and *attribute* removal, selective content stripping, and empty-node pruning. Returns **text** or **HTML** depending on the mode. Behavior -------- Three explicit modes with different outputs: +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **Mode** | **How to trigger** | **Returns** | **Description** | +===============================================+============================================+=========================+==============================================================+ | **A) text-only** | No parameters provided (all ``None``) | ``str`` (plain text) | Extracts text, skips script-supporting tags, inserts spaces. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **B) structural clean** | At least one flag is ``True`` | ``str`` (HTML) | Removes/unwraps per flags and serializes sanitized HTML. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **C) text+preserve** | Parameters present and all are ``False`` | ``str`` (text+markup) | Extracts text but **preserves** groups explicitly set False. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ .. note:: When deleting nodes between adjacent text nodes, the cleaner inserts **one space** to avoid word concatenation. In Mode B the serializer uses ``quote_attr_values="always"`` for stable diffs. Parameters ---------- +-------------------------------+--------------------------------------------------------------------------+ | **Parameter** | **Description** | +===============================+==========================================================================+ | ``text`` | (*str*) Raw HTML input. | +-------------------------------+--------------------------------------------------------------------------+ | ``remove_script`` | (*bool | None*) Drop executable tags (``") print(txt) **Output** .. code-block:: text Hello Mode B — structural clean (HTML out) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Drop scripts, metadata, embeds; strip attributes; prune empties. .. code-block:: python import wizardhtml as wh html = """ x

Title

hello

""" out = wh.clean_html( html, remove_script=True, remove_metadata_tags=True, remove_embedded_tags=True, remove_specific_attributes=["id", "on*"], remove_empty_tags=True, remove_comments=True, remove_doctype=True, ) print(out) **Output** .. code-block:: html

Title

hello

Wildcards and unwrap vs hard remove: .. code-block:: python import wizardhtml as wh html = """

Hello

""" out = wh.clean_html( html, remove_tags_and_contents=["iframe", "template"], remove_specific_attributes=["id", "data-*", "on*"], remove_empty_tags=True, ) print(out) **Output** .. code-block:: html

Hello

Content stripping vs tag deletion: .. code-block:: python import wizardhtml as wh html = """

code stays

""" keep_tags_drop_content = wh.clean_html( html, remove_content_tags=["script","style"], # keep

code stays

Sectioning, headings, flow: .. code-block:: python import wizardhtml as wh html = "

T

Body

" out = wh.clean_html( html, remove_sectioning_tags=True, # drop

remove_heading_tags=True, # drop

-

) print(out) Output .. code-block:: html Interactive and embedded: .. code-block:: python import wizardhtml as wh html = """ """ out = wh.clean_html( html, remove_interactive_tags=True, # button, input, select remove_embedded_tags=True, # img, iframe, embed, video, audio remove_specific_attributes=["id"], remove_empty_tags=True ) print(out) # empty string if everything got removed Mode C — text with preservation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Preserve sectioning + headings + comments: .. code-block:: python import wizardhtml as wh html = "
T
Body
" txt = wh.clean_html( html, remove_sectioning_tags=False, remove_heading_tags=False, remove_comments=False, ) print(txt) Output .. code-block:: html
T
Body
Preserve images but text-only elsewhere: .. code-block:: python import wizardhtml as wh html = '
AB
' txt = wh.clean_html( html, remove_embedded_tags=False, # keep ) print(txt) Output .. code-block:: html AB Operational notes ----------------- - When deleting nodes between adjacent text nodes, the cleaner inserts one space to avoid word concatenation. - In Mode B the serializer prefers stable quoting for diff-friendly output. - If the DOM becomes empty after removals, returns ``""``. Text helper =========== Extract readable text with whitespace normalization. .. code-block:: python import wizardhtml as wh txt = wh.to_text("
A B \n\n C
", separator=" ") print(txt) # "A B C" Beautiful HTML ============== Pretty-print raw HTML without changing semantics. The formatter parses html, serializes a normalized DOM, and indents nodes by a configurable amount. It never reflows RCData content (``