========== WizardHTML ========== .. figure:: _static/img/WizardHTMLBanner.png :alt: WizardHTML Banner :width: 800 :height: 300 :align: center .. image:: https://img.shields.io/pypi/v/wizardhtml.svg :target: https://pypi.org/project/wizardhtml/ :alt: PyPI - Version .. image:: https://img.shields.io/pypi/dm/wizardhtml.svg?label=PyPI%20downloads :target: https://pypistats.org/packages/wizardhtml :alt: PyPI - Downloads/month .. image:: https://img.shields.io/pypi/l/wizardhtml.svg :target: https://github.com/textwizard-dev/wizardhtml/blob/main/LICENSE :alt: License WHATWG-compliant HTML5 toolkit: DFA tokenizer, spec-guided tree builder, DOM, configurable serializer, and high-level helpers for cleaning, pretty-printing, and HTML→Markdown. Installation ============ Requires Python 3.9+. .. code-block:: bash pip install wizardhtml Quick start =========== .. code-block:: python import wizardhtml as wh # Mode A: text-only extraction print(wh.clean_html("

Hello

")) # -> "Hello" # Pretty print html = "

Hi there

" print(wh.beautiful_html(html, indent=2)) # HTML → Markdown print(wh.html_to_markdown("

T

Body

")) # Parser and DOM doc = wh.parse("

Hi

") Public API ========================= .. list-table:: :header-rows: 1 :widths: 50 40 * - Function - Purpose * - ``parse(html, fragment_context=None, return_errors=False)`` - Parse into ``Document`` or ``DocumentFragment``; optional parse error list. * - ``clean_html(text, **flags)`` - High-level HTML cleaning with A/B/C modes. * - ``beautiful_html(html, **opts)`` - Non-destructive pretty-printer for HTML. * - ``html_to_markdown(html)`` - Convert HTML → Markdown. * - ``serialize(node, **opts)`` - Serialize DOM → HTML. * - ``to_text(html, separator="\\n", strip=True, collapse_ws=True)`` - Extract readable text (internally uses Mode A, then normalizes whitespace). DOM types ========= ``Node``, ``Document``, ``DocumentFragment``, ``Element``, ``Text``, ``Comment``. Parsing ======= Signature --------- .. code-block:: python import wizardhtml as wh wh.parse( html: str, fragment_context: str | None = None, return_errors: bool = False, ) -> Document | DocumentFragment | tuple[Document | DocumentFragment, list[str]] Behavior -------- - **Full document** when ``fragment_context is None`` → returns ``Document``. - **Fragment parsing** when ``fragment_context`` is an element name (e.g. ``"div"``, ``"template"``, ``"tbody"``, ``"svg"``, ``"math"``) → returns ``DocumentFragment``. Tokenizer state and insertion mode follow WHATWG rules for the context element. - With ``return_errors=True`` returns ``(node, errors: list[str])`` where errors are informative. Examples -------- Full document: .. code-block:: python import wizardhtml as wh doc = wh.parse("

Hi

") Fragment: .. code-block:: python import wizardhtml as wh frag = wh.parse("
  • item
  • ", fragment_context="ul") Collecting parse errors: .. code-block:: python import wizardhtml as wh node, errors = wh.parse("

    x

    ", return_errors=True) print(errors) HTML cleaning ============= HTML cleanup with granular switches for scripts, metadata, embedded media, interactive elements, headings, phrasing content, and more. Supports wildcard-based *tag* and *attribute* removal, selective content stripping, and empty-node pruning. Returns **text** or **HTML** depending on the mode. Behavior -------- Three explicit modes with different outputs: +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **Mode** | **How to trigger** | **Returns** | **Description** | +===============================================+============================================+=========================+==============================================================+ | **A) text-only** | No parameters provided (all ``None``) | ``str`` (plain text) | Extracts text, skips script-supporting tags, inserts spaces. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **B) structural clean** | At least one flag is ``True`` | ``str`` (HTML) | Removes/unwraps per flags and serializes sanitized HTML. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ | **C) text+preserve** | Parameters present and all are ``False`` | ``str`` (text+markup) | Extracts text but **preserves** groups explicitly set False. | +-----------------------------------------------+--------------------------------------------+-------------------------+--------------------------------------------------------------+ .. note:: When deleting nodes between adjacent text nodes, the cleaner inserts **one space** to avoid word concatenation. In Mode B the serializer uses ``quote_attr_values="always"`` for stable diffs. Parameters ---------- +-------------------------------+--------------------------------------------------------------------------+ | **Parameter** | **Description** | +===============================+==========================================================================+ | ``text`` | (*str*) Raw HTML input. | +-------------------------------+--------------------------------------------------------------------------+ | ``remove_script`` | (*bool | None*) Drop executable tags (``") print(txt) **Output** .. code-block:: text Hello Mode B — structural clean (HTML out) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Drop scripts, metadata, embeds; strip attributes; prune empties. .. code-block:: python import wizardhtml as wh html = """ x

    Title

    hello

    """ out = wh.clean_html( html, remove_script=True, remove_metadata_tags=True, remove_embedded_tags=True, remove_specific_attributes=["id", "on*"], remove_empty_tags=True, remove_comments=True, remove_doctype=True, ) print(out) **Output** .. code-block:: html

    Title

    hello

    Wildcards and unwrap vs hard remove: .. code-block:: python import wizardhtml as wh html = """

    Hello

    """ out = wh.clean_html( html, remove_tags_and_contents=["iframe", "template"], remove_specific_attributes=["id", "data-*", "on*"], remove_empty_tags=True, ) print(out) **Output** .. code-block:: html

    Hello

    Content stripping vs tag deletion: .. code-block:: python import wizardhtml as wh html = """
    code stays
    """ keep_tags_drop_content = wh.clean_html( html, remove_content_tags=["script","style"], # keep
    code stays
    Sectioning, headings, flow: .. code-block:: python import wizardhtml as wh html = "

    T

    X

    Body

    " out = wh.clean_html( html, remove_sectioning_tags=True, # drop
    /
    /