An extremely configurable markdown reverser for Python3.

Last update: Jun 27, 2022

Overview

🔄 Unmarkd

A markdown reverser.

Unmarkd is a BeautifulSoup-powered Markdown reverser written in Python and for Python.

Why

This is created as a StackSearch (one of my other projects) dependency. In order to create a better API, I needed a way to reverse HTML. So I created this.

There are similar projects (written in Ruby) ~~but I have not found any written in Python (or for Python)~~ later I found a popular library, html2text. But Unmarkd still is still better. See comparison.

Installation

You know the drill

pip install unmarkd

Known issues

~~Nested lists are not properly indented (#4)~~ Fixed in #11
~~Blockquote bug (#18)~~ Fixed in #23

Comparison

TL;DR: Html2Text is fast. If you don't need much configuration, you could use Html2Text for the little speed increase.

Click to expand

Speed

TL;DR: Unmarkd < Html2Text

Html2Text is basically faster:

(The DOC variable used can be found here)

Unmarkd sacrifices speed for power.

Html2Text directly uses Python's html.parser module (in the standard library). On the other hand, Unmarkd uses the powerful HTML parsing library, beautifulsoup4. BeautifulSoup can be configured to use different HTML parsers. In Unmarkd, we configure it to use Python's html.parser, too.

But another layer of code means more code is ran.

I hope that's a good explanation of the speed difference.

Correctness

TL;DR: Unmarkd == Html2Text

I actually found two html-to-markdown libraries. One of them was Tomd which had an incorrect implementation:

It seems to be abandoned, anyway.

Now with Html2Text and Unmarkd:

In other words, they work

Configurability

TL;DR: Unmarkd > Html2Text

This is Unmarkd's strong point.

In Html2Text, you only have a limited set of options.

In Unmarkd, you can subclass the BaseUnmarker and implement conversions for new tags (e.g. ), etc. In my opinion, it's much easier to extend and configure Unmarkd.

Unmarkd was originally written as a StackSearch dependancy.

Html2Text has no options for configuring parsing of code blocks. Unmarkd does

Documentation

Here's an example of basic usage

I love markdown!")) # Output: **I *love* markdown!**">

import unmarkd
print(unmarkd.unmark("I love markdown!"))
# Output: **I *love* markdown!**

or something more complex (shamelessly taken from here):

Sample Markdown

This is some basic, sample markdown.

Second Heading

Unordered lists, and:
1. One
2. Two
3. Three
More

Blockquote

And bold, italics, and even italics and later bold. Even ~~strikethrough~~. A link to somewhere.

And code highlighting:

var foo = 'bar';

function baz(s) {
   return foo + ':' + s;
}

Or inline code like var foo = 'bar';.

Or an image of bears

bears

The end ...

""" print(unmarkd.unmark(html_doc))">

import unmarkd
html_doc = R"""Sample Markdown
This is some basic, sample markdown.
Second Heading

   
Unordered lists, and:
      
One
Two
Three

      

More

   

   
Blockquote

   
And bold, italics, and even italics and later bold. Even strikethrough. A link to somewhere.
And code highlighting:
var foo = 'bar';

function baz(s) {
   return foo + ':' + s;
}

Or inline code like var foo = 'bar';.
Or an image of bears

The end ...
"""
print(unmarkd.unmark(html_doc))

and the output:

    # Sample Markdown


    This is some basic, sample markdown.

    ## Second Heading



    - Unordered lists, and:
     1. One
     2. Two
     3. Three
    - More

    >Blockquote


    And **bold**, *italics*, and even *italics and later **bold***. Even ~~strikethrough~~. [A link](https://markdowntohtml.com) to somewhere.

    And code highlighting:


    ```js
    var foo = 'bar';

    function baz(s) {
       return foo + ':' + s;
    }
    ```


    Or inline code like `var foo = 'bar';`.

    Or an image of bears

    ![bears](http://placebear.com/200/200)

    The end ...

Extending

Brief Overview

Most functionality should be covered by the BasicUnmarker class defined in unmarkd.unmarkers.

If you need to reverse markdown from StackExchange (as in the case for my other project), you may use the StackOverflowUnmarker (or it's alias, StackExchangeUnmarker), which is also defined in unmarkd.unmarkers.

Customizing

If the above two classes do not suit your needs, you can subclass the unmarkd.unmarkers.BaseUnmarker abstract class.

Currently, you can optionally override the following methods:

detect_language (parameters: 1)
- Parameters:
  - html: bs4.BeautifulSoup
- When a fenced code block is approached, this function is called with a parameter of type bs4.BeautifulSoup passed to it; this is the element the code block was detected from (i.e. pre).
- This function is responsible for detecting the programming language (or returning '' if none was detected) of the code block.
- Note: This method is different from unmarkd.unmarkers.BasicUnmarker. It is simpler and does less checking/filtering

But Unmarkd is more flexible than that.

Customizable constants

There are currently 3 constants you may override:

Formats: NOTE: Use the Format String Syntax
- UNORDERED_FORMAT
  - The string format of unordered (bulleted) lists.
- ORDERED_FORMAT
  - The string format of ordered (numbered) lists.
Miscellaneous:
- ESCAPABLES
  - A container (preferably a set) of length-1 str that should be escaped

Customize converting HTML tags

For an HTML tag some_tag, you can customize how it's converted to markdown by overriding a method like so:

from unmarkd.unmarkers import BaseUnmarker
class MyCustomUnmarker(BaseUnmarker):
    def tag_some_tag(self, child) -> str:
        ...  # parse code here

To reduce code duplication, if your tag also has aliases (e.g. strong is an alias for b in HTML) then you may modify the TAG_ALIASES.

If you really need to, you may also modify DEFAULT_TAG_ALIASES. Be warned: if you do so, you will also need to implement the aliases (currently em and strong).

Utility functions when overriding

You may use (when extending) the following functions:

__parse, 2 parameters:
- html: bs4.BeautifulSoup
  - The html to unmark. This is used internally by the unmark method and is slightly faster.
- escape: bool
  - Whether to escape the characters inside the string or not. Defaults to False.
escape: 1 parameter:
- string: str
  - The string to escape and make markdown-safe
wrap: 2 parameters:
- element: bs4.BeautifulSoup
  - The element to wrap.
- around_with: str
  - The character to wrap the element around with. WILL NOT BE ESCPAED
And, of course, tag_* and detect_language.

Comments

Nested lists of same type don't work

Both unordered and ordered list don't work when nested of the same type:

Two nested ordered lists

HTML:

<ol>
    <li>Top level 1</li>
    <li>Top level 2
        <ol>
            <li>A</li>
            <li>B</li>
            <li>C</li>
        </ol>
    </li>
    <li>Top level 3</li>
</ol>

Output:

1. Top level 1
 2. Top level 2
        
 1. A
 2. B
 3. C
 3. Top level 3

Two nested unordered lists

HTML:

<ul>
    <li>Top level 1</li>
    <li>Top level 2
        <ul>
            <li>A</li>
            <li>B</li>
            <li>C</li>
        </ul>
    </li>
    <li>Top level 3</li>
</ul>

Output:

- Top level 1
- Top level 2
        
- A
- B
- C
- Top level 3

bug good first issue reproduced

opened by sirnacnud 3

[ImgBot] Optimize images

Beep boop. Your images are optimized!

Your image file size has been reduced by 39% 🎉

Details

| File | Before | After | Percent reduction | |:--|:--|:--|:--| | /assets/correct.png | 372.04kb | 224.67kb | 39.61% | | /assets/tomd_cant_handle.png | 347.74kb | 210.22kb | 39.55% | | /assets/benchmark.png | 219.28kb | 141.36kb | 35.53% | | | | | | | Total : | 939.06kb | 576.25kb | 38.64% |

📝 docs | :octocat: repo | 🙋🏾 issues | 🏪 marketplace

~Imgbot - Part of Optimole family

opened by imgbot[bot] 1
Fix indent getting added to list children that weren't other lists
I was running in to an issue where list items using tags where getting indented when they shouldn't of been.

Example:

<ol> <li>A</li> <li>B</li> <li>C</li> </ol>

Output:

1. A 2. B 3. **C**

I added a test for this case as well. When doing the roundtrip style test, this indentation got lost, so I made the test compare the markdown output.
opened by sirnacnud 1
Support for tables
While Unmarkd currently supports tables, it spits out the html it was given. It would be nice if it supported tables:

| Syntax | Description | | ----------- | ----------- | | Header | Title | | Paragraph | Text |
enhancement
opened by ThatXliner 1

Nested lists are not properly indented

When the following HTML block is parsed:

<ul>
    <li>Unordered lists, and:
        <ol>
            <li>One</li>
            <li>Two</li>
            <li>Three</li>
        </ol>
    </li>
    <li>More</li>
</ul>

The output is incorrect:

 * Unordered lists, and:
 0. One
 1. Two
 2. Three
 * More

bug

opened by ThatXliner 1

Blockquote bug

Apply this patch:

diff --git a/tests/test_roundtrip.py b/tests/test_roundtrip.py
index a836024..5c1e097 100644
--- a/tests/test_roundtrip.py
+++ b/tests/test_roundtrip.py
@@ -1,10 +1,9 @@
 import unicodedata
 
 import markdown_it
-from hypothesis import assume, example, given
-from hypothesis import strategies as st
-
 import unmarkd
+from hypothesis import assume, example, given, reproduce_failure
+from hypothesis import strategies as st
 
 md = markdown_it.MarkdownIt()
 
@@ -17,6 +16,7 @@ def helper(text: str, func=unmarkd.unmark) -> None:
 
 
 @given(text=st.text(st.characters(blacklist_categories=("Cc", "Cf", "Cs", "Co", "Cn"))))
[email protected]_failure("6.10.1", b"AAEADgEADgEADgA=")
 def test_roundtrip_commonmark_unmark(text):
     assume(unicodedata.normalize("NFKC", text) == text)
     helper(text)

Or add an example with text=">>>". Tests will fail

bug

opened by ThatXliner 0

Update README for better comparison
html2text is fast but not very configurable (there's only so any options)

Tomd sucks

Add an unmarker (with html2text-style configuration) to prove that unmarkd's configurability is at least equal to html2text

documentation
opened by ThatXliner 0
Use a more reliable markdown parser

Instead of using commonmark, maybe https://github.com/executablebooks/markdown-it-py, https://github.com/trentm/python-markdown2, https://github.com/lepture/mistune, or https://github.com/Python-Markdown/markdown.

_{Also, I found tomd which might render this project useless 😬}
tests

opened by ThatXliner 0
Cannot handle nested bold and italics

When encountering input like Italic and bold and italic, the output is wrong, usually shadowed by the outer tag (in this case, )
bug

opened by ThatXliner 0
Optimize code
I've noticed that unmarkers.BaseUnmarker been documented as an "abstract base class" when we're actually using it otherwise.

Also, there's some dead code and we should actually sprinkle @staticmethod on some of them.

Here's my idea:

Move all the tag_* methods in BaseUnmarker ➡️ BasicUnmarker

Rename: BaseUnmarker ➡️ AbstractUnmarker

Alias: BaseUnmarker ➡️ BasicMarker

Run shed on the whole codebase (with --refactor)

Version bump: minor
enhancement
opened by ThatXliner 0
Save CSS information
Parse any css files or style tags found. Save it

When a class attribute is found, try to resolve it to the css

Add the resolved to the style attribute: convert to inline css

enhancement
opened by ThatXliner 1

Releases(v0.1.9)

v0.1.9(Jun 18, 2022)
:bug: Bug Fixes

Now we escape headings

Full Changelog: https://github.com/ThatXliner/unmarkd/compare/v0.1.7...v0.1.9
Source code(tar.gz)
Source code(zip)
v0.1.7(Jul 31, 2021)
:bug: Bug fixes

Fix indent getting added to list children that weren't other lists #24

Source code(tar.gz)
Source code(zip)
unmarkd-0.1.7-py3-none-any.whl(35.25 KB)
unmarkd-0.1.7.tar.gz(20.84 KB)
v0.1.6(Jul 6, 2021)
:bug: Bug fixes

Fixed bug with rendering lists (#22)

Fixed bug with block quotes (#23)

Source code(tar.gz)
Source code(zip)
v0.1.5(May 31, 2021)
🐛 Bug fixes

Fixed bug with handling <!DOCTYPE> (#19)

Source code(tar.gz)
Source code(zip)
v0.1.4(May 12, 2021)

Fixed some bugs
Source code(tar.gz)
Source code(zip)
v0.1.3(May 2, 2021)
Better docs

Fixed bug for handling nested lists

Source code(tar.gz)
Source code(zip)
v0.1.2(Feb 28, 2021)
Notable changes

Significantly better tests https://github.com/ThatXliner/unmarkd/commit/60656e5ff16265785f966985d972a49b7e92714e

Support for arbitrary starting points for ordered lists https://github.com/ThatXliner/unmarkd/commit/0ed23f24c934626a8f6fc1b219b0af5ee10ae877

Added escaping https://github.com/ThatXliner/unmarkd/commit/36c50c1e357808a57d09baa9e320ab646a16ffd8

Other changes

Fixed tons of edge case bugs

Source code(tar.gz)
Source code(zip)
v0.1.1(Feb 22, 2021)
Changes

✨ New features

A new alias: StackExchangeUnmarker in unmarkd.unmarkers (aliases to StackOverflowUnmarker in the same file) https://github.com/ThatXliner/unmarkd/commit/81eb8462b38426a2f11fd78a3c9b18190a12ea7f

🐛 Bug fixes

Fixed bugs relating nested structures (#3)

📝 API Changes

Updated parameter types for unmarkd.unmark (https://github.com/ThatXliner/unmarkd/commit/94fe9733677a8ad262f5a48affbc29a8616a2baf)

Other

Added more metadata (https://github.com/ThatXliner/unmarkd/commit/8f80ec86ad30fe77218b1fcd47869aed08e138d8)

Added license (https://github.com/ThatXliner/unmarkd/commit/4a47dca88ffa23b9a7daa0fa19fbcf658658b2c9)

Source code(tar.gz)
Source code(zip)
v0.1.0(Feb 21, 2021)

First release!
Source code(tar.gz)
Source code(zip)

Owner

ThatXliner

I code Python. To me, programming is a logic puzzle. A fun one :D

GitHub Repository https://pypi.org/project/unmarkd/

An extended version of the hotkeys demo code using action classes

An extended version of the hotkeys application using action classes. In adafruit's Hotkeys code, a macro is using a series of integers, assumed to be

5 May 01, 2022

A python based app to improve your presentation workflow

Presentation Remote A remote made for making presentations easier by enabling all the members to have access to change the slide and control the flow

1 Oct 28, 2021

tetrados is a tool to generate a density of states using the linear tetrahedron method from a band structure.

tetrados tetrados is a tool to generate a density of states using the linear tetrahedron method from a band structure. Currently, only VASP calculatio

1 Dec 21, 2021

Get a list of all offline/online members in a discord server

Discord server insights Get a list of all offline/online members in a discord server. Uses Selenium to crawl invite links. Config Download Chrome driv

3 Oct 21, 2022

Virtual Assistant Using Python

-Virtual-Assistant-Using-Python Virtual desktop assistant is an awesome thing. If you want your machine to run on your command like Jarvis did for Ton

1 Nov 13, 2021

Curso de Python 3 do Básico ao Avançado

Curso de Python 3 do Básico ao Avançado Desafio: Buscador de arquivos Criar um programa que faça a pesquisa de arquivos. É fornecido o caminho e um te

1 Jan 21, 2022

All Assignments , Test , Quizzes and Exams with solutions from NIT Patna B.Tech CSE 5th Semester.

A 🌟 to repo would be delightful, just do it ✔️ it is inexpensive. All Assignments , Quizzes and Exam papers at one place with clean and elegant solut

16 Dec 05, 2022

CBLang is a programming language aiming to fix most of my problems with Python

CBLang A bad programming language made in Python. CBLang is a programming language aiming to fix most of my problems with Python (this means that you

43 Dec 22, 2022

Just a little benchmark for scrapper PC's

PopMark Just a little benchmark for scrapper PC's This benchmark is for old computer that dont support other benchmark because of support. Like lack o

1 Nov 24, 2021

Sync SiYuanNote & Yuque.

SiyuanYuque Sync SiYuanNote & Yuque. Install Use pip to install. pip install SiyuanYuque Execute like this: python -m SiyuanYuque Remember to create a

23 Nov 25, 2022

Tutorial on Tempo, Beat and Downbeat estimation

Tempo, Beat and Downbeat Estimation By Matthew E. P. Davies, Sebastian Böck and Magdalena Fuentes Resources and Jupyter Book for the ISMIR 2021 tutori

49 Nov 06, 2022

Calculadora-basica - Calculator with basic operators

Calculadora básica Calculadora com operadores básicos; O programa solicitará a d

2 Apr 26, 2022

CountBoard 是一个基于Tkinter简单的,开源的桌面日程倒计时应用。

CountBoard 是一个基于Tkinter简单的,开源的桌面日程倒计时应用。基本功能置顶功能是否使窗体一直保持在最上面。简洁模式简洁模式使窗体更加简洁。此模式下不可调整大小,请提前在普通模式下调整大小。设置功能修改主窗体背景颜色,修改计时模式。透明设置调整窗体的透明度。修改

130 Dec 01, 2022

The Official interpreter for the Pix programming language.

The official interpreter for the Pix programming language. Pix Pix is a programming language dedicated to readable syntax and usability Q) Is Pix the

6 Sep 25, 2022

This Python library searches through a static directory and appends artist, title, track number, album title, duration, and genre to a .json object

This Python library searches through a static directory (needs to match your environment) and appends artist, title, track number, album title, duration, and genre to a .json object. This .json objec

1 Jun 20, 2022

An extremely configurable markdown reverser for Python3.

Related tags

Overview

🔄 Unmarkd

Why

Installation

Known issues

Comparison

Speed

Correctness

Configurability

Documentation

Second Heading

Sample Markdown

Second Heading

Extending

Brief Overview

Customizing

Customizable constants

Customize converting HTML tags

Utility functions when overriding

Comments

Two nested ordered lists

Two nested unordered lists

Beep boop. Your images are optimized!

Releases(v0.1.9)

v0.1.9(Jun 18, 2022)

:bug: Bug Fixes

v0.1.7(Jul 31, 2021)

:bug: Bug fixes

v0.1.6(Jul 6, 2021)

:bug: Bug fixes

v0.1.5(May 31, 2021)

🐛 Bug fixes

v0.1.4(May 12, 2021)

v0.1.3(May 2, 2021)

v0.1.2(Feb 28, 2021)

Notable changes

Other changes

v0.1.1(Feb 22, 2021)

Changes

✨ New features

🐛 Bug fixes

📝 API Changes

Other

v0.1.0(Feb 21, 2021)

Owner

ThatXliner

An extended version of the hotkeys demo code using action classes

A python based app to improve your presentation workflow

tetrados is a tool to generate a density of states using the linear tetrahedron method from a band structure.

Get a list of all offline/online members in a discord server

Virtual Assistant Using Python

Curso de Python 3 do Básico ao Avançado

All Assignments , Test , Quizzes and Exams with solutions from NIT Patna B.Tech CSE 5th Semester.

CBLang is a programming language aiming to fix most of my problems with Python

Just a little benchmark for scrapper PC's

Sync SiYuanNote & Yuque.

Tutorial on Tempo, Beat and Downbeat estimation

Calculadora-basica - Calculator with basic operators

CountBoard 是一个基于Tkinter简单的,开源的桌面日程倒计时应用。

The Official interpreter for the Pix programming language.

This Python library searches through a static directory and appends artist, title, track number, album title, duration, and genre to a .json object

Blender Addon for Snapping a UV to a specific part of a Tilemap

An almost fully customizable language made in python!

Project based on pure python with OOP

Coronavirus Tracker API

use Notepad++ for real-time sync after python appending new log text