Obsidian tools - a Python package for analysing an Obsidian.md vault

Overview

PyPI version PyPI version Licence Documentation codecov

obsidiantools 🪨 ⚒️

obsidiantools is a Python package for getting structured metadata about your Obsidian.md notes and analysing your vault. Complement your Obsidian workflows by getting metrics and detail about all your notes in one place through the widely-used Python data stack.

It's incredibly easy to explore structured data on your vault through this fluent interface. This is all the code you need to generate a vault object that stores the key data:

import obsidiantools.api as otools

vault = otools.Vault(<VAULT_DIRECTORY>).connect()

See some of the key features below - all accessible from the vault object either through a method or an attribute.

As this package relies upon note (file)names, it is only recommended for use on vaults where wikilinks are not formatted as paths and where note names are unique. This should cover the vast majority of vaults that people create.

💡 Key features

This is how obsidiantools can complement your workflows for note-taking:

  • Access a networkx graph of your vault (vault.graph)
    • NetworkX is the main Python library for network analysis, enabling sophisticated analyses of your vault.
    • NetworkX also supports the ability to export your graph to other data formats.
  • Get summary stats about your notes, e.g. number of backlinks and wikilinks, in a Pandas dataframe
    • Get the dataframe via vault.get_note_metadata()
  • Retrieve detail about your notes' links as built-in Python types
    • The various types of links:
      • Wikilinks (incl. header links, links with alt text)
      • Backlinks
    • You can access all the links in one place, or you can load them for an individual note:
      • e.g. vault.backlinks_index for all backlinks in the vault
      • e.g. vault.get_backlinks( ) for the backlinks of an individual note
    • Check which notes are isolated (vault.isolated_notes)
    • Check which notes do not exist as files yet (vault.nonexistent_notes)

Check out the functionality in the demo repo. Launch the '10 minutes' demo in a virtual machine via Binder:

Documentation Binder

There are other API features that try to mirror the Obsidian.md app, for your convenience when working with Python, but they are no substitute for the interactivity of the app!

The text from vault notes goes through this process: markdown → HTML → ASCII plaintext. The functions for text processing are in the md_utils module so they can be used to get text, e.g. for use in NLP analysis.

⏲️ Installation

pip install obsidiantools

Developed for Python 3.9 but may still work on lower versions.

As of Sep 2021, NetworkX requires Python 3.7 or higher (similar for Pandas too) so that is recommended as a minimum.

🖇️ Dependencies

  • markdown
  • html2text
  • pandas
  • numpy
  • networkx

🏗️ Tests

A small 'dummy vault' vault of lipsum notes is in tests/vault-stub (generated with help of the lorem-markdownum tool). Sense-checking on the API functionality was also done on a personal vault of up to 100 notes.

I am not sure how the parsing will work outside of Latin languages - if you have ideas on how that can be supported feel free to suggest a feature or pull request.

⚖️ Licence

Modified BSD (3-clause)

Comments
  • [FR] Options : choose to use file name / frontmatter title for graph

    [FR] Options : choose to use file name / frontmatter title for graph

    I noticed that the graph created use the filepath, and I want to choose the frontmatter title or the filename instead. How can I do that ?

    Graphic reference : image Generated using pyvis

    enhancement make recipe 
    opened by Lisandra-dev 20
  • `TypeError: 'NoneType' object is not iterable` (in `_remove_front_matter`)

    `TypeError: 'NoneType' object is not iterable` (in `_remove_front_matter`)

    Running the following on my vault:

    import obsidiantools.api as ot
    vault = ot.Vault(Path("/path/to/a/vault").connect()
    

    Results in:

    # --->8--- Irrelevant frames omitted --->8---
    
    ~/.cache/pypoetry/virtualenvs/knowledgebase-scripts-Fe_uWe_V-py3.9/lib/python3.9/site-packages/obsidiantools/md_utils.py in _get_ascii_plaintext_from_md_file(filepath)
        190     html = _get_html_from_md_file(filepath)
        191     # strip out front matter (if any):
    --> 192     html = _remove_front_matter(html)
        193     return _get_ascii_plaintext_from_html(html)
        194 
    
    ~/.cache/pypoetry/virtualenvs/knowledgebase-scripts-Fe_uWe_V-py3.9/lib/python3.9/site-packages/obsidiantools/md_utils.py in _remove_front_matter(html)
        201     if hr_content:
        202         # wipe out content from first hr (the front matter)
    --> 203         for fm_detail in hr_content.find_next("p"):
        204             fm_detail.extract()
        205         # then wipe all hr elements
    
    TypeError: 'NoneType' object is not iterable
    

    A quick roundtrip in a debugger shows this happens with at least:

    1. Notes containing an hr (---) but no YAML frontmatter.
    2. Notes containing only frontmatter, no body.
    bug 
    opened by zoni 10
  • performance - file opens & reads

    performance - file opens & reads

    Hi.

    Every markdown file is being opened & read a total of 8 times in normal connect & gather flow. Might make sense to model a note as a class and have it load its own data once.

    enhancement 
    opened by stepsal 6
  • Unable to filter index using Windows filepath with include_subdirs=[]

    Unable to filter index using Windows filepath with include_subdirs=[]

    Hi,

    I can successfully view my vault file index in Windows. If I then try to filter the list by subdirectory I can successfully list notes in the root and in the 'docs' folders. If I filter by the name of a lower subdirectory using a Windows filepath the returned list is empty.

    For example, my file index includes the following list items:

    {'README': WindowsPath('README.md'),
     'index': WindowsPath('docs/index.md'),
     'Quotations': WindowsPath('docs/Quotations.md'),
     'Creative Commons': WindowsPath('docs/Concepts/Creative Commons.md'),
     'Crowdsourcing': WindowsPath('docs/Concepts/Crowdsourcing.md'),
     'Data Format': WindowsPath('docs/Concepts/Data Format.md'),
     'Data Model': WindowsPath('docs/Concepts/Data Model.md'),
     'Data Sovereignty': WindowsPath('docs/Concepts/Data Sovereignty.md'),
    }
    

    Based on the obsidiantools-demo I would expect to be able to list all the markdown files in the 'Concepts' folder using the following call:

    (otools.Vault(vault_dir, include_subdirs=['docs/Concepts'], include_root=False) .file_index)

    Instead the returned object is empty {}.

    Reversing the slash to create a linux path resolves the issue:

    (otools.Vault(vault_dir, include_subdirs=['docs\Concepts'], include_root=False)
     .file_index)
    

    Returns:

    {'Creative Commons': WindowsPath('docs/Concepts/Creative Commons.md'),
     'Crowdsourcing': WindowsPath('docs/Concepts/Crowdsourcing.md'),
     'Data Format': WindowsPath('docs/Concepts/Data Format.md'),
     'Data Model': WindowsPath('docs/Concepts/Data Model.md'),
     'Data Sovereignty': WindowsPath('docs/Concepts/Data Sovereignty.md')}
    

    Ideally this would be resolved by the obsidiantools package rather than the user. Alternatively suggest updating the documentation.

    bug 
    opened by virtualarchitectures 6
  • UnicodeDecodeError when connecting to Obsidian Vault

    UnicodeDecodeError when connecting to Obsidian Vault

    Hi, I'm testing out the package and I'm getting an error when I try to connect via Jupyter notebook in Windows 10: vault = otools.Vault(vault_dir).connect().gather().

    I'm receiving the following UnicodeDecodeError: UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1400: character maps to <undefined>.

    Assuming the filepath and connection are working I'm unclear whether it is a problem I can correct in Obsidian or if it is a problem with the parser used by obsidiantools. Can you advise how I can resolve the issue?

    image

    For reference the stack trace is as follows:

    ---------------------------------------------------------------------------
    UnicodeDecodeError                        Traceback (most recent call last)
    ~\AppData\Local\Temp/ipykernel_19592/2568229718.py in <module>
    ----> 1 vault = otools.Vault(vault_dir).connect().gather()
          2 print(f"Connected?: {vault.is_connected}")
          3 print(f"Gathered?:  {vault.is_gathered}")
    
    ~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\api.py in connect(self)
        199         if not self._is_connected:
        200             # default graph to mirror Obsidian's link counts
    --> 201             wiki_link_map = self._get_wikilinks_index()
        202             G = nx.MultiDiGraph(wiki_link_map)
        203             self._graph = G
    
    ~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\api.py in _get_wikilinks_index(self)
        438         where k is the md filename
        439         and v is list of ALL wikilinks found in k"""
    --> 440         return {k: get_wikilinks(self._dirpath / v)
        441                 for k, v in self._file_index.items()}
        442 
    
    ~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\api.py in <dictcomp>(.0)
        438         where k is the md filename
        439         and v is list of ALL wikilinks found in k"""
    --> 440         return {k: get_wikilinks(self._dirpath / v)
        441                 for k, v in self._file_index.items()}
        442 
    
    ~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\md_utils.py in get_wikilinks(filepath)
         92         list of strings
         93     """
    ---> 94     plaintext = _get_ascii_plaintext_from_md_file(filepath, remove_code=True)
         95 
         96     wikilinks = _get_all_wikilinks_from_html_content(
    
    ~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\md_utils.py in _get_ascii_plaintext_from_md_file(filepath, remove_code)
        265     """md file -> html -> ASCII plaintext"""
        266     # strip out front matter (if any):
    --> 267     html = _get_html_from_md_file(filepath)
        268     if remove_code:
        269         html = _remove_code(html)
    
    ~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\md_utils.py in _get_html_from_md_file(filepath)
        251 def _get_html_from_md_file(filepath):
        252     """md file -> html (without front matter)"""
    --> 253     _, content = _get_md_front_matter_and_content(filepath)
        254     return markdown.markdown(content, output_format='html')
        255 
    
    ~\anaconda3\envs\Obsidian_Tools\lib\site-packages\obsidiantools\md_utils.py in _get_md_front_matter_and_content(filepath)
        242     with open(filepath) as f:
        243         try:
    --> 244             front_matter, content = frontmatter.parse(f.read())
        245         except yaml.scanner.ScannerError:
        246             # for invalid YAML, return the whole file as content:
    
    ~\anaconda3\envs\Obsidian_Tools\lib\encodings\cp1252.py in decode(self, input, final)
         21 class IncrementalDecoder(codecs.IncrementalDecoder):
         22     def decode(self, input, final=False):
    ---> 23         return codecs.charmap_decode(input,self.errors,decoding_table)[0]
         24 
         25 class StreamWriter(Codec,codecs.StreamWriter):
    
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1400: character maps to <undefined>
    
    bug 
    opened by virtualarchitectures 5
  • Reference to non-md files

    Reference to non-md files

    It seems that obsidiantools does not track non-markdown files (pictures, etc). As a consequence, vault.nonexistent_notes list all references to such files. I suggest to include non-markdown files in the graph as well.

    And, on a related note, vault.nonexistent_notes wrongly includes notes that are referenced with extension. For example, a reference of the form [[note.md]] leads to note.md being listed as non-existent even if the file note.md exists.

    enhancement 
    opened by martinlackner 4
  • error on malformed frontmatter

    error on malformed frontmatter

    in case of a malformed frontmatter in a document an exception is raised and not handled.

    • error can be handled in https://github.com/mfarragher/obsidiantools/blob/ddd78669ef27346fdb4b13cdc956a6f8c00e98f4/obsidiantools/md_utils.py#L259
    • adding the following solves the problem (allthough the specific error should be named)
    except:
        print("problem with file ", filepath)
    
    bug 
    opened by Dorianux 4
  • get_md_relpaths_from_dir() globbing issue

    get_md_relpaths_from_dir() globbing issue

    Hello. Thanks for this library

    I have tried this with 3.9 as suggested. But I'm getting an error straight away on gather.

    File "/home/steve/.pyenv/versions/obsidian-python3.9.0/lib/python3.9/site-packages/obsidiantools/md_utils.py", line 29, in get_md_relpaths_from_dir
        for p in glob(str(dir_path / '**/*.md'), recursive=True)]
    TypeError: unsupported operand type(s) for /: 'str' and 'str'
    

    I have to change the /' to a + in the line in get_md_relpaths_from_dir() to get the globbing working!

    change from

    return [Path(p).relative_to(dir_path)
    for p in glob(str(dir_path / '**/*.md'), recursive=True)]
    

    to

    return [Path(p).relative_to(dir_path)
    for p in glob(str(dir_path + '**/*.md'), recursive=True)]
    
    bug 
    opened by stepsal 4
  • Tags in code blocks are taken

    Tags in code blocks are taken

    As a placeholder. I think code blocks should be ignored for tags? What do you think?

    image
     "file_tags": [
            "meta",
            "idea",
            "shower-thought",
            "to-digest",
            "shower-thought",
            "introduction",
            "shower-thought\"",
            "guru\"",
            "shroedinger-uncertain\"",
            "floating-point-error\"",
            "socratic\""
          ]
    
    bug 
    opened by louis030195 3
  • Handle .md inside wikilinks to reflect Obsidian graph

    Handle .md inside wikilinks to reflect Obsidian graph

    [[Foo]] and [[Bar.md]] will both be related to note 'Foo' in the knowledge graph.

    Currently, wikilinks getters will extract the wikilinks as 'Foo' and 'Bar.md'. The expected behaviour of getters to reflect Obsidian's behaviour is 'Foo' and 'Bar' respectively.

    bug 
    opened by mfarragher 2
  • Text goes missing even though the HTML is OK (html2text parsing issues)

    Text goes missing even though the HTML is OK (html2text parsing issues)

    For one of my notes with a mix of tables, LaTeX, lists & code blocks, there is a lot of text from the note that isn't captured in source_text_index, but is kept in the HTML. This suggests some parsing issues with how html2text is configured.

    Whole paragraph blocks & headers can be completely missing.

    This starts to happen after a table with LaTeX. Anything in body text (<p>) afterwards is missing, yet it keeps all the remaining LaTeX (even the stuff in tables).

    Perhaps it doesn't like MathJax? Maybe wiping out a few tags from HTML, for the source_text functionality, before it gets processed by html2text could make the output smoother in this case.

    Need to think more about:

    • What HTML tags are not necessary for source_text?
      • LaTeX is one aspect to remove if causing problems. Keep as much as possible for html2text to handle (including URLs, images, etc.). Anything more opinionated (e.g. do we want strikethrough text or not) would be better covered in readable_text.
      • May involve another Markdown class if switching off markdown extensions, more functions to do this specific HTML generation, etc.
    • A test case from reduced format of my note
    bug 
    opened by mfarragher 2
  • Incremental refresh

    Incremental refresh

    Hey, for https://github.com/louis030195/obsidian-ava, I'm trying to implement increment refresh of the state of the vault.

    Concretely, I build sentence embeddings of the whole vault and would like to re-compute embeddings every time a note is updated/deleted/created.

    Do you see any way of doing this incrementally rather than reloading the vault and recomputing everything every time? (It takes ~1 min on mps device on my 500k words vault)

    Ideally, I'd see maybe an API that let me listen to vault changes with callback(s) in this library?

    Thanks 🚀😃

    enhancement make recipe 
    opened by louis030195 3
Releases(0.10.0)
  • 0.10.0(Jan 8, 2023)

    New features:

    • connect method has attachments argument to give the option to include 'attachment' files like media files and canvas files in the graph. The behaviour from v0.9 is kept in this new release (via the default attachments=False).
    • Information about media files & their filepaths is stored in the media_file_index attribute.
    • New methods for metadata: get_canvas_file_metadata and get_media_file_metadata. The get_all_file_metadata method is new method that is best-placed to get all the metadata for files & notes in a vault.
    • isolated_media_files and isolated_canvas_files attributes.
    • nonexistent_media_files and nonexistent_media_files attributes.

    Important API changes vs previous version:

    • file_index attribute is now md_file_index, to avoid ambiguity from the extra support now for media files and canvas files.

    Other improvements:

    • Speed improvements for the gather() method and processing of HTML content.
    • Tweaks to the code to address deprecation warnings from other packages.
    Source code(tar.gz)
    Source code(zip)
  • 0.9.0(Dec 24, 2022)

    New features:

    • Support for canvas files (the latest addition to Obsidian). See the Canvas files features notebook for detail on functionality.
    • Nested tags are now supported. By default, Vault will not account for nested tags (as has previously been the case).
    • Column for n_tags in get_note_metadata method.
    • get_wikilink_counts method added.

    Other improvements:

    • Vault can handle duplicate filenames, better reflecting the 'Shortest path when possible' wikilink setting in Obsidian.
    • More robust regex for tags.
    • Fix bug where "tags" were being parsed from code block.
    • More error handling for front matter.
    • Fix bug where source_text gets cut off after a LaTeX block.
    • Wikilinks are robust to the use of file extension md in them.

    Package now requires Python 3.9 as a minimum.

    Wiki has now been added to the Github repo, to cover detail for advanced users.

    Source code(tar.gz)
    Source code(zip)
  • 0.8.1(Aug 7, 2022)

    Bug fixes:

    • Fixed issue where markdown could not parse at all on some environments. md_mermaid extension has been removed. This issue was hard to reproduce so I have removed the extension for now.
    Source code(tar.gz)
    Source code(zip)
  • 0.8.0(Aug 7, 2022)

    New features:

    • The API now has two forms of text: 'source text' and 'readable text'. These have their own object attributes and methods, e.g. get_source_text() and get_readable_text(). The old text attributes and objects have been removed, but they were closest to the source text functionality. The readable text is essentially has a lot of formatting removed, while still retaining the context within notes, so it is in a form that can be used quite easily for NLP analysis. The source text best reflects how notes are formatted in the Obsidian app's 'source mode'.
    • Applied multiple pymarkdown extensions to the logic, to reflect some of the main features of how Obsidian uses extended markdown. For example, mermaid diagrams, LaTeX equations and tables can now be parsed, tilde characters mark the deletion of text, etc.
    • LaTeX equations can now be accessed for notes via get_math method and math_index attribute.
    • More robust tag parsing.

    Bug fixes:

    • More robust paths for cross-platform support.
    • Front matter parsing is more robust.
    Source code(tar.gz)
    Source code(zip)
  • 0.7.0(Dec 27, 2021)

    New features:

    • Support for tags
    • Support for instantiating Vault on a filtered list of subdirectories
    • 'Gather' functionality for note text: gather function; get_note function and notes_index attr

    Fixes:

    • Fix embedded files output where the pipe operator is used (e.g. to scale images): avoid backslashes appearing in output
    • More robust processing of front matter
    Source code(tar.gz)
    Source code(zip)
  • 0.6.0(Oct 19, 2021)

  • 0.5.0(Sep 13, 2021)

Owner
Mark Farragher
🧬 I solve data problems in areas like healthcare, SaaS & economics
Mark Farragher
python script to generate color coded resistor images

Resistor image generator I got nerdsniped into making this. It's not finished at all, and the code is messy. The end goal it generate a whole E-series

MichD 1 Nov 12, 2021
Grank is a feature-rich script that automatically grinds Dank Memer for you

Grank Inspired by this repository. This is a WIP and there will be more functions added in the future. What is Grank? Grank is a feature-rich script t

42 Jul 20, 2022
A program to convert celcius to faranheit. made with python

Temp-Converter What is Temp-Converter Temp-Converter is little program made with pyhton to convert celcius to faranheit. Needed A python interpreter P

Chandula Janith 0 Nov 27, 2021
Python utilities for writing cross-version compatible libraries

Python utilities for writing cross-version compatible libraries

Tyler M. Kontra 85 Jun 29, 2022
Software to help automate collecting crowdsourced annotations using Mechanical Turk.

Video Crowdsourcing Software to help automate collecting crowdsourced annotations using Mechanical Turk. The goal of this project is to enable crowdso

Mike Peven 1 Oct 25, 2021
An OData v4 query parser and transpiler for Python

odata-query is a library that parses OData v4 filter strings, and can convert them to other forms such as Django Queries, SQLAlchemy Queries, or just plain SQL.

Gorilla 39 Jan 05, 2023
Two fast AUC calculation implementations for python

fastauc Two fast AUC calculation implementations for python: python-based is approximately 5X faster than the default sklearn.metrics.roc_auc_score()

Vsevolod Kompantsev 26 Dec 11, 2022
一款不需要买代理来减少扫网站目录被封概率的扫描器,适用于中小规格字典。

PoorScanner使用说明书 -工具在不同环境下可能不怎么稳定,如果有什么问题恳请大家反馈。说明书有什么错误的地方也大家欢迎指正。 更新记录 2021.8.23 修复了云函数主程序 gitee上传文件接口写错了的BUG(之前把自己的上传地址写死进去了,没从配置文件里读) 更新了说明书 PoorS

14 Aug 02, 2022
✨ Un code pour voir les disponibilités des vaccins contre le covid totalement fait en Python par moi, et en français.

Vaccine Notifier ❗ Un chois aléatoire d'un article sur Wikipedia totalement fait en Python par moi, et en français. 🔮 Grâce a une requète API, on peu

MrGabin 3 Jun 06, 2021
Stubmaker is an easy-to-use tool for generating python stubs.

Stubmaker is an easy-to-use tool for generating python stubs. Requirements Stubmaker is to be run under Python 3.7.4+ No side effects during

Toloka 24 Aug 28, 2022
.bvh to .mcfunction file converter.

bvh-to-mcf .bvh file to .mcfunction converter

Hanmin Kim 28 Nov 21, 2022
Simple script to export contacts from telegram into vCard file

Telegram Contacts Exporter Simple script to export contacts from telegram into vCard file Getting Started Prerequisites You must to put your Telegram

Pere Antoni 1 Oct 17, 2021
Animation retargeting tool for Autodesk Maya. Retargets mocap to a custom rig with a few clicks.

Animation Retargeting Tool for Maya A tool for transferring animation data and mocap from a skeleton to a custom rig in Autodesk Maya. Installation: A

Joaen 63 Jan 06, 2023
Helpful functions for use alongside the rich Python library.

🔧 Rich Tools A python package with helpful functions for use alongside with the rich python library. 󠀠󠀠 The current features are: Convert a Pandas

Avi Perl 14 Oct 14, 2022
Basic loader is a small tool that will help you generating Cloudflare cookies

Basic Loader Cloudflare cookies loader This tool may help some people getting valide cloudflare cookies Installation 🔌 : pip install -r requirements.

IHateTomLrge 8 Mar 30, 2022
A Program that generates and checks Stripe keys 24x7.

A Program that generates and checks Stripe keys 24x7. This was made only for Educational Purposes, I'm not responsible for the damages cause by you

iNaveen 18 Dec 17, 2022
Make your functions return something meaningful, typed, and safe!

Make your functions return something meaningful, typed, and safe! Features Brings functional programming to Python land Provides a bunch of primitives

dry-python 2.6k Jan 09, 2023
Password generator

Password generator technologies used What is? It is Password generator How to Download? Download on releases Clone repo git clone https://github.com/m

Miek 1 Nov 02, 2021
Give you a better view of your Docker registry disk usage.

registry-du Give you a better view of your Docker registry disk usage. This small tool will analysis your Docker registry(vanilla or Harbor both work)

Nova Kwok 16 Jan 07, 2023
Script to generate a massive volume of data in sql, csv, json or xml format

DataGenerator Made with Python Open for pull requests 1. Dependencies To install required dependencies run pip install -r requirements.txt 2. Executi

icrescenti 3 Sep 20, 2022