A pipeline for making highlighted text stand-alone.

Overview
title emoji colorFrom colorTo sdk app_file pinned
decontextualizer
📤
green
gray
streamlit
main.py
false

Decontextualizer

As a second step in improving our content consumption workflows, I investigated a new approach to extracting fragments from a content item before saving them for subsequent surfacing. While the lexiscore deals with content items on a holistic level -- evaluating entire books, articles, and papers -- I speculated then that going granular is a natural next step in building tools which help us locate specific valuable ideas in long-form content. The decontextualizer is a stepping stone in that direction, consisting of a pipeline for making text excerpts compact and semantically self-contained. Concretely, the decontextualizer is a web app able to take in an annotated PDF and automatically tweak the highlighted excerpts so that they make more sense on their own, even out of context.

Read more...

Installation

The decontextualizer can either be deployed from source or using Docker.

Docker

To deploy the decontextualizer labeler using Docker, first make sure to have Docker installed, then simply run the following.

docker run -p 8501:8501 paulbricman/decontextualizer 

The tool should be available at localhost:8501.

From Source

To set up the decontextualizer, clone the repository and run the following:

python3 -m pip install -r requirements.txt
streamlit run main.py

The tool should be available at localhost:8501.

Screenshots

You might also like...
🐸   Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️
🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

🐸 Identify anything. pyWhat easily lets you identify emails, IP addresses, and more. Feed it a .pcap file or some text and it'll tell you what it is! 🧙‍♀️

box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.
box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.

Box is a text-based visual programming language inspired by Unreal Engine blueprint function graphs. $ cat factorial.box ┌─ƒ(Factorial)───┐

Export solved codewars kata challenges to a text file.

Codewars Kata Exporter Note:this is not totally my work.i've edited the project to make more easier and faster for me.you can find the original work h

Extract knowledge from raw text

Extract knowledge from raw text This repository is a nearly copy-paste of "From Text to Knowledge: The Information Extraction Pipeline" with some cosm

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.
AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

AnnIE - Annotation Platform, tool for open information extraction annotations using text files.

py-trans is a Free Python library for translate text into different languages.

Free Python library to translate text into different languages.

text-to-speach bot - You really do NOT have time for read a newsletter? Now you can listen to it

NewsletterReader You really do NOT have time for read a newsletter? Now you can listen to it The Newsletter of Filipe Deschamps is a great place to re

a python package that lets you add custom colors and text formatting to your scripts in a very easy way!
a python package that lets you add custom colors and text formatting to your scripts in a very easy way!

colormate Python script text formatting package What is colormate? colormate is a python library that lets you add text formatting to your scripts, it

Repository containing the code for An-Gocair text normaliser

Scottish Gaelic Text Normaliser The following project contains the code and resources for the Scottish Gaelic text normalisation project. The repo can

Comments
  • PDF Failures

    PDF Failures

    Hello!

    I attempted two different PDFs (which I can share in a DM or a email) — one was an older-pre computer PDF that had been OCRed professionally and hilighted. Another was a modern PDF, of a book published last year, also highlighted. Zotero was able to extract from both using pdfjs. When i use https://huggingface.co/spaces/paulbricman/decontextualizer i get:

    File "/home/user/.local/lib/python3.8/site-packages/streamlit/script_runner.py", line 354, in _run_script
        exec(code, module.__dict__)
    File "/home/user/app/main.py", line 17, in <module>
        components.add_section()
    File "/home/user/app/components.py", line 46, in add_section
        excerpts = pdf_to_excerpts(filename)
    File "/home/user/app/processing.py", line 48, in pdf_to_excerpts
        excerpt = extract_annot(annot, words)
    File "/home/user/app/processing.py", line 24, in extract_annot
        quad_count = int(len(quad_points) / 4)
    
    opened by brimwats1 4
Releases(v1.0.0)
Owner
Paul Bricman
Building tools which augment the mind.
Paul Bricman
A generator library for concise, unambiguous and URL-safe UUIDs.

Description shortuuid is a simple python library that generates concise, unambiguous, URL-safe UUIDs. Often, one needs to use non-sequential IDs in pl

Stavros Korokithakis 1.8k Dec 31, 2022
A Python app which can convert normal text to Handwritten text.

Text to HandWritten Text ✍️ Converter Watch Tutorial for this project Usage:- Clone my repository. Open CMD in working directory. Run following comman

Kushal Bhavsar 5 Dec 11, 2022
Fuzzy String Matching in Python

FuzzyWuzzy Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 8.8k Jan 08, 2023
The project is investigating methods to extract human-marked data from document forms such as surveys and tests.

The project is investigating methods to extract human-marked data from document forms such as surveys and tests. They can read questions, multiple-choice exam papers, and grade.

Harry 5 Mar 27, 2022
Making simplex testing clean and simple

Making Simplex Project Testing - Clean and Simple What does this repo do? It organizes the python stack for the coding project What do I need to do in

Mohit Mahajan 1 Jan 30, 2022
Hotpotato is a recipe portfolio App that assists users to discover and comment new recipes.

Hotpotato Hotpotato is a recipe portfolio App that assists users to discover and comment new recipes. It is a fullstack React App made with a Redux st

Nico G Pierson 13 Nov 05, 2021
A python tool one can extract the "hash" from a WINDOWS HELLO PIN

WINHELLO2hashcat About With this tool one can extract the "hash" from a WINDOWS HELLO PIN. This hash can be cracked with Hashcat, more precisely with

33 Dec 05, 2022
Export solved codewars kata challenges to a text file.

Codewars Kata Exporter Note:this is not totally my work.i've edited the project to make more easier and faster for me.you can find the original work h

Oussama Ben Sassi 4 Aug 13, 2021
Translate .sbv subtitle files

deepl4subtitle Deeplを使って字幕ファイル(.sbv)を翻訳します。タイムスタンプも含めて出力しますが、翻訳時はタイムスタンプは文の一部とは切り離されるので、.sbvファイルをそのまま翻訳機に突っ込むよりも高精度な翻訳ができるはずです。 つかいかた 入力する.sbvファイルの前処理

Yasunori Toshimitsu 1 Oct 20, 2021
Paranoid text spacing in Python

pangu.py Paranoid text spacing for good readability, to automatically insert whitespace between CJK (Chinese, Japanese, Korean) and half-width charact

Vinta Chen 194 Nov 19, 2022
Text to ASCII and ASCII to text

Text2ASCII Description This python script (converter.py) contains two functions: encode() is used to return a list of Integer, one item per character

4 Jan 22, 2022
Extract price amount and currency symbol from a raw text string

price-parser is a small library for extracting price and currency from raw text strings.

Scrapinghub 252 Dec 31, 2022
A python tool to convert Bangla Bijoy text to Unicode text.

Unicode Converter A python tool to convert Bangla Bijoy text to Unicode text. Installation Unicode Converter can be installed via PyPi. Make sure pip

Shahad Mahmud 10 Sep 29, 2022
box is a text-based visual programming language inspired by Unreal Engine Blueprint function graphs.

Box is a text-based visual programming language inspired by Unreal Engine blueprint function graphs. $ cat factorial.box ┌─ƒ(Factorial)───┐

Pranav 104 Dec 24, 2022
PyMultiDictionary is a Dictionary Module for Python 3+ to get meanings, translations, synonyms and antonyms of words in 20 different languages

PyMultiDictionary PyMultiDictionary is a Dictionary Module for Python 3+ to get meanings, translations, synonyms and antonyms of words in 20 different

Pablo Pizarro R. 19 Dec 26, 2022
CowExcept - Spice up those exceptions with cowexcept!

CowExcept - Spice up those exceptions with cowexcept!

James Ansley 41 Jun 30, 2022
Find a Doc is a free online resource aimed at helping connect the foreign community in Japan with health services in their native language.

Find a Doc - Localization Find a Doc is a free online resource aimed at helping connect the foreign community in Japan with health services in their n

Our Japan Life 18 Dec 19, 2022
🚩 A simple and clean python banner generator - Banners

🚩 A simple and clean python banner generator - Banners

Kumar Vicku 12 Oct 09, 2022
A query extract python package

A query extract python package

Fayas Noushad 4 Nov 28, 2021
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface, optional external libs usage.

TextDistance TextDistance -- python library for comparing distance between two or more sequences by many algorithms. Features: 30+ algorithms Pure pyt

Life4 3k Jan 02, 2023