Analyse japanese ebooks using MeCab to determine the difficulty level for japanese learners

Overview

japanese-ebook-analysis

This aim of this project is to make analysing the contents of a japanese ebook easy and streamline the process for non-technical users. You can analyse an ebook, and see the following information:

  • The length of the book in words
  • The length of the book in characters
  • The number of unique words used in the book
  • The number of unique words that are only used once in the book
  • The percentage of unique words that are only used once
  • The number of unique characters used
  • The number of unique characters that are only used once
  • The percentage of unique characters that are only used once
  • A list of all the words used in the book as well as how often they are used
  • A list of all the characters used in the book as well as how often they are used

For text processing, we use MeCab

Usage

Currently, the project is not deployed anywhere, so to use the service, you will need to follow the steps below in the development section to get the server running.

  1. Upload a .epub file containing japanese text to the server
  2. The server will redirect you to a page showing you information about the ebook. You can then also click the 'See more details' button to see all the generated data, including a list of all the words used together with how many occurences there are for each word, and the same for the characters as well.

Development

  1. Clone repository: git clone https://github.com/christofferaakre/japanese-ebook-analysis.git
  2. Make sure you have mecab set up on your system. See http://www.robfahey.co.uk/blog/japanese-text-analysis-in-python/
    (Only required if you will actually upload ebooks or run the analyse_epub.py script), which you will not need to do to contribute to other parts of the app. for a good guide on how to set it up.
  3. Install python dependencies: pip install -r requirements.txt
  4. Install other dependencies (these all need to be in your system path):
    • pandoc
  5. Run ./app.py to start the flask dev server

Contributing

I'm very happy for any happy contributions! Before contributing, please have a look at CONTRIBUTING.md.

To see what needs work on, have a look at the repo's Issues and its Pull requests.

Feel free to submit your own issue or pull request about a new feature or anything else. When submitting a pull request, don't be afraid to modify any of the files; I'm not very attached to the coding style used in the repo.

Owner
Christoffer Aakre
Christoffer Aakre
SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering.

SEJE is a prototype for the paper Learning Text-Image Joint Embedding for Efficient Cross-Modal Retrieval with Deep Feature Engineering. Contents Inst

0 Oct 21, 2021
To classify the News into Real/Fake using Features from the Text Content of the article

Hoax-Detector Authenticity of news has now become a major problem. The Idea is to classify the News into Real/Fake using Features from the Text Conten

Aravindhan 1 Feb 09, 2022
Implementation of Token Shift GPT - An autoregressive model that solely relies on shifting the sequence space for mixing

Token Shift GPT Implementation of Token Shift GPT - An autoregressive model that relies solely on shifting along the sequence dimension and feedforwar

Phil Wang 32 Oct 14, 2022
Large-scale open domain KNOwledge grounded conVERsation system based on PaddlePaddle

Knover Knover is a toolkit for knowledge grounded dialogue generation based on PaddlePaddle. Knover allows researchers and developers to carry out eff

606 Dec 28, 2022
Kinky furry assitant based on GPT2

KinkyFurs-V0 Kinky furry assistant based on GPT2 How to run python3 V0.py then, open web browser and go to localhost:8080 Requirements: Flask trans

Sparki 1 Jun 11, 2022
Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger

Part of Speech Tagging using Hidden Markov Model (HMM) POS Tagger and Brill Tagger In this project, our aim is to tune, compare, and contrast the perf

Chirag Daryani 0 Dec 25, 2021
The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Subformer This repository contains the code for the Subformer. To help overcome this we propose the Subformer, allowing us to retain performance while

Machel Reid 10 Dec 27, 2022
Some embedding layer implementation using ivy library

ivy-manual-embeddings Some embedding layer implementation using ivy library. Just for fun. It is based on NYCTaxiFare dataset from kaggle (cut down to

Ishtiaq Hussain 2 Feb 10, 2022
Textpipe: clean and extract metadata from text

textpipe: clean and extract metadata from text textpipe is a Python package for converting raw text in to clean, readable text and extracting metadata

Textpipe 298 Nov 21, 2022
Associated Repository for "Translation between Molecules and Natural Language"

MolT5: Translation between Molecules and Natural Language Associated repository for "Translation between Molecules and Natural Language". Table of Con

67 Dec 15, 2022
A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

A Survey of Natural Language Generation in Task-Oriented Dialogue System (TOD): Recent Advances and New Frontiers

Libo Qin 132 Nov 25, 2022
Outreachy TFX custom component project

Schema Curation Custom Component Outreachy TFX custom component project This repo contains the code for Schema Curation Custom Component made as a par

Robert Crowe 5 Jul 16, 2021
Code for the project carried out fulfilling the course requirements for Fall 2021 NLP at NYU

Introduction Fairseq(-py) is a sequence modeling toolkit that allows researchers and developers to train custom models for translation, summarization,

Sai Himal Allu 1 Apr 25, 2022
A Lightweight NLP Data Loader for All Deep Learning Frameworks in Python

LineFlow: Framework-Agnostic NLP Data Loader in Python LineFlow is a simple text dataset loader for NLP deep learning tasks. LineFlow was designed to

TofuNLP 177 Jan 04, 2023
Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT)

CIRPLANT This repository contains the code and pre-trained models for Composed Image Retrieval using Pretrained LANguage Transformers (CIRPLANT) For d

Zheyuan (David) Liu 29 Nov 17, 2022
CoSENT、STS、SentenceBERT

CoSENT_Pytorch 比Sentence-BERT更有效的句向量方案

102 Dec 07, 2022
InferSent sentence embeddings

InferSent InferSent is a sentence embeddings method that provides semantic representations for English sentences. It is trained on natural language in

Facebook Research 2.2k Dec 27, 2022
L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources.

L3Cube-MahaCorpus L3Cube-MahaCorpus a Marathi monolingual data set scraped from different internet sources. We expand the existing Marathi monolingual

21 Dec 17, 2022
Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Sequence-to-sequence framework with a focus on Neural Machine Translation based on Apache MXNet

Amazon Web Services - Labs 1.1k Dec 27, 2022
🌐 Translation microservice powered by AI

Dot Translate 🌐 A microservice for quick and local translation using A.I. This service starts a local webserver used for neural machine translation.

Dot HQ 48 Nov 22, 2022