Get list of common stop words in various languages in Python

Last update: Dec 21, 2022

Overview

Python Stop Words

Table of contents

Overview
Available languages
Installation
Basic usage
Python compatibility

Overview

Get list of common stop words in various languages in Python.

Available languages

Arabic
Bulgarian
Catalan
Czech
Danish
Dutch
English
Finnish
French
German
Hungarian
Indonesian
Italian
Norwegian
Polish
Portuguese
Romanian
Russian
Spanish
Swedish
Turkish
Ukrainian

Installation

stop-words is available on PyPI

http://pypi.python.org/pypi/stop-words

So easily install it by pip

$ pip install stop-words

Another way is by cloning stop-words's git repo

$ git clone --recursive git://github.com/Alir3z4/python-stop-words.git

Then install it by running:

$ python setup.py install

Basic usage

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')

Python compatibility

Python Stop Words is compatibe with:

Python 2.7
Python 3.4
Python 3.5
Python 3.6
Python 3.7

Comments

Enforces packaging of eggs into folders.

We had an error in our CI pipeline where a package build would fail since the .egg of stop-words is downloaded as a zip.

This leads to the following error where the initializer tries to open a directory when it is actually a zip archive.

Not a directory: '/opt/project/.eggs/stop_words-2015.2.23.1-py3.6.egg/stop_words/stop-words/languages.json'

opened by hfjn 10
add indonesian stop word list

Add stop word list for indonesian language, added mapping to JSON file. Source: https://www.illc.uva.nl/Research/Publications/Reports/MoL-2003-02.text.pdf

opened by frankdevans 4
can you handle a text？

hello, no description about how to use. Now I have a text: The University of Waterloo Stratford Campus is located in Stratford Ontario Canada. It is one of the three satellite campuses of the University of Waterloo a member of the U15 Group of Canadian Research Universities.Established in June 2009 the University of Waterloo Stratford Campus is part of the Faculty of Arts at the University of Waterloo. how to use python-stop-words to filter the stop-words to get a text without stop-words?

thank you very much!!
question

opened by PapaMadeleine2022 2
Python 3 support
List of improvements:

Tests

Python 3 support

Dev installation via zc.buildout

Continuous integration via Travis

Can you make a new release once the branch merged ?

Regards
enhancement
opened by Fantomas42 2
languages.json is missing, if you don't git clone with `--recursive`

languages.json is still missing, if you don't clone with --recursive

$ git clone git://github.com/Alir3z4/python-stop-words.git $ cd python-stop-words $ python3 setup.py install Traceback (most recent call last): File "setup.py", line 5, in version=import("stop_words").get_version(), File "./stop_words/init.py", line 9, in with open(os.path.join(STOP_WORDS_DIR, 'languages.json'), 'rb') as map_file: FileNotFoundError: [Errno 2] No such file or directory: './stop_words/stop-words/languages.json'

opened by marcindulak 1
Update submodule to the latest

Include the stops for newly added languages

https://github.com/Alir3z4/stop-words/pull/4 https://github.com/Alir3z4/stop-words/pull/5 https://github.com/Alir3z4/stop-words/pull/6 https://github.com/Alir3z4/stop-words/pull/7
enhancement

opened by norkans7 1
Decode error AND Add catalan language to LANGUAGE_MAPPING
1. Add catalan language to LANGUAGE_MAPPING. I previously I added the file with stop words in project "stop-words"

2. Decode error

stop_words = [line.strip().decode('utf-8') for line in language_file.readlines()]

Strip() return a copy of the string with leading and trailing whitespace characters removed. But if the string contains non-ascii characters, Strip() causes a UnicodeDecodeError error (eg UnicodeDecodeError: 'utf8' codec can not decode byte 0xc3 in position 34: unexpected end of data).

The workaround is to reorder the call:

stop_words = [line.decode('utf-8').strip() for line in language_file.readlines()]
opened by dmiro 1
Defining custom stop words in NLTK

Hi, I want to know what is the method for defining our own custom stop word? I'm currently developing a sentiment analysis in my local language in which i'm using Naive Bayes classifier to classify the text. I'm quite new to this type of NLP project so sorry if there's a method that I miss.

Hope you can help me thanks.

opened by AllikDaniel 0

Example not work on python 3.7.0

It return empty []

from stop_words import get_stop_words

stop_words = get_stop_words('en')
stop_words = get_stop_words('english')

from stop_words import safe_get_stop_words

stop_words = safe_get_stop_words('unsupported language')
print(stop_words)

opened by nadavvin 2

Releases(2018.7.23)

2018.7.23(Jul 23, 2018)
2018.7.23

Fixed #14: languages.json is missing, if you don't git clone with --recursive.

Feature: Support latest version of Python (3.7+).

Feature #22: Enforces packaging of eggs into folders.

Update the stop-words repository to get the latest languages.

Fixed Travis failing and tests due to bootstrap.

PyPI: https://pypi.org/project/stop-words/2018.7.23/

To install:

$ pip install stop-words==2018.7.23
Source code(tar.gz)
Source code(zip)
2015.2.23.1(Feb 23, 2015)
2015.2.23.1

Fix #9: Missing languages.json file that breaks the installation.

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.23
Source code(tar.gz)
Source code(zip)
2015.2.23(Feb 23, 2015)
2015.2.23

Feature: Using the cache is optional

Feature: Filtering stopwords

Special thanks to Taras Labiak @kissarat

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21
Source code(tar.gz)
Source code(zip)
2015.2.21(Feb 21, 2015)
2015.2.21

Feature: LANGUAGE_MAPPING is loads from stop-words/languages.json

Fix: Made paths OS-independent

PyPi: https://pypi.python.org/pypi/stop-words/2015.2.21

Special thanks to Taras Labiak @kissarat
Source code(tar.gz)
Source code(zip)
2015.1.31(Feb 1, 2015)
2015.1.31

Feature #5: Decode error AND Add catalan language to LANGUAGE_MAPPING.

Feature: Update stop-words dictionary.

Source code(tar.gz)
Source code(zip)
2015.1.22(Jan 22, 2015)
2015.1.22

Feature: Tests

Feature: Python 3 support

Feature: Dev installation via zc.buildout

Feature: Continuous integration via Travis

pypi: https://pypi.python.org/pypi/stop-words/2015.1.22
Source code(tar.gz)
Source code(zip)
2015.1.19(Jan 19, 2015)
2015.1.19

Feature #3: Handle language code, cache and custom errors

Source code(tar.gz)
Source code(zip)

Owner

Alireza Savand

I am Alireza Savand, a Software Architect.

GitHub Repository https://pypi.org/project/stop-words/

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

uber-pickups-analysis Data Source: https://www.kaggle.com/fivethirtyeight/uber-pickups-in-new-york-city Information about data set The dataset contain

1 Nov 02, 2021

Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

The KLEJ Benchmark Baselines The KLEJ benchmark (Kompleksowa Lista Ewaluacji Językowych) is a set of nine evaluation tasks for the Polish language und

17 Oct 18, 2022

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

BERN2 We present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by

99 Jan 06, 2023

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER. @inproceedings{tedes

40 Dec 11, 2022

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

NLP T5 Project proposal Topic Modeling and Clustering of News-Articles-and-Essays Students: Nasser Alshehri Abdullah Bushnag Abdulrhman Alqurashi OVER

2 Jan 18, 2022

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

Arabic-Phonetic-Output You can input the phonetic version of any Arabic text her

1 Dec 30, 2021

A Chinese to English Neural Model Translation Project

ZH-EN NMT Chinese to English Neural Machine Translation This project is inspired by Stanford's CS224N NMT Project Dataset used in this project: News C

29 Nov 26, 2022

Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

Multilabel time series classification with LSTM Tensorflow implementation of model discussed in the following paper: Learning to Diagnose with LSTM Re

552 Nov 28, 2022

Count the frequency of letters or words in a text file and show a graph.

Word Counter By EBUS Coding Club Count the frequency of letters or words in a text file and show a graph. Requirements Python 3.9 or higher matplotlib

0 Apr 09, 2022

Residual2Vec: Debiasing graph embedding using random graphs

Residual2Vec: Debiasing graph embedding using random graphs This repository contains the code for S. Kojaku, J. Yoon, I. Constantino, and Y.-Y. Ahn, R

5 Oct 12, 2022

Tool which allow you to detect and translate text.

Text detection and recognition This repository contains tool which allow to detect region with text and translate it one by one. Description Two pretr

176 Nov 28, 2022

A curated list of FOSS tools to improve the Hacker News experience

Awesome-Hackernews Hacker News is a social news website focusing on computer technologies, hacking and startups. It promotes any content likely to "gr

141 Dec 27, 2022

jiant is an NLP toolkit

jiant is an NLP toolkit The multitask and transfer learning toolkit for natural language processing research Why should I use jiant? jiant supports mu

1.5k Jan 04, 2023

Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

J.A.R.V.I.S Kindly consider starring this repository if you like the program :-) What/Who is J.A.R.V.I.S? J.A.R.V.I.S is an chatbot written that is bu

50 Dec 31, 2022

Get list of common stop words in various languages in Python

Related tags

Overview

Python Stop Words

Comments

Releases(2018.7.23)

2018.7.23(Jul 23, 2018)

2018.7.23

2015.2.23.1(Feb 23, 2015)

2015.2.23.1

2015.2.23(Feb 23, 2015)

2015.2.23

2015.2.21(Feb 21, 2015)

2015.2.21

2015.1.31(Feb 1, 2015)

2015.1.31

2015.1.22(Jan 22, 2015)

2015.1.22

2015.1.19(Jan 19, 2015)

2015.1.19

Owner

Alireza Savand

this repository has datasets containing information of Uber pickups in NYC from April 2014 to September 2014 and January to June 2015. data Analysis , virtualization and some insights are gathered here

Fine-tuning scripts for evaluating transformer-based models on KLEJ benchmark.

BERN2: an advanced neural biomedical namedentity recognition and normalization tool

Data and evaluation code for the paper WikiNEuRal: Combined Neural and Knowledge-based Silver Data Creation for Multilingual NER (EMNLP 2021).

News-Articles-and-Essays - NLP (Topic Modeling and Clustering)

Arabic-Phonetic-Output - You can input the phonetic version of any Arabic text here. This software will show you output in Arabic (with vowels)

A Chinese to English Neural Model Translation Project

Tensorflow implementation of paper: Learning to Diagnose with LSTM Recurrent Neural Networks.

Count the frequency of letters or words in a text file and show a graph.

Residual2Vec: Debiasing graph embedding using random graphs

Tool which allow you to detect and translate text.

A curated list of FOSS tools to improve the Hacker News experience

jiant is an NLP toolkit

Jarvis is a simple Chatbot with a GUI capable of chatting and retrieving information and daily news from the internet for it's user.

Multi Task Vision and Language

Code for paper Multitask-Finetuning of Zero-shot Vision-Language Models

Pretrain CPM - 大规模预训练语言模型的预训练代码

A raytrace framework using taichi language

Fast, DB Backed pretrained word embeddings for natural language processing.

Train 🤗transformers with DeepSpeed: ZeRO-2, ZeRO-3