Snowball compiler and stemming algorithms

Last update: Jan 07, 2023

Related tags

Overview

Snowball is a small string processing language for creating stemming algorithms for use in Information Retrieval, plus a collection of stemming algorithms implemented using it.

Snowball was originally designed and built by Martin Porter. Martin retired from development in 2014 and Snowball is now maintained as a community project. Martin originally chose the name Snowball as a tribute to SNOBOL, the excellent string handling language from the 1960s. It now also serves as a metaphor for how the project grows by gathering contributions over time.

The Snowball compiler translates a Snowball program into source code in another language - currently ISO C, C#, Go, Java, Javascript, Object Pascal, Python and Rust are supported.

This repository contains the source code for the snowball compiler and the stemming algorithms. The snowball compiler is written in ISO C - you'll need a C compiler which support C99 to build it (but the C code it generates should work with any ISO C compiler.)

See https://snowballstem.org/ for more information about Snowball.

What is Stemming?

Stemming maps different forms of the same word to a common "stem" - for example, the English stemmer maps connection, connections, connective, connected, and connecting to connect. So a searching for connected would also find documents which only have the other forms.

This stem form is often a word itself, but this is not always the case as this is not a requirement for text search systems, which are the intended field of use. We also aim to conflate words with the same meaning, rather than all words with a common linguistic root (so awe and awful don't have the same stem), and over-stemming is more problematic than under-stemming so we tend not to stem in cases that are hard to resolve. If you want to always reduce words to a root form and/or get a root form which is itself a word then Snowball's stemming algorithms likely aren't the right answer.

Snowball compiler and stemming algorithms

Related tags

Overview

What is Stemming?

Owner

Snowball Stemming language and algorithms

Deeply Supervised, Layer-wise Prediction-aware (DSLP) Transformer for Non-autoregressive Neural Machine Translation

Code for the paper PermuteFormer

Correctly generate plurals, ordinals, indefinite articles; convert numbers to words

This repository details the steps in creating a Part of Speech tagger using Trigram Hidden Markov Models and the Viterbi Algorithm without using external libraries.

Azure Text-to-speech service for Home Assistant

Facebook AI Research Sequence-to-Sequence Toolkit written in Python.

Lightweight utility tools for the detection of multiple spellings, meanings, and language-specific terminology in British and American English

lightweight, fast and robust columnar dataframe for data analytics with online update

To classify the News into Real/Fake using Features from the Text Content of the article

FedNLP: A Benchmarking Framework for Federated Learning in Natural Language Processing

The code for the Subformer, from the EMNLP 2021 Findings paper: "Subformer: Exploring Weight Sharing for Parameter Efficiency in Generative Transformers", by Machel Reid, Edison Marrese-Taylor, and Yutaka Matsuo

Python library for interactive topic model visualization. Port of the R LDAvis package.

SimBERT升级版（SimBERTv2）！

Simple text to phones converter for multiple languages

Easy to use, state-of-the-art Neural Machine Translation for 100+ languages

Wake: Context-Sensitive Automatic Keyword Extraction Using Word2vec

Tools, wrappers, etc... for data science with a concentration on text processing

A single model that parses Universal Dependencies across 75 languages.

hashily is a Python module that provides a variety of text decoding and encoding operations.

text to speech toolkit. 好用的中文语音合成工具箱，包含语音编码器、语音合成器、声码器和可视化模块。