Spam filtering made easy for you

Last update: Dec 18, 2022

Overview

spammy

Author:	Tasdik Rahman
Latest version:	1.0.3

Contents

1 Overview
2 Features
3 Example
- 3.1 Accuracy of the classifier
4 Installation
- 4.1 Upgrading
- 4.2 Installation behind a proxy
5 Benchmarks
6 Contributing
- 6.1 Roadmap
7 Licensing
8 Credits
9 Donation

1 Overview

spammy : Spam filtering at your service

spammy powers the web app https://plino.herokuapp.com

2 Features

train the classifier on your own dataset to classify your emails into spam or ham
Dead simple to use. See usage
Blazingly fast once the classifier is trained. (See benchmarks)
Custom exceptions raised so that when you miss something, spammy tells you where did you go wrong in a graceful way
Written in uncomplicated python
Built on top of the giant shoulders of nltk

3 Example

[back to top]

Your data directory structure should be something similar to

$ tree /home/tasdik/Dropbox/projects/spammy/examples/test_dataset
/home/tasdik/Dropbox/projects/spammy/examples/test_dataset
├── ham
│   ├── 5458.2001-04-25.kaminski.ham.txt
│   ├── 5459.2001-04-25.kaminski.ham.txt
│   ...
│   ...
│   └── 5851.2001-05-22.kaminski.ham.txt
└── spam
    ├── 4136.2005-07-05.SA_and_HP.spam.txt
    ├── 4137.2005-07-05.SA_and_HP.spam.txt
    ...
    ...
    └── 5269.2005-07-19.SA_and_HP.spam.txt

Example

>>> import os
>>> from spammy import Spammy
>>>
>>> directory = '/home/tasdik/Dropbox/projects/spamfilter/data/corpus3'
>>>
>>> # directory structure
>>> os.listdir(directory)
['spam', 'Summary.txt', 'ham']
>>> os.listdir(os.path.join(directory, 'spam'))[:3]
['4257.2005-04-06.BG.spam.txt', '0724.2004-09-21.BG.spam.txt', '2835.2005-01-19.BG.spam.txt']
>>>
>>> # Spammy object created
>>> cl = Spammy(directory, limit=100)
>>> cl.train()
>>>
>>> SPAM_TEXT = \
... """
... My Dear Friend,
...
... How are you and your family? I hope you all are fine.
...
... My dear I know that this mail will come to you as a surprise, but it's for my
... urgent need for a foreign partner that made me to contact you for your sincere
... genuine assistance My name is Mr.Herman Hirdiramani, I am a banker by
... profession currently holding the post of Director Auditing Department in
... the Islamic Development Bank(IsDB)here in Ouagadougou, Burkina Faso.
...
... I got your email information through the Burkina's Chamber of Commerce
... and industry on foreign business relations here in Ouagadougou Burkina Faso
... I haven'disclose this deal to any body I hope that you will not expose or
... betray this trust and confident that I am about to repose on you for the
... mutual benefit of our both families.
...
... I need your urgent assistance in transferring the sum of Eight Million,
... Four Hundred and Fifty Thousand United States Dollars ($8,450,000:00) into
... your account within 14 working banking days This money has been dormant for
... years in our bank without claim due to the owner of this fund died along with
... his entire family and his supposed next of kin in an underground train crash
... since years ago. For your further informations please visit
... (http://news.bbc.co.uk/2/hi/5141542.stm)
... """
>>> cl.classify(SPAM_TEXT)
'spam'
>>>

3.1 Accuracy of the classifier

>>> from spammy import Spammy
>>> directory = '/home/tasdik/Dropbox/projects/spammy/examples/training_dataset'
>>> cl = Spammy(directory, limit=300)  # training on only 300 spam and ham files
>>> cl.train()
>>> data_dir = '/home/tasdik/Dropbox/projects/spammy/examples/test_dataset'
>>>
>>> cl.accuracy(directory=data_dir, label='spam', limit=300)
0.9554794520547946
>>> cl.accuracy(directory=data_dir, label='ham', limit=300)
0.9033333333333333
>>>

NOTE:

More examples can be found over in the examples directory

4 Installation

[back to top]

NOTE: spammy currently supports only python2

Install the dependencies first

$ pip install nltk==3.2.1, beautifulsoup4==4.4.1

To install use pip:

$ pip install spammy

or if you don't have pip``use ``easy_install

$ easy_install spammy

Or build it yourself (only if you must):

$ git clone https://github.com/tasdikrahman/spammy.git
$ python setup.py install

4.1 Upgrading

To upgrade the package,

$ pip install -U spammy

4.2 Installation behind a proxy

If you are behind a proxy, then this should work

$ pip --proxy [username:password@]domain_name:port install spammy

5 Benchmarks

[back to top]

Spammy is blazingly fast once trained

Don't believe me? Have a look

>>> import timeit
>>> from spammy import Spammy
>>>
>>> directory = '/home/tasdik/Dropbox/projects/spamfilter/data/corpus3'
>>> cl = Spammy(directory, limit=100)
>>> cl.train()
>>> SPAM_TEXT_2 = \
... """
... INTERNATIONAL MONETARY FUND (IMF)
... DEPT: WORLD DEBT RECONCILIATION AGENCIES.
... ADVISE: YOUR OUTSTANDING PAYMENT NOTIFICATION
...
... Attention
... A power of attorney was forwarded to our office this morning by two gentle men,
... one of them is an American national and he is MR DAVID DEANE by name while the
... other person is MR... JACK MORGAN by name a CANADIAN national.
... This gentleman claimed to be your representative, and this power of attorney
... stated that you are dead; they brought an account to replace your information
... in other to claim your fund of (US$9.7M) which is now lying DORMANT and UNCLAIMED,
...  below is the new account they have submitted:
...                     BANK.-HSBC CANADA
...                     Vancouver, CANADA
...                     ACCOUNT NO. 2984-0008-66
...
... Be further informed that this power of attorney also stated that you suffered.
... """
>>>
>>> def classify_timeit():
...    result = cl.classify(SPAM_TEXT_2)
...
>>> timeit.repeat(classify_timeit, number=5)
[0.1810469627380371, 0.16121697425842285, 0.16121196746826172]
>>>

6 Contributing

[back to top]

Refer CONTRIBUTING page for details

6.1 Roadmap

Include more algorithms for increased accuracy
python3 support

7 Licensing

[back to top]

Spammy is built by Tasdik Rahman and licensed under GPLv3.

spammy Copyright (C) 2016 Tasdik Rahman([email protected])

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see <http://www.gnu.org/licenses/>.

You can find a full copy of the LICENSE file here

8 Credits

[back to top]

If you'd like give me credit somewhere on your blog or tweet a shout out to @tasdikrahman, well hey, I'll take it.

9 Donation

If you have found my little bits of software of any use to you, you can help me pay my internet bills :)

Spam filtering made easy for you

Related tags

Overview

spammy

1 Overview

2 Features

3 Example

3.1 Accuracy of the classifier

4 Installation

4.1 Upgrading

4.2 Installation behind a proxy

5 Benchmarks

6 Contributing

6.1 Roadmap

7 Licensing

8 Credits

9 Donation

Owner

Tasdik Rahman

Code for text augmentation method leveraging large-scale language models

IndoBERTweet is the first large-scale pretrained model for Indonesian Twitter. Published at EMNLP 2021 (main conference)

Python interface for converting Penn Treebank trees to Stanford Dependencies and Universal Depenencies

Yomichad - a Japanese pop-up dictionary that can display readings and English definitions of Japanese words

Pretrained language model and its related optimization techniques developed by Huawei Noah's Ark Lab.

A Paper List for Speech Translation

Sploitus - Command line search tool for sploitus.com. Think searchsploit, but with more POCs

Disfl-QA: A Benchmark Dataset for Understanding Disfluencies in Question Answering

Package for controllable summarization

Fake news detector filters - Smart filter project allow to classify the quality of information and web pages

Persian-lexicon - A lexicon of 70K unique Persian (Farsi) words

A framework for training and evaluating AI models on a variety of openly available dialogue datasets.

A number of methods in order to perform Natural Language Processing on live data derived from Twitter

A multi-voice TTS system trained with an emphasis on quality

A python script to prefab your scripts/text files, and re create them with ease and not have to open your browser to copy code or write code yourself

RuCLIP tiny (Russian Contrastive Language–Image Pretraining) is a neural network trained to work with different pairs (images, texts).

The PyTorch based implementation of continuous integrate-and-fire (CIF) module.

Code for the paper "Language Models are Unsupervised Multitask Learners"

Open-Source Toolkit for End-to-End Speech Recognition leveraging PyTorch-Lightning and Hydra.

Recognition of 38 speech commands in russian. Based on Yandex Cup 2021 ML Challenge: ASR