4CAT: Capture and Analysis Toolkit

Last update: Dec 20, 2022

Overview

4CAT: Capture and Analysis Toolkit

4CAT is a research tool that can be used to analyse and process data from online social platforms. Its goal is to make the capture and analysis of data from these platforms accessible to people through a web interface, without requiring any programming or web scraping skills. Our target audience is researchers, students and journalists interested using Digital Methods in their work.

In 4CAT, you create a dataset from a given platform according to a given set of parameters; the result of this (usually a CSV file containing matching items) can then be downloaded or analysed further with a suite of analytical 'processors', which range from simple frequency charts to more advanced analyses such as the generation and visualisation of word embedding models.

4CAT has a (growing) number of supported data sources corresponding to popular platforms that are part of the tool, but you can also add additional data sources using 4CAT's Python API. The following data sources are currently supported actively:

4chan
8kun
Bitchute
Parler
Reddit
Telegram
Twitter API (Academic and regular tracks)

The following platforms are supported through other tools, from which you can import data into 4CAT for analysis:

Facebook (via CrowdTangle exports)
Instagram (via CrowdTangle)
TikTok (via tiktok-scraper)

A number of other platforms have built-in support that is untested, or requires e.g. special API access. You can view the full list of data sources in the GitHub repository.

Install

You can install 4CAT locally or on a server via Docker or manually. The usual

docker-compose up

will work, but detailed and alternative installation instructions are available in our wiki. Currently 4chan, 8chan, and 8kun require additional steps; please see the wiki.

Please check our issues and create one if you experience any problems (pull requests are also very welcome).

Components

4CAT consists of several components, each in a separate folder:

backend: A standalone daemon that collects and processes data, as queued via the tool's web interface or API.
webtool: A Flask app that provides a web front-end to search and analyze the stored data with.
common: Assets and libraries.
datasources: Data source definitions. This is a set of configuration options, database definitions and python scripts to process this data with. If you want to set up your own data sources, refer to the wiki.
processors: A collection of data processing scripts that can plug into 4CAT and manipulate or process datasets created with 4CAT. There is an API you can use to make your own processors.

Credits & License

4CAT was created at OILab and the Digital Methods Initiative at the University of Amsterdam. The tool was inspired by the TCAT, a tool with comparable functionality that can be used to scrape and analyse Twitter data.

4CAT development is supported by the Dutch PDI-SSH foundation through the CAT4SMR project.

4CAT is licensed under the Mozilla Public License, 2.0. Refer to the LICENSE file for more information.

Comments

Allow autologin to _always_ work (or perhaps disable login?)

I am running a 4cat server in docker, with a apache2 reverse proxy in front. It works fine except for one small thing.

MYSERVER.domain host my apache proxy.

In settings -> Flask settings I have: Auto-login name = MYSERVER.domain

However when i access through the proxy don't want to meet a login to 4cat. I just want to be inside. I was thinking that Auto-login name would whitelist hosts so they could bypass login?
enhancement

opened by anderscollstrup 21

Docker swarm server: Cannot make flask frontend work and login (not using default docker-compose) flask overwriting settings values in database

Hi, I have 4cat running in a docker swarm server. After modifying a little bit the compose file to be compatible in docker swarm and other little bit the environment variables i got it running but I cannot login. I see this is a security feature with flask. I have read https://github.com/digitalmethodsinitiative/4cat/issues/269 also it is related to issue https://github.com/digitalmethodsinitiative/4cat/issues/272 I cannot find the whitelist or where is it, since now there is no config.py

Here is a dump of my postgresql database table of settings, Maybe it is relevant.




DATASOURCES               | {"bitchute": {}, "custom": {}, "douban": {}, "customimport": {}, "parler": {}, "reddit": {"boards": "*"}, "telegram": {}, "twitterv2": {"id_lookup": false}}
 4cat.name                 | "4CAT"
 4cat.name_long            | "4CAT: Capture and Analysis Toolkit"
 4cat.github_url           | "https://github.com/digitalmethodsinitiative/4cat"
 path.versionfile          | ".git-checked-out"
 expire.timeout            | 0
 expire.allow_optout       | true
 logging.slack.level       | "WARNING"
 logging.slack.webhook     | null
 mail.admin_email          | null
 mail.host                 | null
 mail.ssl                  | false
 mail.username             | null
 mail.password             | null
 mail.noreply              | "[email protected]"
 SCRAPE_TIMEOUT            | 5
 SCRAPE_PROXIES            | {"http": []}
 IMAGE_INTERVAL            | 3600
 explorer.max_posts        | 100000
 flask.flask_app           | "webtool/fourcat"
 flask.secret_key          | "2e3037b7533c100f324e472a"
 flask.https               | false
 flask.autologin.name      | "Automatic login"
 flask.autologin.api       | ["localhost", "4cat.coraldigital.mx", "\"4cat.coraldigital.mx\"", "51.81.52.207", "0.0.0.0"]
 flask.server_name         | ""
 flask.autologin.hostnames | ["*"]

docker issue

opened by hydrosIII 17

Cannot make flask frontend work
Backend is running: [email protected]:/usr/local/4cat# [email protected]:/usr/local/4cat# ps -ef | grep python root 497 1 0 10:36 ? 00:00:02 /usr/bin/python3 /usr/bin/fail2ban-server -xf start root 516 1 0 10:36 ? 00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 4cat 18989 1 59 12:39 ? 00:00:01 /usr/bin/python3 4cat-daemon.py start root 19008 891 0 12:39 pts/0 00:00:00 grep python [email protected]:/usr/local/4cat#

[email protected]:/usr/local/4cat# [email protected]:/usr/local/4cat# pip install python-dotenv Collecting python-dotenv Downloading python_dotenv-0.20.0-py3-none-any.whl (17 kB) Installing collected packages: python-dotenv Successfully installed python-dotenv-0.20.0 [email protected]/usr/local/4cat# [email protected]:/usr/local/4cat# FLASK_APP=webtool flask run --host=0.0.0.0

Serving Flask app "webtool"

Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.

Debug mode: off

Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) /usr/local/lib/python3.9/dist-packages/flask/sessions.py:208: UserWarning: "localhost" is not a valid cookie domain, it must contain a ".". Add an entry to your hosts file, for example "localhost.localdomain", and use that instead. warnings.warn( MY PC IP - - [10/Jun/2022 12:36:54] "GET / HTTP/1.1" 404 -

And I get 404 in my browser when I point to http://server_ip:5000

4cat is installed using this guide: https://github.com/digitalmethodsinitiative/4cat/wiki/Installing-4CAT Install 4cat manually
docker issue
opened by anderscollstrup 17
Issue with migrate.py preventing me from running 4cat or accessing web interface
Hello, thanks for making this tool available. I'd be grateful for any tips: I'm getting an 'EOFError: EOF when reading a line' message when I run docker-compose up. I'm using Windows 10 Home. I initially tried to install 4cat manually to scrape 4chan, but I couldn't get it to work so I uninstalled and then tried to install through Docker.

I'm using Windows Powershell to run the command because when I run docker-compose up in Ubuntu 20.04 LTS I'm getting this message:

'The command 'docker-compose' could not be found in this WSL 2 distro. We recommend to activate the WSL integration in Docker Desktop settings.

See https://docs.docker.com/desktop/windows/wsl/ for details.'

The WSL integration is activated in Docker Desktop settings by default. Could it be because I didn't bind-mount the folder I'm storing 4cat in to the Linux file system? I skipped that step and just stored 4cat in /c/users/myusername/ on Windows.

This is the message I get when I run docker-compose up command from Powershell:

PS C:\users\myusername\4cat> docker-compose up [+] Running 2/2

Container cat_db_1 Running 0.0s

Container api Recreated 0.7s Attaching to api, db_1 api | Waiting for postgres... api | PostgreSQL started api | 1 api | Seed present api | Starting app api | Running migrations api | api | 4CAT migration agent api | ------------------------------------------ api | Current 4CAT version: 1.9 api | Checked out version: 1.16 api | The following migration scripts will be run: api | migrate-1.9-1.10.py api | migrate-1.10-1.11.py api | migrate-1.11-1.12.py api | migrate-1.12-1.13.py api | migrate-1.13-1.14.py api | migrate-1.14-1.15.py api | WARNING: Migration can take quite a while. 4CAT will not be available during migration. api | If 4CAT is still running, it will be shut down now. api | Do you want to continue [y/n]? Traceback (most recent call last): api | File "helper-scripts/migrate.py", line 142, in api | if not args.yes and input("").lower() != "y": api | EOFError: EOF when reading a line api exited with code 1
opened by robbydigital 15
Unknown local index '4chan_posts' in search request
We managed to overcome our previous issue thanks to your advise. However we are now stuck with a error related to the indexes, appearing whenever we query 4chan.

First we have generated the sphinx.conf using helper_script/generate_sphinx_config.py. This result in the following indexes:

` [...]

/* Indexes */

index 4cat_index { min_infix_len = 3 html_strip = 1 type = template charset_table = 0..9, a..z, _, A..Z->a..z, U+47, U+58, U+40, U+41, U+00C0->a, U+00C1->a, U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c,$ }

index 4chan_posts : 4cat_index { type = plain source = 4chan_posts_old path = /opt/sphinx/data/4chan_posts }

index 4chan_posts : 4cat_index { type = plain source = 4chan_posts_new path = /opt/sphinx/data/4chan_posts } [...] However starting sphinx with this setup result in the following error:Mar 16 11:48:44 dev sphinxsearch[505]: ERROR: section '4chan_posts' (type='index') already exists in /etc/sphinxsearch/sphinx.conf line 51 col 19. ` I have then attempted to uncomment one of the indexes and/or changing the path which allows for sphinx to start. However another error then appears when collection have been initiated:

16-03-2020 11:50:54 | ERROR (threading.py:884): Sphinx crash during query deb9cfe3e0a47d56612fd6e453208ed6: (1064, "unknown local index '4chan_posts' in search request\x00")

Hope you once again can help me figure out how the indexes should be set.
opened by bornakke 12
Installing problem: frontend failed to run with 'docker-compose up' command

When running the command docker-compose up, the database and backend components goes well, but the frontend component could not lead to a result, and always stuck at "[INFO] Booting worker with pid: 12" . The problem is still there after restarting the frontend component on Docker UI.
docker issue

opened by baiyuan523 11

Error "string indices must be integers" from search_twitter.py:403

From our 4cat.log

21-09-2021 10:48:11 | INFO (processor.py:890): Running processor count-posts on dataset a5eeaf86aa27ff91f212d35880090d70
21-09-2021 10:48:11 | INFO (processor.py:890): Running processor attribute-frequencies on dataset 659e224c54209146f7551523e8d26633
21-09-2021 10:48:11 | ERROR (worker.py:890): Processor count-posts raised TypeError while processing dataset a5eeaf86aa27ff91f212d35880090d70 (via 76e33804acca3ac18d3cfa8de8059780) in count_posts.py:59->processor.py:316->search_twitter.py:403:
   string indices must be integers

21-09-2021 10:48:11 | ERROR (worker.py:890): Processor attribute-frequencies raised TypeError while processing dataset 659e224c54209146f7551523e8d26633 (via 01db05ce10f58b320a397d68b61986a2) in rank_attribute.py:132->processor.py:316->search_twitter.py:403:
   string indices must be integers

The line in question is from SearchWithTwitterAPIv2.map_item() https://github.com/digitalmethodsinitiative/4cat/blob/f0e01fb500b7dafb58a05873cf34bf15e288a88c/datasources/twitterv2/search_twitter.py#L403

and I haven't found a good way to bring 4CAT under a debugger and/or inform me of an ID for the violating tweet.

Could this be related to #169 ?

opened by xmacex 10

AttributeError: 'Namespace' object has no attribute 'release'

Fresh installation on MAC with Docker from local files. Any idea what i did wrong?

4cat_backend:

Waiting for postgres... PostgreSQL started Database already created

Traceback (most recent call last): File "helper-scripts/migrate.py", line 66, in if args.release: AttributeError: 'Namespace' object has no attribute 'release'

4cat_backend EXITED (1)
bug deployment

opened by psegovias 9
Docker setup fails to "import config" on macOS Big Sur (M1)
Discussed in https://github.com/digitalmethodsinitiative/4cat/discussions/191

^{Originally posted by p-charis October 25, 2021} Hey everyone! First, thanks a million to the developers for building this & making it available :)

Now, I managed to get 4CAT working on a macOS (latest version-M1 native) but only after I removed the following lines from the docker-setup.py file (line #36 onwards). Without these lines the installation wouldn't work as it returned the error that no module named config was found. I suspect it might have sth to do with the way that Docker runs on macOS generally and the paths it creates, but I haven't figured it out yet. So, I just wanted to let the Devs know, as well as other macOS users that, if they've had a similar problem, they could try this workaround.

# Ensure filepaths exist import config for path in [config.PATH_DATA, config.PATH_IMAGES, config.PATH_LOGS, config.PATH_LOCKFILE, config.PATH_SESSIONS, ]: if Path(config.PATH_ROOT, path).is_dir(): pass else: os.makedirs(Path(config.PATH_ROOT, path))</div>
bug docker issue
opened by p-charis 8
Tokeniser exclusion list ignores last word in list

I'm filtering some commonly used words out of a corpus with the Tokenise processor and it only seems to be partially successful. For example in one month there are 37,325 instances of one word. When I add the word to the reject list there are still 6307 instances of the word. So it's getting most but not at all. I'm having the same issue with some common swear words that I'm trying to filter out - most are gone, but some remain. Is there a reason for this?

Thanks for any insight!

opened by robbydigital 6
Datasource that interfaces with a TCAT instance
It works, and arguably fixes #117, but:

The form looks hideous with the million query fields. Do we need them all for 4CAT? Is there a way to make it look better?

The list of bins displayed in the 'create dataset' form simply lists bins from all instances. This can get really long really fast when supporting multiple instances. A custom form control may be necessary to make this user-friendly.

The list of bins is loaded synchronously whenever get_options() is run. The result should probably be cached or updated in the background (with a separate worker...?)

The data format now follows that of twitterv2's map_item(), but there is quite a bit more data in the TCAT export that we could include.
opened by stijn-uva 6
Update 'FAQ' and 'About' pages

The 'About' page should probably refer to documentation and guides etc rather than the 'news' thing it's doing now, and the FAQ is still very 4chan-oriented.
enhancement (mostly) front-end

opened by stijn-uva 0
Feature request: allow data from linked telegram chat channels to be collected

Telegram chats have linked "discussion" channels, where users can respond to messages in the main channel. Occasionally, these are also public, and if so, can also be found by the API. It would be useful to allow users to also automatically collect data from these chat channels if they're found.

A note on this and future feature requests: we're (https://github.com/GateNLP) putting in some additions to the telegram data collector on our end. Thought it might be worth checking if there's scope for them to be added to the original/main instance.

If any issues with this/they don't really fit with what you have in mind for your instance, all fine, we'll continue to maintain them on our own fork instead!

Linked pull request: https://github.com/digitalmethodsinitiative/4cat/pull/322
enhancement data source

opened by muneerahp 1
LIHKG data source

A data source, for LIHKG. Uses the web interface's web API, which seems reasonable straightforward and stable. There is some rate limiting, which 4CAT tries to respect by pacing requests and implementing an exponential backoff.
enhancement data source questionable

opened by stijn-uva 0
ability to count frequency for specific (and multiple) keywords over time

a processor that can filter on multiple particular words or phrases within a dataset, and outputs the count values (overall, or over time) per item, outputting a .csv that can be imported into raw graphs to compare the evolution of different words/phrases over time, either in absolute or in relative numbers.
processors data source

opened by daniel-dezeeuw 0
Warn about need to update Docker `.env` file when upgrading 4CAT to new version

When using Docker, the .env file can be used to ensure you pull a particular version of 4CAT. If you then upgrade 4CAT interactively, we cannot modify the .env file (which exists on the users host machine). If a user removes or rebuilds 4CAT, it will pull the version of 4CAT listed in the .env file which will not be the latest version that was upgraded to.

I will look at adding a warning/notification to the upgrade logs to notify users of the need to update their .env file.
enhancement deployment

opened by dale-wahl 0

Releases(v1.29)

v1.29(Oct 6, 2022)
Snapshot of 4CAT as of October 2022. Many changes and fixes since the last official release, including:

Restart and upgrade 4CAT via the web interface (#181, #287, #288)

Addition of several processors for Twitter datasets to increase inter-operability with DMI-TCAT

DMI-TCAT data source, which can interface with a DMI-TCAT instance to create datasets from tweets stored therein (#226)

LinkedIn data source, to be used together with Zeeschuimer

Fixes & improvements to Docker container set-up and build process (#269, #270, #290)

A number of processors have been updated to transparently filter NDJSON datasets instead of turning them into CSV datasets (#253, #282, #291, #292)

And many smaller fixes & updates

From this release onwards, 4CAT can be upgraded to the latest release via the Control Panel in the web interface.
Source code(tar.gz)
Source code(zip)
v1.26(May 10, 2022)
Many updates:

Configuration is now stored in the database and (mostly) editable via the web GUI

The Telegram datasource now collects more data and stores the 'raw' message objects as NDJSON

Dialogs in the web UI now use custom widgets instead of alert()

Twitter datasets will retrieve the expected amount of tweets before capturing and ask for confirmation if it is a high number

Various fixes and tweaks to the Dockerfiles

New extended data source information pages with details about limitations, caveats, useful links, etc

And much more

Source code(tar.gz)
Source code(zip)
v1.25(Feb 24, 2022)
Snapshot of 4CAT as of 24 February 2022. Many changes and fixes since the last official release, including:

Explore and annotate your datasets interactively with the new Explorer (beta)

Datasets can be set to automatically get deleted after a set amount of time, and can be made private

Incremental refinement of the web interface

Twitter datasets can be exported to a DMI-TCAT instance

User accounts can now be deactivated (banned)

Many smaller fixes and new features

Source code(tar.gz)
Source code(zip)
v1.21(Sep 28, 2021)
Snapshot of 4CAT as of 28 September 2021. Many changes and fixes since the last official release, including:

User management via control panel

Improved Docker support

Improved 4chan data dump import helper scripts

Improved country code filtering for 4chan/pol/ datasets

More robust and versatile network analysis processors

Various new filter processors

Topic modeling processor

Support for non-academic Twitter API queries

Option to download NDJSON datasets as CSV

Support for hosting 4CAT with a non-root URL

And many more

Source code(tar.gz)
Source code(zip)
v1.18a(May 7, 2021)

A release to trigger publication on Zenodo.
Source code(tar.gz)
Source code(zip)
v1.17(Apr 8, 2021)

Tagging 4CAT at 1.17 because the previous release was super mega outdated
Source code(tar.gz)
Source code(zip)
v1.9b1(Jan 17, 2020)

First public release, licensed under the MPL 2.0
Source code(tar.gz)
Source code(zip)
v1.0b1(Feb 28, 2019)
4CAT is now ready for wider use! It offers...

An API that can be used to queue and manipulate queries programmatically

Diverse analytical post-processors that may be combined to further analyse data sets

A flexible interface for adding various data sources

A robust scraper

A very retro interface

Source code(tar.gz)
Source code(zip)

Owner

Digital Methods Initiative

The Digital Methods Initiative (DMI) is one of Europe's leading Internet Studies research groups. Research tools it develops are collected here.

GitHub Repository

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine Intro This repo contains the python/stan version of the Statistical Rethinking

3 Nov 08, 2022

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

tree-SNE t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology. Building on recent advances in s

61 Nov 21, 2022

pyETT: Python library for Eleven VR Table Tennis data

pyETT: Python library for Eleven VR Table Tennis data Documentation Documentation for pyETT is located at https://pyett.readthedocs.io/. Installation

5 Nov 19, 2022

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

759 Jan 07, 2023

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

This repo contains a powerful tool made using python which is used to visualize, analyse and finally assess the quality of the product depending upon the given observations

8 Oct 18, 2022

This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

32 Nov 27, 2022

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

1 Feb 03, 2022

4CAT: Capture and Analysis Toolkit

Related tags

Overview

4CAT: Capture and Analysis Toolkit

Install

Components

Credits & License

Comments

/* Indexes */

Discussed in https://github.com/digitalmethodsinitiative/4cat/discussions/191

Releases(v1.29)

v1.29(Oct 6, 2022)

v1.26(May 10, 2022)

v1.25(Feb 24, 2022)

v1.21(Sep 28, 2021)

v1.18a(May 7, 2021)

v1.17(Apr 8, 2021)

v1.9b1(Jan 17, 2020)

v1.0b1(Feb 28, 2019)

Owner

Digital Methods Initiative

Statistical Rethinking: A Bayesian Course Using CmdStanPy and Plotnine

t-SNE and hierarchical clustering are popular methods of exploratory data analysis, particularly in biology.

pyETT: Python library for Eleven VR Table Tennis data

Hue Editor: Open source SQL Query Assistant for Databases/Warehouses

This repo contains a simple but effective tool made using python which can be used for quality control in statistical approach.

This tool parses log data and allows to define analysis pipelines for anomaly detection.

Processo de ETL (extração, transformação, carregamento) realizado pela equipe no projeto final do curso da Soul Code Academy.

Average time per match by division

Gathering data of likes on Tinder within the past 7 days

Create HTML profiling reports from pandas DataFrame objects

WaveFake: A Data Set to Facilitate Audio DeepFake Detection

Candlestick Pattern Recognition with Python and TA-Lib

A Python and R autograding solution

Top 50 best selling books on amazon

A library to create multi-page Streamlit applications with ease.

Improving your data science workflows with

This program analyzes a DNA sequence and outputs snippets of DNA that are likely to be protein-coding genes.

Feature Detection Based Template Matching

Data Science Environment Setup in single line

INF42 - Topological Data Analysis