4CAT: Capture and Analysis Toolkit

Overview

4CAT: Capture and Analysis Toolkit

DOI: 10.5281/zenodo.4742622 License: MPL 2.0 Requires Python 3.8 Docker Image CI Status

A screenshot of 4CAT, displaying its 'Create Dataset' interfaceA screenshot of 4CAT, displaying a network visualisation of a dataset

4CAT is a research tool that can be used to analyse and process data from online social platforms. Its goal is to make the capture and analysis of data from these platforms accessible to people through a web interface, without requiring any programming or web scraping skills. Our target audience is researchers, students and journalists interested using Digital Methods in their work.

In 4CAT, you create a dataset from a given platform according to a given set of parameters; the result of this (usually a CSV file containing matching items) can then be downloaded or analysed further with a suite of analytical 'processors', which range from simple frequency charts to more advanced analyses such as the generation and visualisation of word embedding models.

4CAT has a (growing) number of supported data sources corresponding to popular platforms that are part of the tool, but you can also add additional data sources using 4CAT's Python API. The following data sources are currently supported actively:

  • 4chan
  • 8kun
  • Bitchute
  • Parler
  • Reddit
  • Telegram
  • Twitter API (Academic and regular tracks)

The following platforms are supported through other tools, from which you can import data into 4CAT for analysis:

A number of other platforms have built-in support that is untested, or requires e.g. special API access. You can view the full list of data sources in the GitHub repository.

Install

You can install 4CAT locally or on a server via Docker or manually. The usual

docker-compose up

will work, but detailed and alternative installation instructions are available in our wiki. Currently 4chan, 8chan, and 8kun require additional steps; please see the wiki.

Please check our issues and create one if you experience any problems (pull requests are also very welcome).

Components

4CAT consists of several components, each in a separate folder:

  • backend: A standalone daemon that collects and processes data, as queued via the tool's web interface or API.
  • webtool: A Flask app that provides a web front-end to search and analyze the stored data with.
  • common: Assets and libraries.
  • datasources: Data source definitions. This is a set of configuration options, database definitions and python scripts to process this data with. If you want to set up your own data sources, refer to the wiki.
  • processors: A collection of data processing scripts that can plug into 4CAT and manipulate or process datasets created with 4CAT. There is an API you can use to make your own processors.

Credits & License

4CAT was created at OILab and the Digital Methods Initiative at the University of Amsterdam. The tool was inspired by the TCAT, a tool with comparable functionality that can be used to scrape and analyse Twitter data.

4CAT development is supported by the Dutch PDI-SSH foundation through the CAT4SMR project.

4CAT is licensed under the Mozilla Public License, 2.0. Refer to the LICENSE file for more information.

Comments
  • Allow autologin to _always_ work (or perhaps disable login?)

    Allow autologin to _always_ work (or perhaps disable login?)

    I am running a 4cat server in docker, with a apache2 reverse proxy in front. It works fine except for one small thing.

    MYSERVER.domain host my apache proxy.

    In settings -> Flask settings I have: Auto-login name = MYSERVER.domain

    However when i access through the proxy don't want to meet a login to 4cat. I just want to be inside. I was thinking that Auto-login name would whitelist hosts so they could bypass login?

    enhancement 
    opened by anderscollstrup 21
  • Docker swarm server: Cannot make flask frontend work and login (not using default docker-compose) flask overwriting settings  values in database

    Docker swarm server: Cannot make flask frontend work and login (not using default docker-compose) flask overwriting settings values in database

    Hi, I have 4cat running in a docker swarm server. After modifying a little bit the compose file to be compatible in docker swarm and other little bit the environment variables i got it running but I cannot login. I see this is a security feature with flask. I have read https://github.com/digitalmethodsinitiative/4cat/issues/269 also it is related to issue https://github.com/digitalmethodsinitiative/4cat/issues/272 I cannot find the whitelist or where is it, since now there is no config.py

    Here is a dump of my postgresql database table of settings, Maybe it is relevant.

    
    
    
    DATASOURCES               | {"bitchute": {}, "custom": {}, "douban": {}, "customimport": {}, "parler": {}, "reddit": {"boards": "*"}, "telegram": {}, "twitterv2": {"id_lookup": false}}
     4cat.name                 | "4CAT"
     4cat.name_long            | "4CAT: Capture and Analysis Toolkit"
     4cat.github_url           | "https://github.com/digitalmethodsinitiative/4cat"
     path.versionfile          | ".git-checked-out"
     expire.timeout            | 0
     expire.allow_optout       | true
     logging.slack.level       | "WARNING"
     logging.slack.webhook     | null
     mail.admin_email          | null
     mail.host                 | null
     mail.ssl                  | false
     mail.username             | null
     mail.password             | null
     mail.noreply              | "[email protected]"
     SCRAPE_TIMEOUT            | 5
     SCRAPE_PROXIES            | {"http": []}
     IMAGE_INTERVAL            | 3600
     explorer.max_posts        | 100000
     flask.flask_app           | "webtool/fourcat"
     flask.secret_key          | "2e3037b7533c100f324e472a"
     flask.https               | false
     flask.autologin.name      | "Automatic login"
     flask.autologin.api       | ["localhost", "4cat.coraldigital.mx", "\"4cat.coraldigital.mx\"", "51.81.52.207", "0.0.0.0"]
     flask.server_name         | ""
     flask.autologin.hostnames | ["*"]
    
    docker issue 
    opened by hydrosIII 17
  • Cannot make flask frontend work

    Cannot make flask frontend work

    Backend is running: [email protected]:/usr/local/4cat# [email protected]:/usr/local/4cat# ps -ef | grep python root 497 1 0 10:36 ? 00:00:02 /usr/bin/python3 /usr/bin/fail2ban-server -xf start root 516 1 0 10:36 ? 00:00:00 /usr/bin/python3 /usr/share/unattended-upgrades/unattended-upgrade-shutdown --wait-for-signal 4cat 18989 1 59 12:39 ? 00:00:01 /usr/bin/python3 4cat-daemon.py start root 19008 891 0 12:39 pts/0 00:00:00 grep python [email protected]:/usr/local/4cat#

    [email protected]:/usr/local/4cat# [email protected]:/usr/local/4cat# pip install python-dotenv Collecting python-dotenv Downloading python_dotenv-0.20.0-py3-none-any.whl (17 kB) Installing collected packages: python-dotenv Successfully installed python-dotenv-0.20.0 [email protected]/usr/local/4cat# [email protected]:/usr/local/4cat# FLASK_APP=webtool flask run --host=0.0.0.0

    • Serving Flask app "webtool"
    • Environment: production WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
    • Debug mode: off
    • Running on http://0.0.0.0:5000/ (Press CTRL+C to quit) /usr/local/lib/python3.9/dist-packages/flask/sessions.py:208: UserWarning: "localhost" is not a valid cookie domain, it must contain a ".". Add an entry to your hosts file, for example "localhost.localdomain", and use that instead. warnings.warn( MY PC IP - - [10/Jun/2022 12:36:54] "GET / HTTP/1.1" 404 -

    And I get 404 in my browser when I point to http://server_ip:5000

    4cat is installed using this guide: https://github.com/digitalmethodsinitiative/4cat/wiki/Installing-4CAT Install 4cat manually

    docker issue 
    opened by anderscollstrup 17
  • Issue with migrate.py preventing me from running 4cat or accessing web interface

    Issue with migrate.py preventing me from running 4cat or accessing web interface

    Hello, thanks for making this tool available. I'd be grateful for any tips: I'm getting an 'EOFError: EOF when reading a line' message when I run docker-compose up. I'm using Windows 10 Home. I initially tried to install 4cat manually to scrape 4chan, but I couldn't get it to work so I uninstalled and then tried to install through Docker.

    I'm using Windows Powershell to run the command because when I run docker-compose up in Ubuntu 20.04 LTS I'm getting this message:

    'The command 'docker-compose' could not be found in this WSL 2 distro. We recommend to activate the WSL integration in Docker Desktop settings.

    See https://docs.docker.com/desktop/windows/wsl/ for details.'

    The WSL integration is activated in Docker Desktop settings by default. Could it be because I didn't bind-mount the folder I'm storing 4cat in to the Linux file system? I skipped that step and just stored 4cat in /c/users/myusername/ on Windows.

    This is the message I get when I run docker-compose up command from Powershell:

    PS C:\users\myusername\4cat> docker-compose up [+] Running 2/2

    • Container cat_db_1 Running 0.0s
    • Container api Recreated 0.7s Attaching to api, db_1 api | Waiting for postgres... api | PostgreSQL started api | 1 api | Seed present api | Starting app api | Running migrations api | api | 4CAT migration agent api | ------------------------------------------ api | Current 4CAT version: 1.9 api | Checked out version: 1.16 api | The following migration scripts will be run: api | migrate-1.9-1.10.py api | migrate-1.10-1.11.py api | migrate-1.11-1.12.py api | migrate-1.12-1.13.py api | migrate-1.13-1.14.py api | migrate-1.14-1.15.py api | WARNING: Migration can take quite a while. 4CAT will not be available during migration. api | If 4CAT is still running, it will be shut down now. api | Do you want to continue [y/n]? Traceback (most recent call last): api | File "helper-scripts/migrate.py", line 142, in api | if not args.yes and input("").lower() != "y": api | EOFError: EOF when reading a line api exited with code 1
    opened by robbydigital 15
  • Unknown local index '4chan_posts' in search request

    Unknown local index '4chan_posts' in search request

    We managed to overcome our previous issue thanks to your advise. However we are now stuck with a error related to the indexes, appearing whenever we query 4chan.

    First we have generated the sphinx.conf using helper_script/generate_sphinx_config.py. This result in the following indexes:

    ` [...]

    /* Indexes */

    index 4cat_index { min_infix_len = 3 html_strip = 1 type = template charset_table = 0..9, a..z, _, A..Z->a..z, U+47, U+58, U+40, U+41, U+00C0->a, U+00C1->a, U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c,$ }

    index 4chan_posts : 4cat_index { type = plain source = 4chan_posts_old path = /opt/sphinx/data/4chan_posts }

    index 4chan_posts : 4cat_index { type = plain source = 4chan_posts_new path = /opt/sphinx/data/4chan_posts } [...] However starting sphinx with this setup result in the following error:Mar 16 11:48:44 dev sphinxsearch[505]: ERROR: section '4chan_posts' (type='index') already exists in /etc/sphinxsearch/sphinx.conf line 51 col 19. ` I have then attempted to uncomment one of the indexes and/or changing the path which allows for sphinx to start. However another error then appears when collection have been initiated:

    16-03-2020 11:50:54 | ERROR (threading.py:884): Sphinx crash during query deb9cfe3e0a47d56612fd6e453208ed6: (1064, "unknown local index '4chan_posts' in search request\x00")
    

    Hope you once again can help me figure out how the indexes should be set.

    opened by bornakke 12
  • Installing problem: frontend failed to run with 'docker-compose up' command

    Installing problem: frontend failed to run with 'docker-compose up' command

    When running the command docker-compose up, the database and backend components goes well, but the frontend component could not lead to a result, and always stuck at "[INFO] Booting worker with pid: 12" . The problem is still there after restarting the frontend component on Docker UI.

    docker issue 
    opened by baiyuan523 11
  • Error

    Error "string indices must be integers" from search_twitter.py:403

    From our 4cat.log

    21-09-2021 10:48:11 | INFO (processor.py:890): Running processor count-posts on dataset a5eeaf86aa27ff91f212d35880090d70
    21-09-2021 10:48:11 | INFO (processor.py:890): Running processor attribute-frequencies on dataset 659e224c54209146f7551523e8d26633
    21-09-2021 10:48:11 | ERROR (worker.py:890): Processor count-posts raised TypeError while processing dataset a5eeaf86aa27ff91f212d35880090d70 (via 76e33804acca3ac18d3cfa8de8059780) in count_posts.py:59->processor.py:316->search_twitter.py:403:
       string indices must be integers
    
    21-09-2021 10:48:11 | ERROR (worker.py:890): Processor attribute-frequencies raised TypeError while processing dataset 659e224c54209146f7551523e8d26633 (via 01db05ce10f58b320a397d68b61986a2) in rank_attribute.py:132->processor.py:316->search_twitter.py:403:
       string indices must be integers
    

    The line in question is from SearchWithTwitterAPIv2.map_item() https://github.com/digitalmethodsinitiative/4cat/blob/f0e01fb500b7dafb58a05873cf34bf15e288a88c/datasources/twitterv2/search_twitter.py#L403

    and I haven't found a good way to bring 4CAT under a debugger and/or inform me of an ID for the violating tweet.

    Could this be related to #169 ?

    opened by xmacex 10
  • AttributeError: 'Namespace' object has no attribute 'release'

    AttributeError: 'Namespace' object has no attribute 'release'

    Fresh installation on MAC with Docker from local files. Any idea what i did wrong?

    4cat_backend:

    Waiting for postgres... PostgreSQL started Database already created

    Traceback (most recent call last): File "helper-scripts/migrate.py", line 66, in if args.release: AttributeError: 'Namespace' object has no attribute 'release'

    4cat_backend EXITED (1)

    bug deployment 
    opened by psegovias 9
  • Docker setup fails to

    Docker setup fails to "import config" on macOS Big Sur (M1)

    Discussed in https://github.com/digitalmethodsinitiative/4cat/discussions/191

    Originally posted by p-charis October 25, 2021 Hey everyone! First, thanks a million to the developers for building this & making it available :)

    Now, I managed to get 4CAT working on a macOS (latest version-M1 native) but only after I removed the following lines from the docker-setup.py file (line #36 onwards). Without these lines the installation wouldn't work as it returned the error that no module named config was found. I suspect it might have sth to do with the way that Docker runs on macOS generally and the paths it creates, but I haven't figured it out yet. So, I just wanted to let the Devs know, as well as other macOS users that, if they've had a similar problem, they could try this workaround.

    # Ensure filepaths exist
    import config
    for path in [config.PATH_DATA,
                 config.PATH_IMAGES,
                 config.PATH_LOGS,
                 config.PATH_LOCKFILE,
                 config.PATH_SESSIONS,
                 ]:
        if Path(config.PATH_ROOT, path).is_dir():
            pass
        else:
            os.makedirs(Path(config.PATH_ROOT, path))</div>
    
    bug docker issue 
    opened by p-charis 8
  • Tokeniser exclusion list ignores last word in list

    Tokeniser exclusion list ignores last word in list

    I'm filtering some commonly used words out of a corpus with the Tokenise processor and it only seems to be partially successful. For example in one month there are 37,325 instances of one word. When I add the word to the reject list there are still 6307 instances of the word. So it's getting most but not at all. I'm having the same issue with some common swear words that I'm trying to filter out - most are gone, but some remain. Is there a reason for this?

    Thanks for any insight!

    opened by robbydigital 6
  • Datasource that interfaces with a TCAT instance

    Datasource that interfaces with a TCAT instance

    It works, and arguably fixes #117, but:

    • The form looks hideous with the million query fields. Do we need them all for 4CAT? Is there a way to make it look better?
    • The list of bins displayed in the 'create dataset' form simply lists bins from all instances. This can get really long really fast when supporting multiple instances. A custom form control may be necessary to make this user-friendly.
    • The list of bins is loaded synchronously whenever get_options() is run. The result should probably be cached or updated in the background (with a separate worker...?)
    • The data format now follows that of twitterv2's map_item(), but there is quite a bit more data in the TCAT export that we could include.
    opened by stijn-uva 6
  • Update 'FAQ' and 'About' pages

    Update 'FAQ' and 'About' pages

    The 'About' page should probably refer to documentation and guides etc rather than the 'news' thing it's doing now, and the FAQ is still very 4chan-oriented.

    enhancement (mostly) front-end 
    opened by stijn-uva 0
  • Feature request: allow data from linked telegram chat channels to be collected

    Feature request: allow data from linked telegram chat channels to be collected

    Telegram chats have linked "discussion" channels, where users can respond to messages in the main channel. Occasionally, these are also public, and if so, can also be found by the API. It would be useful to allow users to also automatically collect data from these chat channels if they're found.

    A note on this and future feature requests: we're (https://github.com/GateNLP) putting in some additions to the telegram data collector on our end. Thought it might be worth checking if there's scope for them to be added to the original/main instance.

    If any issues with this/they don't really fit with what you have in mind for your instance, all fine, we'll continue to maintain them on our own fork instead!

    Linked pull request: https://github.com/digitalmethodsinitiative/4cat/pull/322

    enhancement data source 
    opened by muneerahp 1
  • LIHKG data source

    LIHKG data source

    A data source, for LIHKG. Uses the web interface's web API, which seems reasonable straightforward and stable. There is some rate limiting, which 4CAT tries to respect by pacing requests and implementing an exponential backoff.

    enhancement data source questionable 
    opened by stijn-uva 0
  • ability to count frequency for specific (and multiple) keywords over time

    ability to count frequency for specific (and multiple) keywords over time

    a processor that can filter on multiple particular words or phrases within a dataset, and outputs the count values (overall, or over time) per item, outputting a .csv that can be imported into raw graphs to compare the evolution of different words/phrases over time, either in absolute or in relative numbers.

    processors data source 
    opened by daniel-dezeeuw 0
  • Warn about need to update Docker `.env` file when upgrading 4CAT to new version

    Warn about need to update Docker `.env` file when upgrading 4CAT to new version

    When using Docker, the .env file can be used to ensure you pull a particular version of 4CAT. If you then upgrade 4CAT interactively, we cannot modify the .env file (which exists on the users host machine). If a user removes or rebuilds 4CAT, it will pull the version of 4CAT listed in the .env file which will not be the latest version that was upgraded to.

    I will look at adding a warning/notification to the upgrade logs to notify users of the need to update their .env file.

    enhancement deployment 
    opened by dale-wahl 0
Releases(v1.29)
  • v1.29(Oct 6, 2022)

    Snapshot of 4CAT as of October 2022. Many changes and fixes since the last official release, including:

    • Restart and upgrade 4CAT via the web interface (#181, #287, #288)
    • Addition of several processors for Twitter datasets to increase inter-operability with DMI-TCAT
    • DMI-TCAT data source, which can interface with a DMI-TCAT instance to create datasets from tweets stored therein (#226)
    • LinkedIn data source, to be used together with Zeeschuimer
    • Fixes & improvements to Docker container set-up and build process (#269, #270, #290)
    • A number of processors have been updated to transparently filter NDJSON datasets instead of turning them into CSV datasets (#253, #282, #291, #292)
    • And many smaller fixes & updates

    From this release onwards, 4CAT can be upgraded to the latest release via the Control Panel in the web interface.

    Source code(tar.gz)
    Source code(zip)
  • v1.26(May 10, 2022)

    Many updates:

    • Configuration is now stored in the database and (mostly) editable via the web GUI
    • The Telegram datasource now collects more data and stores the 'raw' message objects as NDJSON
    • Dialogs in the web UI now use custom widgets instead of alert()
    • Twitter datasets will retrieve the expected amount of tweets before capturing and ask for confirmation if it is a high number
    • Various fixes and tweaks to the Dockerfiles
    • New extended data source information pages with details about limitations, caveats, useful links, etc
    • And much more
    Source code(tar.gz)
    Source code(zip)
  • v1.25(Feb 24, 2022)

    Snapshot of 4CAT as of 24 February 2022. Many changes and fixes since the last official release, including:

    • Explore and annotate your datasets interactively with the new Explorer (beta)
    • Datasets can be set to automatically get deleted after a set amount of time, and can be made private
    • Incremental refinement of the web interface
    • Twitter datasets can be exported to a DMI-TCAT instance
    • User accounts can now be deactivated (banned)
    • Many smaller fixes and new features
    Source code(tar.gz)
    Source code(zip)
  • v1.21(Sep 28, 2021)

    Snapshot of 4CAT as of 28 September 2021. Many changes and fixes since the last official release, including:

    • User management via control panel
    • Improved Docker support
    • Improved 4chan data dump import helper scripts
    • Improved country code filtering for 4chan/pol/ datasets
    • More robust and versatile network analysis processors
    • Various new filter processors
    • Topic modeling processor
    • Support for non-academic Twitter API queries
    • Option to download NDJSON datasets as CSV
    • Support for hosting 4CAT with a non-root URL
    • And many more
    Source code(tar.gz)
    Source code(zip)
  • v1.18a(May 7, 2021)

  • v1.17(Apr 8, 2021)

  • v1.9b1(Jan 17, 2020)

  • v1.0b1(Feb 28, 2019)

    4CAT is now ready for wider use! It offers...

    • An API that can be used to queue and manipulate queries programmatically
    • Diverse analytical post-processors that may be combined to further analyse data sets
    • A flexible interface for adding various data sources
    • A robust scraper
    • A very retro interface
    Source code(tar.gz)
    Source code(zip)
Owner
Digital Methods Initiative
The Digital Methods Initiative (DMI) is one of Europe's leading Internet Studies research groups. Research tools it develops are collected here.
Digital Methods Initiative
CubingB is a timer/analyzer for speedsolving Rubik's cubes, with smart cube support

CubingB is a timer/analyzer for speedsolving Rubik's cubes (and related puzzles). It focuses on supporting "smart cubes" (i.e. bluetooth cubes) for recording the exact moves of a solve in real time.

Zach Wegner 5 Sep 18, 2022
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022
nrgpy is the Python package for processing NRG Data Files

nrgpy nrgpy is the Python package for processing NRG Data Files Website and source: https://github.com/nrgpy/nrgpy Documentation: https://nrgpy.github

NRG Tech Services 23 Dec 08, 2022
An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

Dagster 6.2k Jan 08, 2023
Single-Cell Analysis in Python. Scales to >1M cells.

Scanpy – Single-Cell Analysis in Python Scanpy is a scalable toolkit for analyzing single-cell gene expression data built jointly with anndata. It inc

Theis Lab 1.4k Jan 05, 2023
Display the behaviour of a realtime program with a scope or logic analyser.

1. A monitor for realtime MicroPython code This library provides a means of examining the behaviour of a running system. It was initially designed to

Peter Hinch 17 Dec 05, 2022
Tokyo 2020 Paralympics, Analytics

Tokyo 2020 Paralympics, Analytics Thanks for checking out my app! It was built entirely using matplotlib and Tokyo 2020 Paralympics data. This applica

Petro Ivaniuk 1 Nov 18, 2021
A set of procedures that can realize covid19 virus detection based on blood.

A set of procedures that can realize covid19 virus detection based on blood.

Nuyoah-xlh 3 Mar 07, 2022
Retail-Sim is python package to easily create synthetic dataset of retaile store.

Retailer's Sale Data Simulation Retail-Sim is python package to easily create synthetic dataset of retaile store. Simulation Model Simulator consists

Corca AI 7 Sep 30, 2022
Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

Datashader is a data rasterization pipeline for automating the process of creating meaningful representations of large amounts of data.

HoloViz 2.9k Jan 06, 2023
WaveFake: A Data Set to Facilitate Audio DeepFake Detection

WaveFake: A Data Set to Facilitate Audio DeepFake Detection This is the code repository for our NeurIPS 2021 (Track on Datasets and Benchmarks) paper

Chair for Sys­tems Se­cu­ri­ty 27 Dec 22, 2022
PyIOmica (pyiomica) is a Python package for omics analyses.

PyIOmica (pyiomica) This repository contains PyIOmica, a Python package that provides bioinformatics utilities for analyzing (dynamic) omics datasets.

G. Mias Lab 13 Jun 29, 2022
Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Statistical_Modelling Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data Statistical Methods for Decision Ma

Avnika Mehta 1 Jan 27, 2022
Picka: A Python module for data generation and randomization.

Picka: A Python module for data generation and randomization. Author: Anthony Long Version: 1.0.1 - Fixed the broken image stuff. Whoops What is Picka

Anthony 108 Nov 30, 2021
bigdata_analyse 大数据分析项目

bigdata_analyse 大数据分析项目 wish 采用不同的技术栈,通过对不同行业的数据集进行分析,期望达到以下目标: 了解不同领域的业务分析指标 深化数据处理、数据分析、数据可视化能力 增加大数据批处理、流处理的实践经验 增加数据挖掘的实践经验

Way 2.4k Dec 30, 2022
A variant of LinUCB bandit algorithm with local differential privacy guarantee

Contents LDP LinUCB Description Model Architecture Dataset Environment Requirements Script Description Script and Sample Code Script Parameters Launch

Weiran Huang 4 Oct 25, 2022
Approximate Nearest Neighbor Search for Sparse Data in Python!

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Meta Research 906 Jan 01, 2023
Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023
Data collection, enhancement, and metrics calculation.

l3_data_collection Data collection, enhancement, and metrics calculation. Summary Repository containing code for QuantDAO's JDT data collection task.

Ruiwyn 3 Dec 23, 2022