A Python module for clustering creators of social media content into networks

Overview

sm_content_clustering

A Python module for clustering creators of social media content into networks.

Currently supports identifying potential networks of Facebook Pages in the CSV output files from CrowdTangle.

Installation

Can install via pip with

pip install git+https://github.com/jdallen83/sm_content_clustering

Install requires pandas and fasttext

Language Prediction

To enable language prediction, you will need to download a fasttext language model. Module was tested with lid.176.ftz.

Usage

Command line

Can be called as a module for command line usage.

For usage guide:

python -m sm_content_clustering -h

Example that will create an output CSV with potential networks and predicted languages from several input CSVs:

python -m sm_content_clustering --add_language --ft_model_path /path/to/lid.176.ftz --output_path /path/to/output.csv --min_threshold 0.03 /path/to/input_1.csv /path/to/input_2.csv

Python

Module can also be called from within Python.

Example that will generate a Pandas dataframe that contains potential networks:

import sm_content_clustering.sm_processor as sm_processor

input_files = ['/path/to/1.csv', '/path/to/2.csv', '/path/to/3.csv']
df = sm_processor.ct_generate_page_clusters(input_files, add_language=True, ft_model_path='/path/to/lid.176.ftz')
print(df)

Options

Arguments for sm_processor.ct_generate_page_clusters() are

  1. infiles: Input files to read content from. Required.
  2. content_cols: Which columns from the input files to use as content for the purposes of clustering identical posts. Default: Message, Image Text, Link, Link Text
  3. add_language: Whether to predict the page and network languages. Default: False
  4. ft_model_path: Path to fasttext model file. Default: None
  5. outfile: Path to write output CSV with potential networks. Default: None
  6. update_every: How often to output clustering status. (Print status 1 every N pages). Default: 1000
  7. min_threshold: Minimum similarity score for clustering. Possible range between 0 and 1, with 1 being perfect high confidence overlap, and 0 being no overlap. Default: 0.03
  8. second_cluster_factor: Requirement that the best matched cluster for a page must score a factor X above the second best matched cluster. Default: 2.5

Methodology

Module assumes you have social media content, which includes the body content of a message and the account that created it. It begins by grouping by all messages, and finds which accounts have shared identical messages within the dataset. It then applies a basic agglomerative clustering algorithm to group the accounts into clusters that are frequently sharing the same messages.

The clustering loops through the list of all accounts, normally sorted in reverse size or popularity, and for each account, searches all existing clusters to see if there is a valid match, given the min_threshold and second_cluster_factor parameters. If there is a match, the account is added to the existing cluster. If there is not a match, then, if there is enough messages from the account to justify, a new cluster will be created with the account acting as the seed. Otherwise the account is discarded.

In theory, any measure could be used to determine if a given account should be added to a given cluster, such as, what fraction of the accounts messages match those within the cluster. Currently, the module combines message coverage, Normalized Pointwise Mutual Information, and a dampening factor that reduces matching score when there is an insufficient number of messages to be confident.

At the end, any clusters that are below a size threshold are discarded.

License

MIT License

Time ranges with python

timeranges Time ranges. Read the Docs Installation pip timeranges is available on pip: pip install timeranges GitHub You can also install the latest v

Micael Jarniac 2 Sep 01, 2022
Projeto para realizar o RPA Challenge . Utilizando Python e as bibliotecas Selenium e Pandas.

RPA Challenge in Python Projeto para realizar o RPA Challenge (www.rpachallenge.com), utilizando Python. O objetivo deste desafio é criar um fluxo de

Henrique A. Lourenço 1 Apr 12, 2022
ASTR 302: Python for Astronomy (Winter '22)

ASTR 302, Winter 2022, University of Washington: Python for Astronomy Mario Jurić Location When: 2:30-3:50, Monday & Wednesday, Winter quarter 2022 Wh

UW ASTR 302: Python for Astronomy 4 Jan 12, 2022
[CVPR2022] This repository contains code for the paper "Nested Collaborative Learning for Long-Tailed Visual Recognition", published at CVPR 2022

Nested Collaborative Learning for Long-Tailed Visual Recognition This repository is the official PyTorch implementation of the paper in CVPR 2022: Nes

Jun Li 65 Dec 09, 2022
An orchestration platform for the development, production, and observation of data assets.

Dagster An orchestration platform for the development, production, and observation of data assets. Dagster lets you define jobs in terms of the data f

Dagster 6.2k Jan 08, 2023
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data.

Hatchet Hatchet is a Python-based library that allows Pandas dataframes to be indexed by structured tree and graph data. It is intended for analyzing

Lawrence Livermore National Laboratory 14 Aug 19, 2022
pipeline for migrating lichess data into postgresql

How Long Does It Take Ordinary People To "Get Good" At Chess? TL;DR: According to 5.5 years of data from 2.3 million players and 450 million games, mo

Joseph Wong 182 Nov 11, 2022
The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

The OHSDI OMOP Common Data Model allows for the systematic analysis of healthcare observational databases.

Bell Eapen 14 Jan 02, 2023
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
Weather Image Recognition - Python weather application using series of data

Weather Image Recognition - Python weather application using series of data

Kushal Shingote 1 Feb 04, 2022
Improving your data science workflows with

Make Better Defaults Author: Kjell Wooding [email protected] This is the git re

Kjell Wooding 18 Dec 23, 2022
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022
Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Data Scientist Learning Plan Demonstrate the breadth and depth of your data science skills by earning all of the Databricks Data Scientist credentials

Trung-Duy Nguyen 27 Nov 01, 2022
pandas: powerful Python data analysis toolkit

pandas is a Python package that provides fast, flexible, and expressive data structures designed to make working with "relational" or "labeled" data both easy and intuitive.

pandas 36.4k Jan 03, 2023
a tool that compiles a csv of all h1 program stats

h1stats - h1 Program Stats Scraper This python3 script will call out to HackerOne's graphql API and scrape all currently active programs for informati

Evan 40 Oct 27, 2022
MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data.

MetPy MetPy is a collection of tools in Python for reading, visualizing and performing calculations with weather data. MetPy follows semantic versioni

Unidata 971 Dec 25, 2022
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022
A real-time financial data streaming pipeline and visualization platform using Apache Kafka, Cassandra, and Bokeh.

Realtime Financial Market Data Visualization and Analysis Introduction This repo shows my project about real-time stock data pipeline. All the code is

6 Sep 07, 2022