Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs (CIKM 2020)

Overview

Version License repo size Arxiv build badge coverage badge


Karate Club is an unsupervised machine learning extension library for NetworkX.

Please look at the Documentation, relevant Paper, Promo Video, and External Resources.

Karate Club consists of state-of-the-art methods to do unsupervised learning on graph structured data. To put it simply it is a Swiss Army knife for small-scale graph mining research. First, it provides network embedding techniques at the node and graph level. Second, it includes a variety of overlapping and non-overlapping community detection methods. Implemented methods cover a wide range of network science (NetSci, Complenet), data mining (ICDM, CIKM, KDD), artificial intelligence (AAAI, IJCAI) and machine learning (NeurIPS, ICML, ICLR) conferences, workshops, and pieces from prominent journals.

The newly introduced graph classification datasets are available at SNAP, TUD Graph Kernel Datasets, and GraphLearning.io.


Citing

If you find Karate Club and the new datasets useful in your research, please consider citing the following paper:

@inproceedings{karateclub,
       title = {{Karate Club: An API Oriented Open-source Python Framework for Unsupervised Learning on Graphs}},
       author = {Benedek Rozemberczki and Oliver Kiss and Rik Sarkar},
       year = {2020},
       pages = {3125–3132},
       booktitle = {Proceedings of the 29th ACM International Conference on Information and Knowledge Management (CIKM '20)},
       organization = {ACM},
}

A simple example

Karate Club makes the use of modern community detection techniques quite easy (see here for the accompanying tutorial). For example, this is all it takes to use on a Watts-Strogatz graph Ego-splitting:

import networkx as nx
from karateclub import EgoNetSplitter

g = nx.newman_watts_strogatz_graph(1000, 20, 0.05)

splitter = EgoNetSplitter(1.0)

splitter.fit(g)

print(splitter.get_memberships())

Models included

In detail, the following community detection and embedding methods were implemented.

Overlapping Community Detection

Non-Overlapping Community Detection

Neighbourhood-Based Node Level Embedding

Structural Node Level Embedding

Attributed Node Level Embedding

Meta Node Embedding

Graph Level Embedding

Head over to our documentation to find out more about installation and data handling, a full list of implemented methods, and datasets. For a quick start, check out our examples.

If you notice anything unexpected, please open an issue and let us know. If you are missing a specific method, feel free to open a feature request. We are motivated to constantly make Karate Club even better.


Installation

Karate Club can be installed with the following pip command.

$ pip install karateclub

As we create new releases frequently, upgrading the package casually might be beneficial.

$ pip install karateclub --upgrade

Running examples

As part of the documentation we provide a number of use cases to show how the clusterings and embeddings can be utilized for downstream learning. These can accessed here with detailed explanations.

Besides the case studies we provide synthetic examples for each model. These can be tried out by running the example scripts. In order to run one of the examples, the Graph2Vec snippet:

$ cd examples/whole_graph_embedding/
$ python graph2vec_example.py

Running tests

$ python setup.py test

License

Issues
  • GL2vec : RuntimeError: you must first build vocabulary before training the model

    GL2vec : RuntimeError: you must first build vocabulary before training the model

    Hello, First thanks for your work, it's just great.

    However, I have an error while trying to run GL2vec on my dataset, while it works perfectly with the example. Where is exactly this type of error coming from ?

    Thanks in advance

    opened by hug0prevoteau 14
  • How to build my own dataset?

    How to build my own dataset?

    I have to build graphs, and following that I have to generate graph embedding.

    I checked the documentation i.e. https://karateclub.readthedocs.io/.

    But I didn't understand how to build my own graphs.

    1. Can you please point out a sample code where you create dataset from scratch?
    2. I have already checked code here. But they all load pre-defined dataset.
    3. Can you show any code snippet where you create graph i.e. create nodes and add edges.
    4. How to set attributes (features) for the nodes and edges?

    Thanks in advance for your help.

    I am following the https://karateclub.readthedocs.io/en/latest/notes/installation.html.

    opened by smith-co 9
  • About GL2vec

    About GL2vec

    Hello, thanks for the awesome work!!

    It seems that there are 2 mistakes in the implementation of GL2vec module.

    The first one is :

    in the code below, """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" def _create_line_graph(self, graph): r"""Getting the embedding of graphs. Arg types: * graph (NetworkX graph) - The graph transformed to be a line graph. Return types: * line_graph (NetworkX graph) - The line graph of the source graph. """ graph = nx.line_graph(graph) node_mapper = {node: i for i, node in enumerate(graph.nodes())} edges = [[node_mapper[edge[0]], node_mapper[edge[1]]] for edge in graph.edges()] line_graph = nx.from_edgelist(edges) return line_graph """"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" when converting graph G to line graph LG, the method "_create_line_graph()" ignores the edge attribute of G. (It means that there will be no node arrtibutes in LG) so consequently, the method "WeisfeilerLehmanHashing" will not use the attribute information, and will always use the structural information (degree) instead.

    The second one is :

    The GL2vec module only returns the embedding of line graph. But in the original paper of GL2vec, they concatenate the embedding of graph and of line graph.
    Then named the framework "GL2vec", which means "Graph and Line graph to vector".

    Only use the embedding of line graph for downstream task may lead to worse performance.

    We noticed that when applying the embeddings to the graph classification task, (when graph both have node attribute and edge attribute) the performance (accuracy) are as follow: concat(G , LG) > G > LG

    Hope it helps :)

    opened by cheezy88 8
  • Classifying with graph2vec model

    Classifying with graph2vec model

    I'm able to obtain an embedding for a list of NetworkX graphs using graph2vec, and I was wondering if karateclub has a function to make classifications for graphs outside the training set? That is, given my embedding, I want to input a graph outside my original graph list (used in the model) and obtain a list of most similar graphs (something like a "most similar" function).

    opened by joseluisfalla 6
  • How to improve the performance of Graph2Vec model fit function ?

    How to improve the performance of Graph2Vec model fit function ?

    I tried to increase the performance of the Graph2Vec model by using increasing the worker parameter when initializing the model. But it seems that still, the model takes only 1 core to process the fit function.

    Is method I have used to assign the workers correct ? Is there another method to improve the performance ?

    model =  Graph2Vec(workers=28)
    graphs_list=create_graph_list(graph_df)
    model.fit(graphs_list)
    graph_x = model.get_embedding()
    
    opened by 1209973 6
  • Is consecutive numeric indexing necessary for Graph2Vec?

    Is consecutive numeric indexing necessary for Graph2Vec?

    Thanks for the awesome work, networkx is truly helpful when we are dealing with Graph data structure.

    I'm trying to get graph embedding using Graph2vec so that we could compare similarity among graphs. But I'm stuck in this assertion: assert numeric_indices == node_indices, "The node indexing is wrong."

    Say if we have two graphs, each node in the graph represents a word. We build a mapping so that we could replace text with number. For example, whenever the word "Library" occurs in any graph, we label it with the number "2". In this case, the indexes inside one graph might not be consecutive because the mapping is created from a number of graphs.

    So is it still necessary for enforce consecutive indexing in this case? Or I understand the usage of Graph2Vec wrong?

    opened by bdeng3 5
  • Graph Embeddings using node features and inductivity

    Graph Embeddings using node features and inductivity

    Hello,

    First of all thank you for this amazing library! I have a serie of small graphs where each node contains features and I am trying to learn graph-level embedding in an unsupervised manner. However, I couldn't find how to load node features in the graphs before feeding them to a graph embedding algorithm. Could you describe the input needed by the algorithms ?

    Also, is it possible to generate embedding with some sort of forward function once the models are trained (without retraining the model) ? I.e. does the library support inductivity ?

    Thank you!

    opened by TrovatelliT 5
  • graph2vec implementation and graphs with missing nodes

    graph2vec implementation and graphs with missing nodes

    Hi there,

    first of all, thanks a lot for developing this, it has potential to simplify in-silico experiments on biological networks and I am grateful for that!

    I have a question related to the graph2vec implementation. The requirement of the package for graph notation is that nodes have to be named with integers starting from 0 and have to be consecutive. I am working with a collection of 9.000 small networks and would like to embed all of them into an N-dimensional space. Now, all those networks consist of about 25.000 nodes but in some networks these nodes (here it's really genes) are missing (not all genes are supposed to be present in all networks).

    If I rename all my nodes from actual gene names to integers and know that some networks don't have all the genes, I will end up with some networks without consecutive node names, e.g. there will be (..), 20, 21, 24, 25, (...) in one network and perhaps (...), 20, 21, 22, 24, 25, (...) in another. That would violate the requirement of being consecutive.

    My question is: is the implementation aware that a node 25 is the same object between the different networks? Or is it not important and in reality the embedding only takes into account the structure only and I should 'rename' all my networks separately to keep the node naming consecutive?

    opened by kajocina 5
  • Multithreading for WL hasing function

    Multithreading for WL hasing function

    Hi!

    Maybe just another suggestion. In the embedding algorithms, the WeisfeilerLehmanHashing function in the fit function could be time-consuming and the WL hashing function for each graph is independent. Therefore, maybe using multhreading from python can speed them up and I modify the code for my application of graph2vec:

    ==================================

    def fit(self, graphs):
        """
        Fitting a Graph2Vec model.
    
        Arg types:
            * **graphs** *(List of NetworkX graphs)* - The graphs to be embedded.
        """
        pool = ThreadPool(8)
        args_generator = [(graph, self.wl_iterations, self.attributed) for graph in graphs]
        documents = pool.starmap(WeisfeilerLehmanHashing, args_generator)
        pool.close()
        pool.join()
        #documents = [WeisfeilerLehmanHashing(graph, self.wl_iterations, self.attributed) for graph in graphs]
        documents = [TaggedDocument(words=doc.get_graph_features(), tags=[str(i)]) for i, doc in enumerate(documents)]
    
        model = Doc2Vec(documents,
                        vector_size=self.dimensions,
                        window=0,
                        min_count=self.min_count,
                        dm=0,
                        sample=self.down_sampling,
                        workers=self.workers,
                        epochs=self.epochs,
                        alpha=self.learning_rate,
                        seed=self.seed)
    
        self._embedding = [model.docvecs[str(i)] for i, _ in enumerate(documents)]
    
    opened by zslwyuan 5
  •     assert connected or self.allow_disjoint,

    assert connected or self.allow_disjoint, "Graph is not connected." AttributeError: 'Graph2Vec' object has no attribute 'allow_disjoint'

    I am receiving this error after running the provided example on my own dataset: a set of sub-graphs, each representing one sessions of my experiment. How may solve this?

    The graph is un-directed and without features.

    opened by jiruhameru 4
  • Questions regarding RMST-RBS

    Questions regarding RMST-RBS

    There is this paper called "Community detection and role identification in directed networks: understanding the Twitter network of the care.data debate" https://arxiv.org/abs/1508.03165, which follows "Interest communities and flow roles in directed networks: the Twitter network of the UK riots" https://arxiv.org/abs/1311.6785 and "Finding role communities in directed networks using Role-Based Similarity, Markov Stability and the Relaxed Minimum Spanning Tree" https://arxiv.org/abs/1309.1795

    The papers mainly addresses the optimization of node embedding coupled with community detection as its main goal.

    1. Are there any algorithms that does both of them at the same time? (e.g. Node2Vec)
    2. Would RMST-RBS be included in KarateClub in the near future?
    opened by BrandonKMLee 4
  • Add get_params method to BaseEstimator

    Add get_params method to BaseEstimator

    Hi there, thanks for a great library 🙏

    • Implements a get_params method for BaseEstimator
    • Consistent with scikit-learn API and useful for model tracking in MLflow
    • Passing CI ✅

    This is my first, albeit minor, contribution to an open-source package so hope all OK 🤞

    Thanks again and all good wishes Chris

    opened by tomlincr 3
Releases(v_10301)
Owner
Benedek Rozemberczki
Machine Learning Engineer at AstraZeneca and PhD candidate at The University of Edinburgh.
Benedek Rozemberczki
Kedro is an open-source Python framework for creating reproducible, maintainable and modular data science code

A Python framework for creating reproducible, maintainable and modular data science code.

QuantumBlack Labs 7.5k Aug 19, 2022
CONCEPT (COsmological N-body CodE in PyThon) is a free and open-source simulation code for cosmological structure formation

CONCEPT (COsmological N-body CodE in PyThon) is a free and open-source simulation code for cosmological structure formation. The code should run on any Linux system, from massively parallel computer clusters to laptops.

Jeppe Dakin 58 Apr 1, 2022
An open-source application for biological image analysis

CellProfiler is a free open-source software designed to enable biologists without training in computer vision or programming to quantitatively measure

CellProfiler 693 Aug 15, 2022
OPEM (Open Source PEM Fuel Cell Simulation Tool)

Table of contents What is PEM? Overview Installation Usage Executable Library Telegram Bot Try OPEM in Your Browser! MATLAB Issues & Bug Reports Contr

ECSIM 116 Aug 12, 2022
Mathics is a general-purpose computer algebra system (CAS). It is an open-source alternative to Mathematica

Mathics is a general-purpose computer algebra system (CAS). It is an open-source alternative to Mathematica. It is free both as in "free beer" and as in "freedom".

Mathics 158 Aug 7, 2022
PsychoPy is an open-source package for creating experiments in behavioral science.

PsychoPy is an open-source package for creating experiments in behavioral science. It aims to provide a single package that is: precise enoug

PsychoPy 1.3k Aug 19, 2022
Open Delmic Microscope Software

Odemis Odemis (Open Delmic Microscope Software) is the open-source microscopy software of Delmic B.V. Odemis is used for controlling microscopes of De

Delmic 30 Aug 12, 2022
A framework for feature exploration in Data Science

Beehive A framework for feature exploration in Data Science Background What do we do when we finish one episode of feature exploration in a jupyter no

Steven IJ 1 Jan 3, 2022
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Aesara

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 6.9k Aug 13, 2022
3D visualization of scientific data in Python

Mayavi: 3D visualization of scientific data in Python Mayavi docs: http://docs.enthought.com/mayavi/mayavi/ TVTK docs: http://docs.enthought.com/mayav

Enthought, Inc. 1k Aug 13, 2022
Datamol is a python library to work with molecules

Datamol is a python library to work with molecules. It's a layer built on top of RDKit and aims to be as light as possible.

datamol 152 Aug 18, 2022
Incubator for useful bioinformatics code, primarily in Python and R

Collection of useful code related to biological analysis. Much of this is discussed with examples at Blue collar bioinformatics. All code, images and

Brad Chapman 546 Aug 13, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 7.7k Aug 20, 2022
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.5k Aug 22, 2022
PennyLane is a cross-platform Python library for differentiable programming of quantum computers.

PennyLane is a cross-platform Python library for differentiable programming of quantum computers. Train a quantum computer the same way as a neural network.

PennyLaneAI 1.5k Aug 19, 2022
SCICO is a Python package for solving the inverse problems that arise in scientific imaging applications.

Scientific Computational Imaging COde (SCICO) SCICO is a Python package for solving the inverse problems that arise in scientific imaging applications

Los Alamos National Laboratory 21 Aug 11, 2022
Efficient Python Tricks and Tools for Data Scientists

Why efficient Python? Because using Python more efficiently will make your code more readable and run more efficiently.

Khuyen Tran 677 Aug 18, 2022
Float2Binary - A simple python class which finds the binary representation of a floating-point number.

Float2Binary A simple python class which finds the binary representation of a floating-point number. You can find a class in IEEE754.py file with the

Bora Canbula 3 Dec 14, 2021
A simple computer program made with Python on the brachistochrone curve.

Brachistochrone-curve This is a simple computer program made with Python on the brachistochrone curve. I decided to write it after a physics lesson on

Diego Romeo 1 Dec 16, 2021