Precision Medicine Knowledge Graph (PrimeKG)

Last update: Dec 10, 2022

Overview

PrimeKG

Website | bioRxiv Paper | Harvard Dataverse

Precision Medicine Knowledge Graph (PrimeKG) presents a holistic view of diseases. PrimeKG integrates 20 high-quality biomedical resources to describe 17,080 diseases with 4,050,249 relationships representing ten major biological scales, considerably expanding previous efforts in disease-rooted knowledge graphs. We accompany PrimeKG’s graph structure with text descriptions of clinical guidelines for drugs and diseases to enable multimodal analyses.

Updates

PrimeKG is live on bioRxiv and Harvard Dataverse!

Unique Features of PrimeKG

Diverse coverage of diseases: PrimeKG contains over 17,000 diseases including rare dieases. Disease nodes in PrimeKG are densely connected to other nodes in the graph and have been optimized for clinical relevance in downstream precision medicine tasks.
Heterogeneous knowledge graph: PrimeKG contains over 100,000 nodes distributed over various biological scales as depicted below. PrimeKG also contains over 4 million relationships between these nodes distributed over 29 types of edges.
Multimodal integration of clinical knowledge: Disease and drug nodes in PrimeKG are augmented with clinical descriptors that come from medical authorities such as Mayo Clinic, Orphanet, Drug Bank, and so forth.
Ready-to-use datasets: PrimeKG is minimally dependent on external packages. Our knowledge graph can be retrieved in a ready-to-use format from Harvard Dataverse.
Data functions: PrimeKG provides extensive data functions, including processors for primary resources and scripts to build an updated knowledge graph.

Environment setup

Using `pip`

To install the dependencies required to run the PrimeKG code, use pip:

pip install -r requirements.txt

Or use `conda`

conda env create --name PrimeKG --file=environments.yml

Building an updated PrimeKG

Downloading primary data resources

All persistent identifiers and weblinks to download the 20 primary data resources used to build PrimeKG are systematically provided in the Data Records section of our article. We have also mentioned the exact filenames that were downloaded from each resource for easy corroboration.

Curating primary data resources

We provide the scripts used to process all primary data resources and the names of the resulting output files generated by those scripts. We would be happy to share the intermediate processing datasets that were used to create PrimeKG on request.

Database	Processing scripts	Expected script output
Bgee	bgee.py	anatomy_gene.csv
Comparative Toxicogenomics Database	ctd.py	exposure_data.csv
DisGeNET	-	curated_gene_disease_associations.tsv
DrugBank	drugbank_drug_drug.py	drug_drug.csv
DrugBank	parsexml_drugbank.ipynb, Parsed_feature.ipynb	12 drug feature files
DrugBank	drugbank_drug_protein.py	drug_protein.csv
Drug Central	drugcentral_queries.txt	drug_disease.csv
Drug Central	drugcentral_feature.Rmd	dc_features.csv
Entrez Gene	ncbigene.py	protein_go_associations.csv
Gene Ontology	go.py	go_terms_info.csv, go_terms_relations.csv
Human Phenotype Ontology	hpo.py, hpo_obo_parser.py	hp_terms.csv, hp_parents.csv, hp_references.csv
Human Phenotype Ontology	hpoa.py	disease_phenotype_pos.csv, disease_phenotype_neg.csv
MONDO	mondo.py, mondo_obo_parser.py	mondo_terms.csv, mondo_parents.csv, mondo_references.csv, mondo_subsets.csv, mondo_definitions.csv
Reactome	reactome.py	reactome_ncbi.csv, reactome_terms.csv, reactome_relations.csv
SIDER	sider.py	sider.csv
UBERON	uberon.py	uberon_terms.csv, uberon_rels.csv, uberon_is_a.csv
UMLS	umls.py, map_umls_mondo.py	umls_mondo.csv
UMLS	umls.ipynb	umls_def_disorder_2021.csv, umls_def_disease_2021.csv

Harmonizing datasets into PrimeKG

The code to harmonize datasets and construct PrimeKG is available at build_graph.ipynb. Simply run this jupyter notebook in order to construct the knowledge graph form the outputs of the processing files mentioned above. This jupyter notebook produces all three versions of PrimeKG, kg_raw.csv, kg_giant.csv, and the complete version kg.csv.

Feature extraction

The code required to engineer features can be found at engineer_features.ipynb and mapping_mayo.ipynb.

Cite Us

If you find PrimeKG useful, cite our work:

@article{chandak2022building,
  title={Building a knowledge graph to enable precision medicine},
  author={Chandak, Payal and Huang, Kexin and Zitnik, Marinka},
  journal={bioRxiv},
  doi={10.1101/2022.05.01.489928},
  URL={https://www.biorxiv.org/content/early/2022/05/01/2022.05.01.489928},
  year={2022}
}

Data Server

PrimeKG is hosted on Harvard Dataverse with the following persistent identifier https://doi.org/10.7910/DVN/IXA7BM. When Dataverse is under maintenance, PrimeKG datasets cannot be retrieved. That happens rarely; please check the status on the Dataverse website.

License

PrimeKG codebase is under MIT license. For individual dataset usage, please refer to the dataset license found in the website.

Precision Medicine Knowledge Graph (PrimeKG)

Related tags

Overview

PrimeKG

Updates

Unique Features of PrimeKG

Environment setup

Using `pip`

Or use `conda`

Building an updated PrimeKG

Downloading primary data resources

Curating primary data resources

Harmonizing datasets into PrimeKG

Feature extraction

Cite Us

Data Server

License

Owner

Machine Learning for Medicine and Science @ Harvard

End-2-end speech synthesis with recurrent neural networks

🧪 Cutting-edge experimental spaCy components and features

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

Anomaly Detection 이상치 탐지 전처리 모듈

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

This simple Python program calculates a love score based on your and your crush's full names in English

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Training RNNs as Fast as CNNs

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

A workshop with several modules to help learn Feast, an open-source feature store

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

Stanford CoreNLP provides a set of natural language analysis tools written in Java

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Precision Medicine Knowledge Graph (PrimeKG)

Related tags

Overview

PrimeKG

Updates

Unique Features of PrimeKG

Environment setup

Using pip

Or use conda

Building an updated PrimeKG

Downloading primary data resources

Curating primary data resources

Harmonizing datasets into PrimeKG

Feature extraction

Cite Us

Data Server

License

Owner

Machine Learning for Medicine and Science @ Harvard

End-2-end speech synthesis with recurrent neural networks

🧪 Cutting-edge experimental spaCy components and features

Smart discord chatbot integrated with Dialogflow to manage different classrooms and assist in teaching!

Original implementation of the pooling method introduced in "Speaker embeddings by modeling channel-wise correlations"

Dual languaged (rus+eng) tool for packing and unpacking archives of Silky Engine.

Anomaly Detection 이상치 탐지 전처리 모듈

This is the code for the EMNLP 2021 paper AEDA: An Easier Data Augmentation Technique for Text Classification

This simple Python program calculates a love score based on your and your crush's full names in English

GAP-text2SQL: Learning Contextual Representations for Semantic Parsing with Generation-Augmented Pre-Training

Code for EMNLP20 paper: "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training"

A calibre plugin that generates Word Wise and X-Ray files then sends them to Kindle. Supports KFX, AZW3 and MOBI eBooks. X-Ray supports 18 languages.

Training RNNs as Fast as CNNs

Dé op-de-vlucht Pieton vertaler. Wereldwijd gebruikt door meer dan 1.000+ succesvolle bedrijven!

A workshop with several modules to help learn Feast, an open-source feature store

TunBERT is the first release of a pre-trained BERT model for the Tunisian dialect using a Tunisian Common-Crawl-based dataset.

LSTM based Sentiment Classification using Tensorflow - Amazon Reviews Rating

Stanford CoreNLP provides a set of natural language analysis tools written in Java

A simple tool to update bib entries with their official information (e.g., DBLP or the ACL anthology).

EMNLP'2021: Can Language Models be Biomedical Knowledge Bases?

A PyTorch implementation of the Transformer model in "Attention is All You Need".

Using `pip`

Or use `conda`