Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Overview

MOF-Water-Affinity-Prediction-

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge Structural Database and the CoRE_MOF 2019 dataset.

Prediction Model

The prediction model is used to determine whether a given MOF is hydrophobic or hydrophilic. It uses a Random Forest model from the XGBoost library through a scikit-learn interface. The model reads in a .csv file of training data and then predicts the water affinity of a user inputted MOF. The user can specify what input parameters are to be used in the model.

Overfitting/Underfitting

This script was created to investigate how the prediction model’s accuracy and precision vary with the number and combination of inputs. This script allows a user to compare how the different combinations of inputs affect the score and the standard deviation of the model’s results.

It operates by reading in a .csv file of training data containing 13 input parameters. It then generates a list of all the possible combinations of input parameters according to the lengths specified by the user. For example, if the user wants all the combinations of length 3, 4, and 10 possible, the program will generate a list of all combinations of those lengths, and then use each combination as input for the model. Basically, each combination will undergo the same process as in the prediction model above, and then its results will be added into a .csv file for later analysis. Finally, a plot is created with filters for visualization.

.cif to .csv Converter

In order to create a training set for the prediction model, a csv must be created with all the available datapoints. This includes the MOFs and their crystallographic data. The data needed is collected from three different sources: WebCSD, CoRE_MOF 2019 dataset, and the MOF’s .cif files. Furthermore, additional calculations need to be performed from the information collected from the .cif files.

The code works by reading a .txt file, folder, or both, containing the refcodes and .cif files given to the MOF by the Cambrdige Structural Database. It then searches for these refcodes in the CoRE_MOF 2019 dataset, and retrieves the crystallographic data attached to them. Additionally, it uses the .cif files of the MOFs to calculate the atomic mass percentage of the metals contained in the MOF. These calculations are stored in columns 14-17, but are treated as one input parameter in the models in an attempt to relate them to each other. It also states the MOFs in the training set as hydrophobic and hydrophilic based on previously collected information from the literature. Finally, it produces a .csv file ready for use in the prediction model.

.cif folders

Three different folders are used to store .cif files.

  1. cif: these are hydrophobic MOFs received from Dr. Z. Qiao.
  2. manual hydrophobic: these are hydrophobic MOFs collected from the literature
  3. manual hydrophilic: these are hydrophilic MOFs collected from the literature

To add additional .cif files:

Add additional .cif files into either the manual hydrophobic folder or the manual hydrophilic folder. Make sure the file names represent the CCDC refcodes (including or excluding the CoRE_MOF 2019 name extensions). Finally, add these refcodes into the .txt file available in each folder so that the .cif files can be read by the cif Reader program.

This project is licensed under the terms of the GNU General Public License v3.0

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Codes for the collection and predictive processing of bitcoin from the API of coinmarketcap

Teo Calvo 5 Apr 26, 2022
HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets

HyperSpy is an open source Python library for the interactive analysis of multidimensional datasets that can be described as multidimensional arrays o

HyperSpy 411 Dec 27, 2022
Py-price-monitoring - A Python price monitor

A Python price monitor This project was focused on Brazil, so the monitoring is

Samuel 1 Jan 04, 2022
vartests is a Python library to perform some statistic tests to evaluate Value at Risk (VaR) Models

gg I wasn't satisfied with any of the other available Gemini clients, so I wrote my own. Requires Python 3.9 (maybe older, I haven't checked) and opti

RAFAEL RODRIGUES 5 Jan 03, 2023
Project under the certification "Data Analysis with Python" on FreeCodeCamp

Sea Level Predictor Assignment You will anaylize a dataset of the global average sea level change since 1880. You will use the data to predict the sea

Bhavya Gopal 3 Jan 31, 2022
Containerized Demo of Apache Spark MLlib on a Data Lakehouse (2022)

Spark-DeltaLake-Demo Reliable, Scalable Machine Learning (2022) This project was completed in an attempt to become better acquainted with the latest b

8 Mar 21, 2022
Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods

Cold Brew: Distilling Graph Node Representations with Incomplete or Missing Neighborhoods Introduction Graph Neural Networks (GNNs) have demonstrated

37 Dec 15, 2022
Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021
Analyze the Gravitational wave data stored at LIGO/VIRGO observatories

Gravitational-Wave-Analysis This project showcases how to analyze the Gravitational wave data stored at LIGO/VIRGO observatories, using Python program

1 Jan 23, 2022
This tool parses log data and allows to define analysis pipelines for anomaly detection.

logdata-anomaly-miner This tool parses log data and allows to define analysis pipelines for anomaly detection. It was designed to run the analysis wit

AECID 32 Nov 27, 2022
Stitch together Nanopore tiled amplicon data without polishing a reference

Stitch together Nanopore tiled amplicon data using a reference guided approach Tiled amplicon data, like those produced from primers designed with pri

Amanda Warr 14 Aug 30, 2022
Python data processing, analysis, visualization, and data operations

Python This is a Python data processing, analysis, visualization and data operations of the source code warehouse, book ISBN: 9787115527592 Descriptio

FangWei 1 Jan 16, 2022
OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere.

opendrift OpenDrift is a software for modeling the trajectories and fate of objects or substances drifting in the ocean, or even in the atmosphere. Do

OpenDrift 167 Dec 13, 2022
Approximate Nearest Neighbor Search for Sparse Data in Python!

Approximate Nearest Neighbor Search for Sparse Data in Python! This library is well suited to finding nearest neighbors in sparse, high dimensional spaces (like text documents).

Meta Research 906 Jan 01, 2023
PySpark Structured Streaming ROS Kafka ApacheSpark Cassandra

PySpark-Structured-Streaming-ROS-Kafka-ApacheSpark-Cassandra The purpose of this project is to demonstrate a structured streaming pipeline with Apache

Zekeriyya Demirci 5 Nov 13, 2022
peptides.py is a pure-Python package to compute common descriptors for protein sequences

peptides.py Physicochemical properties and indices for amino-acid sequences. 🗺️ Overview peptides.py is a pure-Python package to compute common descr

Martin Larralde 32 Dec 31, 2022
Intercepting proxy + analysis toolkit for Second Life compatible virtual worlds

Hippolyzer Hippolyzer is a revival of Linden Lab's PyOGP library targeting modern Python 3, with a focus on debugging issues in Second Life-compatible

Salad Dais 6 Sep 01, 2022
Probabilistic Programming in Python: Bayesian Modeling and Probabilistic Machine Learning with Theano

PyMC3 is a Python package for Bayesian statistical modeling and Probabilistic Machine Learning focusing on advanced Markov chain Monte Carlo (MCMC) an

PyMC 7.2k Dec 30, 2022
Shot notebooks resuming the main functions of GeoPandas

Shot notebooks resuming the main functions of GeoPandas, 2 notebooks written as Exercises to apply these functions.

1 Jan 12, 2022
A simple and efficient tool to parallelize Pandas operations on all available CPUs

Pandaral·lel Without parallelization With parallelization Installation $ pip install pandarallel [--upgrade] [--user] Requirements On Windows, Pandara

Manu NALEPA 2.8k Dec 31, 2022