Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks

Overview

MOF-Water-Affinity-Prediction-

The following Python scripts aim to use a Random Forest machine learning algorithm to predict the water affinity of Metal-Organic Frameworks (MOFs). The training set is extracted from the Cambridge Structural Database and the CoRE_MOF 2019 dataset.

Prediction Model

The prediction model is used to determine whether a given MOF is hydrophobic or hydrophilic. It uses a Random Forest model from the XGBoost library through a scikit-learn interface. The model reads in a .csv file of training data and then predicts the water affinity of a user inputted MOF. The user can specify what input parameters are to be used in the model.

Overfitting/Underfitting

This script was created to investigate how the prediction model’s accuracy and precision vary with the number and combination of inputs. This script allows a user to compare how the different combinations of inputs affect the score and the standard deviation of the model’s results.

It operates by reading in a .csv file of training data containing 13 input parameters. It then generates a list of all the possible combinations of input parameters according to the lengths specified by the user. For example, if the user wants all the combinations of length 3, 4, and 10 possible, the program will generate a list of all combinations of those lengths, and then use each combination as input for the model. Basically, each combination will undergo the same process as in the prediction model above, and then its results will be added into a .csv file for later analysis. Finally, a plot is created with filters for visualization.

.cif to .csv Converter

In order to create a training set for the prediction model, a csv must be created with all the available datapoints. This includes the MOFs and their crystallographic data. The data needed is collected from three different sources: WebCSD, CoRE_MOF 2019 dataset, and the MOF’s .cif files. Furthermore, additional calculations need to be performed from the information collected from the .cif files.

The code works by reading a .txt file, folder, or both, containing the refcodes and .cif files given to the MOF by the Cambrdige Structural Database. It then searches for these refcodes in the CoRE_MOF 2019 dataset, and retrieves the crystallographic data attached to them. Additionally, it uses the .cif files of the MOFs to calculate the atomic mass percentage of the metals contained in the MOF. These calculations are stored in columns 14-17, but are treated as one input parameter in the models in an attempt to relate them to each other. It also states the MOFs in the training set as hydrophobic and hydrophilic based on previously collected information from the literature. Finally, it produces a .csv file ready for use in the prediction model.

.cif folders

Three different folders are used to store .cif files.

  1. cif: these are hydrophobic MOFs received from Dr. Z. Qiao.
  2. manual hydrophobic: these are hydrophobic MOFs collected from the literature
  3. manual hydrophilic: these are hydrophilic MOFs collected from the literature

To add additional .cif files:

Add additional .cif files into either the manual hydrophobic folder or the manual hydrophilic folder. Make sure the file names represent the CCDC refcodes (including or excluding the CoRE_MOF 2019 name extensions). Finally, add these refcodes into the .txt file available in each folder so that the .cif files can be read by the cif Reader program.

This project is licensed under the terms of the GNU General Public License v3.0

Techdegree Data Analysis Project 2

Basketball Team Stats Tool In this project you will be writing a program that reads from the "constants" data (PLAYERS and TEAMS) in constants.py. Thi

2 Oct 23, 2021
This mini project showcase how to build and debug Apache Spark application using Python

Spark app can't be debugged using normal procedure. This mini project showcase how to build and debug Apache Spark application using Python programming language. There are also options to run Spark a

Denny Imanuel 1 Dec 29, 2021
Spaghetti: an open-source Python library for the analysis of network-based spatial data

pysal/spaghetti SPAtial GrapHs: nETworks, Topology, & Inference Spaghetti is an open-source Python library for the analysis of network-based spatial d

Python Spatial Analysis Library 203 Jan 03, 2023
pyhsmm MITpyhsmm - Bayesian inference in HSMMs and HMMs. MIT

Bayesian inference in HSMMs and HMMs This is a Python library for approximate unsupervised inference in Bayesian Hidden Markov Models (HMMs) and expli

Matthew Johnson 527 Dec 04, 2022
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
Projects that implement various aspects of Data Engineering.

DATAWAREHOUSE ON AWS The purpose of this project is to build a datawarehouse to accomodate data of active user activity for music streaming applicatio

2 Oct 14, 2021
A set of tools to analyse the output from TraDIS analyses

QuaTradis (Quadram TraDis) A set of tools to analyse the output from TraDIS analyses Contents Introduction Installation Required dependencies Bioconda

Quadram Institute Bioscience 2 Feb 16, 2022
This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics!

COSMETICS GENERATOR This cosmetics generator allows you to generate the new Fortnite cosmetics, Search pak and search cosmetics! Remember to put the l

ᴅᴊʟᴏʀ3xᴢᴏ 11 Dec 13, 2022
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Udacity-api-reporting-pipeline - Udacity api reporting pipeline

udacity-api-reporting-pipeline In this exercise, you'll use portions of each of

Fabio Barbazza 1 Feb 15, 2022
Tools for the analysis, simulation, and presentation of Lorentz TEM data.

ltempy ltempy is a set of tools for Lorentz TEM data analysis, simulation, and presentation. Features Single Image Transport of Intensity Equation (SI

McMorran Lab 1 Dec 26, 2022
Tools for analyzing data collected with a custom unity-based VR for insects.

unityvr Tools for analyzing data collected with a custom unity-based VR for insects. Organization: The unityvr package contains the following submodul

Hannah Haberkern 1 Dec 14, 2022
Stock Analysis dashboard Using Streamlit and Python

StDashApp Stock Analysis Dashboard Using Streamlit and Python If you found the content useful and want to support my work, you can buy me a coffee! Th

StreamAlpha 27 Dec 09, 2022
Template for a Dataflow Flex Template in Python

Dataflow Flex Template in Python This repository contains a template for a Dataflow Flex Template written in Python that can easily be used to build D

STOIX 5 Apr 28, 2022
Generates a simple report about the current Covid-19 cases and deaths in Malaysia

Generates a simple report about the current Covid-19 cases and deaths in Malaysia. Results are delay one day, data provided by the Ministry of Health Malaysia Covid-19 public data.

Yap Khai Chuen 7 Dec 15, 2022
Statsmodels: statistical modeling and econometrics in Python

About statsmodels statsmodels is a Python package that provides a complement to scipy for statistical computations including descriptive statistics an

statsmodels 8k Dec 29, 2022
Aggregating gridded data (xarray) to polygons

A package to aggregate gridded data in xarray to polygons in geopandas using area-weighting from the relative area overlaps between pixels and polygons. Check out the binder link above for a sample c

Kevin Schwarzwald 42 Nov 09, 2022
Picka: A Python module for data generation and randomization.

Picka: A Python module for data generation and randomization. Author: Anthony Long Version: 1.0.1 - Fixed the broken image stuff. Whoops What is Picka

Anthony 108 Nov 30, 2021
PyEmits, a python package for easy manipulation in time-series data.

PyEmits, a python package for easy manipulation in time-series data. Time-series data is very common in real life. Engineering FSI industry (Financial

Thompson 5 Sep 23, 2022
Random dataframe and database table generator

Random database/dataframe generator Authored and maintained by Dr. Tirthajyoti Sarkar, Fremont, USA Introduction Often, beginners in SQL or data scien

Tirthajyoti Sarkar 249 Jan 08, 2023