This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP.

Overview

Overview

Welcome to the Step-X repository. This repo is dedicated to the data extraction and manipulation of the World Bank's database called STEP. Bellow in this readme, it will be explained the installation and usage process.

The extractor was created using the following technologies:

  • Python 3.8
  • Pandas
  • Geeckodriver
  • Selenium
  • MongoDB

Installation process

To install and prepare the Step-X environment it's necessary to follow these instructions in order, step by step. To start, it's needed to:

  • Install the Geckodriver
  • Install the Firefox web browser
  • Install Anaconda and create an environment to proceed with the next steps (if you wish, you can skip this step)
  • Install MongoDB in your machine or server

Once installed the required tools describe above, we need to install the Python's libraries used in this project. To make that, execute the command below:

conda create --name 
   
     --file requirements.txt

   

This command installs the libraries and create a new conda environment. After that, your workspace is prepared to execute the extractor, but you will need to follow some configuration instructions that will be described in the next steps.

Configuration process

To start the extraction, first some configurations is required, such as the World Bank's credentials and the project list that the extractor will retrieve data. Notice that all necessary configuration is imbued in the file called environment.py. To set the World Bank's credentials just replace the variable called wb_credentials with the correct credentials as the example bellow:

wb_credentials = {"email": '[email protected]', 'password': 'password123'}

The geckodriver path is also needed to ensure that the Selenium will be work properly. To set the geckodriver path, just replace the variable geckodriver_path with the desired location:

geckodriver_path = r'/Users/userName/webdriverLocationFolder/geckodriver'

The next step is to set up the database credentials pass name, and the url in environment.py as the example bellow:

database_name = "stepX"
database_url = "localhost"

Finally, for the last configuration, pass the project's list that you wish to extract and manipulate. Follow the example:

PROJECTS_LIST =['PROJECT_ID']
Owner
Keanu Pang
Sr. Mobile App/Web/Software Engineer, Writer, Teacher & Researcher.
Keanu Pang
Describing statistical models in Python using symbolic formulas

Patsy is a Python library for describing statistical models (especially linear models, or models that have a linear component) and building design mat

Python for Data 866 Dec 16, 2022
Code for the DH project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval Muslim World"

Damast This repository contains code developed for the digital humanities project "Dhimmis & Muslims – Analysing Multireligious Spaces in the Medieval

University of Stuttgart Visualization Research Center 2 Jul 01, 2022
Pandas-based utility to calculate weighted means, medians, distributions, standard deviations, and more.

weightedcalcs weightedcalcs is a pandas-based Python library for calculating weighted means, medians, standard deviations, and more. Features Plays we

Jeremy Singer-Vine 98 Dec 31, 2022
TextDescriptives - A Python library for calculating a large variety of statistics from text

A Python library for calculating a large variety of statistics from text(s) using spaCy v.3 pipeline components and extensions. TextDescriptives can be used to calculate several descriptive statistic

150 Dec 30, 2022
Analysiscsv.py for extracting analysis and exporting as CSV

wcc_analysis Lichess page documentation: https://lichess.org/page/world-championships Each WCC has a study, studies are fetched using: https://lichess

32 Apr 25, 2022
BAyesian Model-Building Interface (Bambi) in Python.

Bambi BAyesian Model-Building Interface in Python Overview Bambi is a high-level Bayesian model-building interface written in Python. It's built on to

861 Dec 29, 2022
Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle.

2019-indian-election-eda Exploratory Data Analysis of the 2019 Indian General Elections using a dataset from Kaggle. This project is a part of the Cou

Souradeep Banerjee 5 Oct 10, 2022
Exploring the Top ML and DL GitHub Repositories

This repository contains my work related to my project where I scraped data on the most popular machine learning and deep learning GitHub repositories in order to further visualize and analyze it.

Nico Van den Hooff 17 Aug 21, 2022
An easy-to-use feature store

A feature store is a data storage system for data science and machine-learning. It can store raw data and also transformed features, which can be fed straight into an ML model or training script.

ByteHub AI 48 Dec 09, 2022
Python library for creating data pipelines with chain functional programming

PyFunctional Features PyFunctional makes creating data pipelines easy by using chained functional operators. Here are a few examples of what it can do

Pedro Rodriguez 2.1k Jan 05, 2023
Catalogue data - A Python Scripts to prepare catalogue data

catalogue_data Scripts to prepare catalogue data. Setup Clone this repo. Install

BigScience Workshop 3 Mar 03, 2022
A notebook to analyze Amazon Recommendation Review Dataset.

Amazon Recommendation Review Dataset Analyzer A notebook to analyze Amazon Recommendation Review Dataset. Features Calculates distinct user count, dis

isleki 3 Aug 22, 2022
WithPipe is a simple utility for functional piping in Python.

A utility for functional piping in Python that allows you to access any function in any scope as a partial.

Michael Milton 1 Oct 26, 2021
apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Please consider citing the manuscript if you use apricot in your academic work! You can find more thorough documentation here. apricot implements subm

Jacob Schreiber 457 Dec 20, 2022
Programmatically access the physical and chemical properties of elements in modern periodic table.

API to fetch elements of the periodic table in JSON format. Uses Pandas for dumping .csv data to .json and Flask for API Integration. Deployed on "pyt

the techno hack 3 Oct 23, 2022
Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data

Statistical_Modelling Statistical & Probabilistic Analysis of Store Sales, University Survey, & Manufacturing data Statistical Methods for Decision Ma

Avnika Mehta 1 Jan 27, 2022
Python package for analyzing sensor-collected human motion data

Python package for analyzing sensor-collected human motion data

Simon Ho 71 Nov 05, 2022
This is a repo documenting the best practices in PySpark.

Spark-Syntax This is a public repo documenting all of the "best practices" of writing PySpark code from what I have learnt from working with PySpark f

Eric Xiao 447 Dec 25, 2022
Universal data analysis tools for atmospheric sciences

U_analysis Universal data analysis tools for atmospheric sciences Script written in python 3. This file defines multiple functions that can be used fo

Luis Ackermann 1 Oct 10, 2021
A computer algebra system written in pure Python

SymPy See the AUTHORS file for the list of authors. And many more people helped on the SymPy mailing list, reported bugs, helped organize SymPy's part

SymPy 9.9k Dec 31, 2022