apricot implements submodular optimization for the purpose of selecting subsets of massive data sets to train machine learning models quickly.

Overview

Downloadsbuild Documentation Status

Please consider citing the manuscript if you use apricot in your academic work!

You can find more thorough documentation here.

apricot implements submodular optimization for the purpose of summarizing massive data sets into minimally redundant subsets that are still representative of the original data. These subsets are useful for both visualizing the modalities in the data (such as in the two data sets below) and for training accurate machine learning models with just a fraction of the examples and compute.

Submodular optimization is a well studied field that focuses on set functions which have the diminishing return property. These set functions have the dimishing return property, such that adding an item to a set produces a smaller marginal gain than adding the same item to a superset of that set. More formally, for , the property is . When these functions represent a notion of diversity, finding the subset of examples that maximize these functions corresponds to finding a minimally redundant set.

There are many built-in submodular functions and optimizers in apricot. These include:

Functions

Optimizers

Installation

apricot can be installed easily from PyPI with pip install apricot-select

Usage

The main objects in apricot are the selectors. Each selector encapsulates a submodular function and the cached statistics that speed up the optimization process. The API of the selectors follows the style of a scikit-learn transformer and consists of a fit method, where the examples are selected, a transform method, where the subset is selected from the data set, and a fit_transform method, where both are done sequentially.

Here is an example of reducing a data set of size 5000 down to 100 examples with a facility location function that uses negative Euclidean distance as a similarity measure.

import numpy
from apricot import FacilityLocationSelection

X = numpy.random.normal(100, 1, size=(5000, 25))
X_subset = FacilityLocationSelection(100, metric='euclidean', optimizer='lazy').fit_transform(X)

Facility location functions are general purpose

Facility location functions are more general purpose than feature based functions and work whenever a similarity measure can be defined over pairs of examples. When these functions are maximized, elements are selected that represent elements that are currently underrepresented. However, a downside of facility location functions (and all other submodular functions that rely on a similarity matrix) when compared to feature based functions is that the full similarity matrix requires quadratic memory with the number of examples and is generally impractical to store.

Here is an example of applying facility location selection to the MNIST data set.

from apricot import FacilityLocationSelection
from sklearn.datasets import load_digits

data = load_digits()
X_train = data.data[:1250]

selector = FacilityLocationSelection(n_samples, metric='euclidean', optimizer='lazy', verbose=False)
selector.fit(X_train)

And here is the performance of logistic regression models trained on subsets of either the MNIST or Fashion MNIST data sets where the subsets are selected using either facility location (orange) or random selection (grey).

The selected examples from facility location selection can be used in several ways other than training machine learning models, such as for visualizing the modalities of data (see the example at the start) or as centroids in a greedy version of k-means clustering. The animation below shows samples being selected according to facility location and compared to random selection, with facility location first selecting a sample that represents the entire data set, then selecting a sample that is in the largest cluster but near the second largest, and then the centers of local neighborhoods. This selection process is much faster than finding centroids using the EM based approaches of K-means or GMMs.

Feature based functions scale to summarize massive data sets

Feature-based functions work well when the features correspond to some notion of quantity or importance, e.g. when the features are number of times a word appears in a document, or the strength of a signal at a sensor. When these functions are maximized, the resulting subsets are comprised of examples that are enriched in value in different sets of features. The intuition behind using a feature-based function is that different modalities in the data (like classes or clusters) will exhibit high values in different sets of features, and so selecting examples enriched for different features means covering these modalities better.

When using a feature-based function on the 20 newsgroups data set, one can train a logistic regression model using only 100 samples and get the same performance as using all 1,187 potential samples, much better than using random sampling. The code below shows an example snippet of code usage and the graph shows the results over many subset sizes.

from apricot import FeatureBasedSelection
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer

train_data = fetch_20newsgroups(subset='train', categories=('sci.med', 'sci.space'))
vectorizer = TfidfVectorizer(stop_words='english', max_features=1000)

X_train = vectorizer.fit_transform(train_data.data) # This returns a sparse matrix which is supported in apricot

selector = FeatureBasedSelection(100, concave_func='sqrt', optimizer='two-stage', verbose=False)
selector.fit(X_train)

Initial subsets

The selection process for most optimizers implemented in apricot is greedy, meaning that one example is selected at a time and this is (usually) the best example to include next given those that have already been selected. While a greedy algorithm is not guaranteed to find the best subset of a given size, it was famously shown that this subset cannot have an objective value $1 - e^{-1}$ worse than the optimal subset, and in practice the subsets are usually near-optimal.

apricot exploits the greedy aspect of the algorithms to allow users to define an initial subset to build off of. This can be useful when one already has some elements selected (such as by outside information) and one would like to select elements that are not redundant with these already-selected elements. The initial subsets can be defined in the constructor like so:

import numpy
from apricot import FacilityLocationSelection

X = numpy.random.normal(20, 1, size=(5000, 25))
X_reordered = FacilityLocationSelection(100, initial_subset=[1, 5, 6, 8, 10]).fit_transform(X)

model = FacilityLocationSelection(1000).fit(X)
X_reordered2 = X[model.ranking]

When should I use submodular selection?

If the amount of data that you have right now is not burdensome to deal with then it may not be helpful to use submodular selection. However, if training even simple models using the amount of data you have is difficult, you might consider summarizing your data set using apricot. If you're currently running many random selections because of the amount of data you have you may find that a single run of apricot will yield a better subset.

Comments
  • How to combine feature-based function and Graph based Function?

    How to combine feature-based function and Graph based Function?

    I'm solving a problem that can be formulated as image

    The second part is a normal facility location problem, while the first one is a function associate with each data in X (e.g. the value of the first feature).

    And I try to use the MixtureSelection with it as follows,

    def feature_based_function(X):
        return X[:, -1].sum().sum()
    
    def facility_location_function(X):
        return X[:, :-1].max(axis=0).sum()
    
    v = np.random.uniform(0.5, 1.0, (50, 1))
    similarity = np.random.uniform(0.5, 0.9, size=(50, 50))
    n = 10
    x = np.concatenate([v, similarity], axis=1)
    weight = [1, 1]
    func1 = CustomSelection(n, feature_based_function)
    func2 = CustomGraphSelection(n, facility_location_function, metric='precomputed')
    func3 = MixtureSelection(n, [func1, func2], weights=weight, optimizer='naive').fit(x)
    selected = func3.transform(x)
    

    But it seems this can not generate a good subset, is there any problem?

    opened by mrbeann 12
  • partial_fit behavior for Facility Location

    partial_fit behavior for Facility Location

    I am using the library with great results, but I am struggling to understand if the behavior reported below is correct.

    I am loading batches of data from the disk and then calling partial_fit on every batch. Let's assume that each batch contains 1000 data points, and I load 10 batches, the data points will be indexed from 0 to 9999. After each partial fit call, I store the indices returned in the ranking and the associated data points (e.g., [1, 2,.... 100]). I would expect the list of the indices returned by partial_fit at batch B should be either from the list of the stored indices or from the indices of the batch B but that does not seem to be the case when I use facility location (empirically it is the case for the feature based selection).

    For example, if my stored indices are pi = [1,2,5,90] and the current batch contains the points with in bi = [100, 101, ..., 110], I would expect that given ci = selector.ranking to have ci in pi + bi. Is this a wrong assumption? If my assumption is wrong, how do you retrieve the points associated to the selected indices without a second pass on the data?

    opened by 610v4nn1 7
  • __init__() got an unexpected keyword argument 'pairwise_func'

    __init__() got an unexpected keyword argument 'pairwise_func'

    Getting the following error while calling the FacilityLocationSelection function

    1 from apricot import FacilityLocationSelection 2 ----> 3 model = FacilityLocationSelection(6, pairwise_func='precomputed') 4 model.fit(route_map) 5

    TypeError: init() got an unexpected keyword argument 'pairwise_func'

    opened by aresBlues 3
  • Any plan to implement streaming submodular implementation for the saturated coverage problem?

    Any plan to implement streaming submodular implementation for the saturated coverage problem?

    "Streaming optimization is implemented for mixtures of functions, but not for sum redundancy or saturated coverage functions."

    I was wondering if there is a technical reason for the above or just that its doable but not yet implemented. Specifically I was interested in streaming sub-modular optimization for saturated coverage.

    opened by badrisnps 3
  • bidirectional optimization with Graph Cut ValueError

    bidirectional optimization with Graph Cut ValueError

    Thanks for the amazing package and documentation: I am facing issue when I try to use bidirectional optimization with graph cut GraphCutSelection(N_Samples_Per_Class, metric='euclidean', optimizer='bidirectional').fit_transform(.......... ValueError: zero-size array to reduction operation maximum which has no identity

    Note that using other optimizers worked just fine GraphCutSelection(N_Samples_Per_Class, metric='euclidean', optimizer='lazy').fit_transform(.......... GraphCutSelection(N_Samples_Per_Class, metric='euclidean', optimizer='two-stage').fit_transform(..........

    Your help is appreciated

    opened by M-A-Hassan 2
  • Memory Error for Large Dataset

    Memory Error for Large Dataset

    I'm trying the FacilityLocationSelection with a dataset of 7000000 samples, using Hamming distance as a metric. I get a memory error:

    numpy.core._exceptions.MemoryError: Unable to allocate 196. TiB for an array with shape (26998899737040,) and data type float64 In my previous dataset I had 2500000 samples, and the code worked fine. I assume the issue is calculating the distances between every sample. I'm wondering if there are any options for reducing the memory cost- I've also tried faster optimizers and different functions but I run into the same error.

    opened by devinity1337 2
  • Guidance about usage with large datasets

    Guidance about usage with large datasets

    I tried the library on some datasets and I have to say I am positively surprised about the usability and the effectiveness of the methods provided.

    At the same time, I found some serious blocker in using it with large datasets since this requires to read the literature referenced in the documentation. It would be extremely useful to provide guidance about the computational complexity of the different methods or distinguish between scalable methods (e.g., streaming methods) and less scalable ones.

    opened by 610v4nn1 2
  • Retrieving the original index of the samples in subset

    Retrieving the original index of the samples in subset

    Hi Jacob,

    Thank you for a very cool library. I have a quick question. Maybe I missed something, but right now there are no way to retrieve the original index of the samples in the subset to identify which samples/row number the algorithm chose? This need comes up in 2 scenarios:

    1. When I want to know whether I have sufficiently covered my data distribution based on UMAP 2D embedding, I need to know the index of the samples in the subset to merge it back with the UMAP representation

    2. When I have features [A, B, C], but only want to perform submodular optimization on features [A, B], and then want to know the values of feature C of all the samples in the subset.

    Right now I don't see any way to extract out the index/numpy array row number of the chosen samples.

    opened by hoangthienan95 2
  • Error in running Example A. Facility Location on MNIST and Fashion-MNIST

    Error in running Example A. Facility Location on MNIST and Fashion-MNIST

    ValueError Traceback (most recent call last) in ----> 1 X_fashion_umap = UMAP(metric="precomputed").fit_transform(X_fashion_pairwise)

    ~\Anaconda3\envs\tf\lib\site-packages\umap\umap_.py in fit_transform(self, X, y) 1966 Embedding of the training data in low-dimensional space. 1967 """ -> 1968 self.fit(X, y) 1969 return self.embedding_ 1970

    ~\Anaconda3\envs\tf\lib\site-packages\umap\umap_.py in fit(self, X, y) 1623 self._initial_alpha = self.learning_rate 1624 -> 1625 self._validate_parameters() 1626 1627 if self.verbose:

    ~\Anaconda3\envs\tf\lib\site-packages\umap\umap_.py in _validate_parameters(self) 1499 elif self.metric == "precomputed": 1500 if self.unique is False: -> 1501 raise ValueError("unique is poorly defined on a precomputed metric") 1502 warn( 1503 "using precomputed metric; transform will be unavailable for new data and inverse_transform "

    ValueError: unique is poorly defined on a precomputed metric

    opened by DeekshaDixit15 2
  • Error in running Introduction to Submodular Optimization tutorial: AttributeError: 'FeatureBasedSelection' object has no attribute 'indices'

    Error in running Introduction to Submodular Optimization tutorial: AttributeError: 'FeatureBasedSelection' object has no attribute 'indices'


    AttributeError Traceback (most recent call last) in 3 selector = FeatureBasedSelection(20) 4 selector.fit_transform(X_shap) ----> 5 Xi2 = X[selector.indices] 6 yi2 = y[selector.indices] 7

    AttributeError: 'FeatureBasedSelection' object has no attribute 'indices'

    opened by DeekshaDixit15 2
  • Definition of submodularity in README.md

    Definition of submodularity in README.md

    The definition of submodularity in the README seems to be incorrect. In the readme, it's written

    image

    but in the documentation, the inequality sign is reversed

    image

    The latter definition appears to be the correct one.

    opened by akshayka 2
  • Understanding `partial_fit` results

    Understanding `partial_fit` results

    Hi there, I'm trying to use apricot to help find a diverse set of texts. When I use the fit method, everything works intuitively. However, when I start using the partial_fit method, the outputs do not appear to be correct. I suspect that I'm misunderstanding something about how the library works. In case I'm not, I've prepared a small demo of the issue with explanations of what I got vs. what I expected.

    # environment setup
    pip install textdiversity apricot-select --quiet
    
    from textdiversity import POSSequenceDiversity
    from apricot import FacilityLocationSelection
    
    def chunker(seq, size):
        return (seq[pos:pos + size] for pos in range(0, len(seq), size))
    
    def test_apricot(featurizer, texts, fit_type="full_fit", batch_size = 2):
        selector = FacilityLocationSelection(
            n_samples=len(texts), 
            metric='euclidean', 
            optimizer='lazy')
        if fit_type == "full_fit":
            f, c = d.extract_features(texts)
            Z = d.calculate_similarities(f)
            selector.fit(Z)
        elif fit_type == "unbatched_partial":
            f, c = d.extract_features(texts)
            Z = d.calculate_similarities(f)
            selector.partial_fit(Z)
        elif fit_type == "batched_partial":
            for batch in chunker(texts, batch_size):
                f, c = d.extract_features(batch)
                Z = d.calculate_similarities(f)
                selector.partial_fit(Z)
        print(f"{fit_type} ranking: {selector.ranking} | gain: {sum(selector.gains)}")
    
    # test ====================================================
    
    d = POSSequenceDiversity()
    
    texts = ["This is a test.", 
             "This is also a test.", 
             "This is the real deal.", 
             "So is this one."]
    
    test_apricot(d, texts, "full_fit") # > ranking: [0 3 1 2] | gain: 2.8888888888888893
    test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
    test_apricot(d, texts, "batched_partial") #> ranking: [2 3] | gain: 0.4444444444444444
    
    texts = ["This is the real deal.",
             "So is this one.",
             "This is a test.", 
             "This is also a test."]
    
    test_apricot(d, texts, "full_fit") # > ranking: [0 1 3 2] | gain: 2.8888888888888893
    test_apricot(d, texts, "unbatched_partial") # > ranking: [0 1 2 3] | gain: 0.7222222222222221
    test_apricot(d, texts, "batched_partial") #> ranking: [0 1] | gain: 0.5
    

    Full fit: makes intuitive sense. Texts with overlapping semantics get relegated to lower rankings, etc. Unbatched partial: I would have expected the unbatched partial fit to behave the same as full fit, but no matter what order I put the texts in (e.g. reverse it or any other permutation), I always get [0 1 2 3]. Since the partial_fit method always provides the same ranking despite changes in the underlying order, this may indicate a bug or I don't understand it well enough. Please let me know. Batched partial: This one is responsive to changes in the order of the texts, but a) does not respect the n_samples parameter (I wanted to rank all the texts) and b) does not appear to agree with the ranking from the full fit (which I trust the most, but unfortunately cannot use due to the size of my dataset).

    Thanks for taking the time to read + potentially helping me out.

    opened by fabriceyhc 0
  • Selection with pre-selected set

    Selection with pre-selected set

    Hi,

    Is there a way to pass a preselected set of data to the optimizer to be considered while calculating the gain? The preselected set is user-defined input to the optimizer and will not be altered or modified in any way by the optimizer.

    Thanks for the effort and making this package available. Mohamed

    opened by M-A-Hassan 1
  • License file missing in PyPI builds

    License file missing in PyPI builds

    The setup.py refers to LICENSE.txt, but there's no license file in the tar.gz on PyPI. Maybe as simple as the file being called LICENSE in the source tree.

    opened by hyandell 0
  • `partial_fit` and `sieve` can easily outgrow available memory

    `partial_fit` and `sieve` can easily outgrow available memory

    Thank you for putting together such a great library. It's been extremely helpful.

    I was toying with the parameters in the example in the documentation on massive datasets. I realized that when using partial_fit (and therefore the sieve optimizer) and slightly more features or I set my target sample size to something larger, it is easy to get a memory error. Here is an example that I tried:

    # apricot-massive-dataset-example.py
    from apricot import FeatureBasedSelection
    from sklearn.datasets import fetch_20newsgroups
    from sklearn.feature_extraction.text import TfidfVectorizer
    
    train_data = fetch_20newsgroups(subset='train', categories=('sci.med', 'sci.space'))
    vectorizer = TfidfVectorizer()
    
    X_train = vectorizer.fit_transform(train_data.data) # This returns a sparse matrix which is supported in apricot
    print(X_train.shape)
    
    selector = FeatureBasedSelection(1000, concave_func='sqrt', verbose=False)
    selector.partial_fit(X_train)
    

    Running the above, I get:

    $ python apricot-massive-dataset-example.py
    (1187, 25638)
    Traceback (most recent call last):
      File "apricot-example.py", line 12, in <module>
        selector.partial_fit(X_train)
      File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 258, in partial_fit
        self.optimizer.select(X, self.n_samples, sample_cost=sample_cost)
      File "/envs/bla/lib/python3.8/site-packages/apricot/optimizers.py", line 1103, in select
        self.function._calculate_sieve_gains(X, thresholds, idxs)
      File "/envs/bla/lib/python3.8/site-packages/apricot/functions/featureBased.py", line 360, in _calculate_sieve_gains
        super(FeatureBasedSelection, self)._calculate_sieve_gains(X,
      File "/envs/bla/lib/python3.8/site-packages/apricot/functions/base.py", line 418, in _calculate_sieve_gains
        self.sieve_subsets_ = numpy.zeros((l, self.n_samples, self._X.shape[1]), dtype='float32')
    numpy.core._exceptions.MemoryError: Unable to allocate 117. GiB for an array with shape (1227, 1000, 25638) and data type float32
    

    This behavior doesn't happen when I use fit() and another optimizer, e.g., two-stage.

    Looking into the code, it seems that the root is at an array initialization of sieve_subsets_, and can happen again later here. In both places, we ask for a zero, float64, non-sparse matrix, of size |thresholds| x |n_samples| x |feature_dims|, which can become quite large and not fit in memory when dealing with massive datasets. I wonder if there is a more memory efficient way of writing it? Thanks!

    opened by nucflash 0
  • Have you ever considering implement the multilinear extension and continuous greedy algorithm in apricot package?

    Have you ever considering implement the multilinear extension and continuous greedy algorithm in apricot package?

    Since for now the relaxation method is one of the important solution to submodular optimization and it is easy to do extension comparing with the highly tailored combinatorial algorithms.

    btw, it is a fantastic job you have done!

    opened by athossun 2
  • Nearest neighbors doesn't work

    Nearest neighbors doesn't work

    from apricot import FacilityLocationSelection
    import numpy
    
    X = numpy.random.uniform(0, 1, size=(500, 500))
    
    
    FacilityLocationSelection(100,n_neighbors=10, verbose=True).fit(X)
    

    n_neighbors parameter gives an error, when I remove it it works fine.

    opened by devinity1337 4
Releases(v0.5.0)
Owner
Jacob Schreiber
I am a post-doc at Stanford University (previously University of Washington), studying large scale machine learning and computational biology
Jacob Schreiber
🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

Fabian Gebhart 13 Dec 15, 2022
πŸ’¬ Python scripts to parse Messenger, Hangouts, WhatsApp and Telegram chat logs into DataFrames.

Chatistics Python 3 scripts to convert chat logs from various messaging platforms into Pandas DataFrames. Can also generate histograms and word clouds

Florian 893 Jan 02, 2023
This python script allows you to manipulate the audience data from Sl.ido surveys

Slido-Automated-VoteBot This python script allows you to manipulate the audience data from Sl.ido surveys Since Slido blocks interference from automat

Pranav Menon 1 Jan 24, 2022
ELFXtract is an automated analysis tool used for enumerating ELF binaries

ELFXtract ELFXtract is an automated analysis tool used for enumerating ELF binaries Powered by Radare2 and r2ghidra This is specially developed for PW

Monish Kumar 49 Nov 28, 2022
Fast, flexible and easy to use probabilistic modelling in Python.

Please consider citing the JMLR-MLOSS Manuscript if you've used pomegranate in your academic work! pomegranate is a package for building probabilistic

Jacob Schreiber 3k Jan 02, 2023
Python-based Space Physics Environment Data Analysis Software

pySPEDAS pySPEDAS is an implementation of the SPEDAS framework for Python. The Space Physics Environment Data Analysis Software (SPEDAS) framework is

SPEDAS 98 Dec 22, 2022
Data Analytics on Genomes and Genetics

Data Analytics performed on On genomes and Genetics dataset to predict genetic disorder and disorder subclass. DONE by TEAM SIGMA!

1 Jan 12, 2022
Data Competition: automated systems that can detect whether people are not wearing masks or are wearing masks incorrectly

Table of contents Introduction Dataset Model & Metrics How to Run Quickstart Install Training Evaluation Detection DATA COMPETITION The COVID-19 pande

Thanh Dat Vu 1 Feb 27, 2022
Multiple Pairwise Comparisons (Post Hoc) Tests in Python

scikit-posthocs is a Python package that provides post hoc tests for pairwise multiple comparisons that are usually performed in statistical data anal

Maksim Terpilowski 264 Dec 30, 2022
DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN

DenseClus is a Python module for clustering mixed type data using UMAP and HDBSCAN. Allowing for both categorical and numerical data, DenseClus makes it possible to incorporate all features in cluste

Amazon Web Services - Labs 53 Dec 08, 2022
Integrate bus data from a variety of sources (batch processing and real time processing).

Purpose: This is integrate bus data from a variety of sources such as: csv, json api, sensor data ... into Relational Database (batch processing and r

1 Nov 25, 2021
Utilize data analytics skills to solve real-world business problems using Humana’s big data

Humana-Mays-2021-HealthCare-Analytics-Case-Competition- The goal of the project is to utilize data analytics skills to solve real-world business probl

Yongxian (Caroline) Lun 1 Dec 27, 2021
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
Mining the Stack Overflow Developer Survey

Mining the Stack Overflow Developer Survey A prototype data mining application to compare the accuracy of decision tree and random forest regression m

1 Nov 16, 2021
PySpark bindings for H3, a hierarchical hexagonal geospatial indexing system

h3-pyspark: Uber's H3 Hexagonal Hierarchical Geospatial Indexing System in PySpark PySpark bindings for the H3 core library. For available functions,

Kevin Schaich 12 Dec 24, 2022
Predictive Modeling & Analytics on Home Equity Line of Credit

Predictive Modeling & Analytics on Home Equity Line of Credit Data (Python) HMEQ Data Set In this assignment we will use Python to examine a data set

Dhaval Patel 1 Jan 09, 2022
Pandas and Dask test helper methods with beautiful error messages.

beavis Pandas and Dask test helper methods with beautiful error messages. test helpers These test helper methods are meant to be used in test suites.

Matthew Powers 18 Nov 28, 2022
A highly efficient and modular implementation of Gaussian Processes in PyTorch

GPyTorch GPyTorch is a Gaussian process library implemented using PyTorch. GPyTorch is designed for creating scalable, flexible, and modular Gaussian

3k Jan 02, 2023
BIGDATA SIMULATION ONE PIECE WORLD CENSUS

ONE PIECE is a Japanese manga of great international success. The story turns inhabited in a fictional world, tells the adventures of a young man whose body gained rubber properties after accidentall

Maycon Cypriano 3 Jun 30, 2022
Uses MIT/MEDSL, New York Times, and US Census datasources to analyze per-county COVID-19 deaths.

Covid County Executive summary Setup Install miniconda, then in the command line, run conda create -n covid-county conda activate covid-county conda i

Ahmed Fasih 1 Dec 22, 2021