Select, weight and analyze complex sample data

Overview

Sample Analytics

docs

In large-scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four sub-packages.

Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:

  • Sample size calculation and allocation: Wald and Fleiss methods for proportions.
  • Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
  • Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.

Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:

  • Weight adjustment due to nonresponse
  • Weight poststratification, calibration and normalization
  • Weight replication i.e. Bootstrap, BRR, and Jackknife

Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:

  • Taylor-based, also called linearization methods
  • Replication-based estimation i.e. Boostrap, BRR, and Jackknife
  • Regression-based e.g. generalized regression (GREG)

Small Area Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.

For more details, visit https://samplics.readthedocs.io/en/latest/

Usage

Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision (half confidence interval) of 0.10.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter = "proportion")
sample_size.calculate(target=0.80, half_ci=0.10)

Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter="proportion", method="wald", stratification=True)

expected_proportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50}
half_ci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10}
deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}

sample_size = SampleSize(parameter = "proportion", method="Fleiss", stratification=True)
sample_size.calculate(target=expected_proportions, half_ci=half_ci, deff=deff)

To select a sample of primary sampling units using PPS method, we can use code similar to the snippets below. Note that we first use the datasets module to import the example dataset.

# First we import the example dataset
from samplics.datasets import load_psu_frame
psu_frame_dict = load_psu_frame()
psu_frame = psu_frame_dict["data"]

# Code for the sample selection
from samplics.sampling import SampleSelection

psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}
pps_design = SampleSelection(
   method="pps-sys",
   stratification=True,
   with_replacement=False
   )

psu_frame["psu_prob"] = pps_design.inclusion_probs(
   psu_frame["cluster"],
   psu_sample_size,
   psu_frame["region"],
   psu_frame["number_households_census"]
   )

The initial weighting step is to obtain the design sample weights. In this example, we show a simple example of two-stage sampling design.

import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import SampleWeight

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]],
    ssu_sample[["cluster", "household", "ssu_prob"]],
    on="cluster"
)

full_sample["inclusion_prob"] = full_sample["psu_prob"] * full_sample["ssu_prob"]
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"]

To adjust the design sample weight for nonresponse, we can use code similar to:

import numpy as np

from samplics.weighting import SampleWeight

# Simulate response
np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible", "respondent", "non-respondent", "unknown"],
    size=full_sample.shape[0],
    p=(0.10, 0.70, 0.15, 0.05),
)
# Map custom response statuses to teh generic samplics statuses
status_mapping = {
   "in": "ineligible",
   "rr": "respondent",
   "nr": "non-respondent",
   "uk":"unknown"
   }
# adjust sample weights
full_sample["nr_weight"] = SampleWeight().adjust(
   samp_weight=full_sample["design_weight"],
   adjust_class=full_sample["region"],
   resp_status=full_sample["response_status"],
   resp_dict=status_mapping
   )

To estimate population parameters using Taylor-based and replication-based methods, we can use code similar to:

# Taylor-based
from samplics.datasets import load_nhanes2

nhanes2_dict = load_nhanes2()
nhanes2 = nhanes2_dict["data"]

from samplics.estimation import TaylorEstimator

zinc_mean_str = TaylorEstimator("mean")
zinc_mean_str.estimate(
    y=nhanes2["zinc"],
    samp_weight=nhanes2["finalwgt"],
    stratum=nhanes2["stratid"],
    psu=nhanes2["psuid"],
    remove_nan=True,
)

# Replicate-based
from samplics.datasets import load_nhanes2brr

nhanes2brr_dict = load_nhanes2brr()
nhanes2brr = nhanes2brr_dict["data"]

from samplics.estimation import ReplicateEstimator

ratio_wgt_hgt = ReplicateEstimator("brr", "ratio").estimate(
    y=nhanes2brr["weight"],
    samp_weight=nhanes2brr["finalwgt"],
    x=nhanes2brr["height"],
    rep_weights=nhanes2brr.loc[:, "brr_1":"brr_32"],
    remove_nan=True,
)

To predict small area parameters, we can use code similar to:

import numpy as np
import pandas as pd

# Area-level basic method
from samplics.datasets import load_expenditure_milk

milk_exp_dict = load_expenditure_milk()
milk_exp = milk_exp_dict["data"]

from samplics.sae import EblupAreaModel

fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
    yhat=milk_exp["direct_est"],
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    error_std=milk_exp["std_error"],
    intercept=True,
    tol=1e-8,
)
fh_model_reml.predict(
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    intercept=True,
)

# Unit-level basic method
from samplics.datasets import load_county_crop, load_county_crop_means

# Load County Crop sample data
countycrop_dict = load_county_crop()
countycrop = countycrop_dict["data"]
# Load County Crop Area Means sample data
countycropmeans_dict = load_county_crop_means()
countycrop_means = countycropmeans_dict["data"]

from samplics.sae import EblupUnitModel

eblup_bhf_reml = EblupUnitModel()
eblup_bhf_reml.fit(
    countycrop["corn_area"],
    countycrop[["corn_pixel", "soybeans_pixel"]],
    countycrop["county_id"],
)
eblup_bhf_reml.predict(
    Xmean=countycrop_means[["ave_corn_pixel", "ave_corn_pixel"]],
    area=np.linspace(1, 12, 12),
)

Installation

pip install samplics

Python 3.7 or newer is required and the main dependencies are numpy, pandas, scpy, and statsmodel.

Contribution

If you would like to contribute to the project, please read contributing to samplics

License

MIT

Contact

created by Mamadou S. Diallo - feel free to contact me!

Owner
samplics
samplics
This repo provides a demo for the CVPR 2021 paper "A Fourier-based Framework for Domain Generalization" on the PACS dataset.

FACT This repo provides a demo for the CVPR 2021 paper "A Fourier-based Framework for Domain Generalization" on the PACS dataset. To cite, please use:

105 Dec 17, 2022
Rotation-Only Bundle Adjustment

ROBA: Rotation-Only Bundle Adjustment Paper, Video, Poster, Presentation, Supplementary Material In this repository, we provide the implementation of

Seong 51 Nov 29, 2022
CodeContests is a competitive programming dataset for machine-learning

CodeContests CodeContests is a competitive programming dataset for machine-learning. This dataset was used when training AlphaCode. It consists of pro

DeepMind 1.6k Jan 08, 2023
SlotRefine: A Fast Non-Autoregressive Model forJoint Intent Detection and Slot Filling

SlotRefine: A Fast Non-Autoregressive Model for Joint Intent Detection and Slot Filling Reference Main paper to be cited (Di Wu et al., 2020) @article

Moore 34 Nov 03, 2022
Dynamic Graph Event Detection

DyGED Dynamic Graph Event Detection Get Started pip install -r requirements.txt TODO Paper link to arxiv, and how to cite. Twitter Weather dataset tra

Mert Koşan 3 May 09, 2022
FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks

FedCV: A Federated Learning Framework for Diverse Computer Vision Tasks Image Classification Dataset: Google Landmark, COCO, ImageNet Model: Efficient

FedML-AI 62 Dec 10, 2022
GPU Programming with Julia - course at the Swiss National Supercomputing Centre (CSCS), ETH Zurich

Course Description The programming language Julia is being more and more adopted in High Performance Computing (HPC) due to its unique way to combine

Samuel Omlin 192 Jan 03, 2023
A computational block to solve entity alignment over textual attributes in a knowledge graph creation pipeline.

How to apply? Create your config.ini file following the example provided in config.ini Choose one of the options below to run: Run with Python3 pip in

Scientific Data Management Group 3 Jun 23, 2022
Time Delayed NN implemented in pytorch

Pytorch Time Delayed NN Time Delayed NN implemented in PyTorch. Usage kernels = [(1, 25), (2, 50), (3, 75), (4, 100), (5, 125), (6, 150)] tdnn = TDNN

Daniil Gavrilov 79 Aug 04, 2022
RoboDesk A Multi-Task Reinforcement Learning Benchmark

RoboDesk A Multi-Task Reinforcement Learning Benchmark If you find this open source release useful, please reference in your paper: @misc{kannan2021ro

Google Research 66 Oct 07, 2022
A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

A Jupyter notebook to play with NVIDIA's StyleGAN3 and OpenAI's CLIP for a text-based guided image generation.

Eugenio Herrera 175 Dec 29, 2022
A set of tools to pre-calibrate and calibrate (multi-focus) plenoptic cameras (e.g., a Raytrix R12) based on the libpleno.

COMPOTE: Calibration Of Multi-focus PlenOpTic camEra. COMPOTE is a set of tools to pre-calibrate and calibrate (multifocus) plenoptic cameras (e.g., a

ComSEE - Computers that SEE 4 May 10, 2022
🚩🚩🚩

My CTF Challenges 2021 AIS3 Pre-exam / MyFirstCTF Name Category Keywords Difficulty ⒸⓄⓋⒾⒹ-①⑨ (MyFirstCTF Only) Reverse Baby ★ Piano Reverse C#, .NET ★

6 Oct 28, 2021
pip install python-office

🍬 python for office 👉 http://www.python4office.cn/ 👈 🌎 English Documentation 📚 简介 Python-office 是一个 Python 自动化办公第三方库,能解决大部分自动化办公的问题。而且每个功能只需一行代码,

程序员晚枫 272 Dec 29, 2022
Spatial Single-Cell Analysis Toolkit

Single-Cell Image Analysis Package Scimap is a scalable toolkit for analyzing spatial molecular data. The underlying framework is generalizable to spa

Laboratory of Systems Pharmacology @ Harvard 30 Nov 08, 2022
Generic Event Boundary Detection: A Benchmark for Event Segmentation

Generic Event Boundary Detection: A Benchmark for Event Segmentation We release our data annotation & baseline codes for detecting generic event bound

47 Nov 22, 2022
This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models are Pix2Pix, Pix2PixHD, CycleGAN and PointWise.

RGB2NIR_Experimental This repository contains several image-to-image translation models, whcih were tested for RGB to NIR image generation. The models

5 Jan 04, 2023
Parris, the automated infrastructure setup tool for machine learning algorithms.

README Parris, the automated infrastructure setup tool for machine learning algorithms. What Is This Tool? Parris is a tool for automating the trainin

Joseph Greene 319 Aug 02, 2022
Implementation of PyTorch-based multi-task pre-trained models

mtdp Library containing implementation related to the research paper "Multi-task pre-training of deep neural networks for digital pathology" (Mormont

Romain Mormont 27 Oct 14, 2022
Parasite: a tool allowing you to compress and decompress files, to reduce their size

🦠 Parasite 🦠 Parasite is a tool written in Python3 allowing you to "compress" any file, reducing its size. ⭐ Features ⭐ + Fast + Good optimization,

Billy 30 Nov 25, 2022