Select, weight and analyze complex sample data

Overview

Sample Analytics

docs

In large-scale surveys, often complex random mechanisms are used to select samples. Estimates derived from such samples must reflect the random mechanism. Samplics is a python package that implements a set of sampling techniques for complex survey designs. These survey sampling techniques are organized into the following four sub-packages.

Sampling provides a set of random selection techniques used to draw a sample from a population. It also provides procedures for calculating sample sizes. The sampling subpackage contains:

  • Sample size calculation and allocation: Wald and Fleiss methods for proportions.
  • Equal probability of selection: simple random sampling (SRS) and systematic selection (SYS)
  • Probability proportional to size (PPS): Systematic, Brewer's method, Hanurav-Vijayan method, Murphy's method, and Rao-Sampford's method.

Weighting provides the procedures for adjusting sample weights. More specifically, the weighting subpackage allows the following:

  • Weight adjustment due to nonresponse
  • Weight poststratification, calibration and normalization
  • Weight replication i.e. Bootstrap, BRR, and Jackknife

Estimation provides methods for estimating the parameters of interest with uncertainty measures that are consistent with the sampling design. The estimation subpackage implements the following types of estimation methods:

  • Taylor-based, also called linearization methods
  • Replication-based estimation i.e. Boostrap, BRR, and Jackknife
  • Regression-based e.g. generalized regression (GREG)

Small Area Estimation (SAE). When the sample size is not large enough to produce reliable / stable domain level estimates, SAE techniques can be used to model the output variable of interest to produce domain level estimates. This subpackage provides Area-level and Unit-level SAE methods.

For more details, visit https://samplics.readthedocs.io/en/latest/

Usage

Let's assume that we have a population and we would like to select a sample from it. The goal is to calculate the sample size for an expected proportion of 0.80 with a precision (half confidence interval) of 0.10.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter = "proportion")
sample_size.calculate(target=0.80, half_ci=0.10)

Furthermore, the population is located in four natural regions i.e. North, South, East, and West. We could be interested in calculating sample sizes based on region specific requirements e.g. expected proportions, desired precisions and associated design effects.

from samplics.sampling import SampleSize

sample_size = SampleSize(parameter="proportion", method="wald", stratification=True)

expected_proportions = {"North": 0.95, "South": 0.70, "East": 0.30, "West": 0.50}
half_ci = {"North": 0.30, "South": 0.10, "East": 0.15, "West": 0.10}
deff = {"North": 1, "South": 1.5, "East": 2.5, "West": 2.0}

sample_size = SampleSize(parameter = "proportion", method="Fleiss", stratification=True)
sample_size.calculate(target=expected_proportions, half_ci=half_ci, deff=deff)

To select a sample of primary sampling units using PPS method, we can use code similar to the snippets below. Note that we first use the datasets module to import the example dataset.

# First we import the example dataset
from samplics.datasets import load_psu_frame
psu_frame_dict = load_psu_frame()
psu_frame = psu_frame_dict["data"]

# Code for the sample selection
from samplics.sampling import SampleSelection

psu_sample_size = {"East":3, "West": 2, "North": 2, "South": 3}
pps_design = SampleSelection(
   method="pps-sys",
   stratification=True,
   with_replacement=False
   )

psu_frame["psu_prob"] = pps_design.inclusion_probs(
   psu_frame["cluster"],
   psu_sample_size,
   psu_frame["region"],
   psu_frame["number_households_census"]
   )

The initial weighting step is to obtain the design sample weights. In this example, we show a simple example of two-stage sampling design.

import pandas as pd

from samplics.datasets import load_psu_sample, load_ssu_sample
from samplics.weighting import SampleWeight

# Load PSU sample data
psu_sample_dict = load_psu_sample()
psu_sample = psu_sample_dict["data"]

# Load PSU sample data
ssu_sample_dict = load_ssu_sample()
ssu_sample = ssu_sample_dict["data"]

full_sample = pd.merge(
    psu_sample[["cluster", "region", "psu_prob"]],
    ssu_sample[["cluster", "household", "ssu_prob"]],
    on="cluster"
)

full_sample["inclusion_prob"] = full_sample["psu_prob"] * full_sample["ssu_prob"]
full_sample["design_weight"] = 1 / full_sample["inclusion_prob"]

To adjust the design sample weight for nonresponse, we can use code similar to:

import numpy as np

from samplics.weighting import SampleWeight

# Simulate response
np.random.seed(7)
full_sample["response_status"] = np.random.choice(
    ["ineligible", "respondent", "non-respondent", "unknown"],
    size=full_sample.shape[0],
    p=(0.10, 0.70, 0.15, 0.05),
)
# Map custom response statuses to teh generic samplics statuses
status_mapping = {
   "in": "ineligible",
   "rr": "respondent",
   "nr": "non-respondent",
   "uk":"unknown"
   }
# adjust sample weights
full_sample["nr_weight"] = SampleWeight().adjust(
   samp_weight=full_sample["design_weight"],
   adjust_class=full_sample["region"],
   resp_status=full_sample["response_status"],
   resp_dict=status_mapping
   )

To estimate population parameters using Taylor-based and replication-based methods, we can use code similar to:

# Taylor-based
from samplics.datasets import load_nhanes2

nhanes2_dict = load_nhanes2()
nhanes2 = nhanes2_dict["data"]

from samplics.estimation import TaylorEstimator

zinc_mean_str = TaylorEstimator("mean")
zinc_mean_str.estimate(
    y=nhanes2["zinc"],
    samp_weight=nhanes2["finalwgt"],
    stratum=nhanes2["stratid"],
    psu=nhanes2["psuid"],
    remove_nan=True,
)

# Replicate-based
from samplics.datasets import load_nhanes2brr

nhanes2brr_dict = load_nhanes2brr()
nhanes2brr = nhanes2brr_dict["data"]

from samplics.estimation import ReplicateEstimator

ratio_wgt_hgt = ReplicateEstimator("brr", "ratio").estimate(
    y=nhanes2brr["weight"],
    samp_weight=nhanes2brr["finalwgt"],
    x=nhanes2brr["height"],
    rep_weights=nhanes2brr.loc[:, "brr_1":"brr_32"],
    remove_nan=True,
)

To predict small area parameters, we can use code similar to:

import numpy as np
import pandas as pd

# Area-level basic method
from samplics.datasets import load_expenditure_milk

milk_exp_dict = load_expenditure_milk()
milk_exp = milk_exp_dict["data"]

from samplics.sae import EblupAreaModel

fh_model_reml = EblupAreaModel(method="REML")
fh_model_reml.fit(
    yhat=milk_exp["direct_est"],
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    error_std=milk_exp["std_error"],
    intercept=True,
    tol=1e-8,
)
fh_model_reml.predict(
    X=pd.get_dummies(milk_exp["major_area"], drop_first=True),
    area=milk_exp["small_area"],
    intercept=True,
)

# Unit-level basic method
from samplics.datasets import load_county_crop, load_county_crop_means

# Load County Crop sample data
countycrop_dict = load_county_crop()
countycrop = countycrop_dict["data"]
# Load County Crop Area Means sample data
countycropmeans_dict = load_county_crop_means()
countycrop_means = countycropmeans_dict["data"]

from samplics.sae import EblupUnitModel

eblup_bhf_reml = EblupUnitModel()
eblup_bhf_reml.fit(
    countycrop["corn_area"],
    countycrop[["corn_pixel", "soybeans_pixel"]],
    countycrop["county_id"],
)
eblup_bhf_reml.predict(
    Xmean=countycrop_means[["ave_corn_pixel", "ave_corn_pixel"]],
    area=np.linspace(1, 12, 12),
)

Installation

pip install samplics

Python 3.7 or newer is required and the main dependencies are numpy, pandas, scpy, and statsmodel.

Contribution

If you would like to contribute to the project, please read contributing to samplics

License

MIT

Contact

created by Mamadou S. Diallo - feel free to contact me!

Owner
samplics
samplics
[CVPR'21] Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration

Locally Aware Piecewise Transformation Fields for 3D Human Mesh Registration This repository contains the implementation of our paper Locally Aware Pi

sfwang 70 Dec 19, 2022
Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation

Context-Aware Image Matting for Simultaneous Foreground and Alpha Estimation This is the inference codes of Context-Aware Image Matting for Simultaneo

Qiqi Hou 125 Oct 22, 2022
Code release for The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification (TIP 2020)

The Devil is in the Channels: Mutual-Channel Loss for Fine-Grained Image Classification Code release for The Devil is in the Channels: Mutual-Channel

PRIS-CV: Computer Vision Group 230 Dec 31, 2022
This is the pytorch code for the paper Curious Representation Learning for Embodied Intelligence.

Curious Representation Learning for Embodied Intelligence This is the pytorch code for the paper Curious Representation Learning for Embodied Intellig

19 Oct 19, 2022
Reading Group @mila-iqia on Computational Optimal Transport for Machine Learning Applications

Computational Optimal Transport for Machine Learning Reading Group Over the last few years, optimal transport (OT) has quickly become a central topic

Ali Harakeh 11 Aug 26, 2022
Official code for the CVPR 2022 (oral) paper "Extracting Triangular 3D Models, Materials, and Lighting From Images".

nvdiffrec Joint optimization of topology, materials and lighting from multi-view image observations as described in the paper Extracting Triangular 3D

NVIDIA Research Projects 1.4k Jan 01, 2023
TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision

TorchCV: A PyTorch-Based Framework for Deep Learning in Computer Vision @misc{you2019torchcv, author = {Ansheng You and Xiangtai Li and Zhen Zhu a

Donny You 2.2k Jan 06, 2023
Open-L2O: A Comprehensive and Reproducible Benchmark for Learning to Optimize Algorithms

Open-L2O This repository establishes the first comprehensive benchmark efforts of existing learning to optimize (L2O) approaches on a number of proble

VITA 161 Jan 02, 2023
Code for the CIKM 2019 paper "DSANet: Dual Self-Attention Network for Multivariate Time Series Forecasting".

Dual Self-Attention Network for Multivariate Time Series Forecasting 20.10.26 Update: Due to the difficulty of installation and code maintenance cause

Kyon Huang 223 Dec 16, 2022
Softlearning is a reinforcement learning framework for training maximum entropy policies in continuous domains. Includes the official implementation of the Soft Actor-Critic algorithm.

Softlearning Softlearning is a deep reinforcement learning toolbox for training maximum entropy policies in continuous domains. The implementation is

Robotic AI & Learning Lab Berkeley 997 Dec 30, 2022
A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction.

Graph2SMILES A graph-to-sequence model for one-step retrosynthesis and reaction outcome prediction. 1. Environmental setup System requirements Ubuntu:

29 Nov 18, 2022
SCU OlympicsRunning Baseline

Competition 1v1 running Environment check details in Jidi Competition RLChina2021智能体竞赛 做出的修改: 奖励重塑:修改了环境,重新设置了奖励的分配,使得奖励组成不只有零和博弈,还有探索环境的奖励。 算法微调:修改了官

ZiSeoi Wong 2 Nov 23, 2021
The undersampled DWI image using Slice-Interleaved Diffusion Encoding (SIDE) method can be reconstructed by the UNet network.

UNet-SIDE The undersampled DWI image using Slice-Interleaved Diffusion Encoding (SIDE) method can be reconstructed by the UNet network. For Super Reso

TIANTIAN XU 1 Jan 13, 2022
Hands-On Machine Learning for Algorithmic Trading, published by Packt

Hands-On Machine Learning for Algorithmic Trading Hands-On Machine Learning for Algorithmic Trading, published by Packt This is the code repository fo

Packt 981 Dec 29, 2022
WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking

WebUAV-3M: A Benchmark Unveiling the Power of Million-Scale Deep UAV Tracking [Paper Link] Abstract In this work, we contribute a new million-scale Un

25 Jan 01, 2023
Avalanche RL: an End-to-End Library for Continual Reinforcement Learning

Avalanche RL: an End-to-End Library for Continual Reinforcement Learning Avalanche Website | Getting Started | Examples | Tutorial | API Doc | Paper |

ContinualAI 43 Dec 24, 2022
[CVPR 2022] Official code for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved Neural Network Calibration"

MDCA Calibration This is the official PyTorch implementation for the paper: "A Stitch in Time Saves Nine: A Train-Time Regularizing Loss for Improved

MDCA Calibration 21 Dec 22, 2022
Yas CRNN model training - Yet Another Genshin Impact Scanner

Yas-Train Yet Another Genshin Impact Scanner 又一个原神圣遗物导出器 介绍 该仓库为 Yas 的模型训练程序 相关资料 MobileNetV3 CRNN 使用 假设你会设置基本的pytorch环境。 生成数据集 python main.py gen 训练

wormtql 18 Jan 08, 2023
A Research-oriented Federated Learning Library and Benchmark Platform for Graph Neural Networks. Accepted to ICLR'2021 - DPML and MLSys'21 - GNNSys workshops.

FedGraphNN: A Federated Learning System and Benchmark for Graph Neural Networks A Research-oriented Federated Learning Library and Benchmark Platform

FedML-AI 175 Dec 01, 2022
This is a collection of our NAS and Vision Transformer work.

This is a collection of our NAS and Vision Transformer work.

Microsoft 828 Dec 28, 2022