Fully Distributed CIDACS-RL

The CIDACS-RL is a brazillian record linkage tool suitable to integrate large amount of data with high accuracy. However, its current implementation relies on a ElasticSearch Cluster to distribute the queries and a single node to perform them through Python Multiprocessing lib. This implementation of CIDACS-RL tool can be deployed in a Spark Cluster using all resources available by Jupyter Kernel still using the ElasticSearch cluster, becaming a fully distributed and cluster based solution. It can outperform the legacy version of CIDACS-RL either on multi-node or single node Spark Environment.

config.json

Almost all the aspects of the linkage can be manipulated by the config.json file.

Section	Sub-section	Field (datatype)	Field description
General info		index_data (str<'yes', 'no'>)	This flag says if the linkage process includes the indexing of a data set into elastic search. Constraints: string, it can assume the values "yes" or "no".
General info		es_index_name (str<ES_VALID_INDEX>)	The name of an existing elasticsearch index (if index_data is 'no') or a new one (if index_data is 'yes'). Constraints: string, elasticsearch valid.
General info		es_connect_string (str<ES_URL:ES_PORT>)	Elasticsearch API address. Constraints: string, URL format.
General info		query_size (int)	Number of candidates output for each Elasticsearch query. Constraints: int.
General info		cutoff_exact_match (str<0:1 number>)	Cutoff point to determine wether a pair is an exact match or not. Constraints: str, number between 0 and 1.
General info		null_value (str)	Value to replace missings on both data sets involved. Constraints: string.
General info		temp_dir (str)	Directory used to write checkpoints for exact match and non-exact match phases. Constraints: string, fully qualified path.
General info		debug (str<'true', 'false'>)	If it is set as "true", all records found on exact match will be queried again on non-exact match phase.
Datasets info	Indexed dataset	path (str)	Path for csv or parquet folder of dataset to index.
Datasets info	Indexed dataset	extension (str<'csv', 'parquet'>)	String to determine the type of data reading on Spark.
Datasets info	Indexed dataset	columns (list)	Python list with column names involved on linkage.
Datasets info	Indexed dataset	id_column_name (str)	Name of id column.
Datasets info	Indexed dataset	storage_level (str<'MEMORY_AND_DISK', 'MEMORY_ONLY'>)	Directive for memory allocation on Spark.
Datasets info	Indexed dataset	default_paralelism (str<4*N_OF_AVAILABLE_CORES>)	Number of partitions of a given Spark dataframe.
Datasets info	tolink dataset	path (str)	Path for csv or parquet folder of dataset to index.
Datasets info	tolink dataset	extension (str<'csv', 'parquet'>)	String to determine the type of data reading on Spark.
Datasets info	tolink dataset	columns (list)	Python list with column names involved on linkage.
Datasets info	tolink dataset	id_column_name (str)	Name of id column.
Datasets info	tolink dataset	storage_level (str<'MEMORY_AND_DISK', 'MEMORY_ONLY'>)	Directive for memory allocation on Spark.
Datasets info	tolink dataset	default_paralelism (str<4*N_OF_AVAILABLE_CORES>)	Number of partitions of a given Spark dataframe.
Datasets info	result dataset	path (str)	Path for csv or parquet folder of dataset to index.
Comparisons	label1	indexed_col (str)	Name of first column to be compared on indexed dataset
Comparisons	label1	tolink_col (str)	Name of first column to be compared on tolink dataset
Comparisons	label1	must_match (str<'true', 'false'>)	Set if this pair of columns are included on exact match phase
Comparisons	label1	should_match (str<'true', 'false'>)	Set if this pair of columns are included on non-exact match phase
Comparisons	label1	is_fuzzy (str<'true', 'false'>)	Set if this pair of columns are included on fuzzy queries for non-exact match phase
Comparisons	label1	boost (str)	Set the boost/weight of this pair of columns on queries
Comparisons	label1	query_type (str<'match', 'term'>)	Set the type of matching for this pair of columns on non-exact match phase
Comparisons	label1	similarity (str<'jaro_winkler', 'overlap', 'hamming'>	Set the similarity to be calculated between the values of this pair of columns
Comparisons	label1	weight (str)	Set the weight of this pair of columns.
Comparisons	label1	penalty (str)	Set the penalty of the overall similarity in case of missing value(s).
Comparisons	label2	...	...

config.json example


{
 'index_data': 'no',
 'es_index_name': 'fd-cidacs-rl',
 'es_connect_string': 'http://localhost:9200',
 'query_size': 100,
 'cutoff_exact_match': '0.95',
 'null_value': '99',
 'temp_dir': '../../../0_global_data/fd-cidacs-rl/temp_dataframe/',
 'debug': 'false',
 
 'datasets_info': {
    'indexed_dataset': {
        'path': '../../../0_global_data/fd-cidacs-rl/sinthetic-dataset-A.parquet',
        'extension': 'parquet',
        'columns': ['id_cidacs_a', 'nome_a', 'nome_mae_a', 'dt_nasc_a', 'sexo_a'],
        'id_column_name': 'id_cidacs_a',
        'storage_level': 'MEMORY_ONLY',
        'default_paralelism': '16'},
    'tolink_dataset': {
        'path': '../../../0_global_data/fd-cidacs-rl/sinthetic-datasets-b/sinthetic-datasets-b-500000.parquet',
        'extension': 'parquet',
        'columns': ['id_cidacs_b', 'nome_b', 'nome_mae_b', 'dt_nasc_b', 'sexo_b'],
        'id_column_name': 'id_cidacs_b',
        'storage_level': 'MEMORY_ONLY',
        'default_paralelism': '16'},
    'result_dataset': {
        'path': '../0_global_data/result/500000/'}},
        
 'comparisons': {
    'name': {
        'indexed_col': 'nome_a',
        'tolink_col': 'nome_b',
        'must_match': 'true',
        'should_match': 'true',
        'is_fuzzy': 'true',
        'boost': '3.0',
        'query_type': 'match',
        'similarity': 'jaro_winkler',
        'weight': 5.0,
        'penalty': 0.02},
    'mothers_name': {
       'indexed_col': 'nome_mae_a',
       'tolink_col': 'nome_mae_b',
       'must_match': 'true',
       'should_match': 'true',
       'is_fuzzy': 'true',
       'boost': '2.0',
       'query_type': 'match',
       'similarity': 'jaro_winkler',
       'weight': 5.0,
       'penalty': 0.02},
  'birthdate': {
       'indexed_col': 'dt_nasc_a',
       'tolink_col': 'dt_nasc_b',
       'must_match': 'false',
       'should_match': 'true',
       'is_fuzzy': 'false',
       'boost': '',
       'query_type': 'term',
       'similarity': 'hamming',
       'weight': 1.0,
       'penalty': 0.02},
  'sex': {
       'indexed_col': 'sexo_a',
       'tolink_col': 'sexo_b',
       'must_match': 'true',
       'should_match': 'true',
       'is_fuzzy': 'false',
       'boost': '',
       'query_type': 'term',
       'similarity': 'overlap',
       'weight': 3.0,
       'penalty': 0.02}}}

Running in a Standalone Spark Cluster

Read more: https://github.com/elastic/elasticsearch-hadoop https://www.elastic.co/guide/en/elasticsearch/hadoop/current/spark.html https://search.maven.org/artifact/org.elasticsearch/elasticsearch-spark-30_2.12 If you intend to run this tool into a single node Spark environment, consider to include this in you spark-submit or spark-shell command line


pyspark --packages org.elasticsearch:elasticsearch-spark-30_2.12:7.14.0 --conf spark.es.nodes="localhost" --conf spark.es.port="9200"

If you are running into a Spark Cluster under JupyterHUB kernels, try to add this kernel or edit an existing one:


{
	 "display_name": "Spark3.3",
	  "language": "python",
	   "argv": [
		     "/opt/bigdata/anaconda3/bin/python",
		       "-m",
		         "ipykernel",
			   "-f",
			     "{connection_file}"
			      ],
			       "env": {
				         "SPARK_HOME": "/opt/bigdata/spark",
					   "PYTHONPATH": "/opt/bigdata/spark/python:/opt/bigdata/spark/python/lib/py4j-0.10.9.2-src.zip",
					     "PYTHONSTARTUP": "/opt/bigdata/spark/python/pyspark/python/pyspark/shell.py",
					       "PYSPARK_PYTHON": "/opt/bigdata/anaconda3/bin/python",
					         "PYSPARK_SUBMIT_ARGS": "--master spark://node1.sparkcluster:7077 --packages org.elasticsearch:elasticsearch-spark-30_2.12:7.14.0 --conf spark.es.nodes=['node1','node2'] --conf spark.es.port='9200' pyspark-shell"
						  }
}

Some advices for indexed data and queries

Every col should be casted as string (df.withColumn('column', F.col('column').cast(string')))
Date type columns will not be proper indexed as string, except if some preprocessing step tranform it from yyyy-MM-dd to yyyyMMdd.
All the nodes of elasticsearch cluster must be included on --packages configuration.
Term queries are good to well structured variables, such as CPF, dates, CNPJ, etc.

A new version of the CIDACS-RL linkage tool suitable to a cluster computing environment.

Related tags

Overview

Fully Distributed CIDACS-RL

config.json

config.json example

Running in a Standalone Spark Cluster

Some advices for indexed data and queries

Owner

Robespierre Pita

这个开源项目主要是对经典的时间序列预测算法论文进行复现，模型主要参考自GluonTS，框架主要参考自Informer

[CVPR 2021] Counterfactual VQA: A Cause-Effect Look at Language Bias

code for our paper "Source Data-absent Unsupervised Domain Adaptation through Hypothesis Transfer and Labeling Transfer"

MNIST, but with Bezier curves instead of pixels

A curated list of awesome open source libraries to deploy, monitor, version and scale your machine learning

Transformer model implemented with Pytorch

PyTorch implementation for the paper Visual Representation Learning with Self-Supervised Attention for Low-Label High-Data Regime

This is official implementaion of paper "Token Shift Transformer for Video Classification".

Machine Learning automation and tracking

Reinforcement learning algorithms in RLlib

Assessing syntactic abilities of BERT

Implementation of Deep Deterministic Policy Gradiet Algorithm in Tensorflow

Implicit MLE: Backpropagating Through Discrete Exponential Family Distributions

Exploring the link between uncertainty estimates obtained via "exact" Bayesian inference and out-of-distribution (OOD) detection.

PyTea: PyTorch Tensor shape error analyzer

The implement of papar "Enhanced Graph Learning for Collaborative Filtering via Mutual Information Maximization"

Code needed to reproduce the examples found in "The Temporal Robustness of Stochastic Signals"

Provide baselines and evaluation metrics of the task: traffic flow prediction

Train Yolov4 using NBX-Jobs

Prompts - Read a textfile of prompts and import into anki via ankiconnect