Map Reduce Wordcount in Python using gRPC

Overview

Map-Reduce Word Count

This project is implemented in Python using gRPC. The input files are given in .txt format and the word count operation is performed.

The medium article for understansing the code better is Learn gRPC with an Example

Description

MapReduce is a programming model and an associated implementation for processing and generating large data sets. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. Programs written in this functional style are automatically parallelized and executed on a large cluster of commodity machines. The run-time system takes care of the details of partitioning the input data, scheduling the program’s execution across a set of machines, handling machine failures, and managing the required inter-machine communication

Installation and Usage

Setup

Clone this repository:

$ git clone https://github.com/divija-swetha/coding-exercise.git

Dependencies

$ python -V
    Python 3.8.5
$ python -m pip install grpcio
$ python -m pip install grpcio tools

Code Structure

There are three main files called the client, worker and driver. The client gives the input files and the number of output files and the worker ports (e.g. 127.0.0.1:4001). The worker nodes are launched with their ports and are responsible for the map and reduce operations. The driver takes the input from the client and distributes the work among all the worker nodes.

Proto Files

The code implenation begins with writing the proto files for the driver and worker.

driver.proto

Driver file launches the data processing operation, which is carried out when rpc launchDriver (launchData) returns (status); is executed in the code.

Once the proto file is written, the following command is executed in the terminal.

$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. driver.proto

This generates two files in the directory named, driver_pb2_grpc.py and driver_pb2.py.

worker.proto

The worker file sets the driver port, carries out the map and reduce operations. An additional method, die, is provided to terminate the process. The worker methods are as follows:

rpc setDriverPort(driverPort) returns (status);
rpc map(mapInput) returns (status);
rpc reduce (rid) returns (status);
rpc die (empty) returns (status);

Similar to driver, the following command is executed in the terminal.

$ python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. worker.proto

This generates two files in the directory named, worker_pb2_grpc.py and worker_pb2.py.

Python Files

Once the proto files are ready, python files for client, driver and worker are written.

worker.py

The files worker_pb2_grpc.py and worker_pb2.py are imported along with the python libraries. The code for map and reduce are defined in the worker class along with connecting to the driver port.

MAP

Mapper function maps input key/value pairs to a set of intermediate key/value pairs. Maps are the individual tasks that transform input records into intermediate records. The transformed intermediate records do not need to be of the same type as the input records. A given input pair may map to zero or many output pairs. The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The input text files are opened in read mode. The operations performed on a given input file are converting it to lower case, removing special charectors (other than words) and tranform the document into words. Then each word is sent to a bucket and these are stored in files.

REDUCE

Reducer reduces a set of intermediate values which share a key to a smaller set of values. The numver of reduce operations are drfined in client.py. In reduce function, glob library is used to extract all files based on a similar id. Then we use the counter function (imported from library collections) to generate a dictionary with the frequency of words.

driver.py

The worker.py file, all the files generated from proto files and python libraries are imported. The Driver class has several functions. The files are loaded and the worker ports are saved in launchDriver and then connections are established with all the workers. For each worker, map operation is sent along with the parameters. This loop continues untill all the input files are mapped. Then reduce operation is performed similar to the map operation. The driver sents task completed update to the client.

client.py

The in-built python libraries are imported along with the files generated from the proto file. A connection channel is established with the driver. Inputs and number of reduce operations and worker ports are initialized when the client is launched. Once the word count operation is performed, it exits with a message.

All the connections communicate through gRPC and evry connection is timed. If a worker or driver connection doesn't respond within 10 sec, the connection would be timed out. The driver distributed the work based on the information of worker's state and assigns jobs to the idle workers. If no workers are available, the driver waits and tries again. Once it receives the worker outputs, it aggregates the results.

The input text files are stored in inputs folder. The intermediate files are stored in temp folder and outputs in out folder. The visualization is provided in the video on how the files are created in temp and out during the execution of the code. The video shows the working of the code in visual studio in anaconda environment.

Video

Running the Code

You can run the code with any number of workers and output files. Here, I am running for 3 workers and 6 output files.

  1. Launch 3 workers by running the following command in the terminal.Provide the port number along with the command.
$ python worker.py 4001

Open another terminal and run the following command:

$ python worker.py 4002

Each worker needs to be declared in a seperate terminal. Open a new terminal and run the following command:

$ python worker.py 4003
  1. Launch the driver in a new terminal using the following command:
$ python driver.py 4000
  1. Finally launch the client in a new terminal using the following command:
$ python client.py ./inputs 6 4001 4002 4003

Additional Information

  1. Bloom RPC can be used to visualize the gRPC server client communication.
  2. Stagglers can be handled by various methods like replication, etc.

References

GRPC Tutorial

Map Reduce

Map Reduce Tutorial using Python

Bloom RPC

Contributors

License

MIT LICENSE

Owner
Divija
Divija
一款高性能敏感词(非法词/脏字)检测过滤组件,附带繁体简体互换,支持全角半角互换,汉字转拼音,模糊搜索等功能。

一款高性能非法词(敏感词)检测组件,附带繁体简体互换,支持全角半角互换,获取拼音首字母,获取拼音字母,拼音模糊搜索等功能。

ToolGood 3.6k Jan 07, 2023
CowExcept - Spice up those exceptions with cowexcept!

CowExcept - Spice up those exceptions with cowexcept!

James Ansley 41 Jun 30, 2022
Widevine KEY Extractor in Python

Widevine Client 3 This was originally written by T3rry7f. This repo is slightly modified version of his repo. This only works on standard Windows! Usa

Vank0n (SJJeon) 68 Dec 29, 2022
Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

Fuzzy string matching like a boss. It uses Levenshtein Distance to calculate the differences between sequences in a simple-to-use package.

SeatGeek 1.2k Jan 01, 2023
Hamming code generation, error detection & correction.

Hamming code generation, error detection & correction.

Farhan Bin Amin 2 Jun 30, 2022
Redlines produces a Markdown text showing the differences between two strings/text

Redlines Redlines produces a Markdown text showing the differences between two strings/text. The changes are represented with strike-throughs and unde

Houfu Ang 2 Apr 08, 2022
RSS Reader application for the Emacs Application Framework.

EAF RSS Reader RSS Reader application for the Emacs Application Framework. Load application (add-to-list 'load-path "~/.emacs.d/site-lisp/eaf-rss-read

EAF 15 Dec 07, 2022
Map Reduce Wordcount in Python using gRPC

This project is implemented in Python using gRPC. The input files are given in .txt format and the word count operation is performed.

Divija 4 Dec 05, 2022
pydantic-i18n is an extension to support an i18n for the pydantic error messages.

pydantic-i18n is an extension to support an i18n for the pydantic error messages

Boardpack 48 Dec 21, 2022
Athens: a great tool for taking notes and organising knowldge

AthensSyncer Athens is a great tool for taking notes and organising knowldge. But it is a bummer that you cannot use it accross multiple devices. Well

6 Dec 14, 2022
Wikipedia Reader for the GNOME Desktop

Wike Wike is a Wikipedia reader for the GNOME Desktop. Provides access to all the content of this online encyclopedia in a native application, with a

Hugo Olabera 126 Dec 24, 2022
Skype export archive to text converter for python

Skype export archive to text converter This software utility extracts chat logs

Roland Pihlakas open source projects 2 Jun 30, 2022
A Python3 script that simulates the user typing a text on their keyboard.

A Python3 script that simulates the user typing a text on their keyboard. (control the speed, randomness, rate of typos and more!)

Jose Gracia Berenguer 3 Feb 22, 2022
A minimal code sceleton for a textadveture parser written in python.

Textadventure sceleton written in python Use with a map file generated on https://www.trizbort.io Use the following Sockets for walking directions: n

1 Jan 06, 2022
Vastasanuli - Vastasanuli pelaa Sanuli-peliä.

Vastasanuli Vastasanuli pelaa SANULI -peliä. Se ei aina voita. Käyttö Tarttet Pythonin (3.6+). Aja make (tai lataa words.txt muualta) Asentele vaaditt

Aarni Koskela 1 Jan 06, 2022
An implementation of figlet written in Python

All of the documentation and the majority of the work done was by Christopher Jones ([emai

Peter Waller 1.1k Jan 02, 2023
Bidirectionally transformed strings

bistring The bistring library provides non-destructive versions of common string processing operations like normalization, case folding, and find/repl

Microsoft 352 Dec 19, 2022
从flomo导出的笔记中生成词云

flomo-word-cloud 从flomo导出的笔记中生成词云 如何使用? 将本项目克隆到你的电脑上,使用如下的命令,安装所需python库 pip install -r requirements.txt 在项目里新建一个file文件夹,把所有从flomo导出的html文件放入其中 运行main

Hannnk 9 Dec 30, 2022
Format Covid values to ASCII-Table (Only for Germany and Austria)

Covid-19-Formatter (Only for Germany and Austria) Dieses Script speichert die gemeldeten Daten des RKIs / BMSGPK und formatiert diese zu einer Asci Ta

56 Jan 22, 2022
The app gets your sutitle.srt and proccess it to extract sentences

DubbingAssistants This app gets your sutitle.srt and proccess it to extract sentences, and also find Start time and End time of them. Step 1: install

Ali Booresh 1 Jan 04, 2022