Calling Julia from Python - blog post material

This material is part of a series of blog posts about using Julia from Python (Soon). The idea was initially presented internally at the Netherlands eScience Center(See the slides).

Post links:

Summary

We read Patrick's blog post about improving the reading of irregular files.
Patrick has a Python (Pandas) code that is slow.
Using some packages, he moves the reading and parsing to C++.
We decided to try to replace C++ with Julia to check:
- How easy/hard it is
- How much improvement can be gained with a basic Julia code;
- How much further improvement can be gained with an optimized Julia code.

The strategies we examined are below, with a plot with the comparison following it:

Python with Pandas, as seen in Patrick's post. label: "Pure Python".
Python with reading and parsing in C++, as seen in Patrick's post. label: "C++".
Python with reading and parsing in Julia, in 4 different versions:
- Basic Julia version with mostly disregard for efficiency, label="Basic Julia".
- Julia version trying to improve memory usage. label: "Prealloc Julia".
- Julia version where the elements are read with fscanf from C. label: "Julia + C parsing".
- Julia version reading the file as bytes and manually walking through the bytes. label: "Optimized Julia".

Take-aways (see blog post):

The "Prealloc Julia" strategy is already an improvement over the "Pure Python" strategy.
The "Optimized Julia" strategy is faster than the "C++" strategy.
If you don't know Julia nor C++, moving the slow code to Julia yields benefits faster and with less effort.

The image below shows the speedup gain over the effort to get there:

Building the docker images

docker build --tag jl-from-py:<VERSION>

Reproducting the results

Download dataset and store in a folder called dataset.

Get the image with

docker pull abelsiqueira/faster-python-with-julia-blogpost:post3

Run it with

docker run --rm --volume "$PWD/dataset:/app/dataset" --volume "$PWD/out:/app/out" abelsiqueira/faster-python-with-julia-blogpost:post3

You will find the outputs in the out/ folder.

The execution of this script with default options took about 45 minutes on a Dell Precision 5530 with the Intel chip i7-8850H (2.6GHz) and 16GiB of RAM.

The docker runs the script src/main.py that runs run_experiments.py and run_analysis.py.

Arguments

--folder FOLDER: Set the dataset folder. (Default: dataset).
--max-num-files N: Maximum number of files to read from can be used to limit the experiment. The files are traversed in sorted name order. Use 0 or a negative number to run all. (Default: 0).
--skip-after X: Time threshold in seconds to skip the tests of a specific version. If the threshold is reached twice, that version is skipped in the additional tests. (Default: 0).
--skip VALUE1 [VALUE2 ...]: List of versions to skip. Valid values: python, cpp, julia_basic, julia_c, julia_prealloc, julia_opt.

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
assets		assets
post1		post1
post2		post2
post3		post3
slides		slides
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

assets

assets

post1

post1

post2

post2

post3

post3

slides

slides

.gitignore

.gitignore

CITATION.cff

CITATION.cff

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Calling Julia from Python - blog post material

Summary

Building the docker images

Reproducting the results

Arguments

About

Releases 5

Packages

Contributors 2

Languages

License

abelsiqueira/faster-python-using-julia-blogposts

Folders and files

Latest commit

History

Repository files navigation

Calling Julia from Python - blog post material

Summary

Building the docker images

Reproducting the results

Arguments

About

Resources

License

Stars

Watchers

Forks

Languages