Skip to content

abelsiqueira/faster-python-using-julia-blogposts

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

72 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Calling Julia from Python - blog post material

DOI

This material is part of a series of blog posts about using Julia from Python (Soon). The idea was initially presented internally at the Netherlands eScience Center(See the slides).

Post links:

Summary

  • We read Patrick's blog post about improving the reading of irregular files.
  • Patrick has a Python (Pandas) code that is slow.
  • Using some packages, he moves the reading and parsing to C++.
  • We decided to try to replace C++ with Julia to check:
    • How easy/hard it is
    • How much improvement can be gained with a basic Julia code;
    • How much further improvement can be gained with an optimized Julia code.

The strategies we examined are below, with a plot with the comparison following it:

  • Python with Pandas, as seen in Patrick's post. label: "Pure Python".
  • Python with reading and parsing in C++, as seen in Patrick's post. label: "C++".
  • Python with reading and parsing in Julia, in 4 different versions:
    • Basic Julia version with mostly disregard for efficiency, label="Basic Julia".
    • Julia version trying to improve memory usage. label: "Prealloc Julia".
    • Julia version where the elements are read with fscanf from C. label: "Julia + C parsing".
    • Julia version reading the file as bytes and manually walking through the bytes. label: "Optimized Julia".

Take-aways (see blog post):

  • The "Prealloc Julia" strategy is already an improvement over the "Pure Python" strategy.
  • The "Optimized Julia" strategy is faster than the "C++" strategy.
  • If you don't know Julia nor C++, moving the slow code to Julia yields benefits faster and with less effort.

The image below shows the speedup gain over the effort to get there:

Building the docker images

docker build --tag jl-from-py:<VERSION>

Reproducting the results

  • Download dataset and store in a folder called dataset.

  • Get the image with

    docker pull abelsiqueira/faster-python-with-julia-blogpost:post3
  • Run it with

    docker run --rm --volume "$PWD/dataset:/app/dataset" --volume "$PWD/out:/app/out" abelsiqueira/faster-python-with-julia-blogpost:post3
  • You will find the outputs in the out/ folder.

The execution of this script with default options took about 45 minutes on a Dell Precision 5530 with the Intel chip i7-8850H (2.6GHz) and 16GiB of RAM.

The docker runs the script src/main.py that runs run_experiments.py and run_analysis.py.

Arguments

  • --folder FOLDER: Set the dataset folder. (Default: dataset).
  • --max-num-files N: Maximum number of files to read from can be used to limit the experiment. The files are traversed in sorted name order. Use 0 or a negative number to run all. (Default: 0).
  • --skip-after X: Time threshold in seconds to skip the tests of a specific version. If the threshold is reached twice, that version is skipped in the additional tests. (Default: 0).
  • --skip VALUE1 [VALUE2 ...]: List of versions to skip. Valid values: python, cpp, julia_basic, julia_c, julia_prealloc, julia_opt.