In-place Parallel Super Scalar Samplesort (IPS⁴o)

Related tags

Deep Learningips4o
Overview

In-place Parallel Super Scalar Samplesort (IPS⁴o)

This is the implementation of the algorithm IPS⁴o presented in the paper Engineering In-place (Shared-memory) Sorting Algorithms, which contains an in-depth description of its inner workings, as well as an extensive experimental performance evaluation. Here's the abstract:

We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account.

Our new comparison-based algorithm In-place Superscalar Samplesort (IPS⁴o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings.

Another surprising result is that IPS⁴o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS²Ra) turns out to be the best algorithm.

Claims to have the -- in some sense -- "best" sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.

An initial version of IPS⁴o has been described in our publication on the 25th Annual European Symposium on Algorithms.

Usage

Clone this repository and check out its submodule

git clone --recurse-submodules https://github.com/ips4o/ips4o.git

or use the following commands instead if you want to include this repository as a submodule:

git submodule add https://github.com/ips4o/ips4o.git
git submodule update --recursive --init

IPS⁴o provides a CMake library for simple usage:

add_subdirectory(<path-to-the-ips4o-repository>)
target_link_libraries(<your-target> PRIVATE ips4o)

A minimal working example:

#include "ips4o.hpp"

// sort sequentially
ips4o::sort(begin, end[, comparator]);

// sort in parallel (uses OpenMP if available, std::thread otherwise)
ips4o::parallel::sort(begin, end[, comparator]);

The parallel version of IPS⁴o requires 16-byte atomic compare-and-exchange instructions to run the fastest. Most CPUs and compilers support 16-byte compare-and-exchange instructions nowadays. If the CPU in question does so, IPS⁴o uses 16-byte compare-and-exchange instructions when you set your CPU correctly (e.g., -march=native) or when you enable the instructions explicitly (-mcx16). In this case, you also have to link against GCC's libatomic (-latomic). Otherwise, we emulate some 16-byte compare-and-exchange instructions with locks which may slightly mitigate the performance of IPS⁴o.

If you use the CMake example shown above, we automatically optimize IPS⁴o for the native CPU (e.g., -march=native). You can disable the CMake property IPS4O_OPTIMIZE_FOR_NATIVE to avoid native optimization and you can enable the CMake property IPS4O_USE_MCX16 if you compile with GCC or Clang to enable 16-byte compare-and-exchange instructions explicitly.

IPS⁴o uses C++ threads if not specified otherwise. If you prefer OpenMP threads, you need to enable OpenMP threads, e.g., enable the CMake property IPS4O_USE_OPENMP or add OpenMP to your target. If you enable the CMake property DISABLE_IPS4O_PARALLEL, most of the parallel code will not be compiled and no parallel libraries will be linked. Otherwise, CMake automatically enables C++ threads (e.g., -pthread) and links against TBB and GCC's libatomic. (Only when you compile your code for 16-byte compare-and-exchange instructions you need libatomic.) Thus, you need the Thread Building Blocks (TBB) library to compile and execute the parallel version of IPS⁴o. We search for TBB with find_package(TBB REQUIRED). If you want to execute IPS⁴o in parallel but your TBB library is not accessible via find_package(TBB REQUIRED), you can still compile IPS⁴o with parallel support. Just enable the CMake property DISABLE_IPS4O_PARALLEL, enable C++ threads for your own target and link your own target against your TBB library (and also link your target against libatomic if you want 16-byte atomic compare-and-exchange instruction support).

If you do not set a CMake build type, we use the build type Release which disables debugging (e.g., -DNDEBUG) and enables optimizations (e.g., -O3).

Currently, the code does not compile on Windows.

Licensing

IPS⁴o is free software provided under the BSD 2-Clause License described in the LICENSE file. If you use this implementation of IPS⁴o in an academic setting please cite the paper Engineering In-place (Shared-memory) Sorting Algorithms using the BibTeX entry

@misc{axtmann2020engineering,
  title =	 {Engineering In-place (Shared-memory) Sorting Algorithms},
  author =	 {Michael Axtmann and Sascha Witt and Daniel Ferizovic and Peter Sanders},
  howpublished = {Computing Research Repository (CoRR)},
  year =	 {Sept. 2020},
  archivePrefix ={arXiv},
  eprint =	 {2009.13569},
}
This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation".

IR-GAIL This is an example implementation of the paper "Cross Domain Robot Imitation with Invariant Representation". Dependency The experiments are de

Zhao-Heng Yin 1 Jul 14, 2022
Performant, differentiable reinforcement learning

deluca Performant, differentiable reinforcement learning Notes This is pre-alpha software and is undergoing a number of core changes. Updates to follo

Google 114 Dec 27, 2022
Official Code Release for "CLIP-Adapter: Better Vision-Language Models with Feature Adapters"

Official Code Release for "CLIP-Adapter: Better Vision-Language Models with Feature Adapters" Pipeline of CLIP-Adapter CLIP-Adapter is a drop-in modul

peng gao 157 Dec 26, 2022
Deep Learning and Reinforcement Learning Library for Scientists and Engineers 🔥

TensorLayer is a novel TensorFlow-based deep learning and reinforcement learning library designed for researchers and engineers. It provides an extens

TensorLayer Community 7.1k Dec 29, 2022
Bayesian optimisation library developped by Huawei Noah's Ark Library

Bayesian Optimisation Research This directory contains official implementations for Bayesian optimisation works developped by Huawei R&D, Noah's Ark L

HUAWEI Noah's Ark Lab 395 Dec 30, 2022
Permeability Prediction Via Multi Scale 3D CNN

Permeability-Prediction-Via-Multi-Scale-3D-CNN Data: The raw CT rock cores are obtained from the Imperial Colloge portal. The CT rock cores are sub-sa

Mohamed Elmorsy 2 Jul 06, 2022
Definition of a business problem according to Wilson Lower Bound Score and Time Based Average Rating

Wilson Lower Bound Score, Time Based Rating Average In this study I tried to calculate the product rating and sorting reviews more accurately. I have

3 Sep 30, 2021
An open source app to help calm you down when needed.

By: Seanpm2001, Et; Al. Top README.md Read this article in a different language Sorted by: A-Z Sorting options unavailable ( af Afrikaans Afrikaans |

Sean P. Myrick V19.1.7.2 2 Oct 24, 2022
Incomplete easy-to-use math solver and PDF generator.

Math Expert Let me do your work Preview preview.mp4 Introduction Math Expert is our (@salastro, @younis-tarek, @marawn-mogeb) math high school graduat

SalahDin Ahmed 22 Jul 11, 2022
This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Instance Segmentation".

ObjProp Introduction This is the official implementation of the paper "Object Propagation via Inter-Frame Attentions for Temporally Stable Video Insta

Anirudh S Chakravarthy 6 May 03, 2022
Complementary Patch for Weakly Supervised Semantic Segmentation, ICCV21 (poster)

CPN (ICCV2021) This is an implementation of Complementary Patch for Weakly Supervised Semantic Segmentation, which is accepted by ICCV2021 poster. Thi

Ferenas 20 Dec 12, 2022
AITUS - An atomatic notr maker for CYTUS

AITUS an automatic note maker for CYTUS. 利用AI根据指定乐曲生成CYTUS游戏谱面。 效果展示:https://www

GradiusTwinbee 6 Feb 24, 2022
[ICLR 2021 Spotlight Oral] "Undistillable: Making A Nasty Teacher That CANNOT teach students", Haoyu Ma, Tianlong Chen, Ting-Kuei Hu, Chenyu You, Xiaohui Xie, Zhangyang Wang

Undistillable: Making A Nasty Teacher That CANNOT teach students "Undistillable: Making A Nasty Teacher That CANNOT teach students" Haoyu Ma, Tianlong

VITA 71 Dec 28, 2022
AdaFocus (ICCV 2021) Adaptive Focus for Efficient Video Recognition

AdaFocus (ICCV 2021) This repo contains the official code and pre-trained models for AdaFocus. Adaptive Focus for Efficient Video Recognition Referenc

Rainforest Wang 115 Dec 21, 2022
Code for Universal Semi-Supervised Semantic Segmentation models paper accepted in ICCV 2019

USSS_ICCV19 Code for Universal Semi Supervised Semantic Segmentation accepted to ICCV 2019. Full Paper available at https://arxiv.org/abs/1811.10323.

Tarun K 68 Nov 24, 2022
NaturalProofs: Mathematical Theorem Proving in Natural Language

NaturalProofs: Mathematical Theorem Proving in Natural Language NaturalProofs: Mathematical Theorem Proving in Natural Language Sean Welleck, Jiacheng

Sean Welleck 83 Jan 05, 2023
Code for our ICASSP 2021 paper: SA-Net: Shuffle Attention for Deep Convolutional Neural Networks

SA-Net: Shuffle Attention for Deep Convolutional Neural Networks (paper) By Qing-Long Zhang and Yu-Bin Yang [State Key Laboratory for Novel Software T

Qing-Long Zhang 199 Jan 08, 2023
From a body shape, infer the anatomic skeleton.

OSSO: Obtaining Skeletal Shape from Outside (CVPR 2022) This repository contains the official implementation of the skeleton inference from: OSSO: Obt

Marilyn Keller 166 Dec 28, 2022
SOTA easy to use PyTorch-based DL training library

Easily train or fine-tune SOTA computer vision models from one training repository. SuperGradients Introduction Welcome to SuperGradients, a free open

619 Jan 03, 2023