simdjson : Parsing gigabytes of JSON per second

Last update: Dec 29, 2022

Overview

simdjson : Parsing gigabytes of JSON per second

JSON is everywhere on the Internet. Servers spend a *lot* of time parsing it. We need a fresh approach. The simdjson library uses commonly available SIMD instructions and microparallel algorithms to parse JSON 4x faster than RapidJSON and 25x faster than JSON for Modern C++.

Fast: Over 4x faster than commonly used production-grade JSON parsers.
Record Breaking Features: Minify JSON at 6 GB/s, validate UTF-8 at 13 GB/s, NDJSON at 3.5 GB/s.
Easy: First-class, easy to use and carefully documented APIs.
Strict: Full JSON and UTF-8 validation, lossless parsing. Performance with no compromises.
Automatic: Selects a CPU-tailored parser at runtime. No configuration needed.
Reliable: From memory allocation to error handling, simdjson's design avoids surprises.
Peer Reviewed: Our research appears in venues like VLDB Journal, Software: Practice and Experience.

This library is part of the Awesome Modern C++ list.

Quick Start
Documentation
Performance results
Real-world usage
Bindings and Ports of simdjson
About simdjson
Funding
Contributing to simdjson
License

Quick Start

The simdjson library is easily consumable with a single .h and .cpp file.

Prerequisites: g++ (version 7 or better) or clang++ (version 6 or better), and a 64-bit system with a command-line shell (e.g., Linux, macOS, freeBSD). We also support programming environments like Visual Studio and Xcode, but different steps are needed.

Pull simdjson.h and simdjson.cpp into a directory, along with the sample file twitter.json.

wget https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.h https://raw.githubusercontent.com/simdjson/simdjson/master/singleheader/simdjson.cpp https://raw.githubusercontent.com/simdjson/simdjson/master/jsonexamples/twitter.json

Create quickstart.cpp:

#include "simdjson.h"
using namespace simdjson;
int main(void) {
    ondemand::parser parser;
    padded_string json = padded_string::load("twitter.json");
    ondemand::document tweets = parser.iterate(json);
    std::cout << uint64_t(tweets["search_metadata"]["count"]) << " results." << std::endl;
}

c++ -o quickstart quickstart.cpp simdjson.cpp
./quickstart
```
100 results.
```

Documentation

Usage documentation is available:

Basics is an overview of how to use simdjson and its APIs.
Performance shows some more advanced scenarios and how to tune for them.
Implementation Selection describes runtime CPU detection and how you can work with it.
API contains the automatically generated API documentation.

Performance results

The simdjson library uses three-quarters less instructions than state-of-the-art parser RapidJSON. To our knowledge, simdjson is the first fully-validating JSON parser to run at gigabytes per second (GB/s) on commodity processors. It can parse millions of JSON documents per second on a single core.

The following figure represents parsing speed in GB/s for parsing various files on an Intel Skylake processor (3.4 GHz) using the GNU GCC 10 compiler (with the -O3 flag). We compare against the best and fastest C++ libraries on benchmarks that load and process the data. The simdjson library offers full unicode (UTF-8) validation and exact number parsing.

The simdjson library offers high speed whether it processes tiny files (e.g., 300 bytes) or larger files (e.g., 3MB). The following plot presents parsing speed for synthetic files over various sizes generated with a script on a 3.4 GHz Skylake processor (GNU GCC 9, -O3).

All our experiments are reproducible.

For NDJSON files, we can exceed 3 GB/s with our multithreaded parsing functions.

Real-world usage

If you are planning to use simdjson in a product, please work from one of our releases.

Bindings and Ports of simdjson

We distinguish between "bindings" (which just wrap the C++ code) and a port to another programming language (which reimplements everything).

ZippyJSON: Swift bindings for the simdjson project.
libpy_simdjson: high-speed Python bindings for simdjson using libpy.
pysimdjson: Python bindings for the simdjson project.
cysimdjson: high-speed Python bindings for the simdjson project.
simdjson-rs: Rust port.
simdjson-rust: Rust wrapper (bindings).
SimdJsonSharp: C# version for .NET Core (bindings and full port).
simdjson_nodejs: Node.js bindings for the simdjson project.
simdjson_php: PHP bindings for the simdjson project.
simdjson_ruby: Ruby bindings for the simdjson project.
fast_jsonparser: Ruby bindings for the simdjson project.
simdjson-go: Go port using Golang assembly.
rcppsimdjson: R bindings.
simdjson_erlang: erlang bindings.

About simdjson

The simdjson library takes advantage of modern microarchitectures, parallelizing with SIMD vector instructions, reducing branch misprediction, and reducing data dependency to take advantage of each CPU's multiple execution cores.

Some people enjoy reading our paper: A description of the design and implementation of simdjson is in our research article:

Geoff Langdale, Daniel Lemire, Parsing Gigabytes of JSON per Second, VLDB Journal 28 (6), 2019.

We have an in-depth paper focused on the UTF-8 validation:

John Keiser, Daniel Lemire, Validating UTF-8 In Less Than One Instruction Per Byte, Software: Practice & Experience 51 (5), 2021.

We also have an informal blog post providing some background and context.

For the video inclined,

(it was the best voted talk, we're kinda proud of it).

Funding

The work is supported by the Natural Sciences and Engineering Research Council of Canada under grant number RGPIN-2017-03910.

Contributing to simdjson

Head over to CONTRIBUTING.md for information on contributing to simdjson, and HACKING.md for information on source, building, and architecture/design.

License

This code is made available under the Apache License 2.0.

Under Windows, we build some tools using the windows/dirent_portable.h file (which is outside our library code): it under the liberal (business-friendly) MIT license.

For compilers that do not support C++17, we bundle the string-view library which is published under the Boost license (http://www.boost.org/LICENSE_1_0.txt). Like the Apache license, the Boost license is a permissive license allowing commercial redistribution.

For efficient number serialization, we bundle Florian Loitsch's implementation of the Grisu2 algorithm for binary to decimal floating-point numbers. The implementation was slightly modified by JSON for Modern C++ library. Both Florian Loitsch's implementation and JSON for Modern C++ are provided under the MIT license.

For runtime dispatching, we use some code from the PyTorch project licensed under 3-clause BSD.

Comments

On-Demand Parsing

This introduces a DOM-like API that parses JSON with forward-only streaming--combining the ease of traditional DOM parsers with the performance of SAX. One major virtue of this approach is that we know what type the user wants a value to be before we parse it, so we eliminate the typical "type check" employed by DOM and SAX, instead parsing with a parser dedicated to that type.

It is far faster than using the DOM (4.0 GB/s vs. 2.3 GB/s). It is also much easier than the SAX approach (15 lines for a Tweet reader versus 300 in SAX), as well as being slightly faster (results on the Clang 10 compiler, on a Skylake machine):

| Benchmark | Generic DOM | SAX | On-Demand | |------------|---|---|---| | Tweets | 2.3 GB/s | 3.5 GB/s | 4.0 GB/s | | LargeRandom | 0.50 GB/s | 0.71 GB/s | 0.71 GB/s |

Examples

The benchmarks have some real, fairly simple examples with equivalent DOM and SAJ implementations.

points.json

This parses a giant array of points: [ { "x": 1.0, "y": 1.2, "z": 1.3 }, ... ]

ondemand::parser parser;
std::vector<my_point> container;
for (ondemand::object p : parser.parse(json)) {
  container.emplace_back(my_point{p["x"], p["y"], p["z"]});
}

twitter.json

This parses a list of Tweets (from the Twitter API) into C++ structs, serializing text, screen names, ids, and favore/retweet count.

// Walk the document, parsing the tweets as we go
std::vector<twitter::tweet> tweets;
ondemand::parser parser;
auto doc = parser.parse(json);
for (ondemand::object tweet : doc["statuses"]) {
  tweets.emplace_back(twitter::tweet{
    tweet["created_at"],
    tweet["id"],
    tweet["text"],
    nullable_int(tweet["in_reply_to_status_id"]),
    read_user(tweet["user"]),
    tweet["retweet_count"],
    tweet["favorite_count"]
  });
}

With these auxiliary functions:

simdjson_really_inline twitter::twitter_user read_user(ondemand::object && u) {
  return { u["id"], u["screen_name"] };
}
simdjson_really_inline uint64_t nullable_int(ondemand::value && value) {
  if (value.is_null()) { return 0; }
  return std::move(value);
}

Principles

Inline Iteration: Iterating arrays or objects is done through exactly the kind of for loop you'd expect. You can nest them, iterating an array of arrays or an array of objects with array values through nested for loops. Under the hood, the iterator checks for the "[", passes you the index of the value, and when you finish with a value, it checks for "," and passes the next value until it sees "]".
Forward-Only Iteration: To prevent reiteration of the same values and to keep the number of variables down (literally), only a single index is maintained and everything uses it (even if you have nested for loops). This means when you're going through an array of arrays, for example, that the inner array loop will advance the index to the next comma, and the array can just pick it up and look at it.
Inline, On-Demand Parsing: Parses exactly the type you want and nothing else. Because it's inline this means way fewer branches per value, and they're more predictable as well. For example, if you ask for an unsigned integer, we just start parsing digits. If there were no digits, we toss an error. With a generic parser you have to do a big switch statement checking whether it's a digit before you even start parsing, so it's both an extra branch, and a hard to predict one (because you are also checking other values).
Streaming Output: This is streaming in the sense of output, but not input. You still have to pass the whole file as input; it just doesn't parse anything besides the marks until you ask. This also means the parser memory has to grow as the file grows (because of structural indexes). Streaming input is a whole nother problem, however.
Validate What You Use: It deliberately validates the values you use and the structure leading to it, but nothing else. The goal is a guarantee that the value you asked for is the correct one and is not malformed so that there is no confusion over whether you got the right value. But it leaves the possibility that the JSON as a whole is invalid. A full-validation mode is possible and planned, but I think this mode should be the default, personally, or at least pretty heavily advertised. Full-validation mode should really only be for debug.
Avoids Genericity Pitfalls I think it avoids the pitfalls of generating a generic DOM, which are that you don't know what to expect in the file so you can't tune the parser to expect it (and thus branch mispredictions abound). Even SAX falls into this, though definitely less than others: the core of SAX still has to have a giant switch statement in a loop, and that's just going to be inherently branchy.

Impact on DOM parse (skylake haswell gcc10.2)

As expected / hoped, this is entirely neutral with respect to our existing performance, all the way down to identical instruction count:

| File | Blocks | master Cycles | PR Cycles | +Throughput | master Instr. | PR Instr. | -Instr. | | --- | --: | --: | --: | --: | --: | --: | --: | | gsoc-2018 | 51997 | 72.7 | 71.6 | 1% | 160.1 | 160.1 | 0% | | instruments | 3442 | 108.2 | 107.3 | 0% | 370.6 | 370.6 | 0% | | github_events | 1017 | 78.6 | 78.4 | 0% | 256.7 | 256.7 | 0% | | numbers | 2345 | 284.1 | 283.4 | 0% | 791.0 | 791.0 | 0% | | apache_builds | 1988 | 84.9 | 84.7 | 0% | 295.1 | 295.1 | 0% | | mesh | 11306 | 319.3 | 318.5 | 0% | 984.2 | 984.2 | 0% | | twitterescaped | 8787 | 188.1 | 187.8 | 0% | 493.0 | 493.0 | 0% | | marine_ik | 46616 | 318.4 | 318.1 | 0% | 895.7 | 895.7 | 0% | | update-center | 8330 | 113.2 | 113.2 | 0% | 326.5 | 326.5 | 0% | | mesh.pretty | 24646 | 189.0 | 188.9 | 0% | 571.3 | 571.3 | 0% | | twitter | 9867 | 92.0 | 92.0 | 0% | 281.6 | 281.6 | 0% | | citm_catalog | 26987 | 81.7 | 81.7 | 0% | 287.5 | 287.5 | 0% | | canada | 35172 | 311.2 | 311.4 | 0% | 946.7 | 946.7 | 0% | | semanticscholar-corpus | 134271 | 108.8 | 109.0 | 0% | 274.4 | 274.4 | 0% | | random | 7976 | 141.2 | 142.2 | 0% | 482.1 | 482.1 | 0% |

Design

The primary classes are:

ondemand::parser: The equivalent of dom::parser.
- This handles allocation and parse calls, and keeps memory around between parses.
ondemand::document: Holds iteration state. Can be cast to array, object or scalar.
- Forward-Only: This is a forward-only input iterator. You may only get the document's value once. Once you have retrieved an array, object, or scalar, subsequent attempts to get other values fail.
- Iteration Owner: If you let go of the document object, iteration will fail. This is not checked, but the failure will be really obvious :) Moves are disallowed after iteration has started, because array/object/value all point to the document.
- Locks the Parser: Only one iteration is allowed at a time. If you attempt to parse a new document before destroying the old one, you will get an error. document cannot be copied.
ondemand::array: Manages array iteration.
- Forward-Only: Retrieving the same array element twice will fail. There is currently no check on whether it has handed out a value, it's just something you shouldn't do multiple times without ++. This is consistent with C++'s "input iterator" concept.
- Child Blindness: Once you get an array element, it has no control over whether you do anything with it. For example, you could decide not to handle a value if it's an array or object. To control for this, when you ++ the array checks whether there is an unfinished array or object by checking if we're at the current depth. If so, it skips tokens until it's returned to the current depth.
- Chainable: We allow you to pass an error into the iterator, which it will yield on its first iteration and then stop. This allows error chaining to make its way all the way to the loop: for (auto o : parser.parse(json)) works!
- C++ Iterator: Because C++ breaks what could be a single next() call into !=, ++, and * calls, we have to break up the algorithm into parts and keep some state between them.
  - operator * Reads key, : and a value, advancing exactly three times. Returns an error if key or : is invalid. value takes up the slack from there, incrementing depth if there is a [ or { even if the user doesn't use the value. If there is an error to report, it decrements depth so that the loop will terminate, and returns it.
  - operator ++ Checks if we have a ] (decrementing depth) or , (setting error if no comma).
  - operator != lets the loop continue if current depth >= array depth.
- Zero Overhead: It keeps state, but that state is all either constant and knowable at compile time, or only has an effect for a single iteration (the first or last). Our goal is to get the compiler to elide them all.
  - document: This member variable is in many objects, but always has the same value. It is likely the compiler will elide it.
  - at_start: This member variable is set the first time true. If it is true, we check for } before the first != and then set it to false.
  - error: Whether this member variable is passed in initially or detected by ++, error has no effect unless it is nonzero, and when it is zero, the loop always terminates after the next iteration. We hope this will be elided, therefore, into a trailing control flow.
  - depth: This member variable is constant and knowable at compile time, because depth will have been incremented a constant number of times based on how many nested objects you have. Whether the compiler recognizes this is anybody's game, however :/
ondemand::object: Manages object iteration and object["key"].
- Forward-Only: [] will never go backwards; you must do [] in order to get all the fields you want. Retrieving the same field twice with * will fail. There is currently no check on whether it has handed out a value, it's just something you shouldn't do multiple times without ++. This is consistent with C++'s "input iterator" concept.
- Child Blindness: Once you get a field or value, the object has no control over whether you do anything with it. For example, you could decide not to handle a value if it's an array or object. To control for this, when you ++ or do a second [], the array checks whether there is an unfinished array or object by checking if we're at the current depth. If so, it skips tokens until it's returned to the current depth.
- Chainable: We allow you to pass an error into the iterator, which it will yield on its first iteration and then stop. This allows error chaining to make its way all the way to the loop: for (auto o : parser.parse(json)) works!
- C++ Iterator: Because C++ breaks what could be a single next() call into !=, ++, and * calls, we have to break up the algorithm into parts and keep some state between them.
  - operator * Reads key, : and a value, advancing exactly three times. Returns an error if key or : is invalid. value takes up the slack from there, incrementing depth if there is a [ or { even if the user doesn't use the value. If there is an error to report, it decrements depth so that the loop will terminate, and returns it.
  - operator ++ Checks if we have a ] (decrementing depth) or , (setting error if no comma).
  - operator != lets the loop continue if current depth >= array depth.
- Zero Overhead: It keeps state, but that state is all either constant and knowable at compile time, or only has an effect for a single iteration (the first or last). Our goal is to get the compiler to elide them all.
  - document: This member variable is in many objects, but always has the same value. It is likely the compiler will elide it.
  - at_start: This member variable is set the first time true. If it is true, we check for } before the first != and then set it to false therafter. We expect this to be elided in favor of leading control flow.
  - error: Whether this member variable is passed in initially or detected by ++, error has no effect unless it is nonzero, and when it is zero, the loop always terminates after the next iteration. We hope this will be elided, therefore, into a trailing control flow.
  - depth: This member variable is constant and knowable at compile time, because depth will have been incremented a constant number of times based on how many nested objects you have. Whether the compiler recognizes this is anybody's game, however :/
ondemand::value: A transient object giving you the opportunity to convert a JSON value into an array, object, or scalar.
- Forward-Only: This is transient: its value can only be retrieved once. Further retrievals will fail. It is an error to keep multiple value objects around at once (it is also hard to do, but possible).
- Skippable: If you don't use the value (for example, if you have a field and don't care about the key), the destructor will check if it's { or [ and increment depth, to keep things consistent.
ondemand::raw_json_string: Represents the raw json string inside the buffer, terminated by ". This allows you to inspect and do comparisons against the raw, escaped json string without performance penalty.
- The unescape() method on it will parse it (escapes and all) into a string buffer of your choice.
ondemand::token_iterator: Internal. Used to actually track structural iteration. document is the only way users will see this.

Concerns / Rough Edges

Compiler Flags: You have to compile all of simdjson.cpp/simdjson.h with the target flags. Otherwise, simdjson_result<> and other things in the include/ headers can't be inlined with your Haswell-specific code.
Heavy Reliance On Optimizers: The object and array structs have four member variables each, which I'm expecting the compiler to elide completely in normal cases. I think these are largely unavoidable given C++'s iterator design. Without that elision, register pressure will be intense and stuff will get shoved into memory. I made some assumptions about how optimizers should work, particularly that it can deduce the depth value since it's constant, and that it can elide variables like at_start into control flow since it only affects the header of the loop.
Unconsumed Values: Because the user drives the parse, we hand them array, object and value objects which they can then iterate or call get_string/etc. on. However, if they don't do this, or only partially iterate, then the structural_index and depth won't get updated. I added a destructor to value to check whether the value has been used, and check for start/end array if not. We also keep track of depth so that if child iterations are left partially iterated, we can skip everything until we get back to our own depth. All of this can be optimized away if the compiler is smart enough ... but I'm not convinced it will be :) We'll see.

Performance

clang does a lot better on the (relatively complex) twitter benchmark with ondemand than g++. I assume this has to do with its affinity for SSA optimizations:

Haswell clang10.0 (Skylake)

| Benchmark | Generic DOM | On-Demand | SAX | |------------|---|---|---| | PartialTweets | 2.3 GB/s | 4.0 GB/s | 3.5 GB/s | | LargeRandom | 0.50 GB/s | 0.71 GB/s | 0.71 GB/s |

Haswell gcc10 (Skylake)

GCC is more or less on par with clang:

| Benchmark | DOM | On-Demand | SAX | |------------|---|---|---| | PartialTweets | 2.5 GB/s | 3.8 GB/s | 3.7 GB/s | | LargeRandom | 0.50 GB/s | 0.78 GB/s | 0.74 GB/s |

Running the Benchmark

You can see several examples in benchmark/bench_ondemand.cpp. To compile for your native platform, do this:

rm -rf build
cd build
cmake -DCMAKE_CXX_FLAGS="-march=native" ..
make bench_ondemand
benchmark/bench_ondemand --benchmark_counters_tabular=true

Raw Data: Haswell clang10.0 (Skylake)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations best_branch_miss best_bytes_per_sec best_cache_miss best_cache_ref best_cycles best_cycles_per_byte best_docs_per_sec best_frequency best_instructions best_instructions_per_byte best_instructions_per_cycle best_items_per_sec branch_miss      bytes bytes_per_second cache_miss  cache_ref     cycles cycles_per_byte docs_per_sec  frequency instructions instructions_per_byte instructions_per_cycle      items items_per_second
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PartialTweets<OnDemand>      165179 ns       165149 ns         4237           1.538k             4.056G               0        58.152k    574.945k             0.910422          6.42265k       3.69267G          1.93943M                    3.07108                     3.37325           642.265k    1.71353k   631.515k       3.56129G/s   8.96861m   58.1731k   580.103k         0.91859   6.05512k/s  3.5126G/s     1.93943M               3.07108                3.34325        100       605.512k/s [best: throughput=  4.06 GB/s doc_throughput=  6422 docs/s instructions=     1939432 cycles=      574945 branch_miss=    1538 cache_miss=       0 cache_ref=     58152 items=       100 avg_time=    157210 ns]
PartialTweets<Iter>          182774 ns       182773 ns         3828           2.825k           3.66457G               0         58.28k    636.406k              1.00774          5.80282k       3.69295G             1.84M                    2.91363                     2.89124           580.282k    3.08751k   631.515k       3.21789G/s    3.9185m    58.158k   644.682k         1.02085   5.47127k/s 3.52723G/s        1.84M               2.91363                2.85412        100       547.127k/s [best: throughput=  3.66 GB/s doc_throughput=  5802 docs/s instructions=     1840002 cycles=      636406 branch_miss=    2825 cache_miss=       0 cache_ref=     58280 items=       100 avg_time=    174683 ns]
PartialTweets<Dom>           282433 ns       282382 ns         2480           3.149k           2.31979G               0        92.404k    1005.16k              1.59167          3.67338k       3.69235G           2.9721M                     4.7063                     2.95682           367.338k     3.2693k   631.515k        2.0828G/s   0.439516   92.8304k   1011.84k         1.60225    3.5413k/s 3.58324G/s      2.9721M                4.7063                2.93731        100        354.13k/s [best: throughput=  2.32 GB/s doc_throughput=  3673 docs/s instructions=     2972097 cycles=     1005165 branch_miss=    3149 cache_miss=       0 cache_ref=     92404 items=       100 avg_time=    274248 ns]
Creating a source file spanning 44921 KB 
LargeRandom<Dom>           91468995 ns     91467607 ns            8         968.537k           503.312M        10.8974M       15.4273M    337.109M              7.32865           10.9419        3.6886G          1041.23M                    22.6361                     3.08872           10.9419M    967.752k   45.9988M         479.6M/s   10.9539M   15.4278M   337.335M         7.33356    10.9328/s 3.68803G/s     1041.23M               22.6361                3.08665      1000k       10.9328M/s [best: throughput=  0.50 GB/s doc_throughput=    10 docs/s instructions=  1041233885 cycles=   337109080 branch_miss=  968537 cache_miss=10897426 cache_ref=  15427295 items=   1000000 avg_time=  91455371 ns]
LargeRandomSum<Dom>        89605540 ns     89588793 ns            8         968.435k           514.102M        10.3415M       14.5719M     330.07M              7.17562           11.1764         3.689G          1022.23M                    22.2231                     3.09702           11.1764M    968.274k   45.9988M       489.658M/s   10.3269M   14.5726M   330.174M         7.17789    11.1621/s 3.68544G/s     1022.23M               22.2231                3.09604      1000k       11.1621M/s [best: throughput=  0.51 GB/s doc_throughput=    11 docs/s instructions=  1022233883 cycles=   330069936 branch_miss=  968435 cache_miss=10341543 cache_ref=  14571861 items=   1000000 avg_time=  89591929 ns]
LargeRandom<OnDemand>      64622779 ns     64622223 ns           11         929.755k           712.493M        5.63348M       8.01073M     238.12M              5.17666           15.4894       3.68834G           648.69M                    14.1023                     2.72422           15.4894M    930.709k   45.9988M       678.835M/s   5.66211M   8.01242M   238.304M         5.18066    15.4746/s 3.68765G/s      648.69M               14.1023                2.72212      1000k       15.4746M/s [best: throughput=  0.71 GB/s doc_throughput=    15 docs/s instructions=   648690465 cycles=   238120090 branch_miss=  929755 cache_miss= 5633485 cache_ref=   8010731 items=   1000000 avg_time=  64609673 ns]
LargeRandomSum<OnDemand>   64569064 ns     64568969 ns           11         951.323k           714.434M        4.98723M       7.12271M    237.524M              5.16371           15.5316       3.68913G           641.69M                    13.9502                     2.70158           15.5316M    958.366k   45.9988M       679.395M/s    5.0381M   7.12423M    238.12M         5.17667    15.4873/s 3.68785G/s      641.69M               13.9502                2.69481      1000k       15.4873M/s [best: throughput=  0.71 GB/s doc_throughput=    15 docs/s instructions=   641690192 cycles=   237524226 branch_miss=  951323 cache_miss= 4987231 cache_ref=   7122709 items=   1000000 avg_time=  64556393 ns]
LargeRandom<Iter>          60862746 ns     60863442 ns           11         990.089k           757.035M        5.62286M       7.98759M    224.156M              4.87309           16.4577        3.6891G          581.692M                    12.6458                     2.59503           16.4577M    995.475k   45.9988M       720.759M/s   5.65907M   7.98859M   224.456M          4.8796    16.4302/s 3.68786G/s     581.692M               12.6458                2.59157      1000k       16.4302M/s [best: throughput=  0.76 GB/s doc_throughput=    16 docs/s instructions=   581691751 cycles=   224156097 branch_miss=  990089 cache_miss= 5622863 cache_ref=   7987593 items=   1000000 avg_time=  60850056 ns]
LargeRandomSum<Iter>       59778441 ns     59777987 ns           12          981.57k           770.555M        5.01271M       7.15428M    220.194M              4.78696           16.7516       3.68861G          570.691M                    12.4067                     2.59177           16.7516M    986.227k   45.9988M       733.846M/s   5.05344M   7.15511M   220.459M         4.79271    16.7286/s 3.68796G/s     570.691M               12.4067                2.58865      1000k       16.7286M/s [best: throughput=  0.77 GB/s doc_throughput=    16 docs/s instructions=   570691393 cycles=   220194110 branch_miss=  981570 cache_miss= 5012710 cache_ref=   7154280 items=   1000000 avg_time=  59765975 ns]
Creating a source file spanning 134087 KB 
Kostya<Dom>                94265351 ns     94266428 ns            7          1045.8k           1.45789G        15.7114M       22.3239M    347.272M               2.5292           10.6179        3.6873G          975.883M                    7.10741                     2.81014           5.56684M    1045.46k   137.305M       1.35653G/s   15.7288M   22.2886M   347.637M         2.53186    10.6082/s 3.68782G/s     975.883M               7.10741                2.80719   524.288k       5.56177M/s [best: throughput=  1.46 GB/s doc_throughput=    10 docs/s instructions=   975882556 cycles=   347271978 branch_miss= 1045795 cache_miss=15711430 cache_ref=  22323923 items=    524288 avg_time=  94251475 ns]
KostyaSum<Dom>             93482419 ns     93481371 ns            7         1048.42k           1.47166G        15.4012M       21.9002M    344.184M              2.50671           10.7182       3.68902G          970.115M                     7.0654                      2.8186            5.6194M    1049.48k   137.305M       1.36792G/s   15.4135M   21.7919M   344.773M           2.511    10.6973/s 3.68815G/s     970.115M                7.0654                2.81378   524.288k       5.60848M/s [best: throughput=  1.47 GB/s doc_throughput=    10 docs/s instructions=   970115386 cycles=   344183956 branch_miss= 1048419 cache_miss=15401206 cache_ref=  21900243 items=    524288 avg_time=  93468372 ns]
Kostya<OnDemand>           59869079 ns     59857167 ns           12         468.968k            2.2974G        9.97175M       13.9606M    220.483M              1.60579           16.7321       3.68914G          635.858M                    4.63099                     2.88393           8.77243M    472.341k   137.305M       2.13634G/s   9.99057M   13.9227M   220.746M         1.60771    16.7064/s 3.68788G/s     635.858M               4.63099                2.88049   524.288k       8.75898M/s [best: throughput=  2.30 GB/s doc_throughput=    16 docs/s instructions=   635857684 cycles=   220483050 branch_miss=  468968 cache_miss= 9971751 cache_ref=  13960563 items=    524288 avg_time=  59856199 ns]
KostyaSum<OnDemand>        60280325 ns     60279422 ns           12         469.529k           2.27999G        9.67866M       13.4947M    222.157M              1.61799           16.6053       3.68898G          630.615M                     4.5928                     2.83859           8.70594M    469.704k   137.305M       2.12137G/s   9.67636M   13.4264M   222.299M         1.61902    16.5894/s 3.68781G/s     630.615M                4.5928                2.83678   524.288k       8.69763M/s [best: throughput=  2.28 GB/s doc_throughput=    16 docs/s instructions=   630614946 cycles=   222157441 branch_miss=  469529 cache_miss= 9678657 cache_ref=  13494711 items=    524288 avg_time=  60267368 ns]
Kostya<Iter>               61758017 ns     61757605 ns           11         497.377k           2.22614G        9.95741M       13.9293M    227.537M              1.65716           16.2131       3.68908G          606.497M                    4.41715                     2.66549           8.50035M    497.937k   137.305M        2.0706G/s   9.99912M   13.9207M   227.752M         1.65873    16.1923/s 3.68785G/s     606.497M               4.41715                2.66297   524.288k       8.48945M/s [best: throughput=  2.23 GB/s doc_throughput=    16 docs/s instructions=   606497405 cycles=   227536701 branch_miss=  497377 cache_miss= 9957411 cache_ref=  13929258 items=    524288 avg_time=  61745137 ns]
KostyaSum<Iter>            59370790 ns     59359390 ns           12         464.597k           2.31499G        9.64345M       13.4522M    218.801M              1.59354           16.8602       3.68902G          597.061M                    4.34843                     2.72878           8.83958M    464.774k   137.305M       2.15425G/s   9.67782M   13.4532M    218.89M         1.59419    16.8465/s 3.68754G/s     597.061M               4.34843                2.72767   524.288k       8.83244M/s [best: throughput=  2.31 GB/s doc_throughput=    16 docs/s instructions=   597060518 cycles=   218801084 branch_miss=  464597 cache_miss= 9643448 cache_ref=  13452173 items=    524288 avg_time=  59357614 ns]

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                    Time             CPU   Iterations best_branch_miss best_bytes_per_sec best_cache_miss best_cache_ref best_cycles best_cycles_per_byte best_docs_per_sec best_frequency best_instructions best_instructions_per_byte best_instructions_per_cycle best_items_per_sec branch_miss      bytes bytes_per_second cache_miss  cache_ref     cycles cycles_per_byte docs_per_sec  frequency instructions instructions_per_byte instructions_per_cycle      items items_per_second
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PartialTweets<Sax>      191072 ns       191072 ns         3670           1.323k           3.47984G               0        59.009k    670.174k              1.06122          5.51031k       3.69287G          2.17912M                    3.45062                     3.25157           551.031k    1.47164k   631.515k       3.07814G/s   0.013624   58.9976k   674.469k         1.06802   5.23364k/s 3.52993G/s     2.17912M               3.45062                3.23086        100       523.364k/s [best: throughput=  3.48 GB/s doc_throughput=  5510 docs/s instructions=     2179117 cycles=      670174 branch_miss=    1323 cache_miss=       0 cache_ref=     59009 items=       100 avg_time=    182754 ns]
Creating a source file spanning 44921 KB 
LargeRandom<Dom>      91475275 ns     91476347 ns            8         933.026k           503.328M        11.1398M       15.6303M    337.144M              7.32942           10.9422        3.6891G          1040.23M                    22.6144                     3.08543           10.9422M     932.62k   45.9988M       479.554M/s   11.1208M   15.6317M   337.404M         7.33507    10.9318/s 3.68843G/s     1040.23M               22.6144                3.08305      1000k       10.9318M/s [best: throughput=  0.50 GB/s doc_throughput=    10 docs/s instructions=  1040233883 cycles=   337144298 branch_miss=  933026 cache_miss=11139804 cache_ref=  15630295 items=   1000000 avg_time=  91461980 ns]
LargeRandomSum<Dom>   90329253 ns     90329264 ns            8         932.216k            509.83M        10.4788M       14.7649M    332.849M              7.23603           11.0836       3.68915G          1022.23M                    22.2231                     3.07117           11.0836M    932.897k   45.9988M       485.644M/s    10.522M   14.7658M   333.176M         7.24315    11.0706/s 3.68846G/s     1022.23M               22.2231                3.06815      1000k       11.0706M/s [best: throughput=  0.51 GB/s doc_throughput=    11 docs/s instructions=  1022233881 cycles=   332848515 branch_miss=  932216 cache_miss=10478836 cache_ref=  14764904 items=   1000000 avg_time=  90315728 ns]
LargeRandom<Sax>      67111081 ns     67111854 ns           10         973.014k           686.397M         5.6914M       8.09484M    247.225M               5.3746           14.9221        3.6891G          675.692M                    14.6893                     2.73311           14.9221M    975.305k   45.9988M       653.653M/s    5.7521M    8.0964M   247.525M         5.38113    14.9005/s 3.68825G/s     675.692M               14.6893                2.72979      1000k       14.9005M/s [best: throughput=  0.69 GB/s doc_throughput=    14 docs/s instructions=   675691776 cycles=   247224897 branch_miss=  973014 cache_miss= 5691399 cache_ref=   8094842 items=   1000000 avg_time=  67098397 ns]

Raw Data: Haswell gcc 10.2 (Skylake)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                         Time             CPU   Iterations best_branch_miss best_bytes_per_sec best_cache_miss best_cache_ref best_cycles best_cycles_per_byte best_docs_per_sec best_frequency best_instructions best_instructions_per_byte best_instructions_per_cycle best_items_per_sec branch_miss      bytes bytes_per_second cache_miss  cache_ref     cycles cycles_per_byte docs_per_sec  frequency instructions instructions_per_byte instructions_per_cycle      items items_per_second
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PartialTweets<OnDemand>      177528 ns       177530 ns         3941           1.657k           3.74809G               0        54.903k    622.194k              0.98524          5.93507k       3.69277G          2.11159M                    3.34369                     3.39378           593.507k    1.80355k   631.515k       3.31293G/s  0.0284192   54.9838k   626.235k        0.991639   5.63285k/s 3.52748G/s     2.11159M               3.34369                3.37188        100       563.285k/s [best: throughput=  3.75 GB/s doc_throughput=  5935 docs/s instructions=     2111588 cycles=      622194 branch_miss=    1657 cache_miss=       0 cache_ref=     54903 items=       100 avg_time=    169681 ns]
PartialTweets<Iter>          384499 ns       384501 ns         1819           2.968k           1.68681G               0         55.31k    1.38216M              2.18865          2.67105k       3.69182G          4.40862M                    6.98102                     3.18965           267.105k    3.19399k   631.515k       1.52963G/s   0.239142   55.4276k   1.38969M         2.20056   2.60078k/s 3.61426G/s     4.40862M               6.98102                3.17239        100       260.078k/s [best: throughput=  1.69 GB/s doc_throughput=  2671 docs/s instructions=     4408621 cycles=     1382163 branch_miss=    2968 cache_miss=       0 cache_ref=     55310 items=       100 avg_time=    376595 ns]
PartialTweets<Dom>           266111 ns       266112 ns         2630            3.53k           2.46696G               0        87.587k    945.164k              1.49666          3.90642k       3.69221G          2.91945M                    4.62293                     3.08883           390.642k    3.70404k   631.515k       2.21014G/s  0.0285171   87.3744k   952.087k         1.50762   3.75782k/s 3.57777G/s     2.91945M               4.62293                3.06637        100       375.782k/s [best: throughput=  2.47 GB/s doc_throughput=  3906 docs/s instructions=     2919449 cycles=      945164 branch_miss=    3530 cache_miss=       0 cache_ref=     87587 items=       100 avg_time=    257984 ns]
Creating a source file spanning 44921 KB 
LargeRandom<Dom>           91849988 ns     91849888 ns            8         889.651k           502.486M        10.9371M       15.2509M    337.691M              7.34129           10.9239        3.6889G          970.316M                    21.0944                     2.87339           10.9239M    889.185k   45.9988M       477.604M/s   11.0046M   15.2543M   338.763M         7.36461    10.8873/s 3.68822G/s     970.316M               21.0944                2.86429      1000k       10.8873M/s [best: throughput=  0.50 GB/s doc_throughput=    10 docs/s instructions=   970315574 cycles=   337690544 branch_miss=  889651 cache_miss=10937148 cache_ref=  15250891 items=   1000000 avg_time=  91836392 ns]
LargeRandomSum<Dom>        92188755 ns     92188563 ns            8         889.635k           499.861M          10.36M       14.4073M    339.484M              7.38027           10.8668       3.68911G          974.316M                    21.1813                     2.86999           10.8668M    889.272k   45.9988M       475.849M/s   10.4213M   14.4099M   340.028M          7.3921    10.8473/s 3.68839G/s     974.316M               21.1813                 2.8654      1000k       10.8473M/s [best: throughput=  0.50 GB/s doc_throughput=    10 docs/s instructions=   974315578 cycles=   339483518 branch_miss=  889635 cache_miss=10360012 cache_ref=  14407265 items=   1000000 avg_time=  92175359 ns]
LargeRandom<OnDemand>      58992605 ns     58991725 ns           12         869.377k           781.677M        5.62944M       7.89403M    217.093M              4.71954           16.9934       3.68916G          615.695M                     13.385                     2.83609           16.9934M    868.032k   45.9988M       743.627M/s   5.67331M   7.89681M    217.57M         4.72991    16.9515/s 3.68815G/s     615.695M                13.385                2.82987      1000k       16.9515M/s [best: throughput=  0.78 GB/s doc_throughput=    16 docs/s instructions=   615694894 cycles=   217093162 branch_miss=  869377 cache_miss= 5629445 cache_ref=   7894027 items=   1000000 avg_time=  58980167 ns]
LargeRandomSum<OnDemand>   56594492 ns     56594225 ns           12         876.324k           813.963M        5.01997M       7.05425M    208.485M               4.5324           17.6953       3.68921G          606.695M                    13.1894                     2.91002           17.6953M    876.066k   45.9988M       775.129M/s     5.066M   7.05672M   208.735M         4.53784    17.6696/s 3.68828G/s     606.695M               13.1894                2.90653      1000k       17.6696M/s [best: throughput=  0.81 GB/s doc_throughput=    17 docs/s instructions=   606694893 cycles=   208485037 branch_miss=  876324 cache_miss= 5019967 cache_ref=   7054246 items=   1000000 avg_time=  56582402 ns]
LargeRandom<Iter>          53364551 ns     53364201 ns           13         894.323k            863.44M        5.63805M       7.89683M    196.541M              4.27273           18.7709       3.68925G          570.695M                    12.4067                      2.9037           18.7709M    894.787k   45.9988M       822.046M/s   5.66445M   7.89822M    196.82M          4.2788    18.7392/s 3.68823G/s     570.695M               12.4067                2.89958      1000k       18.7392M/s [best: throughput=  0.86 GB/s doc_throughput=    18 docs/s instructions=   570694596 cycles=   196540518 branch_miss=  894323 cache_miss= 5638049 cache_ref=   7896828 items=   1000000 avg_time=  53352485 ns]
LargeRandomSum<Iter>       54883627 ns     54883439 ns           13         871.251k           841.314M        5.02219M       7.05069M    201.706M              4.38502           18.2899       3.68918G          577.695M                    12.5589                     2.86405           18.2899M    871.778k   45.9988M       799.291M/s   5.06164M   7.05285M   202.423M         4.40061    18.2204/s 3.68823G/s     577.695M               12.5589                2.85391      1000k       18.2204M/s [best: throughput=  0.84 GB/s doc_throughput=    18 docs/s instructions=   577695426 cycles=   201705764 branch_miss=  871251 cache_miss= 5022193 cache_ref=   7050692 items=   1000000 avg_time=  54871279 ns]
Creating a source file spanning 134087 KB 
Kostya<Dom>                86984857 ns     86984354 ns            8         494.739k           1.58086G        15.8348M       22.1883M      320.4M              2.33349           11.5135       3.68893G          936.468M                    6.82035                     2.92281            6.0364M    494.617k   137.305M       1.47009G/s    15.849M   22.1757M   320.827M          2.3366    11.4963/s 3.68833G/s     936.468M               6.82035                2.91892   524.288k       6.02738M/s [best: throughput=  1.58 GB/s doc_throughput=    11 docs/s instructions=   936467833 cycles=   320400264 branch_miss=  494739 cache_miss=15834791 cache_ref=  22188266 items=    524288 avg_time=  86971545 ns]
KostyaSum<Dom>             86970961 ns     86970134 ns            8         495.135k           1.58123G        15.5502M       21.6331M    320.352M              2.33314           11.5162       3.68924G          938.565M                    6.83562                     2.92979           6.03781M    494.743k   137.305M       1.47034G/s   15.6095M   21.6834M   320.783M         2.33628    11.4982/s 3.68842G/s     938.565M               6.83562                2.92586   524.288k       6.02837M/s [best: throughput=  1.58 GB/s doc_throughput=    11 docs/s instructions=   938564987 cycles=   320352061 branch_miss=  495135 cache_miss=15550182 cache_ref=  21633093 items=    524288 avg_time=  86957969 ns]
Kostya<OnDemand>           60343057 ns     60343775 ns           12         456.213k           2.28197G        10.1305M       13.9803M    221.966M              1.61659           16.6197       3.68902G          647.782M                    4.71783                     2.91838           8.71353M    456.138k   137.305M       2.11911G/s   10.1682M    13.868M   222.551M         1.62085    16.5717/s 3.68805G/s     647.782M               4.71783                2.91072   524.288k       8.68835M/s [best: throughput=  2.28 GB/s doc_throughput=    16 docs/s instructions=   647782119 cycles=   221966325 branch_miss=  456213 cache_miss=10130481 cache_ref=  13980257 items=    524288 avg_time=  60330609 ns]
KostyaSum<OnDemand>        58642231 ns     58641748 ns           12          453.15k            2.3471G        9.82263M       13.5381M    215.814M              1.57178            17.094       3.68913G          643.064M                    4.68347                     2.97972            8.9622M    453.464k   137.305M       2.18062G/s   9.84356M   13.5389M    216.28M         1.57518    17.0527/s 3.68816G/s     643.064M               4.68347                2.97329   524.288k       8.94052M/s [best: throughput=  2.35 GB/s doc_throughput=    17 docs/s instructions=   643063529 cycles=   215813521 branch_miss=  453150 cache_miss= 9822627 cache_ref=  13538124 items=    524288 avg_time=  58629573 ns]
Kostya<Iter>               59348929 ns     59348585 ns           12          452.97k           2.32237G        10.0784M       13.9769M    218.108M              1.58849           16.9139       3.68906G          642.015M                    4.67583                     2.94357           8.86776M    453.045k   137.305M       2.15465G/s   10.1584M    13.924M    218.88M         1.59412    16.8496/s 3.68804G/s     642.015M               4.67583                2.93318   524.288k       8.83404M/s [best: throughput=  2.32 GB/s doc_throughput=    16 docs/s instructions=   642015174 cycles=   218107739 branch_miss=  452970 cache_miss=10078433 cache_ref=  13976859 items=    524288 avg_time=  59336271 ns]
KostyaSum<Iter>           121340358 ns    121337200 ns            6         453.895k           1.13236G          9.935M       13.6518M    447.335M              3.25796           8.24704       3.68919G          1.31992G                    9.61305                     2.95063           4.32383M    454.037k   137.305M       1079.18M/s   9.95978M   13.6513M   447.577M         3.25973     8.2415/s 3.68871G/s     1.31992G               9.61305                2.94903   524.288k       4.32092M/s [best: throughput=  1.13 GB/s doc_throughput=     8 docs/s instructions=  1319919270 cycles=   447334605 branch_miss=  453895 cache_miss= 9934995 cache_ref=  13651799 items=    524288 avg_time= 121327101 ns]

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                    Time             CPU   Iterations best_branch_miss best_bytes_per_sec best_cache_miss best_cache_ref best_cycles best_cycles_per_byte best_docs_per_sec best_frequency best_instructions best_instructions_per_byte best_instructions_per_cycle best_items_per_sec branch_miss      bytes bytes_per_second cache_miss  cache_ref     cycles cycles_per_byte docs_per_sec  frequency instructions instructions_per_byte instructions_per_cycle      items items_per_second
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
PartialTweets<Sax>      181416 ns       181385 ns         3854           1.384k           3.67147G               0        58.443k    635.165k              1.00578          5.81375k       3.69269G          2.07459M                     3.2851                     3.26623           581.375k    1.53157k   631.515k       3.24252G/s   0.044629   58.4857k   640.112k         1.01361   5.51314k/s 3.52903G/s     2.07459M                3.2851                3.24098        100       551.314k/s [best: throughput=  3.67 GB/s doc_throughput=  5813 docs/s instructions=     2074593 cycles=      635165 branch_miss=    1384 cache_miss=       0 cache_ref=     58443 items=       100 avg_time=    173490 ns]
Creating a source file spanning 44921 KB 
LargeRandom<Dom>      88806316 ns     88806323 ns            8         871.991k           518.893M        10.8318M       15.3949M    326.983M              7.10851           11.2806       3.68855G          970.316M                    21.0944                     2.96748           11.2806M    872.063k   45.9988M       493.972M/s   10.8779M   15.3961M   327.465M         7.11899    11.2605/s 3.68741G/s     970.316M               21.0944                2.96311      1000k       11.2605M/s [best: throughput=  0.52 GB/s doc_throughput=    11 docs/s instructions=   970315579 cycles=   326982612 branch_miss=  871991 cache_miss=10831777 cache_ref=  15394882 items=   1000000 avg_time=  88792975 ns]
LargeRandomSum<Dom>   89824596 ns     89807350 ns            8          872.01k           513.081M        10.3216M       14.5612M    330.716M              7.18967           11.1542       3.68889G          974.316M                    21.1813                     2.94608           11.1542M    871.863k   45.9988M       488.466M/s   10.3039M   14.5623M   331.208M         7.20036    11.1349/s 3.68798G/s     974.316M               21.1813                 2.9417      1000k       11.1349M/s [best: throughput=  0.51 GB/s doc_throughput=    11 docs/s instructions=   974315579 cycles=   330716103 branch_miss=  872010 cache_miss=10321626 cache_ref=  14561194 items=   1000000 avg_time=  89811290 ns]
LargeRandom<Sax>      62784008 ns     62783943 ns           11         913.123k           738.521M        5.61415M       8.01956M    229.784M              4.99545           16.0552       3.68924G          672.695M                    14.6242                      2.9275           16.0552M    918.166k   45.9988M       698.711M/s   5.65478M   8.02137M   231.532M         5.03344    15.9276/s 3.68776G/s     672.695M               14.6242                2.90541      1000k       15.9276M/s [best: throughput=  0.74 GB/s doc_throughput=    16 docs/s instructions=   672694521 cycles=   229784372 branch_miss=  913123 cache_miss= 5614148 cache_ref=   8019564 items=   1000000 avg_time=  62771549 ns]

Loose Ends

Things that likely won't get finished in this checkin, but might be considered for a full release (and also might not :)):

Bugs
- Possible corruption with incomplete arrays / objects ("[ 1, 2, ")
- Win32 / VS2019 failure (#1208)
Rough Edges
- x["a"]["b"] unsupported (right now x["a"] would be released early and the element never gets fully skipped)
- parser.load()
- parser.iterate(buf, len)
- get_c_str()
- Out-of-order key lookup support
Features
- Print / minify
- document_stream
- Validation of skipped values
- Strict object support (don't ever skip keys, error immediately if the next key is not what you expect)
- Nullable value support: .get_nullable_int64()
- .is_false_or_null() is probably useful ...
- SIMDJSON_ONDEMAND_SAFETY_RAILS tests
- Compile-time safety tests (make sure bad things don't compile)
Performance
- Add ondemand to competitions
- Make more competitions
- Performance optimization for recursion
- Tuple support [ x, y, z ]? It would force-unroll loops, basically, possibly improving performance.
- Ability to stop iterating when finished (right now it will inspect and skip all remaining elements)
Sanity checks:
- Sanity review of & and && versions of methods (I hate the error messages but I hate letting people compile programs that will fail at runtime even more). Can we make it so people can use things in more flexible ways (for example, ["a"]["b"] above)? Can we make error messages better?
- Sanity review of document vs. value. I don't like that they behave almost identically but have different code. Even with the root value parsing difference notwithstanding, the fact that document owns the iterator and value has a reference makes the code fundamentally non-reusable. We should look into doing something about that.

Next Steps

[X] Supported way to use it on multiple platforms ("single kernel mode"? Plug in to simdjson's architecture selection?)
[X] Parse numbers/booleans at the root correctly, without overrun
[X] Don't overrun when objects/arrays are unbalanced
[X] Thorough type tests similar to DOM API tests
[X] Error tests for common error cases (to make sure errors are actually raised and iteration stops)
[X] Last performance check to ensure we haven't dropped below 4.0GB/s in the final round of fixes
[X] Resolve compiler failures on other platforms

enhancement performance research

opened by jkeiser 187

Bringing ndjson(document_stream) to On Demand
I have trying to implement just a simple document stream for On Demand. The main issue, as discussed by @lemire in this comment, is that On Demand does not know where the end of a single document lies in a document stream (if stage1 covers multiple documents). To overcome this, I created a JSON iterator that will be used to traverse a document and go to the next document when needed. This JSON iterator is also used to create a document instance when the *operator is called.

[x] implements threaded version so that stage 1 is processed independently

[x] add the rest of the tests so that we have as good a coverage as the DOM document_stream,

[x] add documentation and examples,

[x] add benchmarking.

Fixes https://github.com/simdjson/simdjson/issues/1464
opened by NicolasJiaxin 106

MSVC simdjson is slower than g++ on Windows

On the same machine and OS, WSL g++ 7.5-compiled simdjson parses at 2.6GB/s and MSVC 2019-compiled simdjson parses at 1.0GB/s. ClangCL parses at 1.4GB/s, so there might be a link.exe thing going on there. My machine is Kady Lake R (AVX2 but not AVX512).

After investigation: these seem to be the major impactors:

[ ] 40%: @TrianglesPCT may be fixing some or all of the most major regression, caused by generic SIMD, by removing lambdas.
[ ] 10%: We need to understand why this did not fully recover the performance we had before this. Either one of them could be the culprit, but it's probably not anything in between.
[ ] 10%: We need to understand why we lost another 10% to the stage 1 structural scanner refactor.

Data

g++ 7.5.0 under WSL

[email protected]:~/simdjson/build$ benchmark/parse ../jsonexamples/twitter.json
number of iterations 200 
                                                     
../jsonexamples/twitter.json

     9867 blocks -     631515 bytes - 55263 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  24.3210 ns per block ( 70.04%) -   0.3800 ns per byte -   4.3429 ns per structural -    2.631 GB/s
|- Stage 1
|    Speed        :  11.5728 ns per block ( 33.33%) -   0.1808 ns per byte -   2.0665 ns per structural -    5.530 GB/s
|- Stage 2
|    Speed        :  12.6267 ns per block ( 36.36%) -   0.1973 ns per byte -   2.2547 ns per structural -    5.068 GB/s

3181.7 documents parsed per second

VS 2019 (cl.exe 19.25.28614)

PS C:\Users\john\Source\simdjson\build> .\benchmark\Release\parse.exe ..\jsonexamples\twitter.json
number of iterations 200 

..\jsonexamples\twitter.json

     9867 blocks -     631515 bytes - 55263 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  65.5249 ns per block ( 83.29%) -   1.0239 ns per byte -  11.7004 ns per structural -    0.977 GB/s
|- Allocation
|    Speed        :   2.8679 ns per block (  3.65%) -   0.0448 ns per byte -   0.5121 ns per structural -   22.315 GB/s
|- Stage 1
|    Speed        :  32.2862 ns per block ( 41.04%) -   0.5045 ns per byte -   5.7652 ns per structural -    1.982 GB/s
|- Stage 2
|    Speed        :  29.4285 ns per block ( 37.41%) -   0.4598 ns per byte -   5.2549 ns per structural -    2.175 GB/s

1976.0 documents parsed per second

VS 2019 (cl.exe 19.25.28614) with /arch:AVX2

Compiling with /arch:AVX2 only gave a 10% improvement:

PS C:\Users\john\Source\simdjson\build> .\benchmark\Release\parse.exe ..\jsonexamples\twitter.json
number of iterations 200

..\jsonexamples\twitter.json

     9867 blocks -     631515 bytes - 55263 structurals (  8.8 %)
special blocks with: utf8      2284 ( 23.1 %) - escape       598 (  6.1 %) - 0 structurals      1287 ( 13.0 %) - 1+ structurals      8581 ( 87.0 %) - 8+ structurals      3272 ( 33.2 %) - 16+ structurals         0 (  0.0 %)
special block flips: utf8      1104 ( 11.2 %) - escape       642 (  6.5 %) - 0 structurals       940 (  9.5 %) - 1+ structurals       940 (  9.5 %) - 8+ structurals      2593 ( 26.3 %) - 16+ structurals         0 (  0.0 %)

All Stages
|    Speed        :  60.7013 ns per block ( 82.70%) -   0.9485 ns per byte -  10.8391 ns per structural -    1.054 GB/s
|- Allocation
|    Speed        :   2.4726 ns per block (  3.37%) -   0.0386 ns per byte -   0.4415 ns per structural -   25.882 GB/s
|- Stage 1
|    Speed        :  27.1889 ns per block ( 37.04%) -   0.4249 ns per byte -   4.8550 ns per structural -    2.354 GB/s
|- Stage 2
|    Speed        :  29.8135 ns per block ( 40.62%) -   0.4659 ns per byte -   5.3236 ns per structural -    2.147 GB/s

2246.1 documents parsed per second

performance

opened by jkeiser 103

UTF-8 validation flag lookup algorithm
This lookup algorithm's primary feature is that it does most of the work with 3 lookup tables against the high nibbles of bytes 1 and 2, and the low nibble of byte 1.

EDIT: @zwegner independently came up with a better variant that uses scalar masks to process continuation bytes (which probably makes better use of the execution cores by spreading the load across simd and scalar execution units). I have integrated it here. ARM uses fastvalidate still, because none of the algorithms could match it.

UTF-8 Shootout

While evaluating the algorithms, I ran a "UTF-8 shootout" to figure out what was going to be the fastest. What you see here represents the winner :)

I put all of the algorithms in here in separate headers, which can be switched between by changing the #include. A brief "shootout" between the algorithms, running stage 1 against twitter.json with ./parse -tf -n 1000 jsonexamples/twitter.json (and ./parse -tfs -n 1000 jsonexamples/twitter.json for SSE) yields this on my Kaby Lake machine (run multiple times for each and pick the best number):

twitter.json:

| | AVX2 | SSE4.2 | |--------------|--------------|----------------| | @zwegner | 5.952074 | 3.364491 | | lookup | 5.825784 | 3.400727 | | range | 5.715068 | 3.263643 | | fastvalidate | 5.588628 | 3.208918 |

random.json:

| | AVX2 | SSE4.2 | |--------------|--------------|----------------| | @zwegner | 4.318748 | 2.632677 | | lookup | 4.087078 | 2.387633 | | range | 3.917698 | 2.295306 | | fastvalidate | 3.778505 | 2.174089 |

gsoc-2018.json:

| | AVX2 | SSE4.2 | |--------------|--------------|----------------| | @zwegner | 7.923407 | 4.403640 | | lookup | 8.091007 | 4.583158 | | range | 7.891465 | 4.483739 | | fastvalidate | 7.949907 | 4.563674 |

This algorithm uses 4-bit table lookups to lookup 8-bit "error flags" from each nibble, and treats them as an error if all nibbles in the sequence have the error flag. Turns out UTF-8 has 8 3-nibble sequences (byte 1 alongside the high nibble of the next 4 bytes) and 2 2-nibble error sequences (byte 1 by itself). It is also notable that, incredibly, no more than [8 distinct combinations] of the first byte and later paired bytes are required to reliably detect all errors.

It works by sequences of 4-bit table lookups, &'d together, so that the error is present only if all nibbles are part of the error value. For example, to detect this overlong encoding of "a" (0x61):

overlong encoding of a: 11000001 10010001

Lookup high nibble 1100 in this table, yielding ERROR_OVERLONG_2

Lookup low nibble 0001 in this table, yielding ERROR_OVERLONG_2

When &'d together, the bits align and we have an error! If the low nibble had been 0010, we would have detected no errors.

The algorithm is simple:

First byte errors: Check for errors that only require the first byte to detect: overlong encoding of 2 bytes (1100000_) and a subset of too-large encodings (anything >= 11110101). This is done by a pair of table lookups (high and low nibble) and a single &, Too-large encodings could be done quickly with >, but when we're figuring out both of these together, I couldn't find any way to beat 2 table lookups and an &, instruction-count-wise.

Second byte errors: Check for errors that require the first byte + another (overlong 3- and 4-byte encodings, missing/extra continuations, some of the smaller overlarge values, and surrogates). To accomplish this, we're essentially doing the same thing as step 1 (flag lookups with &) but with 3 nibbles each time.

Put together the first byte error flags once (high and low nibbles). We do 2 table lookups and an & on the leading byte, but we don't treat that as an error, yet.

& the first byte error flags with another flag lookup on the 2nd, 3rd, 4th and 5th bytes.

This made AVX around 1-1.5% faster on my machine (about the same as the range lookup algorithm), and astonishingly passed make test the first time I ran it! Submitting this is partly to see what ARM thinks of it :)

I'm curious if anyone has ideas for further improvement. It feels like overuse of the AND-flag-lookup method, but that method takes so few instructions (4-5 per pair) that it's hard to come up with non-lookup-based methods that compare. I suspect 2-byte-long error detection of overlong/surrogate/too-large bytes is about as optimal as it can get, but perhaps there are clever ways to handle missing/extra continuation detection or the 1-byte-long error detection that save us a few instructions.
opened by jkeiser 97
[WIP] Exact float parsing
This is an attempt at providing exact (or more exact) float parsing at high speed.

[ ] in functions like compute_float_64 and the 128-bit counterpart, we need to ensure that the binary exponent is always greater than 0 and strictly smaller than 0x7FF.

[ ] we need to add more testing

[ ] we should compare the performance against abseil-cpp, ScanadvDouble, Boost Spirit and Andrei Alexandrescu's implementation (Folly)

[ ] the lookup tables could be made smaller, we should investigate

[ ] Make sure that the code compiles under most compilers including Visual Studio; this involves avoiding 128-bit integers.

[ ] Benchmark on ARM processors.

Fixes https://github.com/lemire/simdjson/issues/242

This replaces https://github.com/lemire/simdjson/pull/303
research
opened by lemire 87
elimination of g++ -Weffc++ warnings
document.h bugfixes

operator++

document.h

noninitialized members

document_stream load_many trailing whitespace

implementation.h

missing virtual destructor

whitespaces (not me my editor)

parsedjson_iterator.h

operator=

document_stream.h

trailing blank

noninitialized members

document.h

trailing witespace

noninitialized members

operator++

parsedjson_iterator.h

noninitialized members json_minifier.h

noninitialized members json_scanner.h

noninitialized members

trailing space

json_structural_indexer.h

noninitialized members

trailing space

stage2_build_tape.h

noninitialized members
opened by ostri 80
MSVC simdjson twice as slow as ClangCL
On Windows, simdjson compiled with MSVC parses twitter.json at almost half the speed of ClangCL-compiled simdjson, on the same machine, in the same command prompt.

| Platform | Overall | Stage 1 | Stage 2 | |---|---|---|---| | MSVC 19.25.28614 | 1.3051 | 2.3777 | 3.3898 | | ClangCL 9.0.0 | 2.2221 | 5.4161 | 4.6401 |

Methodology:

MSVC: git clean -ffdx build && cmake -B build && cmake --build build --target parse --config Release && build\benchmark\Release\parse jsonexamples\twitter.json

ClangCL: git clean -ffdx build && cmake -B build -T ClangCL && cmake --build build --target parse --config Release && build\benchmark\Release\parse jsonexamples\twitter.json

I validated that MSVC simdjson is using the haswell implementation, both by running json2json to print out the implementation, and by doing set SIMDJSON_FORCE_IMPLEMENTATION=haswell.
performance platform coverage
opened by jkeiser 64
[WIP] Unroll the loop, do more work during pipeline stall

This patch improves simdjson performance on AVX-2 by 3-4% in my tests. This comment describes the 4 changes made, and the reasons why. It also adds a -f option to parse.cpp so that causes it to only run find_structural_bits--this was critical to isolating the different performance gains.

opened by jkeiser 52
Faster float parsing

@lemire here's a version that works for all floats in canada.json. there are a few edge cases to be worked out, and the perf does take a hit (about a 20% increase). but i will be playing with using a mix of powers of 10 and 5, or just using both all 10 and all 5, to see if i can get more accurate mantissas (or at least, mantissas whose inaccuracy is not correlated). it does hit the fast path most of the time, just gets hit hard by what i guess is a slow strtod on this mac
research

opened by michaeleisel 52
Change new usages to std::allocator to accept custom memory allocator

I've checked the performance using benchmark/perfdiff as well as benchmark/on_demand, and there seems to be no difference.

The main issue with this PR is that std::allocator::allocate throws bad_alloc when it fails to allocate instead of gracefully failing into a nullptr like new (std::nothrow) does. Per this stack overflow post, forcing a bad_alloc with -fno_exceptions would result in an abort() (also confirmed locally).

Edit: Also wasn't sure to mess with malloc calls or not, so I chose not to.

Closes issue #1017

opened by rrohak 46
Make `object["field"]` order-insensitive in On Demand
This makes field lookup order-insensitive by default in On Demand: out-of-order lookups will succeed.

This means this JSON can be properly processed with code that does object["x"].get_double() + object["y"].get_double():

[ { "x": 1, "y": 2 }, { "y": 2, "x": 1 } ]

The previous order-sensitive behavior can still be accessed with object.find_field("field").

Design

object.find_field("field") does an order-sensitive search.

object.find_field_unordered("field") does an order-insensitive search.

object["field"] uses find_field_unordered().

When fields are in order, it behaves exactly the same as find_field(), starting after "x" when it's looking for "y".

When fields are out of order, the find_field() algorithm will reach the end without finding anything. At this point, find_field_unordered() cycles back to the beginning of the object and searches from there, stopping when it reaches the original (in order) starting point. If the field is still not found, it returns NO_SUCH_FIELD. This is what allows it to find "y" even though it's not after "x".

Performance

This may cause minor performance regressions relative to order-sensitive lookup (there is certainly no reason to believe it will improve performance). There is now an OnDemandUnordered version of the LargeRandom benchmark, which is roughly 2% worse than the ordered version, a real difference caused almost entirely by increased instruction count:

| Benchmark | Throughput | Instructions | Branch Misses | |---|---|---|---| | LargeRandom<OnDemand> | 634.066 MB/s | 614,119,494 | 972,480 | | LargeRandom<OnDemandUnordered> | 650.795 MB/s | 636,119,541 | 956,152 |

All benchmarks were updated to use find_field() since they are intended to be order-sensitive. This also let me verify that there is, as expected, no performance difference between the previous [] and the new find_field().
on demand
opened by jkeiser 45
Add support for JSON Path

The simdjson library has support for JSON Pointers. JSON Path is a much more powerful query language. It seems that it could be efficiently implemented with On Demand.

cc @jkeiser

opened by lemire 0
Disabling fallback kernel on systems where it is not needed

This might save a bit on the binary size (20kB, 30kB) and speed up the build slightly.

@jkeiser correctly shies away from testing that the macro is defined, rather than testing its value. I think that's a good practice. We tend to abuse "#ifdef' in the code base which makes some logic less clear.

Fixes: https://github.com/simdjson/simdjson/issues/1772

opened by lemire 0
Add get_uint32()/get_int32()

Couldn't find any discussion on this. User can of course check if value is in the range, but this is such a common case that it would be a useful addition to get_uint64() and get_int64().

opened by karlisolte 1
RISC-V 64 - Vector Support
Hi!

I have recently been doing some tests using riscv_vector.h and still can't find any tests in real projects, only examples cited in the references below. I don't know how significant it would be to get this advantage in a new architecture, but it might be interesting to try as many in simulated environments and real devices.

References

RISC-V Vector Extension Intrinsic Document

RISC-V Vector extension Spec
opened by kassane 1

Document and test std::ranges functionality

For C++20 users, we support std::ranges but it undocumented.

#include "simdjson.h"
#include <iostream>
#include <ranges>
using namespace simdjson;
int main(void) {
  auto cars_json = R"( [
  { "make": "Toyota", "model": "Camry",  "year": 2018, "tire_pressure": [ 40.1, 39.9, 37.7, 40.4 ] },
  { "make": "Kia",    "model": "Soul",   "year": 2012, "tire_pressure": [ 30.1, 31.0, 28.6, 28.7 ] },
  { "make": "Toyota", "model": "Tercel", "year": 1999, "tire_pressure": [ 29.8, 30.0, 30.2, 30.5 ] }
] )"_padded;
  dom::parser parser;
  auto justmodel = [](auto car) { return car["model"]; };
  for (auto car : parser.parse(cars_json).get_array() | std::views::transform(justmodel)) {
    std::cout << car << std::endl;
  }
}

Also it would be nice to extend std::ranges to ondemand.

opened by lemire 0

Releases(v3.0.1)

v3.0.1(Nov 23, 2022)
What's Changed

Adding more development checks to the DOM front-end by @lemire in https://github.com/simdjson/simdjson/pull/1915

Fix: Add padded_string_view overload for parser::parse by @spnda in https://github.com/simdjson/simdjson/pull/1916

Serialize integers stored as floats with trailing .0 by @lemire in https://github.com/simdjson/simdjson/pull/1921

Full Changelog: https://github.com/simdjson/simdjson/compare/v3.0.0...v3.0.1
Source code(tar.gz)
Source code(zip)
simdjson.cpp(647.58 KB)
simdjson.h(1.20 MB)
v3.0.0(Oct 6, 2022)
What's Changed (gist)

The main change in version 3.0.0 is that the is_null() methods may return an error. Previously, they would return merely true or false.

What's Changed (details)

[skip ci] Add an .editorconfig for .cpp/.h/.md for whitespace settings by @TysonAndre in https://github.com/simdjson/simdjson/pull/1901

Minor fix (documentation and safety) regarding max. depth in ondemand. by @lemire in https://github.com/simdjson/simdjson/pull/1906

Documenting how one can check for the end of the document. by @lemire in https://github.com/simdjson/simdjson/pull/1907

Check for trailing tokens in json2msgpack ondemand benchmark by @TysonAndre in https://github.com/simdjson/simdjson/pull/1908

Documents better the type method and makes is_null return an error condition in some instances by @lemire in https://github.com/simdjson/simdjson/pull/1909

Full Changelog: https://github.com/simdjson/simdjson/compare/v2.2.3...v3.0.0
Source code(tar.gz)
Source code(zip)
simdjson.cpp(647.61 KB)
simdjson.h(1.20 MB)
v2.2.3(Oct 2, 2022)
What's Changed

Fixes and verifies issue 1878 https://github.com/simdjson/simdjson/issues/1878. by @lemire in https://github.com/simdjson/simdjson/pull/1880

Fixed if-else-if condition, Win64 _fseeki64, trivial constructors C++11 by @GermanAizek in https://github.com/simdjson/simdjson/pull/1883

build: add pkg-config support by @Tachi107 in https://github.com/simdjson/simdjson/pull/1767

This fixes an error caused by overeager gcc static analyzer by @lemire in https://github.com/simdjson/simdjson/pull/1891

Fix various warnings by @spnda in https://github.com/simdjson/simdjson/pull/1888

Fix documentation of description() method in implementation by @epoll-reactor in https://github.com/simdjson/simdjson/pull/1895

Fix typos in doc/basics.md by @TysonAndre in https://github.com/simdjson/simdjson/pull/1893

fix: Reject surrogate pairs with invalid low surrogate by @TysonAndre in https://github.com/simdjson/simdjson/pull/1896

Micro-optimization for parsing surrogate pairs by @TysonAndre in https://github.com/simdjson/simdjson/pull/1897

Fix typo, formatting nit in HACKING.md by @TysonAndre in https://github.com/simdjson/simdjson/pull/1902

Fixing issue 1898 (https://github.com/simdjson/simdjson/issues/1898) by @TysonAndre and @lemire in https://github.com/simdjson/simdjson/pull/1899

New Contributors

@GermanAizek made their first contribution in https://github.com/simdjson/simdjson/pull/1883

@spnda made their first contribution in https://github.com/simdjson/simdjson/pull/1888

Full Changelog: https://github.com/simdjson/simdjson/compare/v2.2.2...v2.2.3
Source code(tar.gz)
Source code(zip)
v2.2.2(Jul 29, 2022)
What's Changed

remove empty if block by @striezel in https://github.com/simdjson/simdjson/pull/1873

cleaning on-demand benchmarks by @lemire in https://github.com/simdjson/simdjson/pull/1875

Verifying and fixing issue 1876 by @lemire in https://github.com/simdjson/simdjson/pull/1877

Full Changelog: https://github.com/simdjson/simdjson/compare/v2.2.1...v2.2.2
Source code(tar.gz)
Source code(zip)
simdjson.cpp(645.96 KB)
simdjson.h(1.20 MB)
v2.2.1(Jul 19, 2022)
What's Changed

update JsonCpp to version 1.9.5 by @striezel in https://github.com/simdjson/simdjson/pull/1863

update dependency nlohmann/json for benchmarks to current version 3.10.5 by @striezel in https://github.com/simdjson/simdjson/pull/1862

Documenting a specific use case where you need a value if and only if it is another key is not present by @lemire in https://github.com/simdjson/simdjson/pull/1865

Adding DOM benchmark to msgpack by @lemire in https://github.com/simdjson/simdjson/pull/1866

Improve build times for debug builds by @strager in https://github.com/simdjson/simdjson/pull/1859

Fixing issue 1870 by @lemire in https://github.com/simdjson/simdjson/pull/1871

We change slightly on development checks are enabled. by @lemire in https://github.com/simdjson/simdjson/pull/1869

Full Changelog: https://github.com/simdjson/simdjson/compare/v2.2.0...v2.2.1
Source code(tar.gz)
Source code(zip)
v2.2.0(Jul 5, 2022)
This release is a bug-fix release and is recommended for users relying on the icelake kernel.

What's Changed

Increased SIMDJSON_PADDING to 64 bytes (for AVX-512)

Improved C++11 compatibility by replacing 0b1111... with 0xfff...

Source code(tar.gz)
Source code(zip)
simdjson.cpp(652.04 KB)
simdjson.h(1.21 MB)
v2.1.0(Jun 30, 2022)
What's Changed

add SIMDJSON_IMPLEMENTATION_ICELAKE to implementation-selection.md by @striezel in https://github.com/simdjson/simdjson/pull/1848

Improve string performance in ondemand by making the string processing runtime dispatched. by @lemire in https://github.com/simdjson/simdjson/pull/1849

Removing dead code. by @lemire in https://github.com/simdjson/simdjson/pull/1852

adding msgpack benchmarks by @lemire in https://github.com/simdjson/simdjson/pull/1853

Full Changelog: https://github.com/simdjson/simdjson/compare/v2.0.4...v2.1.0
Source code(tar.gz)
Source code(zip)
simdjson.cpp(652.81 KB)
simdjson.h(1.21 MB)
v2.0.4(Jun 15, 2022)
What's Changed

Compiling under gcc12 without warnings https://github.com/simdjson/simdjson/pull/1836, credit @bertptrs

Simpler counters when benchmarking (useful for graviton processors). by @lemire in https://github.com/simdjson/simdjson/pull/1841

Fixed a minor bug https://github.com/simdjson/simdjson/issues/1834, credit @MashPlant

Fixed clang-13 compiler warning https://github.com/simdjson/simdjson/issues/1840, credit @He3lixxx

Full Changelog: https://github.com/simdjson/simdjson/compare/v2.0.3...v2.0.4
Source code(tar.gz)
Source code(zip)
simdjson.cpp(618.20 KB)
simdjson.h(1.24 MB)
v2.0.3(Jun 2, 2022)

This is a third patch release which allows the AVX-512 kernel under Visual Studio 2019 and up.
Source code(tar.gz)
Source code(zip)
simdjson.cpp(618.20 KB)
simdjson.h(1.24 MB)
v2.0.2(Jun 2, 2022)

This is a patch release which fixes compilation errors for 2.0.1 under GCC 8 specifically. The path contains two "workarounds" specific to GCC 8.
Source code(tar.gz)
Source code(zip)
simdjson.cpp(618.20 KB)
simdjson.h(1.24 MB)
v2.0.1(May 26, 2022)

It is a patch release which makes sure that AVX-512 is always enabled by default, even when using single-header files. It was the intended behaviour.
Source code(tar.gz)
Source code(zip)
simdjson.cpp(618.20 KB)
simdjson.h(1.23 MB)
v2.0.0(May 25, 2022)

Adding new icelake kernel with AVX-512 support. When the compiler and the processor has the adequate AVX-512 support, you might see a performance boost of between 25% to 40% on several tasks compared to the best previous kernel (haswell).

We rely on the fact that AVX-512 instructions no longer produce systematic frequency throttling on recent Intel processors (Ice Lake), https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Blog post: Parsing JSON faster with Intel AVX-512 https://lemire.me/blog/2022/05/25/parsing-json-faster-with-intel-avx-512/

Credit: Fangzheng Zhang and Weiqiang Wan (both from Intel) with indirect contributions by Kim Walisch and Jatin Bhateja.

(This release only enables AVX-512 if SIMDJSON_AVX512_ALLOWED is set to 1, which is the default when building with CMake. An upcoming patch release will make AVX-512 available by default no matter how the code is built.)
Source code(tar.gz)
Source code(zip)
simdjson.cpp(617.98 KB)
simdjson.h(1.23 MB)
v1.1.0(May 17, 2022)
This is a maintenance release. Except for the new 'current_depth()' method, no new API component was added.

Adds 'current_depth()'.

Fixing a minor bug in value::get_number_type()

fixing compiler warnings

various documentation fixes

prefixing some internal macros

operator<< uses ADL

Source code(tar.gz)
Source code(zip)
simdjson.cpp(522.02 KB)
simdjson.h(1.14 MB)
v1.0.2(Oct 27, 2021)

Patch for bug https://github.com/simdjson/simdjson/issues/1742 credit to @madhur4127 for the report.
Source code(tar.gz)
Source code(zip)
simdjson.cpp(520.94 KB)
simdjson.h(1.13 MB)
v1.0.1(Oct 20, 2021)

It is a patch release on version 1.0.1.

It fixes issue https://github.com/simdjson/simdjson/issues/1736 (number parse bug in get_uint64_in_string())

Credit for the bug report to @madhur4127
Source code(tar.gz)
Source code(zip)
simdjson.cpp(520.94 KB)
simdjson.h(1.13 MB)
v1.0.0(Sep 7, 2021)
Release 1.0.0 of the simdjson library builds on earlier pre-1.0 release that made the On Demand frontend our default. The On Demand front-end is a new way to build parser. With On Demand, if you open a file containing 1000 numbers and you need just one of these numbers, only one number is parsed. If you need to put the numbers into your own data structure, they are materialized there directly, without being first written to a temporary tree. Thus we expect that the simdjson On Demand might often provide superior performance, when you do not need to intermediate materialized view of a DOM tree. The On Demand front-end was primarily developed by @jkeiser.

If you adopted simdjson from an earlier version and relied on the DOM approach, it remains as always. Though On Demand is our new default, we remain committed to supporting the conventional DOM approach in the future, as there are instances where it is more appropriate.

Release 1.0.0 adds several key features:

In big data analytics, it is common to serialize large sets of records as multiple JSON documents separated by while spaces. You can now get the benefits of On Demand while parsing almost infinitely long streams of JSON records (see iterate_many). At each step, you have access to the current document, but a secondary thread indexes the following block. You can thus access enormous files while using a small amount of memory and achieve record-breaking speeds. Credit: @NicolasJiaxin.

In some cases, JSON documents contain numbers embedded within strings (e.g., "3.1416"). You can access these numbers directly using methods such as get_double_in_string(). Credit: @NicolasJiaxin

Given an On Demand instance (value, array, object, etc.), you can now convert it to a JSON string using the to_json_string method which returns a string view in the original document for unbeatable speeds. Credit: @NicolasJiaxin

The On Demand front-end now supports the JSON Pointer specification. You can request a specific value using a JSON Pointer within a large document. Credit: @NicolasJiaxin

Arrays in On Demand now have a count_elements() method. Objects have a count_fields() method. Arrays and objects have a reset method for when you need to iterate through them more than once. Document instances now have a rewind method in case you need to process the same document multiple times.

Other improvements include:

We have extended and improved our documentation and we have added much testing.

We have accelerated the JSON minification function (simdjson::minify) under ARM processors (credit @dougallj)

We encourage users of previous versions of the simdjson library to update. We encourage users to deploy it for production uses.
Source code(tar.gz)
Source code(zip)
simdjson.cpp(520.94 KB)
simdjson.h(1.13 MB)
v0.9.7(Jul 31, 2021)

Fixing an issue whereas the operator << would halt the program due to a noexcept clause when called with a simdjson_result input.

credit @teathinker
Source code(tar.gz)
Source code(zip)
v0.9.6(Jun 6, 2021)

This is a patch release fixing issue https://github.com/simdjson/simdjson/issues/1601 That is, users reusing the same document instance may get spurious errors in following parsing attempts because the "error" flag in the document is 'sticky'. This patch fixes the issue by modifying two lines of code.
Source code(tar.gz)
Source code(zip)
v0.9.5(May 28, 2021)

Patch release fixing issue 1588: Unexpected behavior while traversing a json array containing json objects containing a subset of known keys. This patch adds a single line. Users relying on the On Demand front-end should update.
Source code(tar.gz)
Source code(zip)
v0.9.4(May 20, 2021)

This fourth patch release is the second and final fix on an issue with the ondemand front-end where when we search for a key and it is not found, we can end up in a poor state from which follow-up queries will lead to spurious errors even with a valid JSON.
Source code(tar.gz)
Source code(zip)
v0.9.3(May 14, 2021)

Disable performance testing.
Source code(tar.gz)
Source code(zip)
v0.9.2(Apr 1, 2021)

This is a patch release which fixes a bug for users of the On Demand front-end. In some instances, when trying to access keys that are missing, the parser will fail with a generic error (TAPE_ERROR) under versions 0.9.0 and 0.9.1. Thanks to @jpalus for reporting the issue (https://github.com/simdjson/simdjson/issues/1521) and to @jkeiser for reviewing the patch.
Source code(tar.gz)
Source code(zip)
v0.9.1(Mar 18, 2021)

This is a patch release removing dead code (get_root_value). https://github.com/simdjson/simdjson/issues/1504
Source code(tar.gz)
Source code(zip)
v0.9.0(Mar 17, 2021)
The high-performance On Demand front-end introduced in version 0.7 and optimized in version 0.8 becomes the primary simdjson front-end in version 0.9 (credit @jkeiser). The On Demand front-end benefits from new features:

The On Demand elements have received a type() method so branching on the schema becomes possible.

We make it safer to recover one JSON element as multiple types https://github.com/simdjson/simdjson/pull/1414

We added safety checks for out-of-order iterations https://github.com/simdjson/simdjson/pull/1416

Other changes :

You can now parse a DOM document as a separate entity from the parser https://github.com/simdjson/simdjson/pull/1430

Improve compatibility with the Qt library.

We encourage users to replace templated get<T>() methods with specific methods, using get_object() instead of get<simdjson::dom::object>(), and get_uint64() instead of get<uint64_t>().

dom::document_stream::iterator has received a default constructor and is copyable

Source code(tar.gz)
Source code(zip)
v0.8.2(Feb 11, 2021)

This patch adds an explicit include for string_view which is needed by Visual Studio 2017. It also includes other minor Visual Studio 2017 fixes. Visual Studio 2017 are reminded that they should target specifically x64 builds.

credit to @NancyLi1013 for reporting the issue
Source code(tar.gz)
Source code(zip)
v0.8.1(Jan 28, 2021)

This is a patch release fixing the find_package(simdjson) functionality for CMake for users that updated to simdjson 0.8.0.
Source code(tar.gz)
Source code(zip)
v0.8.0(Jan 22, 2021)
The high-performance On Demand front-end introduced in version 0.7 has received major quality-of-life and performance improvements (credit @jkeiser).

Runtime dispatching is now supported, achieving high performance without compiling for a specific CPU.

Object field lookup is now order-insensitive: double x = object["x"]; double y = object["y"]; will work no matter which order the fields appear in the object. Reading fields in order still gives maximum performance.

Object lookup and array iteration can now be used against untyped values, enabling things like chained lookup (object["a"]["b"])

Numbers, strings and boolean values can be saved and parsed later by storing the ondemand::value, allowing more efficient filtering and bulk parsing, as well as fixing smaller quality-of-life issues.

We have improved our CMake build with respect to installed artefacts so that CMake dependencies automatically handle thread dependencies.

We have greatly improved our benchmarks with a set of realistic tasks on realistic datasets, using Google Benchmark as a framework.

Source code(tar.gz)
Source code(zip)
v0.7.1(Dec 14, 2020)

It is a patch release for version 0.7.0 which mistakenly disabled, by default, optimized ARM NEON and POWER kernels. The result was a substantial lost of performance, by default, on these platforms. Users could still work around the limitation by passing macro values to the compiler.
Source code(tar.gz)
Source code(zip)
v0.7.0(Dec 5, 2020)
This version improves our support for streams of documents (ndjson). We have improved the documentation, especially with how one might deal with truncated inputs and track the location of the current JSON. We have added fuzz testing and other tests. We added some minor fixes.

Performance:

SIMD accelerated (AltiVec) kernel for POWER processors.

Platform support:

We can now disable exceptions when compiling with MSVC

Source code(tar.gz)
Source code(zip)
v0.6.1(Nov 4, 2020)

This is a minor patch release for version 0.6.0 to support legacy libc++ builds.

https://github.com/simdjson/simdjson/issues/1285

https://github.com/simdjson/simdjson/issues/1286
Source code(tar.gz)
Source code(zip)

simdjson : Parsing gigabytes of JSON per second

Related tags

Overview

simdjson : Parsing gigabytes of JSON per second

Table of Contents

Quick Start

Documentation

Performance results

Real-world usage

Bindings and Ports of simdjson

About simdjson

Funding

Contributing to simdjson

License

Comments

Examples

points.json

twitter.json

Principles

Impact on DOM parse (skylake haswell gcc10.2)

Design

Concerns / Rough Edges

Performance

Haswell clang10.0 (Skylake)

Haswell gcc10 (Skylake)

Running the Benchmark

Raw Data: Haswell clang10.0 (Skylake)

Raw Data: Haswell gcc 10.2 (Skylake)

Loose Ends

Next Steps

Data

g++ 7.5.0 under WSL

VS 2019 (cl.exe 19.25.28614)

VS 2019 (cl.exe 19.25.28614) with /arch:AVX2

UTF-8 Shootout

Design

Performance

References

Releases(v3.0.1)

v3.0.1(Nov 23, 2022)

What's Changed

v3.0.0(Oct 6, 2022)

What's Changed (gist)

What's Changed (details)

v2.2.3(Oct 2, 2022)

What's Changed

New Contributors

v2.2.2(Jul 29, 2022)

What's Changed

v2.2.1(Jul 19, 2022)

What's Changed

v2.2.0(Jul 5, 2022)

What's Changed

v2.1.0(Jun 30, 2022)

What's Changed

v2.0.4(Jun 15, 2022)

What's Changed

v2.0.3(Jun 2, 2022)

v2.0.2(Jun 2, 2022)

v2.0.1(May 26, 2022)

v2.0.0(May 25, 2022)

v1.1.0(May 17, 2022)

v1.0.2(Oct 27, 2021)

v1.0.1(Oct 20, 2021)

v1.0.0(Sep 7, 2021)

v0.9.7(Jul 31, 2021)

v0.9.6(Jun 6, 2021)

v0.9.5(May 28, 2021)

v0.9.4(May 20, 2021)

v0.9.3(May 14, 2021)

v0.9.2(Apr 1, 2021)

v0.9.1(Mar 18, 2021)

v0.9.0(Mar 17, 2021)

v0.8.2(Feb 11, 2021)

v0.8.1(Jan 28, 2021)

v0.8.0(Jan 22, 2021)

v0.7.1(Dec 14, 2020)

v0.7.0(Dec 5, 2020)

v0.6.1(Nov 4, 2020)

Owner