High performance datastore for time series and tick data

Last update: Dec 23, 2022

Overview

Arctic TimeSeries and Tick store

Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-the-box, with pluggable support for other data types and optional versioning.

Arctic can query millions of rows per second per client, achieves ~10x compression on network bandwidth, ~10x compression on disk, and scales to hundreds of millions of rows per second per MongoDB instance.

Arctic has been under active development at Man AHL since 2012.

Quickstart

Install Arctic

pip install git+https://github.com/manahl/arctic.git

Run a MongoDB

mongod --dbpath <path/to/db_directory>

Using VersionStore

from arctic import Arctic
import quandl

# Connect to Local MONGODB
store = Arctic('localhost')

# Create the library - defaults to VersionStore
store.initialize_library('NASDAQ')

# Access the library
library = store['NASDAQ']

# Load some data - maybe from Quandl
aapl = quandl.get("WIKI/AAPL", authtoken="your token here")

# Store the data in the library
library.write('AAPL', aapl, metadata={'source': 'Quandl'})

# Reading the data
item = library.read('AAPL')
aapl = item.data
metadata = item.metadata

VersionStore supports much more: See the HowTo!

Adding your own storage engine

Plugging a custom class in as a library type is straightforward. This example shows how.

Documentation

You can find complete documentation at Arctic docs

Concepts

Libraries

Arctic provides namespaced libraries of data. These libraries allow bucketing data by source, user or some other metric (for example frequency: End-Of-Day; Minute Bars; etc.).

Arctic supports multiple data libraries per user. A user (or namespace) maps to a MongoDB database (the granularity of mongo authentication). The library itself is composed of a number of collections within the database. Libraries look like:

user.EOD
user.ONEMINUTE

A library is mapped to a Python class. All library databases in MongoDB are prefixed with 'arctic_'

Storage Engines

Arctic includes three storage engines:

VersionStore: a key-value versioned TimeSeries store. It supports:
- Pandas data types (other Python types pickled)
- Multiple versions of each data item. Can easily read previous versions.
- Create point-in-time snapshots across symbols in a library
- Soft quota support
- Hooks for persisting other data types
- Audited writes: API for saving metadata and data before and after a write.
- a wide range of TimeSeries data frequencies: End-Of-Day to Minute bars
- See the HowTo
- Documentation
TickStore: Column oriented tick database. Supports dynamic fields, chunks aren't versioned. Designed for large continuously ticking data.
Chunkstore: A storage type that allows data to be stored in customizable chunk sizes. Chunks aren't versioned, and can be appended to and updated in place.
- Documentation

Arctic storage implementations are pluggable. VersionStore is the default.

Requirements

Arctic currently works with:

Python 2.7, 3.4, 3.5, 3.6
pymongo >= 3.6
Pandas
MongoDB >= 2.4.x

Operating Systems:

Linux
macOS
Windows 10

Acknowledgements

Arctic has been under active development at Man AHL since 2012.

It wouldn't be possible without the work of the AHL Data Engineering Team including:

Contributions welcome!

License

Arctic is licensed under the GNU LGPL v2.1. A copy of which is included in LICENSE

Comments

Cannot compile on Windows

I installed the recommended C++ Compiler for Python 2.7, however I can't seem to run the installer.

The fatal error I got is: C1083: Cannot open include file: 'stdint.h': No such file or directory

Is it possible to release a compiled version of this?
help wanted

opened by beartastic 36

can't install on mac os x

clang -fno-strict-aliasing -fno-common -dynamic -g -O2 -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/2.7.10/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/_compress.c -o build/temp.macosx-10.10-x86_64-2.7/src/_compress.o -fopenmp
src/_compress.c:259:10: fatal error: 'omp.h' file not found
#include "omp.h"
         ^
1 error generated.
error: command 'clang' failed with exit status 1

i installed clang-omp..

enhancement help wanted

opened by ccoossdddffdd 21

Usage discussion: VersionStore vs TickStore, allowed options for VersionStore.write..
First of all - my thanks to the maintainers. This library is exactly what I was looking for and looks very promising.

I've been having a bit of trouble figuring how to optimally use arctic though. I've been following the examples in /howto which are... sparse. Is there somewhere else I might find examples or docs?

Now, some dumb questions about VersionStore and TickStore:

I've noticed that every time I write to a VersionStore, an entirely new version is created. Are finer-grained options for versioning available? For instance, I would like to write streaming updates to a single version, only incrementing version when manually specified. I tried just passing version=1 to lib.write, but this doesn't seem to be supported.

In what scenarios might one want to use VersionStore vs TickStore? It's not clear to me what the differences are from the README or the code.

My current use case is primarily as a database for streams - for this use case TickStore is recommended? Is there a reason one might want to use VersionStore for this?

~~Is TickStore appropriate for data which may have more than row for each timestamp (event data)?~~ Nope, not allowed by TickStore

Thanks in advance for your help and patience!
opened by rueberger 19
how_to_use_arctic.py fails with SyntaxError: unexpected EOF while parsing on library.read()

Attempting to run the demo as a test, and getting errors on library.read(). Also, I get an error the first time I run store.initialize_library('username.scratch'), but the second time it did works. I did have to make the following change to get import Arctic to run:

store/_version_store_utils.py, line 66: if pd.version.startswith("0.14"):

See paste of output included the EOF error below. I'm not sure if it's a numpy bug or arctic, but I'm not sure where to go from here.

Python 2.7.10 |Anaconda 2.3.0 (64-bit)| (default, Oct 19 2015, 18:04:42) [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)] on linux2 Type "help", "copyright", "credits" or "license" for more information. Anaconda is brought to you by Continuum Analytics. Please check out: http://continuum.io/thanks and https://anaconda.org

from arctic import Arctic from datetime import datetime as dt import pandas as pd store = Arctic('localhost') store.list_libraries() [u'NASDAQ', u'username.scratch'] store.initialize_library('username.scratch') No handlers could be found for logger "arctic.store.version_store" store.initialize_library('username.scratch') library = store['username.scratch'] df = pd.DataFrame({'prices': [1, 2, 3]},[dt(2014, 1, 1), dt(2014, 1, 2), dt(2014, 1, 3)]) library.write('SYMBOL', df) VersionedItem(symbol=SYMBOL,library=arctic_username.scratch,data=<type 'NoneType'>,version=19,metadata=None library.read('SYMBOL') Traceback (most recent call last): File "", line 1, in File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/version_store.py", line 319, in read date_range=date_range, read_preference=read_preference, *_kwargs) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/version_store.py", line 363, in _do_read data = handler.read(self._arctic_lib, version, symbol, from_version=from_version, *_kwargs) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_pandas_ndarray_store.py", line 279, in read item = super(PandasDataFrameStore, self).read(arctic_lib, version, symbol, *_kwargs) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_pandas_ndarray_store.py", line 193, in read date_range=date_range, *_kwargs) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_ndarray_store.py", line 180, in read return self._do_read(collection, version, symbol, index_range=index_range) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_ndarray_store.py", line 219, in _do_read dtype = self._dtype(version['dtype'], version.get('dtype_metadata', {})) File "/home/jeff/anaconda/lib/python2.7/site-packages/arctic/store/_ndarray_store.py", line 139, in _dtype return np.dtype(string, metadata=metadata) File "/home/jeff/anaconda/lib/python2.7/site-packages/numpy/core/_internal.py", line 191, in _commastring newitem = (dtype, eval(repeats)) File "", line 1 ( ^ SyntaxError: unexpected EOF while parsing

opened by jeffneuen 19
ServerSelectionTimeoutError: even when can connect to mongo via CLI or Compass
Arctic Version

# arctic==1.66.0

Arctic Store

# TICK_STORE

Platform and version

Ubuntu 18.04 LTS (mongodb)

Description of problem and/or code sample that reproduces the issue

I am trying to access a mongodb instance on the network, I have been able to connect via MongoDB CLI and Compass via the ip:port call below. But somehow when I run it I get a server timeout?

from arctic import Arctic # also tried full mongo uri 'mongodb://197.168.0.210:27200/' store = Arctic('197.168.0.210:27200') store.list_libraries() ServerSelectionTimeoutError: 197.168.0.210:27200: timed out

I checked the pymongo call to connect, and attempted to use the full mongo URI but it still fails any call to the database.
opened by derekwong9 17
Problem with TickStore: "arctic.date._mktz.TimezoneError: Timezone "UTC" can not be read"

I'm with an issue trying to add data to a TickStore Library.

The data is a DataFrame and it's index is a DatetimeIndex. (Pdb) data_new.index DatetimeIndex(['2014-11-21 16:56:58.534000-02:00', '2014-11-21 17:49:56.935000-02:00', '2014-11-21 18:00:01.099000-02:00', '2014-11-21 18:06:00.012000-02:00'], dtype='datetime64[ns, America/Sao_Paulo]', freq=None)

When i try to add it to the library, i got this: (Pdb) library.write(i, data_new) arctic.date.mktz.TimezoneError: Timezone "UTC" can not be read, error: "[Errno 2] No such file or directory: '/usr/share/zoneinfo/UTC'"

How can i deal with it?

opened by rtadewald 17
Future Development? - Java API

Hi there,

I have been trialling this out and it looks a great framework for storage retrieval, is there any plans for the Java API or are you looking for contributors to help? I'm testing this out for data storage for zipline/quantopian backtester and also for a JVM based project and was wondering what stage that was at if any?

opened by michaeljohnbennett 17
arctic import error (gcc/cpython)
Arctic Version

1.55.0

Arctic Store

ChunkStore

Platform and version

macOS Sierra 10.12.6 anaconda python3.6

Description of problem and/or code sample that reproduces the issue

I am having an error when importing arctic that I believe relates to my current gcc or cython build. I think I need to install gcc --without-multilib, but I am not entirely sure. I also think it is possible that the gcc version python is using is outdated.

Here is the full code output and error statement:

Suryas-MacBook-Pro:~ surya$ python Python 3.6.3 |Anaconda custom (64-bit)| (default, Oct 6 2017, 12:04:38) [GCC 4.2.1 Compatible Clang 4.0.1 (tags/RELEASE_401/final)] on darwin Type "help", "copyright", "credits" or "license" for more information.

import arctic Traceback (most recent call last): File "", line 1, in File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/init.py", line 3, in from .arctic import Arctic, register_library_type File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/arctic.py", line 12, in from .store import version_store, bson_store, metadata_store File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/store/version_store.py", line 15, in from ._pickle_store import PickleStore File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/store/_pickle_store.py", line 7, in from .._compression import decompress, compress_array File "/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/_compression.py", line 3, in from . import _compress as clz4 ImportError: dlopen(/Users/surya/anaconda3/lib/python3.6/site-packages/arctic/_compress.cpython-36m-darwin.so, 2): Symbol not found: _GOMP_parallel Referenced from: /Users/surya/anaconda3/lib/python3.6/site-packages/arctic/_compress.cpython-36m-darwin.so Expected in: flat namespace in /Users/surya/anaconda3/lib/python3.6/site-packages/arctic/_compress.cpython-36m-darwin.so

Here is my output for brew list --versions: coreutils 8.28_1 gcc 7.2.0 gmp 6.1.2_1 isl 0.18 libmpc 1.0.3_1 mongodb 3.4.10 mpfr 3.1.6 openssl 1.0.2l pkg-config 0.29.2
opened by suryabahubalendruni 15

Randomly raise Exception: Error decompressing if append many times to a symbol in chunk store

Arctic Version

arctic (1.51.0)

Arctic Store

ChunkStore

Platform and version

Red Hat Enterprise Linux Server release 7.2 (Maipo)

Description of problem and/or code sample that reproduces the issue

I append daily data to one symbol (write if not exists and set chunk_size = 'A').

The data looks like this:

columns of the DataFrame

Index(['beta', 'btop', 'earnyild', 'growth', 'industry', 'leverage',
       'liquidty', 'momentum', 'resvol', 'sid', 'size', 'sizenl'],
      dtype='object')

head of the DataFrame (part)

             beta   btop  earnyild  growth  industry  leverage  liquidty 
date                                                                       
2008-12-25  0.200 -0.386    -0.669  -0.432        23    -0.307     0.746   
2008-12-25  0.653  0.048     0.671   0.182        10     0.255     1.097   
2008-12-25 -1.726 -1.105    -1.042  -2.661        22    -0.732    -3.400   
2008-12-25 -0.407  2.840     2.588  -1.505        19    -0.454    -1.137   
2008-12-25  0.931  1.302    -0.946  -0.306        31     3.042    -0.429

the dtypes

beta        float64
btop        float64
earnyild    float64
growth      float64
industry      int64
leverage    float64
liquidty    float64
momentum    float64
resvol      float64
sid           int64
size        float64
sizenl      float64

it will randomly raise the following exception (2008-12-25 for example)

[2017-08-29 14:17:00] [factor.value] [INFO] update 2008-12-25 barra exposures failed：Error decompressing
[2017-08-29 14:17:00] [factor.value] [ERROR] Traceback (most recent call last):
  File "/home/quant/newalpha/warden/warden/_update_factors.py", line 88, in _update_barra_exposures
    n = update_lib(lib_factors, 'barra_exposures', exposures)
  File "/home/quant/newalpha/warden/warden/utils.py", line 70, in update_lib
    lib.append(symbol, data_to_append, metadata=meta)
  File "/opt/anaconda3/lib/python3.5/site-packages/arctic/chunkstore/chunkstore.py", line 503, in append
    self.__update(sym, item, metadata=metadata, combine_method=SER_MAP[sym[SERIALIZER]].combine, audit=audit)
  File "/opt/anaconda3/lib/python3.5/site-packages/arctic/chunkstore/chunkstore.py", line 415, in __update
    df = self.read(symbol, chunk_range=chunker.to_range(start, end), filter_data=False)
  File "/opt/anaconda3/lib/python3.5/site-packages/arctic/chunkstore/chunkstore.py", line 268, in read
    data = SER_MAP[sym[SERIALIZER]].deserialize(chunks, **kwargs)
  File "/opt/anaconda3/lib/python3.5/site-packages/arctic/serialization/numpy_arrays.py", line 195, in deserialize
    df = pd.concat([self.converter.objify(d, columns) for d in data], ignore_index=not index)
  File "/opt/anaconda3/lib/python3.5/site-packages/arctic/serialization/numpy_arrays.py", line 195, in <listcomp>
    df = pd.concat([self.converter.objify(d, columns) for d in data], ignore_index=not index)
  File "/opt/anaconda3/lib/python3.5/site-packages/arctic/serialization/numpy_arrays.py", line 126, in objify
    d = decompress(doc[DATA][doc[METADATA][LENGTHS][col][0]: doc[METADATA][LENGTHS][col][1] + 1])
  File "/opt/anaconda3/lib/python3.5/site-packages/arctic/_compression.py", line 55, in decompress
    return clz4.decompress(_str)
  File "_compress.pyx", line 121, in _compress.decompress (src/_compress.c:2151)
Exception: Error decompressing

It seems that the data was broken and can not be decompressed (for any date range). If I delete the document related to 2008, it can be decompressed again.

Thx a lot!

opened by lf-shaw 14

With lib_type='TickStoreV3': No field of name index - index.name and index.tzinfo not preserved - max_date returning min date (without timezone)

Hello,

this code

from pandas_datareader import data as pdr
symbol = "IBM"
df = pdr.DataReader(symbol, "yahoo", "2010-01-01", "2015-12-29")
df.index = df.index.tz_localize('UTC')

from arctic import Arctic
store = Arctic('localhost')
store.initialize_library('library_name', 'TickStoreV3')
library = store['library_name']
library.write(symbol, df)

raises

ValueError: no field of name index

I'm using TickStoreV3 as lib_type because I'm not very interested (at least for now) by audited write, versioning...

I noticed that

>>> df['index']=0
>>> library.write(symbol, df)
1 buckets in 0.015091: approx 6626466 ticks/sec

seems to fix this... but

>>> library.read(symbol)
                           index        High   Adj Close     ...             Low       Close        Open
1970-01-01 01:00:00+01:00      0  132.970001  116.564610     ...      130.850006  132.449997  131.179993
1970-01-01 01:00:00+01:00      0  131.850006  115.156514     ...      130.100006  130.850006  131.679993
1970-01-01 01:00:00+01:00      0  131.490005  114.408453     ...      129.809998  130.000000  130.679993
1970-01-01 01:00:00+01:00      0  130.250000  114.012427     ...      128.910004  129.550003  129.869995
1970-01-01 01:00:00+01:00      0  130.919998  115.156514     ...      129.050003  130.850006  129.070007
...                          ...         ...         ...     ...             ...         ...         ...
1970-01-01 01:00:00+01:00      0  135.830002  135.500000     ...      134.020004  135.500000  135.830002
1970-01-01 01:00:00+01:00      0  138.190002  137.929993     ...      135.649994  137.929993  135.880005
1970-01-01 01:00:00+01:00      0  139.309998  138.539993     ...      138.110001  138.539993  138.300003
1970-01-01 01:00:00+01:00      0  138.880005  138.250000     ...      138.110001  138.250000  138.429993
1970-01-01 01:00:00+01:00      0  138.039993  137.610001     ...      136.539993  137.610001  137.740005

[1507 rows x 7 columns]

It looks like as if write was looking for a DataFrame with a column named 'index'... which is quite odd.

If I do

df['index']=1
library.write(symbol, df)

then

library.write(symbol, df)

raises

OverflowError: Python int too large to convert to C long

Any idea ?

opened by femtotrader 13

MemoryError when saving a dataframe with large strings to TickStore

Arctic Version

1.79.2

Arctic Store

TickStore

Platform and version

Python 3.6.7, Linux Mint 19 Cinnamon 64-bit

Description of problem and/or code sample that reproduces the issue

Hi, I'm trying to save the following data: https://drive.google.com/file/d/1dWWBNvx6vjyNK4kjZTVL4-YM0fmWxT5b/view?usp=sharing

to TickStore, code: https://pastebin.com/jEqXxq2t

and getting a MemoryError, see the stack traces: https://pastebin.com/Uy4pYAfH

I'm quite new to arctic so I might be doing something wrong, and I would appreciate if you could guide me with this.

Side question: Considering the nature of my data (2 col made of a time stamp and long string/json file), what is the best way to store these using arctic?

Thanks, Alan

opened by alanbogossian 12
Missing last chunk in CHUNK_STORE
Arctic Version

1.80.5

Arctic Store

# ChunkStore

Platform and version

Python 3.8.5

Description of problem and/or code sample that reproduces the issue

I noticed that if I save a dataframe where the UTC date carries over to the next day, most functions (reverse_iterator, get_chunk_ranges, get_info, ...) don't return the chunk for the new date. The following example will make this clear (jupyter notebook attached in the zip file):

Set Up

import pandas as pd from arctic import Arctic, CHUNK_STORE store = Arctic("localhost") store.initialize_library("scratch_lib", lib_type=CHUNK_STORE) lib = store["scratch_lib"]

Create an Index with some times that will change dates when converted to UTC

ind = pd.Index([pd.Timestamp("20121208T16:00", tz="US/Eastern"), pd.Timestamp("20121208T18:00", tz="US/Eastern"), pd.Timestamp("20121208T20:00", tz="US/Eastern"), pd.Timestamp("20121208T22:00", tz="US/Eastern")], name="date") print(ind)

Output:

DatetimeIndex(['2012-12-08 16:00:00-05:00', '2012-12-08 18:00:00-05:00', '2012-12-08 20:00:00-05:00', '2012-12-08 22:00:00-05:00'], dtype='datetime64[ns, US/Eastern]', name='date', freq=None)

print(ind.tz_convert("UTC"))

Output

DatetimeIndex(['2012-12-08 21:00:00+00:00', '2012-12-08 23:00:00+00:00', '2012-12-09 01:00:00+00:00', '2012-12-09 03:00:00+00:00'], dtype='datetime64[ns, UTC]', name='date', freq=None)

Create dataframe, write it to the library, and read it back out

df = pd.DataFrame([1, 2, 3, 4], index=ind, columns=["col"]) lib.write("example_df", df, chunk_size="D") df_read = lib.read("example_df") print(df_read)

Output

date col 2012-12-08 21:00:00 1 2012-12-08 23:00:00 2 2012-12-09 01:00:00 3 2012-12-09 03:00:00 4

This is different from what I expected. Is this behavior expected?

lib.get_info("example_df")

Output

{'chunk_count': 1, 'len': 4, 'appended_rows': 0, 'metadata': {'columns': ['date', 'col']}, 'chunker': 'date', 'chunk_size': 'D', 'serializer': 'FrameToArray'}

>> expected chunk_count = 2, not 1

list(lib.get_chunk_ranges("example_df"))

Output

[(b'2012-12-08 00:00:00', b'2012-12-08 23:59:59.999000')]

>> expected [(b'2012-12-08 00:00:00', b'2012-12-08 23:59:59.999000'), (b'2012-12-09 00:00:00', b'2012-12-09 23:59:59.999000')]

iterator = lib.reverse_iterator("example_df") while True: data = next(iterator, None) if data is None: break print(data)

Output

date col 2012-12-08 21:00:00 1 2012-12-08 23:00:00 2

**>> expected the following: date col 2012-12-09 01:00:00 3 2012-12-09 03:00:00 4

date col 2012-12-08 21:00:00 1 2012-12-08 23:00:00 2**

arctic_issue_example.zip
opened by atamkapoor 1
best practice usage

Hello, thank you very much for making this open source

1/ is there a optimised way to access timeseries of revisions. In the VersionStore if we have saved several versions

version1 saved at 2022-01-04 2022-01-01 1 2022-01-02 2

version2 saved at 2022-01-05 2022-01-01 1 2022-01-02 3

then I would like to retrieve in an efficient manner the timeseries of change for the value as of 2022-01-02, ie: 2022-01-04 2 2022-01-05 3

2/ is there a permission layer allowing to choose who has access to which ticker?

opened by RockScience 0

Index Monotonic Sort Bug in class DateChunker

Index Monotonic Sort Bug in class DateChunker (in file date_chunker.py)

If the df's index is not monotonic increasing, arctic will sort the df by index. BUT the variable dates is still not in order.

I suggest arctic to put the code dates = df.index.get_level_values('date') after the if sentence.

def to_chunks(self, df, chunk_size='D', func=None, **kwargs):
    """
    chunks the dataframe/series by dates

    Parameters
    ----------
    df: pandas dataframe or series
    chunk_size: str
        any valid Pandas frequency string
    func: function
        func will be applied to each `chunk` generated by the chunker.
        This function CANNOT modify the date column of the dataframe!

    Returns
    -------
    generator that produces tuples: (start date, end date,
              chunk_size, dataframe/series)
    """
    if 'date' in df.index.names:
        dates = df.index.get_level_values('date')
        if not df.index.is_monotonic_increasing:
            df = df.sort_index()
        # TODO dates won't be sorted, which will cause data store error.
        
      # dates = df.index.get_level_values('date')

Anyway, arctic is an excellent project !

这是我第一次在github上留言。蹩脚的英文。

opened by qcyfred 0

MongoDB 4.2 EOL April 2023 - What's Next?

I recently received a email from MongoDB reminding that my 4.2 cluster will be reaching EOL in April 2023. I am sure I am not the only one who received this...

@jamesblackburn -What's the plan here? Will we need to request an extension from Mongo to continue running 4.2? Is there any roadmap to when if at all you will support 4.4 or 5.0? Per #938 a 4.4 or 5.0 version doesn't seem to be close?

Will we need to plan to move to the S3 version? What's going on here?

Some clarity here would help... as the date is approaching real fast.

Btw - Thanks for the v1.80.5

opened by luongjames8 6
mongodump and mongorestore library - Blob (not pure dataframe)
Arctic Version

# 1.80.0

Arctic Store

# VersionStore

Platform and version

Spyder (Python 3.8)

Description of problem and/or code sample that reproduces the issue

Hi, I use mongodump and mongorestore to move libraries in between PCs (let me know if there are easier ways). So for each library (in this case, my library is called "attribution_europe_data"), it has 5 collections (from MongoDB's point of view), which are attribution_europe_data / ....ARCTIC / ....snapshots /...version_nums/...versions, and during the mongodump process, it dumps 2 files for each collection, so a total of 10 files for each library.

I successfully manage to mongorestore those 10 files into a seperate PC. ie I can do things like below (ie I can do things like print(Arctic('localhost')['attribution_data_europe'].list_symbols())

Now, each symbol in my library represents a pandas dataframe (actually they are saved as Blob, since they contain Objects), its around 5000rows x 2000 columns. The issue is if i read it in the new PC, eg "Arctic('localhost')['attribution_europe_data'].read('20220913').data" in Spyder, it will freeze, and eventually "restarting kernel...."

It shouldn't be a memory issue reading that dataframe, as I generated a similar size dataframe randomly in the same PC and it is ok.

As a test, I use the same mongodump and mongorestore method on a smaller / simpler library, of which the library consists of a very simple symbol of a dictionary of {'hi':1}. And the new PC (where I restore it) is is able to read this library and this symbol without any issue. Similarly I use the same method on pure dataframe, as opposed to Blob, it works as well!

So do you think during the mongodump and mongorestore process, it corrupts Blob object?

Also what you guys normally use to transfer arctic libraries from one PC to another? surely there is a simplier way than mongodump and mongorestore?

============== Just to update on more investigations:

if the symbol is a dataframe (that is NOT saved as a blob), it works

if the symbol is a dict say {'hi':1}, it works

if the symbol is a blob, it DOES NOT work (ie it will have trouble reading that symbol from the restored library in the new PC)

if the symbol is a dict wrapped around a pure dataframe, eg {'hi' : pd.DataFrame(np.random.rand(2,2))}, then it works

if the symbol is a dict wrapper around a blob, DOES NOT WORK, eg {'hi': some_blob}.

I have included what it looks like in the old PC, and what error it throws up in the new PC if the symbol is a dict wraps around a blob

(old PC)

(new PC)
opened by fengster123 0

Releases(v1.80.5)

v1.80.5(Oct 26, 2022)

remove python 2.7 dependencies support python 3.8 pins pandas<1.1.0, numpy<1.19.0
Source code(tar.gz)
Source code(zip)
v1.80.4(Jan 26, 2022)

Stable version in use at Man Group.
Source code(tar.gz)
Source code(zip)
v1.80.3(Jan 21, 2022)

VersionStore : add a named index on versions table to prevent an increase in the longest fully qualified index name, which can break the 127 char limit. This resolves the issue first seen in release v1.80.2
Source code(tar.gz)
Source code(zip)
v1.80.2(Jan 10, 2022)

Bugfix: #932 revert serialization-optimization (#909, #910)
Source code(tar.gz)
Source code(zip)
v1.79.4(Dec 1, 2020)

Source code(tar.gz)
Source code(zip)
arctic-1.79.4-py3-none-any.whl(135.33 KB)
arctic-1.79.4-py3.7.egg(281.21 KB)
arctic-1.79.4.tar.gz(439.66 KB)
v1.74.0(Feb 28, 2019)

Source code(tar.gz)
Source code(zip)
arctic-1.74.0.tar.gz(436.61 KB)
arctic-1.74.0.zip(494.84 KB)
v1.5.0(Sep 11, 2015)
Bugfix release

Always use the primary cluster node for ‘has_symbol()’, it’s safer

Source code(tar.gz)
Source code(zip)
v1.4.0(Aug 19, 2015)

Source code(tar.gz)
Source code(zip)
v1.2.0(Jul 29, 2015)
Add support for snapshotting specific symbol + version pairs

Source code(tar.gz)
Source code(zip)

Owner

Man Group

GitHub Repository https://arctic.readthedocs.io/en/latest/

High performance datastore for time series and tick data

Arctic TimeSeries and Tick store Arctic is a high performance datastore for numeric data. It supports Pandas, numpy arrays and pickled objects out-of-

2.9k Dec 23, 2022

Koalas: pandas API on Apache Spark

pandas API on Apache Spark Explore Koalas docs » Live notebook · Issues · Mailing list Help Thirsty Koalas Devastated by Recent Fires The Koalas proje

3.2k Jan 04, 2023

sqldf for pandas

pandasql pandasql allows you to query pandas DataFrames using SQL syntax. It works similarly to sqldf in R. pandasql seeks to provide a more familiar

1.2k Jan 09, 2023

Create HTML profiling reports from pandas DataFrame objects

Pandas Profiling Documentation | Slack | Stack Overflow Generates profile reports from a pandas DataFrame. The pandas df.describe() function is great

10k Jan 01, 2023

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

What is Vaex? Vaex is a high performance Python library for lazy Out-of-Core DataFrames (similar to Pandas), to visualize and explore big tabular data

7.7k Jan 01, 2023

Modin: Speed up your Pandas workflows by changing a single line of code

Scale your pandas workflows by changing one line of code To use Modin, replace the pandas import: # import pandas as pd import modin.pandas as pd Inst

8.2k Jan 01, 2023

Pandas Google BigQuery

pandas-gbq pandas-gbq is a package providing an interface to the Google BigQuery API from pandas Installation Install latest release version via conda

348 Jan 03, 2023

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

pysparkling Pysparkling provides a faster, more responsive way to develop programs for PySpark. It enables code intended for Spark applications to exe

254 Dec 06, 2022

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

swifter A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner. Blog posts Release 1.0.0 Fir

2.2k Jan 04, 2023

The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

pandas-log The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common funct

206 Dec 13, 2022

High performance datastore for time series and tick data

Related tags

Overview

Quickstart

Install Arctic

Run a MongoDB

Using VersionStore

Adding your own storage engine

Documentation

Concepts

Libraries

Storage Engines

Requirements

Acknowledgements

License

Comments

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

Set Up

Create an Index with some times that will change dates when converted to UTC

Output:

Output

Create dataframe, write it to the library, and read it back out

Output

This is different from what I expected. Is this behavior expected?

Output

Output

Output

Index Monotonic Sort Bug in class DateChunker (in file date_chunker.py)

If the df's index is not monotonic increasing, arctic will sort the df by index. BUT the variable dates is still not in order.

Arctic Version

Arctic Store

Platform and version

Description of problem and/or code sample that reproduces the issue

Releases(v1.80.5)

v1.80.5(Oct 26, 2022)

v1.80.4(Jan 26, 2022)

v1.80.3(Jan 21, 2022)

v1.80.2(Jan 10, 2022)

v1.79.4(Dec 1, 2020)

v1.74.0(Feb 28, 2019)

v1.5.0(Sep 11, 2015)

v1.4.0(Aug 19, 2015)

v1.2.0(Jul 29, 2015)

Owner

Man Group

High performance datastore for time series and tick data

Koalas: pandas API on Apache Spark

sqldf for pandas

Create HTML profiling reports from pandas DataFrame objects

Out-of-Core DataFrames for Python, ML, visualize and explore big tabular data at a billion rows per second 🚀

Modin: Speed up your Pandas workflows by changing a single line of code

Pandas Google BigQuery

A pure Python implementation of Apache Spark's RDD and DStream interfaces.

A package which efficiently applies any function to a pandas dataframe or series in the fastest available manner

The goal of pandas-log is to provide feedback about basic pandas operations. It provides simple wrapper functions for the most common functions that add additional logs

Universal 1d/2d data containers with Transformers functionality for data analysis.

A Python package for manipulating 2-dimensional tabular data structures

The easy way to write your own flavor of Pandas

cuDF - GPU DataFrame Library

NumPy and Pandas interface to Big Data