scalable analysis of images and time series

Last update: Dec 29, 2022

Related tags

Overview

thunder

scalable analysis of image and time series analysis in python

Thunder is an ecosystem of tools for the analysis of image and time series data in Python. It provides data structures and algorithms for loading, processing, and analyzing these data, and can be useful in a variety of domains, including neuroscience, medical imaging, video processing, and geospatial and climate analysis. It can be used locally, but also supports large-scale analysis through the distributed computing engine spark. All data structures and analyses in Thunder are designed to run identically and with the same API whether local or distributed.

Thunder is designed around modularity and composability — the core thunder package, in this repository, only defines common data structures and read/write patterns, and most functionality is broken out into several related packages. Each one is independently versioned, with its own GitHub repository for organizing issues and contributions.

This readme provides an overview of the core thunder package, its data types, and methods for loading and saving. Tutorials, detailed API documentation, and info about all associated packages can be found at the documentation site.

install

The core thunder package defines data structures and read/write patterns for images and series data. It is built on numpy, scipy, scikit-learn, and scikit-image, and is compatible with Python 2.7+ and 3.4+. You can install it using:

pip install thunder-python

related packages

Lots of functionality in Thunder, especially for specific types of analyses, is broken out into the following separate packages.

thunder-regression mass univariate regression algorithms
thunder-factorization matrix factorization algorithms
thunder-registration registration for image sequences

You can install the ones you want with pip, for example

pip install thunder-regression
pip install thunder-registration

example

Here's a short snippet showing how to load an image sequence (in this case random data), median filter it, transform it to a series, detrend and compute a fourier transform on each pixel, then convert it to an array.

import thunder as td

data = td.images.fromrandom()
ts = data.median_filter(3).toseries()
frequencies = ts.detrend().fourier(freq=3).toarray()

usage

Most workflows in Thunder begin by loading data, which can come from a variety of sources and locations, and can be either local or distributed (see below).

The two primary data types are images and series. images are used for collections or sequences of images, and are especially useful when working with movie data. series are used for collections of one-dimensional arrays, often representing time series.

Once loaded, each data type can be manipulated through a variety of statistical operators, including simple statistical aggregiations like mean min and max or more complex operations like gaussian_filter detrend and subsample. Both images and series objects are wrappers for ndarrays: either a local numpy ndarray or a distributed ndarray using bolt and spark. Calling toarray() on an images or series object at any time returns a local numpy ndarray, which is an easy way to move between Thunder and other Python data analysis tools, like pandas and scikit-learn.

For a full list of methods on image and series data, see the documentation site.

loading data

Both images and series can be loaded from a variety of data types and locations. For all loading methods, the optional argument engine allows you to specify whether data should be loaded in 'local' mode, which is backed by a numpy array, or in 'spark' mode, which is backed by an RDD.

All loading methods are available on the module for the corresponding data type, for example

import thunder as td

data = td.images.fromtif('/path/to/tifs')
data = td.series.fromarray(somearray)
data_distributed = ts.series.fromarray(somearray, engine=sc)

The argument engine can be either None for local use or a SparkContext for distributed use with Spark. And in either case, methods that load from files e.g. fromtif or frombinary can load from either a local filesystem or Amazon S3, with the optional argument credentials for S3 credentials. See the documentation site for a full list of data loading methods.

using with spark

Thunder doesn't require Spark and can run locally without it, but Spark and Thunder work great together! To install and configure a Spark cluster, consult the official Spark documentation. Thunder supports Spark version 1.5+ (currently tested against 2.0.0), and uses the Python API PySpark. If you have Spark installed, you can install Thunder just by calling pip install thunder-python on both the master node and all worker nodes of your cluster. Alternatively, you can clone this GitHub repository, and make sure it is on the PYTHONPATH of both the master and worker nodes.

Once you have a running cluster with a valid SparkContext — this is created automatically as the variable sc if you call the pyspark executable — you can pass it as the engine to any of Thunder's loading methods, and this will load your data in distributed 'spark' mode. In this mode, all operations will be parallelized, and chained operations will be lazily executed.

contributing

Thunder is a community effort! The codebase so far is due to the excellent work of the following individuals:

Andrew Osheroff, Ben Poole, Chris Stock, Davis Bennett, Jascha Swisher, Jason Wittenbach, Jeremy Freeman, Josh Rosen, Kunal Lillaney, Logan Grosenick, Matt Conlen, Michael Broxton, Noah Young, Ognen Duzlevski, Richard Hofer, Owen Kahn, Ted Fujimoto, Tom Sainsbury, Uri Laseron, W J Liddy

If you run into a problem, have a feature request, or want to contribute, submit an issue or a pull request, or come talk to us in the chatroom!

Comments

Serializable
Note: This pull request came out of a face-to-face discussion between @freeman-lab , @poolio , @logang, and @broxtronix.

This pull request introduces a new @serializable decorator that can decorate any class to make it easy to store that class in a human readable JSON format and then recall it and recover the original object instance. Classes instances that are wrapped in this decorator gain the serialize() method, and the class also gains a deserialize() static method that can automatically "pickle" and "unpickle" a wide variety of objects like so:

@serializable class Visitor(): def __init__(self, ip_addr = None, agent = None, referrer = None): self.ip = ip_addr self.ua = agent self.referrer= referrer self.time = datetime.datetime.now() orig_visitor = Visitor('192.168', 'UA-1', 'http://www.google.com') #serialize the object pickled_visitor = orig_visitor.serialize() #restore object recov_visitor = Visitor.deserialize(pickled_visitor)

Note that this decorator is NOT designed to provide generalized pickling capabilities. Rather, it is designed to make it very easy to convert small classes containing model properties to a human and machine parsable format for later analysis or visualization. A few classes under consideration for such decorating include the Transformation class for image alignment and the Source classes for source extraction.

A key feature of the @serializable decorator is that it can "pickle" data types that are not normally supported by Python's stock JSON dump() and load() methods. Supported datatypes include: list, set, tuple, namedtuple, OrderedDict, datetime objects, numpy ndarrays, and dicts with non-string (but still data) keys. Serialization is performed recursively, and descends into the standard python container types (list, dict, tuple, set).
opened by broxtronix 20
Error running ICA on a local machine
Hi all,

I am posting an error log that I am getting when trying to run ICA on a recording of Ca2+ traces. There are about 50 cells in the field of view. So I set the number of ICs to 75, with 150 PCs.

The images at each time point are stored as .tif files. I loaded them in as a series and then normalized them using:

normdata = data.toTimeSeries().normalize(baseline='mean') #Normalize data by the global mean. (data-mean)/mean

normdata = data.toTimeSeries()

normdata.cache()

Thanks a lot for your help! And also, thanks a lot for Thunder :)

Py4JJavaError Traceback (most recent call last) in () 3 start_time = time.time() 4 from thunder import ICA ----> 5 modelICA = ICA(k=150,c=75).fit(normdata) # Run ICA on normalized data. k=#of principal components, c=#of ICs 6 sns.set_style('darkgrid') 7 plt.plot(modelICA.a);

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/factorization/ica.pyc in fit(self, data) 95 96 # reduce dimensionality ---> 97 svd = SVD(k=self.k, method=self.svdMethod).calc(data) 98 99 # whiten data

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/factorization/svd.pyc in calc(self, mat) 137 138 # compute (xx')^-1 through a map reduce --> 139 xx = mat.times(cInv).gramian() 140 xxInv = inv(xx) 141

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/matrices.pyc in times(self, other) 191 newindex = arange(0, new_d) 192 return self._constructor(self.rdd.mapValues(lambda x: dot(x, other_b.value)), --> 193 nrows=self._nrows, ncols=new_d, index=newindex).finalize(self) 194 195 def elementwise(self, other, op):

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/matrices.pyc in init(self, rdd, index, dims, dtype, nrows, ncols, nrecords) 52 elif ncols is not None: 53 index = arange(ncols) ---> 54 super(RowMatrix, self).init(rdd, nrecords=nrecs, dtype=dtype, dims=dims, index=index) 55 56 @property

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in init(self, rdd, nrecords, dtype, index, dims) 48 self._index = None 49 if index is not None: ---> 50 self.index = index 51 if dims and not isinstance(dims, Dimensions): 52 try:

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in index(self, value) 65 def index(self, value): 66 # touches self.index to trigger automatic calculation from first record if self.index is not set ---> 67 lenSelf = len(self.index) 68 if type(value) is str: 69 value = [value]

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in index(self) 59 def index(self): 60 if self._index is None: ---> 61 self.populateParamsFromFirstRecord() 62 return self._index 63

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/series.pyc in populateParamsFromFirstRecord(self) 103 Returns the result of calling self.rdd.first(). 104 """ --> 105 record = super(Series, self).populateParamsFromFirstRecord() 106 if self._index is None: 107 val = record[1]

/home/stuberlab/anaconda/lib/python2.7/site-packages/thunder/rdds/data.pyc in populateParamsFromFirstRecord(self) 76 from numpy import asarray 77 ---> 78 record = self.rdd.first() 79 self._dtype = str(asarray(record[1]).dtype) 80 return record

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.pyc in first(self) 1165 2 1166 """ -> 1167 return self.take(1)[0] 1168 1169 def saveAsNewAPIHadoopDataset(self, conf, keyConverter=None, valueConverter=None):

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/rdd.pyc in take(self, num) 1151 p = range( 1152 partsScanned, min(partsScanned + numPartsToTry, totalParts)) -> 1153 res = self.context.runJob(self, takeUpToNumLeft, p, True) 1154 1155 items += res

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/context.pyc in runJob(self, rdd, partitionFunc, partitions, allowLocal) 768 # SparkContext#runJob. 769 mappedRDD = rdd.mapPartitions(partitionFunc) --> 770 it = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, javaPartitions, allowLocal) 771 return list(mappedRDD._collect_iterator_through_file(it)) 772

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in call(self, *args) 536 answer = self.gateway_client.send_command(command) 537 return_value = get_return_value(answer, self.gateway_client, --> 538 self.target_id, self.name) 539 540 for temp_arg in temp_args:

/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 298 raise Py4JJavaError( 299 'An error occurred while calling {0}{1}{2}.\n'. --> 300 format(target_id, '.', name), value) 301 else: 302 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 17.0 failed 1 times, most recent failure: Lost task 0.0 in stage 17.0 (TID 12005, localhost): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/worker.py", line 75, in main command = pickleSer._read_with_length(infile) File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 146, in _read_with_length length = read_int(stream) File "/home/stuberlab/Downloads/spark-1.1.0-bin-hadoop1/python/pyspark/serializers.py", line 464, in read_int raise EOFError EOFError

org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:124) org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:154) org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:87) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply$mcV$sp(PythonRDD.scala:209) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.api.python.PythonRDD$WriterThread$$anonfun$run$1.apply(PythonRDD.scala:184) org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) org.apache.spark.api.python.PythonRDD$WriterThread.run(PythonRDD.scala:183)

Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
opened by vjlbym 18
Refactor (WIP)
This is a huge refactoring of Thunder, and will the basis of an upcoming new release. We'd normally break it up into multiple PRs, but this touches so much of the code base that it was easier to do all at once.

There are three primary goals, based on a year of community experience and feedback, and consideration of the current ecosystem:

Loosen the dependency on Spark. This is a big one. Many superficial issues, including installation issues, complexity for new users and contributors, etc are due to Thunder's hard dependence on Spark. We will definitely continue to support Spark, we also want to enable work seamlessly across local and distributed environments, and against a variety of execution engines, including Spark but also new libraries like Dask. This PR begins that effort through some fundemental but neccessary refactoring.

Modularize the components. Thunder has started absorbing a wide variety of algorithms / analyses, especially with recent additions to image registration and spatiotemporal source extraction. These components are at different levels of maturity and specificity, and are better off as pluggable, composable pieces living in separate repos.

Modernize the codebase, and make more friendly to the Python ecosystem, in particular by ensuring Python 3 compatibility, using py.test for unit tests, and Pythonic naming conventions.

refactoring

[x] develop global context manager for backend

[x] refactor data reading / writing

[x] update reading / writing tests

[x] remove executables

[x] remove standalone scripts

[x] use S3 for external data

[x] use py.test for unit tests

[ ] update documentation

[ ] make python 3 compatible

[x] use snakecase

new packages (inside thunder-project)

[ ] rime - source extraction

[ ] sleet - image registration

[ ] thundercloud - manage cluster on ec2

new packages (external)

[x] station - context manager for distributed backends

[x] checkist - minimal argument checking

[x] showit - simple display of images and tiled images

[ ] serdeme - custom class serialization/deserialization
opened by freeman-lab 14
Thunder integration with OCP

Hey Jeremy,

I have merged the latest branch of thunder and documented my function. In addition, the tests are also fixed. They will not fail if OCP is down.

opened by kunallillaney 10
adding support for writing multipage tiffs
The current totif method only supports writing 2D arrays or 3D arrays where the third channel is color It uses

from scipy.misc import imsave

Instead

from skimage.io import imsave

Supports writing 3D arrays to tiffs using tifffile

Unfortunately though this support is only for writing directly to files, not to file objects / byte streams and so I was unable to swap it out for the current imsave directly.

There is however a modified version of tifffile here that supports writing to file objects

Using this version could allow for writing multipage tiffs
opened by sofroniewn 9
fix incorrect propagation of dtype in Series normalize and other methods

This PR addresses a bug in Series.normalize() and other methods, where the dtype attribute of the output was being set incorrectly to the dtype of the input RDD.

After this patch, the default behavior for apply() and most other methods that can potentially produce output with a dtype different from the input will be to leave this attribute unset, to be lazily determined as needed by making an implicit call to first() when the dtype attribute is requested.

For normalize() in particular, the output dtype will now be the smallest floating-point type that can safely store the data without over/underflow, as determined by commons.smallest_float_type(). This will be properly set on the output Series, so that no implicit first() call is needed.

@freeman-lab, if you're okay with this PR, I can take care of merging it into master from the 0.4.x branch.

opened by industrial-sloth 9
Negative Value Errors for Images

When using the images.minus() function, sometimes the values of some pixels may become negative.

To correct for this, I would like to shift the whole image by a scalar value (The minimum of the difference between the images). However, after doing the minus call, anytime I try to access the new image object, I get this error:

Traceback (most recent call last): File "", line 1, in File "build/bdist.linux-x86_64/egg/thunder/images/images.py", line 191, in map File "build/bdist.linux-x86_64/egg/thunder/base.py", line 460, in _map File "build/bdist.linux-x86_64/egg/bolt/spark/array.py", line 141, in map File "build/bdist.linux-x86_64/egg/bolt/spark/array.py", line 94, in _align TypeError: unsupported operand type(s) for -: 'int' and 'NoneType'

Thus, I'm not able to calculate the minimum value across the images to then adjust.

A current workaround is to convert the image object to an RDD, calculate the minimum, and adjust the minimum value as an rdd, then do td.images.fromrdd() to get back to an RDD.

opened by kr-hansen 8
1.0.0 labels
Implementation

This PR implements labels, a new feature on the Series object that allows the user to keep track of the identity of the individual series that make up the Series object even through operations such as Series.filter and indexing (Series[...]). In analogy to how Series.index allows the user to keep tabs on the final dimension of the Series object, Series.labels allows the user to track the identities of the "base axes" (the non-final axes which, in spark mode, are distributed).

Assume we have a Series object named series with shape (x, y, z, t) or (n, t). We can attach a set of labels to these series with:

series.labels = labels

where labels is an array-like object of size (x, y, z) or (n) respectively.

In regards to how they affect the labels, operations on Series fall into three categories:

Operations that are effectively a map do not change the structure of non-final dimensions and thus the labels are unaffected -- e.g. Series.map, Series.zscore, Series.between.

Operations that are effectively a reduce combine all the individual series in a the Series object and thus the identities of the individual series are lost and the labels are dropped -- e.g. Series.reduce, Series.mean.

Operations that are effective a filter will drop some of the series. This is where labels are most useful in tracking the identities of the retained series. In these cases, the labels will be updated to reflect the new structure of the Series object -- e.g. Series.filter and Series.__getitem__ (i.e. indexing).

A note on performance in spark mode:

In the distributed setting, determining which elements of the Series object were dropped/retained during a filter can be expensive. This effectively involves making two passes through the data: the first to determine which values will be dropped (a map) and a second to actually drop those values (a filter). When labels are set (i.e. not None), then these too passes will happen in a non-lazy fashion so that the labels can be appropriately updated (NB: filter is already non-lazy in this setting).

Indexing is similar to a filter in that records are dropped, however the specification of which records will be dropped is knowable directly from the inputs, thus updating the labels (like the indexing itself) is fast and the indexing operation remains lazy.
opened by jwittenbach 8
added map_as_series

Adds a Image.map_as_series method that uses Blocks to apply a function to each series in an Images object and then turn the data back into an Images object -- avoids needing to transform the data all the way to a Series representation, which can be quite expensive to turn back into Images due to the high level of fragmentation that can occur when the total size of the spatial dimensions greatly outnumbers the size of the temporal dimension.

opened by jwittenbach 8
JSON serializable registration model
This PR modifies the existing JSON serialization code quite heavily, with the end goal of having this be usable to serialize RegistrationModel objects from the imgprocessing image registration code.

This gets around a couple issues with the previous serialization code:

RegistrationModels have nested within them Displacement objects. However the previous serialization code didn't handle custom classes. The current code can handle nested custom classes, so long as those nested classes are themselves serializable.

The previous decorator-based code produced objects that were not pickleable, since their type (ThunderSerializableObjectWrapper) was defined inside a function rather than at the top level of a module, and thus pickle could not dynamically instantiate them. RegistrationModels need to be pickleable, since they are broadcast by pyspark, which uses pickle to do so. This PR moves the serialization logic into an abstract base class rather than a decorator, so serializable classes must now extend ThunderSerializable (can be multiple inheritance) rather than being wrapped by the @serializable decorator.

At present this is still a little messy. I'm opening this PR right now for visibility and comment, but I don't yet consider it ready to be merged in.
opened by industrial-sloth 8
support multiple time points per image file

This PR adds an nplanes option to the main Images-loading methods. If nplanes is specified, then a single input file will be interpreted as containing multiple image volumes, each with size nplanes in its final dimension. For instance, a single binary stack file loaded with arguments dims=(x, y, 8), nplanes=2 would turn into 4 separate records in an Images RDD, each with size (x, y, 2). In general, images that are loaded with z planes and a positive nplanes argument will result in z / nplanes time points, each with nplanes planes.

opened by industrial-sloth 8
will it work for multivariate time series prediction both regression and classification
great code thanks may you clarify : will it work for multivariate time series prediction both regression and classification 1 where all values are continues values weight height age target 1 56 160 34 1.2 2 77 170 54 3.5 3 87 167 43 0.7 4 55 198 72 0.5 5 88 176 32 2.3

2 or even will it work for multivariate time series where values are mixture of continues and categorical values for example 2 dimensions have continues values and 3 dimensions are categorical values

color weight gender height age target

1 black 56 m 160 34 yes 2 white 77 f 170 54 no 3 yellow 87 m 167 43 yes 4 white 55 m 198 72 no 5 white 88 f 176 32 yes
opened by Sandy4321 0
Question: Liveness of this project

The last commits were two years ago - which would generally be a conclusive signal that a project were abandonware. However there are over 2K commits and iirc 22 contributors - so I'll venture asking if there were still plans to keep this project - with so much effort placed in it - afloat?

opened by javadba 0
support google tagmanager

It would be nice if google tag manager was supported as well. We use this on every site. This deprecates google analytics for us and allows hotjar or other implementations without changes in drupal. You can just add these things in tagmanager. https://www.drupal.org/project/google_tag

opened by woutersf 0

Installing Thunder in Windows 7

Hi, folks, I tried to install Thunder in Anaconda on Windows 7, using pip install thunder-python. It asked to install Visual C++ for Python, which I did. Still, installation fails with the following errors:

...
    writing dependency_links to tifffile.egg-info\dependency_links.txt
    warning: manifest_maker: standard file '-c' not found

    reading manifest file 'tifffile.egg-info\SOURCES.txt'
    reading manifest template 'MANIFEST.in'
    warning: no files found matching '*.c'
    warning: no previously-included files matching '__pycache__' found under di
ectory '*'
    warning: no previously-included files matching '*.py[co]' found under direc
ory '*'
    writing manifest file 'tifffile.egg-info\SOURCES.txt'
    copying tifffile\_tifffile.c -> build\lib.win-amd64-2.7\tifffile
    running build_ext
    building 'tifffile._tifffile' extension
    creating build\temp.win-amd64-2.7
    creating build\temp.win-amd64-2.7\Release
    creating build\temp.win-amd64-2.7\Release\tifffile
    C:\Users\username\AppData\Local\Programs\Common\Microsoft\Visual C++ for Pyt
on\9.0\VC\Bin\amd64\cl.exe /c /nologo /Ox /MD /W3 /GS- /DNDEBUG -Ic:\users\nvla
im\appdata\local\continuum\anaconda2\lib\site-packages\numpy\core\include -Ic:\
sers\username\appdata\local\continuum\anaconda2\include -Ic:\users\username\appda
a\local\continuum\anaconda2\PC /Tctifffile/_tifffile.c /Fobuild\temp.win-amd64-
.7\Release\tifffile/_tifffile.obj
    _tifffile.c
    tifffile/_tifffile.c(75) : fatal error C1083: Cannot open include file: 'st
int.h': No such file or directory
    error: command 'C:\\Users\\username\\AppData\\Local\\Programs\\Common\\Micro
oft\\Visual C++ for Python\\9.0\\VC\\Bin\\amd64\\cl.exe' failed with exit statu
 2

    ----------------------------------------
Command "c:\users\username\appdata\local\continuum\anaconda2\python.exe -u -c "i
port setuptools, tokenize;__file__='c:\\users\\username\\appdata\\local\\temp\\p
p-build-tcw9mf\\tifffile\\setup.py';f=getattr(tokenize, 'open', open)(__file__)
code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exe
'))" install --record c:\users\username\appdata\local\temp\pip-bn1cya-record\ins
all-record.txt --single-version-externally-managed --compile" failed with error
code 1 in c:\users\username\appdata\local\temp\pip-build-tcw9mf\tifffile\

Is there any fix or workaround for it? Thanks!

opened by nvladimus 1

updates references to Bolt to align with restructuring

There is a pending PR in Bolt that removes the makes some minor API changes related to removing BoltArrayLocal and changing BoltArraySpark to simply BoltArray. This PR makes a few small updates to Thunder to take these changes into account.

Tests will not pass until new version of Bolt is released on PyPI.

opened by jwittenbach 0

Releases(v0.5.1)

v0.5.1(Jul 28, 2015)
This a maintenance release of Thunder.

The main focus is fixing a variety of deployment and installation related issues, and adding initial support for the recently released Spark 1.4. Thunder has not been extensively used alongside Spark 1.4, but with this release all core functionality has been verified.

Changes and Bug Fixes

Fix launching error when starting Thunder with Spark 1.4 (addresses #201)

Fix EC2 deployment with Spark 1.4

More informative errors for handling import errors on startup

Remove pylab when starting notebooks on EC2

Improved dependency handling on EC2

Updated documentation for factorization methods

Contributions

Davis Bennet (@d-v-b): doc improvements

Andrew Giessel (@andrewgiessel): EC2 deployment

Jeremy Freeman (@freeman-lab): varius bug fixes

If you have any questions come chat with us, and stay tuned for Thunder 0.6.0 in the near future.
Source code(tar.gz)
Source code(zip)
v0.5.0(Apr 2, 2015)
We are pleased to announce the release of Thunder 0.5.0. This release introduces several new features, including a new framework for image registration algorithms, performance improvements for core data conversions, improved EC2 deployment, and many bug fixes. This release requires Spark 1.1.0 or later, and is compatible with the most recent Spark release, 1.3.0.

Major features

A new image registration API inside the new thunder.imgprocessing package. See the tutorial.

Significant performance improvements to the Images to Series conversion, including a Blocks object as an intermediate stage. The inverse conversion, from Series back to Images, is now supported.

Support for tiff image files as an input format has been expanded and made more robust. Multiple image volumes can now be read from a single input file via the nplanes argument in the loading functions, and files can be read from nested directory trees using the recursive=True flag.

New methods for working with mutli-level indexing on Series objects, including selectByIndex and seriesStatByIndex, see the tutorial.

Convenient new getter methods for extracting Individual records or small sets of records using bracket notation, as in Series[(x,y,z)] or Images[k].

A new serializable decorator to make it easy to save/load small objects (e.g. models) to JSON, including handling of numpy arrays. See saving/loading of RegistrationModel for an example.

Minor features

Parameter files can be loaded from a file with simple JSON schema (useful for working with covariates), using ThunderContext.loadParams

A new method ThunderContext.setAWSCredentials handles AWS credential settings in managed cluster environments (where it may not be possible to modify system config files)

An Images object can be saved to a collection of binary files using Images.saveAsBinaryImages

Data objects now have a consistent __repr__ method, displaying uniform and informative results when these objects are printed.

Images and Series objects now each offer a meanByRegions() method, which calculates a mean over one or more regions specified either by a set of indices or a mask image.

TimeSeries has a new convolve() method.

The thunder and thunder-submit executables have been modified to better expose the options available in the underlying pyspark and spark-submit Spark executable scripts.

An improved and streamlined Colorize with new colorization options.

Load data hosted by the Open Connectome Project with the loadImagesOCP method.

New example data sets available, both for local testing and on S3

New tutorials: regression, image registration, multi-level indexing

Transition guide

Some keyword parameters have been changed for consistency with the Thunder style guide naming conventions. Example are inputformat, startidx, and stopidx parameters on the ThunderContext loading methods, which are now inputFormat, startIdx, and stopIdx, respectively. We expect minimal future changes in existing method and parameter names.

The Series methods normalize() and detrend() have been moved to TimeSeries objects, which can be created by the Series.toTimeSeries() method.

The default file extension for the binary stack format is now bin instead of stack. If you need to load files with the stack extension, you can use the ext='stack' keyword argument of loadImages.

export is now a method on the ThunderContext instead of a standalone function, and now supports exporting to S3.

The loadImagesAsSeries and convertImagesToSeries methods on ThunderContext now default to shuffle=True, making use of a revised execution path that should improve performance.

The method for loading example data has been renamed from loadExampleEC2 to loadExampleS3

Deployment and development

Anaconda is now the default Python installation on EC2 deployments, as well as on our Travis server for testing.

EC2 scripts and unit tests provide quieter and prettier status outputs.

Egg files now included with official releases, so that a pip install of thunder-python can immediately be deployed on a cluster without cloning the repo and building an egg.

Contributions:

Andrew Osheroff (data getter improvements)

Ben Poole (optimized window normalization, image registration)

Jascha Swisher (images to series conversion, serializable class, tif handling, get and meanBy methods, bug fixes)

Jason Wittenbach (new series indexing functionality, regression and indexing tutorials, bug fixes)

Jeremy Freeman (image registration, EC2 deployment, exporting, colorizing, bug fixes)

Kunal Lillaney (loading from OCP)

Michael Broxton (serializable class, new series statistics, improved EC2 deployment)

Noah Young (improved EC2 deployment)

Tom Sainsbury (image filtering, PNG saving options)

Uri Laseron (submit scripts, Hadoop versioning)

Roadmap

Moving forward we will do a code freeze and cut a release every three months. The next will be June 30th.

For 0.6.0 we will focus on the following components:

A source extraction / segmentation API

New capabilities for regression and GLM model fitting

New image registration algorithms (including volumetric methods)

Latent factor and network models

Improved performance on single-core workflows

Bug fixes and performance improvements throughout

If you are interested in contributing, let us know! Check out the existing issues or join us in the chatroom.
Source code(tar.gz)
Source code(zip)
v0.4.1(Nov 4, 2014)
We are happy to announce the 0.4.1 release of Thunder. This is a maintenance / bug fix release.

The focus is ensuring consistent array indexing across all supported input types and internal data formats. For 3D image volumes, the z-plane will now be on the third array axis (e.g. ary[:,:,2]), and will be in the same position for Series indices and the dims attribute on Images and Series objects. Visualizing image data by matplotlib’s imshow() function will yield an image in the expected orientation, both for Images objects and for the arrays returned by a Series.pack() call. Other changes described below.

Changes and Bug Fixes

Handling of wildcards in path strings for the local filesystem and S3 is improved.

New Data.astype method for converting numerical type of values.

A dtype parameter has been added to the ThunderContext.load* methods.

Several exceptions thrown by uncommon edge cases in tif handling code have been resolved.

The Series.pack() method no longer automatically casts returned data to float16. This can instead be performed ahead of time using the new astype methods.

tsc.convertImagesToSeries() did not previously write output files with tif file input when shuffle=True.

A ValueError thrown by the random sampling methods with numpy 1.9 has been resolved (issue #41).

The thunder-ec2 script will now generate a ~/.boto configuration file containing AWS access keys on all nodes, allowing workers to access S3 with no additional configuration.

Test example data files are now copied out to all nodes in a cluster as part of the thunder-ec2 script.

Now compatible with boto 2.8.0 and later versions, for EC2 deployments (issue #40).

Fixed a dimension bug when colorizing 2D images with the indexed conversion type.

Fixed an issue with optimization approach being misspecified in colorization.

Thanks

Joseph Naegele: reporting path and data type bugs

Allan Wong: reporting random sampling bug

Sung Soo Kim: reporting colorization optimization issue

Thomas Sainsbury: reporting indexed colorization bug

Contributions

Jascha Swisher (@industrial-sloth): unified indexing schemes, bug fixes

Jeremy Freeman (@freeman-lab): bug fixes

Thanks very much for your interest in Thunder. Questions and comments can be set to the mailing list.
Source code(tar.gz)
Source code(zip)
v0.4.0(Oct 16, 2014)
We are pleased to announce the release of Thunder 0.4.0.

This release introduces some major API changes, especially around loading and converting data types. It also brings some substantial updates to the documentation and tutorials, and better support for data sets stored on Amazon S3. While some big changes have been made, we feel that this new architecture provides a more solid foundation for the project, better supporting existing use cases, and encouraging contributions. Please read on for more!

Major Changes

Data representation. Most data in Thunder now exists as subclasses of the new thunder.rdds.Data object. This wraps a PySpark RDD and provides several general convenience methods. Users will typically interact with two main subclasses of data, thunder.rdds.Images and thunder.rdds.Series, representing spatially- and temporally-oriented data sets, respectively. A common workflow will be to load image data into an Images object and then convert it to a Series object for further analysis, or just to convert Images directly to Series data.

Loading data. The main entry point for most users remains the thunder.utils.context.ThunderContext object, available in the interactive shell as tsc, but this class has many new, expanded, or renamed methods, in particular loadImages(), loadSeries(), loadImagesAsSeries(), and convertImagesToSeries(). Please see the Thunder Context tutorial and the API documentation for more examples and detail.

New methods for manipulating and processing images and series data, including refactored versions of some earlier analyses (e.g. routines from the package previously known as timeseries).

Documentation has been expanded, and new tutorials have been added.

Core API components are now exposed at the top-level for simpler importing, e.g. from thunder import Series or from thunder import ICA Improved support for loading image data directly from Amazon S3, using the boto AWS client library. The load* methods in ThunderContext now all support s3n:// schema URIs as data path specifiers.

Notes about requirements and environments

Spark 1.1.0 is required. Most functionality will be intact with earlier versions of Spark, with the exception of loading flat binary data.

“Hadoop 1” jars as packaged with Spark are recommended, but Thunder should work fine if recompiled against the CDH4, CDH5, or “Hadoop 2” builds.

Python 2 required, version 2.6 or greater.

PIL/pillow libraries are used to handle tif images. We have encountered some issues working with these libraries, particularly on OSX 10.9. Some errors related to image loading may be traceable to a broken PIL/pillow installation.

This release has been tested most extensively in three environments: local usage, a private research compute cluster, and Amazon EC2 clusters stood up using the thunder-ec2 script packaged with the distribution.

Future Directions

Thunder is still young, and will continue to grow. Now is a great time to get involved! While we will try to minimize changes to the API, it should not yet be considered stable, and may change in upcoming releases. That said, if you are using or contemplating using Thunder in a production environment, please reach out and let us know what you’re working on, or post to the mailing list.

Contributors

Jascha Swisher (@industrial-sloth): loading functionality, data types, AWS compatibility, API Jeremy Freeman (@freeman-lab): API, data types, analyses, general performance and stability
Source code(tar.gz)
Source code(zip)
v0.3.2(Sep 11, 2014)
This release includes bug fixes and other minor improvements.

Bug fixes

Removed pillow dependency, to prevent a bug that appears to occur frequently in Mac OS 10.9 installations (87280ec)

Customized EC2 installation and configuration, to avoid using Anaconda AMI, which was failing to properly configure mounted drives (fixes #21)

Improvements

Handle either zero- or one-based indexing in keys (#20)

Support requester pays bucket setting for example data (fixes #21)

Source code(tar.gz)
Source code(zip)
v0.3.1(Sep 4, 2014)
Maintenance release with bug fixes and minor improvements.

Bug fixes

Fixed error specifying path to shell.py in pip installations

Fixed a broken import that prevented use of Colorize

Improvements

Query returns average keys as well as average values

Loading example data from EC2 supports "requester pays" mode

Fixed documentation typos (#19)

Source code(tar.gz)
Source code(zip)
v0.3.0(Aug 23, 2014)
This update adds new functionality for loading data, alongside changes to the API for loading, and a variety of smaller bug fixes.

API changes

All data loading is performed through the new Thunder Context, a thin wrapper for a Spark Context. This context is automatically created when starting thunder, and has methods for loading data from different input sources.

tsc.loadText behaves identically to the load from previous versions.

Example data sets can now be loaded from tsc.makeExample, tsc.loadExample, and tsc.loadExampleEC2.

Output of the pack operation now preserves xy definition, but outputs will be transposed relative to previous versions.

New features

Include design matrix with example data set on EC2

Faster nmf implementation by changing update equation order (#15)

Support for loading local MAT files into RDDs through tsc.loadMatLocal

Preliminary support for loading binary files from HDFS using tsc.loadBinary (depends on features currently only available in Spark's master branch)

Bug fixes

Used pillow instead of PIL (#11)

Fixed important typo in documentation page (#18)

Fixed sorting bug in local correlations

Source code(tar.gz)
Source code(zip)
v0.2.0(Aug 4, 2014)
This is a significant update with changes and enhancements to the API, new analyses, and bug fixes.

Major changes

Updated for compatibility with Spark 1.0.0, which brings with it a number of significant performance improvements

Reorganization of the API such that all analyses are all accessed through their respective classes and methods (e.g. ICA.fit, Stats.calc). Standalone functions use the same classes, and act as wrappers soley for non-interactive job submission (e.g. thunder-submit factorization/ica <opts>)

Executables included with the release for easily launching a PySpark shell, or an EC2 cluster, with Thunder dependencies and set-up handled automatically

Improved and expanded documentation, built with Sphinx

Basic functionality for colorization of results, useful for visualization, see example

Registered project in PyPi

New analyses and features

A DataSet class for easily loading simulated and real data examples

A decoding package and MassUnivariateClassifier class, currently supporting two mass univariate classification analyse (GaussNaiveBayes and TTest)

An NMF class for dense non-negative matrix factorization, a useful analysis for spatio-temporal decompositions

Bug fixes and other changes

Renamed sigprocessing library to timeseries

Replace eig with eigh for symmetric matrix

Use set and broadcasting to speed up filtering for subsets in Query

Several optimizations and bug fixes in basic saving functionality, including new pack function

Fixed handling of integer indices in subtoind

Source code(tar.gz)
Source code(zip)
v0.1.0(Jan 8, 2014)

First development release, highlighting newly refactored four analysis packages (clustering, factorization, regression, and sigprocessing) and more extensive testing and documentation

Release notes:

General Preprocessing an optional argument for all analysis scripts Tests for accuracy for all analyses

Clustering Max iterations and tolerance optional arguments for kmeans

Factorization Unified singular value decomposition into one function with method option ("direct" or "em") Made max iterations and tolerance optional arguments to ICA Added random seed argument to ICA to faciliate testing

Regression All functions use derivatives of a single RegressionModel or TuningModel class Allow input to RegressionModel classes to be arrays or tuples for increased flexibility Made regression-related arguments to tuning optional arguments

Signal processing All functions use derivatives of a single SigProcessMethod class Added crosscorr function

Thanks to many contributions from @JoshRosen!
Source code(tar.gz)
Source code(zip)

scalable analysis of images and time series

Related tags

Overview

thunder

install

related packages

example

usage

loading data

using with spark

contributing

Comments

normdata = data.toTimeSeries()

Implementation

A note on performance in spark mode:

Releases(v0.5.1)

v0.5.1(Jul 28, 2015)

Changes and Bug Fixes

Contributions

v0.5.0(Apr 2, 2015)

Major features

Minor features

Transition guide

Deployment and development

Contributions:

Roadmap

v0.4.1(Nov 4, 2014)

Changes and Bug Fixes

Thanks

Contributions

v0.4.0(Oct 16, 2014)

Major Changes

Notes about requirements and environments

Future Directions

Contributors

v0.3.2(Sep 11, 2014)

Bug fixes

Improvements

v0.3.1(Sep 4, 2014)

Bug fixes

Improvements

v0.3.0(Aug 23, 2014)

API changes

New features

Bug fixes

v0.2.0(Aug 4, 2014)

Major changes

New analyses and features

Bug fixes and other changes

v0.1.0(Jan 8, 2014)

Owner

thunder-project

Spatial Interpolation Toolbox is a Python-based GUI that is able to interpolate spatial data in vector format.

scalable analysis of images and time series

FDTD simulator that generates s-parameters from OFF geometry files using a GPU

Tool to suck data from ArcGIS Server and spit it into PostgreSQL

Get-countries-info - A python code that fetches data of any country

An API built to format given addresses using Python and Flask.

Geocode rows in a SQLite database table

Python interface to PROJ (cartographic projections and coordinate transformations library)

A part of HyRiver software stack for handling geospatial data manipulations

A simple reverse geocoder that resolves a location to a country

List of Land Cover datasets in the GEE Catalog

Record railway train route profile with GNSS tools

Simple CLI for Google Earth Engine Uploads

PyTorch implementation of ''Background Activation Suppression for Weakly Supervised Object Localization''.

A Django application that provides country choices for use with forms, flag icons static files, and a country field for models.

Geographic add-ons for Django REST Framework. Maintained by the OpenWISP Project.

Imports VZD (Latvian State Land Service) open data into postgis enabled database

A light-weight, versatile XYZ tile server, built with Flask and Rasterio :earth_africa:

Google Maps keeps old satellite imagery around for a while – this tool collects what's available for a user-specified region in the form of a GIF.

Python package for earth-observing satellite data processing

A note on performance in `spark` mode: