⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.

Last update: Dec 30, 2022

Overview

Optimized Einsum

Optimized Einsum: A tensor contraction order optimizer

Optimized einsum can significantly reduce the overall execution time of einsum-like expressions (e.g., np.einsum, dask.array.einsum, pytorch.einsum, tensorflow.einsum, ) by optimizing the expression's contraction order and dispatching many operations to canonical BLAS, cuBLAS, or other specialized routines.

Optimized einsum is agnostic to the backend and can handle NumPy, Dask, PyTorch, Tensorflow, CuPy, Sparse, Theano, JAX, and Autograd arrays as well as potentially any library which conforms to a standard API. See the documentation for more information.

Example usage

The opt_einsum.contract function can often act as a drop-in replacement for einsum functions without further changes to the code while providing superior performance. Here, a tensor contraction is performed with and without optimization:

import numpy as np
from opt_einsum import contract

N = 10
C = np.random.rand(N, N)
I = np.random.rand(N, N, N, N)

%timeit np.einsum('pi,qj,ijkl,rk,sl->pqrs', C, C, I, C, C)
1 loops, best of 3: 934 ms per loop

%timeit contract('pi,qj,ijkl,rk,sl->pqrs', C, C, I, C, C)
1000 loops, best of 3: 324 us per loop

In this particular example, we see a ~3000x performance improvement which is not uncommon when compared against unoptimized contractions. See the backend examples for more information on using other backends.

Features

The algorithms found in this repository often power the einsum optimizations in many of the above projects. For example, the optimization of np.einsum has been passed upstream and most of the same features that can be found in this repository can be enabled with np.einsum(..., optimize=True). However, this repository often has more up to date algorithms for complex contractions.

The following capabilities are enabled by opt_einsum:

Inspect detailed information about the path chosen.
Perform contractions with numerous backends, including on the GPU and with libraries such as TensorFlow and PyTorch.
Generate reusable expressions, potentially with constant tensors, that can be compiled for greater performance.
Use an arbitrary number of indices to find contractions for hundreds or even thousands of tensors.
Share intermediate computations among multiple contractions.
Compute gradients of tensor contractions using autograd or jax

Please see the documentation for more features!

Installation

opt_einsum can either be installed via pip install opt_einsum or from conda conda install opt_einsum -c conda-forge. See the installation documentation for further methods.

Citation

If this code has benefited your research, please support us by citing:

Daniel G. A. Smith and Johnnie Gray, opt_einsum - A Python package for optimizing contraction order for einsum-like expressions. Journal of Open Source Software, 2018, 3(26), 753

DOI: https://doi.org/10.21105/joss.00753

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.

A detailed overview on how to contribute can be found in the contributing guide.

Comments

added dynamic programming path
Description

In the field of tensor network states (matrix product states, projected entangled pair states, ...) the most common approach to optimise tensor network contractions is based on dynamic programming [Phys. Rev. E 90, 033315 (2014)], which usually provides an optimal contraction path within a reasonable amount of time (even for a large number of tensors). This pull request provides such an implementation.

Questions

[ ] Is there a simple way to run the benchmarks of the "Performance Comparison" section of the documentation for the dynamic programming optimiser to compare with the other approaches?
opened by mrader1248 45
[WIP] recursive, depth-first optimizer
Use recursive, depth-first optimizer for optimal & possibly 'greedy2'

I had a play around with making the opt_einsum optimal path finder a bit more efficient. The basic idea is:

Use a recursive search to simplify the logic and find paths 'depth-first', i.e. follow a path all the way to its termination before searching for the next one.

Allowed by this, terminate searches as soon as the path cost increases above the best found so far, sieving many known-bad paths

The results are pretty promising, current optimal:

>> eq, shapes = oe.helpers.rand_equation(8, 3, seed=42) >> views = [np.random.rand(*s) for s in shapes] >> %time path = oe.contract_path(eq, *views, optimize='optimal')[1] CPU times: user 25.3 s, sys: 716 ms, total: 26 s Wall time: 26 s >>> path.opt_cost 130316

and with this recursive strategy:

>> %time path = oe.contract_path(eq, *views, optimize='roptimal')[1] CPU times: user 107 ms, sys: 324 µs, total: 107 ms Wall time: 108 ms >>> path.opt_cost 130316

The fundamental scaling is still factorial, but it should allow optimal-paths to be realistically found for expressions of up to 10 tensors. Possibly other heuristics can be used to sieve paths eagerly as well.

One additional option this opens up is to recurse only into the n best possible contractions (current greedy/eager being n=1). Thus you could have a 'greedy2', 'greedy3' etc, which scale like 2**n, 3**n which while exponential (modulo sieving), is significantly more manageable than factorial scaling. These might use custom cost functions #64 and random shaking as well #63.

Anyway, thought I'd share this WIP initial implementation. My suggestion might be to have this approach eventually replace optimal, with 'greedy', 'greedy2'... using it also.

cc @fritzo --- took some inspiration from your 'eager' path if you have any input!

TODO:

[x] apply some caching?

[ ] keep track of keep indices rather than computing each time?

[x] handle memory_limit case properly? (i.e. contract all remaining tensors at once)

enhancement
opened by jcmgray 34
Use best of many random greedy paths
In quantum computing simulations particularly, finding the optimal contraction sequence has a huge effect on what is computable - the amount of memory and time required is exponential in how 'bad' the path found is. The greedy approach is generally worse than the optimal approach but it is cheap to compute (and even cheaper with #60!).

If we introduce a random greedy algorithm by tweaking the cost objective function, we get a variety of different paths with varying costs and we can just pick the best one. I had a little play with contracting a 49 qubit depth 20 circuit (661 tensors overall) and randomly adjusting each cost by +- 5%.

The following are 'contraction complexities' (maximum rank of tensor produced - 2^this is the memory required which is also a rough proxy for computation time) on the left and frequency on the right from 100 runs.

[(25, 1), (27, 1), (28, 3), (29, 6), (30, 10), (31, 16), (32, 6), # <- this is what the current greedy finds (33, 9), (34, 21), (35, 8), (36, 8), (37, 1), (38, 4), (39, 2), (40, 1), (42, 3)]

So the best contraction found vs the current contraction found requires ~128x less memory and computation, and turns the simulation from 'super'-computable to 'laptop'-computable.
enhancement
opened by jcmgray 19
Implement shared_intermediates context manager
Resolves #9

Description

This implements a shared_intermediates context manager, within which multiple calls to contract will share intermediate computations. This is very useful e.g. in message passing algorithms. The first draft of this PR is copied directly from Pyro, where we are using sharing for probabilistic inference in graphical models.

The implementation uses a special internal backend opt_einsum.backends.shared. When a sharing context is opened

with shared_intermediates(): contract(..., backend='foo') contract(..., backend='foo')

a special handle_sharing() context is activated inside the contract function, temporarily switching to the shared backend, and setting the original backend foo in a global shared._CURRENT_BACKEND variable. The sharing backend then performs all the original foo operations and also memoizes them.

One design choice is to memoize by id() rather than by value. This makes sense from a computational perspective (equality comparison is expensive), but requires a bit more care by users.

A second design choice is to expose the memoization cache to users. This makes it easy to share intermediates by passing the cache around, as is necessary when it is difficult to perform all computation in a context:

with shared_intermediates() as cache: contract(...) ...pass control elsewhere... with shared_intermediates(cache): # <-- reuse previous cache contract(...) del cache # <-- intermediates will be garbage collected

Todos

[x] Cache type conversions e.g. numpy->torch->numpy so args can be compared by id.

[x] fix failing tests

[x] add tests that pass around the cache

[x] add tests where we tests the normal ordering:
with shared_intermeidates() as cache: a = np.random.rand(4, 4) b = np.random.rand(4, 4) contract("ab,bc->ac", a, b) assert get_cache_key("zc,az->ac", b, a) in cache ...

[x] test against the Pyro rm-einsum-shared branch

[x] test nested sharing

[x] write docs

[x] move _alpha_canonicalize() to parser.py

[x] test caching with constant expressions

Questions

[x] Could a maintainer suggest where the handle_sharing() context should be moved? I've attempted to insert it into ContractExpression.__call__(), but I'm a bit lost.

Status

[x] Ready to go

enhancement
opened by fritzo 16
Re-use intermediates across various contractions (Common Subexpression Elimination)

Suppose you want to compute two contractions of the same tensors, e.g. contraction strings

['ab,dca,eb,cde', 'ab,cda,eb,cde']

The (globally) optimal way to do this would be to first perform the contractions over indices a,b and e, and then perform the remaining contractions over c and d for the two sets of contractions. The current opt_einsum implementation does not allow for such a global optimization of contraction order and re-use of common intermediates.

I'm still organizing my thoughts on this, and all input would be most welcome. On a side note, I believe implementing such a more general optimization strategy will also fix #7 as a by-product.

Optimization logic

A relevant publication suggesting a concrete algorithm for this optimization problem is

Hartono et al, Identifying Cost-Effective Common Subexpressions to Reduce Operation Count in Tensor Contraction Evaluations

I do not know to what extent the current code can be re-used in this more general setup, but the single-term optimization should be under control with the current implementation.

Interface

Such a multi_contract function could be called with a list of tuples (contractionString, [tensors, ..]), and would return a list of results with the same length as the input list.

Internally, one would have to find out which tensors are actually identical between the various contractions, and then use the above contraction logic. Ideally this information should be deduced automatically and not rely on user input being in the correct form. In the same spirit, dummy indices should be allowed to have arbitrary names, i.e. they should not have to match across different contractions to be correctly identified as a common subexpression. This may require transforming the input into a 'canonical' form first to make sure that common subexpressions are captured correctly.

In contrast to the setup outlined in Hartono et al, contraction strings should maybe remain in their current 'simple' form and not be generalized to allow for numerical expressions like sums of tensors etc. Such a behavior can be implemented a posteriori with the interface described here by computing the sum of the resulting tensors, e.g.

contract('ab,ab + ab,ba', N,M) --> 'sum'(multi_contract( [('ab,ab', N,M), ('ab,ba', N,M)] ))

Thus, restricting contraction strings to be of the form currently used does not cause loss of generality, the only downside being that it might lead to a slightly increased memory-footprint as the function would return several arrays instead of one.

Other thoughts?
enhancement

opened by ebatz 15
Support more symbols for subscripts (#67)
Add more symbol support based on hash

Refactor the converting to subscripts routine a little bit

Update tests

Description

Make opt_einsum support anything hashable as subscripts, such as:

contract(arr1, ['left edge', 'bond'], arr2, ['bond', 'right edge'])

See #67

Questions

[x] I've tried to add the new feature to the document, but it seems the document does not have a chapter describing the inputs. Maybe I've left out something or maybe this should be solved in a different PR?

Status

[x] Ready to go

enhancement
opened by liwt31 14
non-numpy (e.g. GPU) backend support
This adds non-numpy backend support to opt_einsum in two flavors:

General duck-typed support

For arbitrary ndarray libraries, i.e. if the arrays supplied have got a .shape attribute and opt_einsum can find {backend}.tensordot etc then it should work, returning whatever type the library uses. So for example

contract("abc,bcd,def->aef", *dask_arrays, backend='dask')

will return a dask.Array without any explicit logic. I've added tests for dask and also sparse.

Edit: I should add that extra functionality like taking diagonals etc is still reliant on a einsum implementation.

Special conversion support

To support GPU operations on numpy arrays, here, on first call the arrays are converted to the correct placeholder, an expression generated using these, and then this expression can be called with normal numpy arrays, returning numpy arrays also. E.g.:

expr = contract_expression("abc,bcd,def->aef", *shapes) expr(*numpy_arrays, backend='theano')

Explicit tested support here is for tensorflow, theano and cupy - GPU libraries where it makes sense to convert to and from.

Adding more backends

contract.py doesn't have any explicit reference to the various backends, thats all contained in backends.py now. So it should be very easy to add new backends. E.g. if tblis were to add tensordot, it would fall under the duck-typing category above and I think no changes would actually be needed at all.

Obviously any recommendations welcome! Getting the right backend function and all that is cached so I don't think this has added any more than a few 100ns overhead.
opened by jcmgray 14

`Optimal` path not optimal?

First, thanks for the great package!

Was running optimal vs. auto on a 10 element random expression, and while I know optimal is not recommended for larger networks, I found it odd that it gives worse scaling and with a larger intermediate size. Some of the other threads mention memory as a factor in the optimization, but the auto solution seems better on memory as well? All the tensors are 3x3. Is this intentional behaviour?

auto path:

([(2, 3), (4, 8), (0, 4), (0, 4), (4, 5), (0, 2), (2, 3), (1, 2), (0, 1)],
   Complete contraction:  db,cc,fe,fe,aa,ff,fe,cb,ea,ac->d
          Naive scaling:  6
      Optimized scaling:  3
       Naive FLOP count:  7.290e+3
   Optimized FLOP count:  2.700e+2
    Theoretical speedup:  27.000
   Largest intermediate:  9.000e+0 elements
 --------------------------------------------------------------------------------
 scaling        BLAS                current                             remaining
 --------------------------------------------------------------------------------
    2              0              fe,fe->fe         db,cc,aa,ff,fe,cb,ea,ac,fe->d
    2              0              fe,fe->fe            db,cc,aa,ff,cb,ea,ac,fe->d
    3           GEMM              cb,db->cd               cc,aa,ff,ea,ac,fe,cd->d
    2              0              ac,cc->ac                  aa,ff,ea,fe,cd,ac->d
    3           GEMM              ac,cd->ad                     aa,ff,ea,fe,ad->d
    2              0              ea,aa->ea                        ff,fe,ad,ea->d
    3           GEMM              ea,ad->ed                           ff,fe,ed->d
    3           GEMM              ed,fe->df                              ff,df->d

optimal path:

([(1, 7), (0, 8), (0, 2), (0, 1), (0, 5), (0, 3), (0, 3), (0, 2), (0, 1)],
   Complete contraction:  db,cc,fe,fe,aa,ff,fe,cb,ea,ac->d
          Naive scaling:  6
      Optimized scaling:  4
       Naive FLOP count:  7.290e+3
   Optimized FLOP count:  5.130e+2
    Theoretical speedup:  14.211
   Largest intermediate:  2.700e+1 elements
 --------------------------------------------------------------------------------
 scaling        BLAS                current                             remaining
 --------------------------------------------------------------------------------
    2              0              cb,cc->cb         db,fe,fe,aa,ff,fe,ea,ac,cb->d
    3           GEMM              cb,db->cd            fe,fe,aa,ff,fe,ea,ac,cd->d
    3              0             aa,fe->afe              fe,ff,fe,ea,ac,cd,afe->d
    2              0              ff,fe->fe                 fe,ea,ac,cd,afe,fe->d
    2              0              fe,fe->fe                    ea,ac,cd,afe,fe->d
    3              0            afe,ea->afe                       ac,cd,fe,afe->d
    4           GEMM            afe,ac->fec                          cd,fe,fec->d
    4           GEMM            fec,cd->fed                             fe,fed->d
    3           GEMM              fed,fe->d                                  d->d)

bug

opened by Bonnevie 13

[WIP] allow constant tensors in expressions
This allows constant tensors to be supplied to contract_expression, which will then build as many constant intermediaries as it can. Still a bit of a work in progress but the core functionality is working so thought I'd share. Resolves #7.

Example

Basic setup:

>>> from opt_einsum import contract_expression >>> from numpy.random import rand >>> equation = 'ij,jk,kl,lm,mn->ni' >>> const_ops = rand(2, 2), rand(2, 2), rand(2, 2) >>> ops = (3, 2), *const_ops, (2, 3) # non-const ops are just shapes >>> expr = contract_expression(equation, *ops, constants=[1, 2, 3]) >>> expr <ContractExpression('ij,[jk,kl,lm],mn->ni', constants=[1, 2, 3])> >>> print(expr) <ContractExpression('ij,[jk,kl,lm],mn->ni', constants=[1, 2, 3])> 1. 'jm,ij->mi' [GEMM] 2. 'mi,mn->ni' [GEMM]

Now we can call it with just two arrays:

>>> expr(rand(3, 2), rand(2, 3)) array([[0.10982582, 0.06334146, 0.09379349], [0.09159049, 0.05282374, 0.07821902], [0.15775775, 0.09097925, 0.13471551]]) >>> expr(rand(3, 2), rand(2, 3)) array([[0.19207156, 0.21354127, 0.04811443], [0.01633236, 0.01815862, 0.004091 ], [0.06079124, 0.06758219, 0.0152304 ]])

Notes

Currently this just performs as many contractions in the contraction_list as possible, stopping when it reaches a non-constant array. This means that some potential constant contractions might still be missed even though they don't depend on anything (e.g. 'a,a,b,b' with constants=[0, 1] might still choose to contract b,b first). One solution would be to somehow reshuffle the contraction_list but this gets a bit messy.

How do we feel about the repr?

TODO

[x] ~~Sort out interaction with backend (need to not cache expression on first constants pass, and deal with various conversion points (e.g. keep constants on GPU)~~

[x] Document

[ ] Add edge case tests? Additional errors to help user?
opened by jcmgray 13

shared_intermediates and contraction trees

The shared_intermediates example in the docs is to calculate multiple marginals, which is a use case of interest to me. Calculating marginals is equivalent to calculating environment tensors, so I'm interested in using the approach from https://arxiv.org/abs/1310.8023 which basically guarantees that calculating all marginals has the same cost as 3x the cost of calculating 1 marginal by using extensive memoization. This is ensured by making each contraction path a member of the same contraction tree family, with each tree node at worst corresponding to 3 different contractions that can be appropriately memoized. This function is called multienv() in various tensor network packages.

Here is a rough but working prototype of a ContractionTree class that can generate the various contraction paths by calls to get_environment_path.

Contraction tree prototype

class ContractionTree:
    """Interface for finding contraction paths for each possible environment tensor with high cache reuse.

    Based on "Improving the efficiency of variational tensor network algorithms" by Evenbly and Pfeifer. Returned
    environment paths stay in same contraction tree family.
    """

    def __init__(self, tensors, optimize='greedy'):
        operands = chain.from_iterable([(tensor.tensor, tensor.axis_names) for tensor in tensors])
        self.partition_path, _ = contract_path(*operands, (), optimize=optimize)
        self.contraction_tree = nx.DiGraph()
        for index, tensor in enumerate(tensors):
            self.contraction_tree.add_node(tensor, index=index)
        self.tensors = tensors
        self.cache = HitDict()

        active_nodes = [tensor for tensor in tensors]
        for path_step in self.partition_path:
            children = []
            for index in sorted(path_step, reverse=True):
                children.append(active_nodes.pop(index))
            active_nodes.append(frozenset(children))
            for child in children:
                self.contraction_tree.add_edge(child, active_nodes[-1])

    def get_environment_path(self, marginal_tensor):
        """Returns opt_einsum path for computing the environment tensor of `marginal_tensor`"""
        environment_tree = self.contraction_tree.copy()

        def redirect_edges(node, child, visited):
            """recursive function that redirects all edges towards initial node passed"""
            visited = visited + [node]
            out_edges = list(environment_tree.out_edges(node))
            for start, end in out_edges:
                if child is None or end != child:
                    environment_tree.remove_edge(start, end)
                    environment_tree.add_edge(end, start)
                if end not in visited:
                    redirect_edges(end, node, visited)
            in_edges = list(environment_tree.in_edges(node))
            for start, end in in_edges:
                if start not in visited:
                    redirect_edges(start, node, visited)

        #  redirect and remove redundant nodes
        redirect_edges(marginal_tensor, None, [marginal_tensor])
        nodes = list(environment_tree.nodes)
        for node in nodes:
            in_edges = environment_tree.in_edges(node)
            out_edges = environment_tree.out_edges(node)
            if len(out_edges) and len(in_edges) == 1:
                origin = list(in_edges)[0][0]
                for out_edge in out_edges:
                    environment_tree.add_edge(origin, out_edge[1])
                environment_tree.remove_node(node)

        #  maintain dict mapping tensor -> index position in list of tensors
        positions = {tensor: index for index, tensor in enumerate(self.tensors)}
        positions = {tensor: position - (positions[marginal_tensor] < position)
                     for tensor, position in positions.items()}
        del positions[marginal_tensor] #  assume the query tensor is removed from input string

        #  rebuild path from reoriented contraction tree
        path = []
        active = len(self.tensors) - 1
        for node in nx.topological_sort(environment_tree):
            children = [start for start, end in environment_tree.in_edges(node)]
            if len(children) > 1:
                path.append(tuple(positions[tensor] for tensor in children))
                positions = {tensor: position - sum(positions[removed_tensor] < position for removed_tensor in children)
                             for tensor, position in positions.items() if tensor not in children}
                active = active - len(children) + 1
                positions[node] = active - 1
        return path

My issue is

using the HitDict class proposed for counting cache hits it seems that using these paths gives me 0 cache hits?? I am using jax, but tried with numpy as well.
Is shared_intermediates capable of reusing high-order intermediates as found in the contraction tree?

I am calling it using interleaved input (I am using a wrapper of tensor with an axis_names attribute):

    def get_environment(self, tensors, environment_of, optimize='auto', **kwargs):
        environment_tensors = [tensor for tensor in tensors if tensor != environment_of]
        operands = chain.from_iterable([(tensor.tensor, tensor.axis_names) for tensor in environment_tensors])
        return oe.contract(*operands, environment_of.axis_names, optimize=optimize, **kwargs)
    
    def marginals(all_tensors, marginal_tensors):
        tree = ContractionTree(all_tensors)
        marginals = []
        for marginal_tensor in marginal_tensors:
            path = tree.get_environment_path(marginal_tensor)
            with shared_intermediates(tree.cache):
                marginals.append(get_environment(all_tensors, marginal_tensor, optimize=path))
    return marginals, tree.cache

I realize this might be outside the scope of opt_einsum, but it also seems like this would be a valuable extension of shared_intermediates - or it might already be supported via the current caching mechanism and there's just a bug in my implementation :)

opened by Bonnevie 12

v3.0 Planning
We are planning a major version bump for the following reasons:

Transition to be backend agnostic (NumPy is no longer the principle backend).

See #81 which has the (small) potential to break current implementations.

TODO Items:

[ ] Refactoring in preparation for NEP18 (#46) should be considered.

[ ] Consider adding a Ricci calculus parser contract('A_ij B_jk A_kl', {"A": a, "B": b}).

[x] Run through the backends to check where we can refactor.

[x] Update docs and README to reflect that we are NumPy agnostic.

[ ] Consider parsers that use metrics besides raw FLOPS (actually trying out several contraction paths for example).

Are there other items where we might need to do a few (small) compatibility breaks that would be good for future sustainability?
release
opened by dgasmith 12

Contigugous loss after opt-einsum that may affect the efficiency of subsequent operations

After a tensor contraction, the resulting tensor from numpy.einsum is C-contiguous, and in opt_einsum, such properties may be lost. If there is a further tensor addition, the non-contiguous array may be slower. Is there any solution to let the resulting tensor from opt_einsum be still C-contiguous? If not, is there any option to specify the resulting tensor to be C-contiguous, while letting the searching of the path within C-contiguous constrain?

For example, the following code

import numpy as np
import opt_einsum as oe
import time


na = 5
nb = 5
nc = 8
nd = 8
ne = 2
nf = 1
ng = 8
nh = 8
ni = 8
nj = 8

dim_a = (na,nb,nc,nd,ne,nf)
dim_b = (ne,nf,ng,nh,ni,nj)
dim_c = (na,nb,nc,nd,ng,nh,ni,nj)
n_times = 20

contraction = 'abcdef,efghij -> abcdghij'
a = np.random.random(dim_a)
b = np.random.random(dim_b)
c = np.einsum(contraction, a, b)
d = oe.contract(contraction, a, b)

sum_array = np.zeros(dim_c)
print(a.flags)
print(b.flags)
print(c.flags)
print(d.flags)

print('list')
list_a = []
list_b = []
list_c = []
for n in range(n_times):
    a = np.random.random(dim_a)
    b = np.random.random(dim_b)
    list_a.append(a)
    list_b.append(b)


print('np')

start = time.time()
for n in range(n_times):
    list_c.append( np.einsum(contraction, list_a[n], list_b[n]) )
print(time.time() - start) 

sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
    np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start) 


print('oe')
opt_path = oe.contract_expression(contraction, a.shape, b.shape, optimize = "optimal") 

list_a = []
list_b = []
list_c = []
for n in range(n_times):
    a = np.random.random(dim_a)
    b = np.random.random(dim_b)
    list_a.append(a)
    list_b.append(b)


start = time.time()
for n in range(n_times):
   list_c.append(opt_path(list_a[n], list_b[n]))
print(time.time() - start) 

#start = time.time()
#for n in range(n_times):
#    list_c[n] = np.ascontiguousarray(list_c[n])
#print(time.time() - start) 



sum_array = np.zeros(dim_c)
start = time.time()
for n in range(n_times):
  #  list_c[n] = np.ascontiguousarray(list_c[n])
    np.add(sum_array, list_c[n], out = sum_array)
print(time.time() - start)

The result is

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

  C_CONTIGUOUS : True
  F_CONTIGUOUS : False
  OWNDATA : True
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

  C_CONTIGUOUS : False
  F_CONTIGUOUS : False
  OWNDATA : False
  WRITEABLE : True
  ALIGNED : True
  WRITEBACKIFCOPY : False

list
np
1.9423425197601318
0.32601356506347656
oe
1.6317214965820312
1.2006006240844727

which the np.add part is much slower in opt_einsum than einsum. If I use np.ascontiguousarray to let the resulting array be C-contiguous, this step itself is slow.

opened by Geositta2000 7

[WIP] add a preprocessor
Description

This adds a potential preprocessor that takes an equation and converts it into a minimal form with the following simplifications:

All indices which only appear on a single input (and not the output) are summed over.

All indices which appear multiple times on the same term are traced.

All scalars are multiplied into the smallest other term

All terms with the same indices are multiplied (hamadard product / elementwise) into a single term ('de-duplication').

along with a function, that transforms input arrays into the inputs for the new simplified eq. It is a function so that it could be part of a contract 'expression'.

This would address (#114, #99, #112, #167, #189, ...).

Example from the docstring:

eq = ',ab,abee,,cd,cd,dd->ac' arrays = helpers.build_views(eq) new_eq, simplifier = make_simplifier(eq) new_eq # 'ab,cd,d->ac' sarrays = simplifier(arrays) oe.contract(new_eq, *sarrays) # array([[0.65245661, 0.14684493, 0.42543411, 0.32350895], # [0.38357005, 0.08632807, 0.25010672, 0.19018636]])

Todos

Notable points that this PR has either accomplished or will accomplish.

[ ] when and how should this be called?

[ ] return a partial path rather than a 'simplifier'?

[ ] should shape information (size_dict) be used and propagated for use with a contraction expression (and so that the smallest tensor to multiply scalars into can be most accurately chosen)?

[ ] other simplifications are possible (e.g. hadamard deduplication up to transpose, performing all non-rank increasing contractions)...

[x] ~currently the parsing is a single sweep, should it iteratively sweep til no more changes? E.g. currently examples like ab,ab-> transform to ab-> but if processed again could transform to ->~

Status

[ ] Ready to go
opened by jcmgray 1
Disable torch.backends.opt_einsum to avoid duplicate work
Description

Recently, torch.einsum has improved to automatically optimize for multi-contractions if the opt_einsum library is installed. This way, torch users can reap benefits easily. However, this change may inadvertently cause regressions for those who use opt_einsum.contract with torch tensors. While these use cases can be improved by just changing opt_einsum.contract calls to torch.einsum, it is not right to cause silent regressions.

Consider the following example:

import torch import opt_einsum A = torch.rand(4, 5) B = torch.rand(5, 6) C = torch.rand(6, 7) D = opt_einsum.contract('ij,jk,kl->il', A, B, C)

Looking through the code, opt_einsum.contract will figure out an ideal path + return that path in the form of contraction tuples (commonly pairs). Then, for each of the contraction tuples, opt_einsum may call BACK into the torch backend and call torch.einsum.

Since the contractions are commonly pairs, calling back into torch.einsum will do what it used to--just directly do the contraction.

HOWEVER. There are cases where the contractions are not necessarily pairs (@dgasmith had mentioned that hadamard products are faster when chained vs staggered in pairs) and in this case, torch.einsum will do unnecessary work in recomputing an ideal path, which is a regression from the previous state.

This PR simply asks "hey, does the opt_einsum backend exist in torch?" If so, this means torch.einsum will do the unnecessary work in recomputing an ideal path, so let's just turn it off.

Todos

Notable points that this PR has either accomplished or will accomplish.

[x] Add the code to disable torch's opt_einsum optimization

[ ] Test cases?

Questions

[ ] Unsure how to best test this/what use case would ensure that, without this code, the path would be computed more than once, and after this code, the path would only be computed at the top level. We could just test that torch.backends.opt_einsum.enabled is globally False, I guess.

Status

[ ] Ready to go
opened by janeyx99 8
Remove NumPy as a hard dependency
Description

This PR removes NumPy from opt_einsum as a runtime dependency. The change has little effect on the opt_einsum code base but significantly effects the testing infrastructure. Several testing strategies are currently employed to help understand the most effective one. Currently, these are:

Move all NumPy functions to a testing.py module with pytest.skip

Various pytest.skip functions wrapped in decorators

Local NumPy imports.

Feedback welcome!

Todos

Notable points that this PR has either accomplished or will accomplish.

[x] Remove unused custom tensordot code

[ ] Remove import numpy from possibly_convert_to_numpy

[ ] Fix ssa_to_linear with a proper implementation

Questions

[ ] Question1

Status

[ ] Ready to go
opened by dgasmith 3
Thoughts on making the numpy dependency optional?

Proposal

Make the numpy dependency optional, if possible.

Why?

Minimizing dependencies is a general goal as it allows a bigger audience to reap the benefits of this library. More specifically, some of us are interested in making opt_einsum a hard dependency for torch, but we would like to keep numpy unrequired. (If you're curious why torch does not have a hard dependency on numpy, see https://github.com/pytorch/pytorch/issues/60081, tl;dr being the last comment.)

A hard dependency would mean all torch users would get the benefits of opt_einsum right away without thinking too hard/needing to manually install opt_einsum themselves.

Alternatives

We could also have torch vendor in opt_einsum, but that is increased complexity/maintenance + we would like to automatically subscribe to improvements in opt_einsum!

opened by janeyx99 4

When I use opt_einsum optimizes torch.einum, the running time after optimization increases.

import numpy as np
import time
import torch
from opt_einsum import contract

dim = 4

x = torch.randn(6 ,4 ,4, 4)
w1 = torch.randn(1,4,dim)
w2 = torch.randn(dim,4,dim)
w3 = torch.randn(dim,4,dim)
w4 = torch.randn(dim,8,dim)
w5 = torch.randn(dim,8,dim)
w6 = torch.randn(dim,4,1)

def naive(x, w1, w2, w3, w4, w5, w6):
    return torch.einsum('bkxy,ikj,jxm,myf,fpl,lqz,zri -> bpqr', x, w1, w2, w3, w4, w5, w6)

def optimized(x, w1, w2, w3, w4, w5, w6):
    return contract('bkxy,ikj,jxm,myf,fpl,lqz,zri -> bpqr', x, w1, w2, w3, w4, w5, w6)

The respective running time:

naive
0.0005145072937011719

optimized
0.876018762588501

I want to know what caused this.Thanks!

opened by edwin-zft 5

Releases(v3.3.0)

v3.3.0(Jul 19, 2020)
Adds a object backend for optimized contractions on arbitrary Python objects.

New Features

(#145) Adds a object based backend so that contract(backend='object') can be used on arbitrary objects such as SymPy symbols.

Enhancements

(#140) Better error messages when the requested contract backend cannot be found.

(#141) Adds a check with RandomOptimizers to ensure the objects are not accidentally reused for different contractions.

(#149) Limits the remaining category for the contract_path output to only show up to 20 tensors to prevent issues with the quadratically scaling memory requirements and the number of print lines for large contractions.

Source code(tar.gz)
Source code(zip)
v3.2.1(Apr 15, 2020)
Bug Fixes

(#131) Fixes an einstein subscript error message to point to the correct value.

(#135) Lazily loads JAX similar to other backends.

Source code(tar.gz)
Source code(zip)
v3.2.0(Mar 3, 2020)
Small fixes for the dp path and support for a new mars backend.

New Features

(#109) Adds mars backend support.

Enhancements

(#110) New auto-hq and 'random-greedy-128' paths.

(#119) Fixes several edge cases in the dp path.

Bug fixes

(#127) Fixes an issue where Python 3.6 features are required while Python 3.5 is opt_einsum's stated minimum version.

Source code(tar.gz)
Source code(zip)
v3.1.0(Sep 30, 2019)

Adds a new dynamic programming algorithm to the suite of paths.
Source code(tar.gz)
Source code(zip)
v3.0.1(Aug 22, 2019)

Alters setup.py to correctly state that opt_einsum requires Python 3.5+. This will now correctly effect PyPI and other pip mirror downloads, v3.0.0 will be removed from PyPI to prevent further issues.
Source code(tar.gz)
Source code(zip)
v3.0.0(Aug 12, 2019)
This release moves opt_einsum to be backend agnostic while adding support additional backends such as Jax and Autograd. Support for Python 2.7 has been dropped and Python 3.5 will become the new minimum version, a Python deprecation policy equivalent to NumPy's has been adopted.

New Features

(#78) A new random-optimizer has been implemented which uses Boltzmann weighting to explore alternative near-minimum paths using greedy-like schemes. This provides a fairly large path performance enhancements with a linear path time overhead.

(#78) A new PathOptimizer class has been implemented to provide a framework for building new optimizers. An example is that now custom cost functions can now be provided in the greedy formalism for building custom optimizers without a large amount of additional code.

(#81) The backend="auto" keyword has been implemented for contract allowing automatic detection of the correct backend to use based off provided tensors in the contraction.

(#88) Autograd and Jax support have been implemented.

(#96) Deprecates Python 2 functionality and devops improvements.

Enhancements

(#84) The contract_path function can now accept shape tuples rather than full tensors.

(#84) The contract_path automated path algorithm decision technology has been refactored to a standalone function.

Source code(tar.gz)
Source code(zip)
v2.3.2(Dec 9, 2018)
Bug Fixes:

(#77) Fixes a PyTorch v1.0 JIT tensor shape issue.

Source code(tar.gz)
Source code(zip)
v2.3.1(Dec 1, 2018)
Bug Fixes:

Minor tweak to release procedure.

Source code(tar.gz)
Source code(zip)
v2.3.0(Dec 1, 2018)
This release primarily focuses on expanding the suite of available path technologies to provide better optimization characistics for 4-20 tensors while decreasing the time to find paths for 50-200+ tensors. See Path Overview for more information.

New Features:

(#60) A new greedy implementation has been added which is up to two orders of magnitude faster for 200 tensors.

(#73) Adds a new branch path that uses greedy ideas to prune the optimal exploration space to provide a better path than greedy at sub optimal cost.

(#73) Adds a new auto keyword to the opt_einsum.contract path option. This keyword automatically chooses the best path technology that takes under 1ms to execute.

Enhancements:

(#61) The opt_einsum.contract path keyword has been changed to optimize to more closely match NumPy. path will be deprecated in the future.

(#61) The opt_einsum.contract_path now returns a opt_einsum.contract.PathInfo object that can be queried for the scaling, flops, and intermediates of the path. The print representation of this object is identical to before.

(#61) The default memory_limit is now unlimited by default based on community feedback.

(#66) The Torch backend will now use tensordot when using a version of Torch which includes this functionality.

(#68) Indices can now be any hashable object when provided in the "Interleaved Input" syntax.

(#74) Allows the default transpose operation to be overridden to take advantage of more advanced tensor transpose libraries.

(#73) The optimal path is now significantly faster.

Bug fixes:

(#72) Fixes the "Interleaved Input" syntax and adds documentation.

Source code(tar.gz)
Source code(zip)
v2.2.0(Aug 29, 2018)
New features:

(#48) Intermediates can now be shared between contractions, see here for more details.

(#53) Intermediate caching is thread safe.

Enhancements:

(#48) Expressions are now mapped to non-unicode index set so that unicode input is support for all backends.

(#58) Adds tensorflow and theano with shared intermediates.

Bug fixes:

(#41) PyTorch indices are mapped back to a small a-z subset valid for PyTorch's einsum implementation.

Source code(tar.gz)
Source code(zip)
v2.1.3(Aug 23, 2018)
Bug fixes:

Fixes unicode issue for large numbers of tensors in Python 2.7.

Fixes unicode install bug in README.md.

Source code(tar.gz)
Source code(zip)
v2.1.2(Aug 15, 2018)
Bug Fixes:

Ensures versioneer.py is in MANIFEST.in for a clean pip install.

Source code(tar.gz)
Source code(zip)
v2.1.1(Aug 14, 2018)
Bug Fixes:

Minor tweak to release procedure.

Source code(tar.gz)
Source code(zip)
v2.1.0(Aug 14, 2018)
opt_einsum continues to improve its support for additional backends beyond NumPy with PyTorch.

We have also published the opt_einsum package in the Journal of Open Source Software. If you use this package in your work, please consider citing us!

New features:

PyTorch backend support

Tensorflow eager-mode execution backend support

Enhancements:

Intermediate tensordot-like expressions are now ordered to avoid transposes.

CI now uses conda backend to better support GPU and tensor libraries.

Now accepts arbitrary unicode indices rather than a subset.

New auto path option which switches between optimal and greedy at four tensors.

Bug fixes:

Fixed issue where broadcast indices were incorrectly locked out of tensordot-like evaluations even after their dimension was broadcast.

Source code(tar.gz)
Source code(zip)
v2.0.1(Jun 28, 2018)
opt_einsum is a powerful tensor contraction order optimizer for NumPy and related ecosystems.

New Features

Allows unlimited Unicode indices.

Adds a Journal of Open-Source Software paper.

Minor documentation improvements.

Source code(tar.gz)
Source code(zip)
v2.0.0(May 17, 2018)
opt_einsum is a powerful tensor contraction order optimizer for NumPy and related ecosystems.

New Features

Expressions can be precompiled so that the expression optimization need not happen multiple times.

The greedy order optimization algorithm has been tuned to be able to handle hundreds of tensors in several seconds.

Input indices can now be unicode so that expressions can have many thousands of indices.

GPU and distributed computing backends have been added such as Dask, TensorFlow, CUPy, Theano, and Sparse.

Bug Fixes

A error effecting cases where opt_einsum mistook broadcasting operations for matrix multiply has been fixed.

Most error messages are now more expressive.

Source code(tar.gz)
Source code(zip)
1.0(Oct 15, 2016)

Official 1.0 release.

Einsum is a very powerful function for contracting tensors of arbitrary dimension and index. However, it is only optimized to contract two terms at a time resulting in non-optimal scaling for contractions with many terms. Opt_einsum aims to fix this by optimizing the contraction order which can lead to arbitrarily large speed ups at the cost of additional intermediate tensors.

Opt_einsum is also implemented into the np.einsum function as of NumPy v1.12.
Source code(tar.gz)
Source code(zip)
v0.2.0(Jul 30, 2016)

A large step towards to a full 1.0 release. BLAS usage is now automatically applied to all operations. Future releases will be more careful with regard to views and needless data copying.
Source code(tar.gz)
Source code(zip)
v0.1.1(Mar 23, 2016)

Adds Python 3 support in addition to installation through a setup.py command.
Source code(tar.gz)
Source code(zip)

⚡️Optimizing einsum functions in NumPy, Tensorflow, Dask, and more with contraction order optimization.

Related tags

Overview

Optimized Einsum

Optimized Einsum: A tensor contraction order optimizer

Example usage

Features

Installation

Citation

Contributing

Comments

Description

Questions

Use recursive, depth-first optimizer for optimal & possibly 'greedy2'

TODO:

Description

Todos

Questions

Status

Optimization logic

Interface

Description

Questions

Status

General duck-typed support

Special conversion support

Adding more backends

Example

Notes

TODO

Description

Todos

Status

Description

Todos

Questions

Status

Description

Todos

Questions

Status

Proposal

Why?

Alternatives

Releases(v3.3.0)

v3.3.0(Jul 19, 2020)

New Features

Enhancements

v3.2.1(Apr 15, 2020)

Bug Fixes

v3.2.0(Mar 3, 2020)

New Features

Enhancements

Bug fixes

v3.1.0(Sep 30, 2019)

v3.0.1(Aug 22, 2019)

v3.0.0(Aug 12, 2019)

New Features

Enhancements

v2.3.2(Dec 9, 2018)

Bug Fixes:

v2.3.1(Dec 1, 2018)

Bug Fixes:

v2.3.0(Dec 1, 2018)

New Features:

Enhancements:

Bug fixes:

v2.2.0(Aug 29, 2018)

v2.1.3(Aug 23, 2018)

v2.1.2(Aug 15, 2018)

v2.1.1(Aug 14, 2018)

v2.1.0(Aug 14, 2018)

v2.0.1(Jun 28, 2018)

New Features

v2.0.0(May 17, 2018)

New Features

Bug Fixes

1.0(Oct 15, 2016)

v0.2.0(Jul 30, 2016)

v0.1.1(Mar 23, 2016)