LowRankModels.jl is a julia package for modeling and fitting generalized low rank models.

Overview

LowRankModels.jl

Build Status Build status codecov

LowRankModels.jl is a Julia package for modeling and fitting generalized low rank models (GLRMs). GLRMs model a data array by a low rank matrix, and include many well known models in data analysis, such as principal components analysis (PCA), matrix completion, robust PCA, nonnegative matrix factorization, k-means, and many more.

For more information on GLRMs, see our paper. There is a python interface to this package, and a GLRM implementation in the H2O machine learning platform with interfaces in a variety of languages.

LowRankModels.jl makes it easy to mix and match loss functions and regularizers to construct a model suitable for a particular data set. In particular, it supports

  • using different loss functions for different columns of the data array, which is useful when data types are heterogeneous (e.g., real, boolean, and ordinal columns);
  • fitting the model to only some of the entries in the table, which is useful for data tables with many missing (unobserved) entries; and
  • adding offsets and scalings to the model without destroying sparsity, which is useful when the data is poorly scaled.

Installation

To install, just call

Pkg.add("LowRankModels")

at the Julia prompt.

Generalized Low Rank Models

GLRMs form a low rank model for tabular data A with m rows and n columns, which can be input as an array or any array-like object (for example, a data frame). It is fine if only some of the entries have been observed (i.e., the others are missing); the GLRM will only be fit on the !ismissing entries. The desired model is specified by choosing a rank k for the model, an array of loss functions losses, and two regularizers, rx and ry. The data is modeled as X'*Y, where X is a kxm matrix and Y is a kxn matrix. X and Y are found by solving the optimization problem

minimize sum_{(i,j) in obs} losses[j]((X'*Y)[i,j], A[i,j]) + sum_i rx(X[:,i]) + sum_j ry(Y[:,j])

The basic type used by LowRankModels.jl is the GLRM. To form a GLRM, the user specifies

  • the data A (any AbstractArray, such as an array, a sparse matrix, or a data frame)
  • the array of loss functions losses
  • the regularizers rx and ry
  • the rank k

The user may also specify

  • the observed entries obs
  • starting matrices X₀ and Y₀

obs is a list of tuples of the indices of the observed entries in the matrix, and may be omitted if all the entries in the matrix have been observed. If A is a sparse matrix, implicit zeros are interpreted as missing entries by default; see the discussion of sparse matrices below for more details. X₀ and Y₀ are initialization matrices that represent a starting guess for the optimization.

Losses and regularizers must be of type Loss and Regularizer, respectively, and may be chosen from a list of supported losses and regularizers, which include

Losses:

  • quadratic loss QuadLoss
  • hinge loss HingeLoss
  • logistic loss LogisticLoss
  • Poisson loss PoissonLoss
  • weighted hinge loss WeightedHingeLoss
  • l1 loss L1Loss
  • ordinal hinge loss OrdinalHingeLoss
  • periodic loss PeriodicLoss
  • multinomial categorical loss MultinomialLoss
  • multinomial ordinal (aka ordered logit) loss OrderedMultinomialLoss
  • bigger-vs-smaller loss BvSLoss (for ordinal data)
  • one-vs all-loss OvALoss (for categorical data)

The constructors for all the ordinal and categorical losses take as an argument the maximum (or both minimum and maximum) value the variable may take. Using the one-vs-all loss is equivalent to transforming a categorical value to a one-hot vector and using a binary loss on each entry in that vector. Using the bigger-vs-smaller loss is equivalent to transforming the ordinal value to a Boolean vector and using a binary loss on each entry in that vector. By default, the binary loss used is the logistic loss.

Regularizers:

  • quadratic regularization QuadReg
  • constrained squared euclidean norm QuadConstraint
  • l1 regularization OneReg
  • no regularization ZeroReg
  • nonnegative constraint NonNegConstraint (e.g., for nonnegative matrix factorization)
  • 1-sparse constraint OneSparseConstraint (e.g., for orthogonal NNMF)
  • unit 1-sparse constraint UnitOneSparseConstraint (e.g., for k-means)
  • simplex constraint SimplexConstraint
  • l1 regularization, combined with nonnegative constraint NonNegOneReg
  • fix features at values y0 FixedLatentFeaturesConstraint(y0)

Each of these losses and regularizers can be scaled (for example, to increase the importance of the loss relative to the regularizer) by calling mul!(loss, newscale). Users may also implement their own losses and regularizers, or adjust internal parameters of the losses and regularizers; see losses.jl and regularizers.jl for more details.

Example

For example, the following code forms a k-means model with k=5 on the 100x100 matrix A:

using LowRankModels
m, n, k = 100, 100, 5
losses = QuadLoss() # minimize squared distance to cluster centroids
rx = UnitOneSparseConstraint() # each row is assigned to exactly one cluster
ry = ZeroReg() # no regularization on the cluster centroids
glrm = GLRM(A, losses, rx, ry, k)

To fit the model, call

X, Y, ch = fit!(glrm)

which runs an alternating directions proximal gradient method on glrm to find the X and Y minimizing the objective function. (ch gives the convergence history; see Technical details below for more information.)

The losses argument can also be an array of loss functions, with one for each column (in order). For example, for a data set with 3 columns, you could use

losses = Loss[QuadLoss(), LogisticLoss(), HingeLoss()]

Similiarly, the ry argument can be an array of regularizers, with one for each column (in order). For example, for a data set with 3 columns, you could use

ry = Regularizer[QuadReg(1), QuadReg(10), FixedLatentFeaturesConstraint([1.,2.,3.])]

This regularizes the first to columns of Y with ||Y[:,1]||^2 + 10||Y[:,2]||^2 and constrains the third (and last) column of Y to be equal to [1,2,3].

More examples here.

Missing data

If not all entries are present in your data table, just tell the GLRM which observations to fit the model to by listing tuples of their indices in obs, e.g., if obs=[(1,2), (5,3)], exactly two entries have been observed. Then initialize the model using

GLRM(A, losses, rx, ry, k, obs=obs)

If A is a DataFrame and you just want the model to ignore any entry that is missing, you can use

obs = observations(A)

Standard low rank models

Low rank models can easily be used to fit standard models such as PCA, k-means, and nonnegative matrix factorization. The following functions are available:

  • pca: principal components analysis
  • qpca: quadratically regularized principal components analysis
  • rpca: robust principal components analysis
  • nnmf: nonnegative matrix factorization
  • k-means: k-means

See the code for usage. Any keyword argument valid for a GLRM object, such as an initial value for X or Y or a list of observations, can also be used with these standard low rank models.

Scaling and offsets

If you choose, LowRankModels.jl can add an offset to your model and scale the loss functions and regularizers so all columns have the same pull in the model. Simply call

glrm = GLRM(A, losses, rx, ry, k, offset=true, scale=true)

This transformation generalizes standardization, a common preprocessing technique applied before PCA. (For more about offsets and scaling, see the code or the paper.)

You can also add offsets and scalings to previously unscaled models:

  • Add an offset to the model (by applying no regularization to the last row of the matrix Y, and enforcing that the last column of X be all 1s) using
add_offset!(glrm)
  • Scale the loss functions and regularizers by calling
equilibrate_variance!(glrm)
  • Scale only the columns associated to QuadLoss or HuberLoss loss functions.
prob_scale!(glrm)

Fitting DataFrames

Perhaps all this sounds like too much work. Perhaps you happen to have a DataFrame df lying around that you'd like a low rank (e.g., k=2) model for. For example,

import RDatasets
df = RDatasets.dataset("psych", "msq")

Never fear! Just call

glrm, labels = GLRM(df, k)
X, Y, ch = fit!(glrm)

This will fit a GLRM with rank k to your data, using a QuadLoss loss for real valued columns, HingeLoss loss for boolean columns, and ordinal HingeLoss loss for integer columns, a small amount of QuadLoss regularization, and scaling and adding an offset to the model as described here. It returns the column labels for the columns it fit, along with the model. Right now, all other data types are ignored. NaN values are treated as missing values (missings) and ignored in the fit.

The full call signature is

function GLRM(df::DataFrame, k::Int;
              losses = Loss[], rx = QuadReg(.01), ry = QuadReg(.01),
              offset = true, scale = false,
              prob_scale = true, NaNs_to_NAs = true)

You can modify the losses or regularizers, or turn off offsets or scaling, using these keyword arguments.

Or to specify a map from data types to losses, define a new loss_map from datatypes to losses (like probabilistic_losses, below):

probabilistic_losses = Dict{Symbol, Any}(
    :real        => QuadLoss,
    :bool        => LogisticLoss,
    :ord         => MultinomialOrdinalLoss,
    :cat         => MultinomialLoss
)

and input an array of datatypes (one for each column of your data frame: GLRM(A, k, datatypes; loss_map = loss_map). The full call signature is

function GLRM(df::DataFrame, k::Int, datatypes::Array{Symbol,1};
              loss_map = probabilistic_losses,
              rx = QuadReg(.01), ry = QuadReg(.01),
              offset = true, scale = false, prob_scale = true,
              transform_data_to_numbers = true, NaNs_to_NAs = true)

You can modify the losses or regularizers, or turn off offsets or scaling, using these keyword arguments.

To fit a data frame with categorical values, you can use the function expand_categoricals! to turn categorical columns into a Boolean column for each level of the categorical variable. For example, expand_categoricals!(df, [:gender]) will replace the gender column with a column corresponding to gender=male, a column corresponding to gender=female, and other columns corresponding to labels outside the gender binary, if they appear in the data set.

You can use the model to get some intuition for the data set. For example, try plotting the columns of Y with the labels; you might see that similar features are close to each other!

Fitting Sparse Matrices

If you have a very large, sparsely observed dataset, then you may want to encode your data as a sparse matrix. By default, LowRankModels interprets the sparse entries of a sparse matrix as missing entries (i.e. NA values). There is no need to pass the indices of observed entries (obs) -- this is done automatically when GLRM(A::SparseMatrixCSC,...) is called. In addition, calling fit!(glrm) when glrm.A is a sparse matrix will use the sparse variant of the proximal gradient descent algorithm, fit!(glrm, SparseProxGradParams(); kwargs...).

If, instead, you'd like to interpret the sparse entries as zeros, rather than missing or NA entries, use:

glrm = GLRM(...; sparse_na=false)

In this case, the dataset is dense in terms of observations, but sparse in terms of nonzero values. Thus, it may make more sense to fit the model with the vanilla proximal gradient descent algorithm, fit!(glrm, ProxGradParams(); kwargs...).

Parallel fitting (experimental)

LowRankModels makes use of Julia v0.5's new multithreading functionality to fit models in parallel. To fit a LowRankModel in parallel using multithreading, simply set the number of threads from the command line before starting Julia: e.g.,

export JULIA_NUM_THREADS=4

Technical details

Optimization

The function fit! uses an alternating directions proximal gradient method to minimize the objective. This method is not guaranteed to converge to the optimum, or even to a local minimum. If your code is not converging or is converging to a model you dislike, there are a number of parameters you can tweak.

Warm start

The algorithm starts with glrm.X and glrm.Y as the initial estimates for X and Y. If these are not given explicitly, they will be initialized randomly. If you have a good guess for a model, try setting them explicitly. If you think that you're getting stuck in a local minimum, try reinitializing your GLRM (so as to construct a new initial random point) and see if the model you obtain improves.

The function fit! sets the fields glrm.X and glrm.Y after fitting the model. This is particularly useful if you want to use the model you generate as a warm start for further iterations. If you prefer to preserve the original glrm.X and glrm.Y (e.g., for cross validation), you should call the function fit, which does not mutate its arguments.

You can even start with an easy-to-optimize loss function, run fit!, change the loss function (glrm.losses = newlosses), and keep going from your warm start by calling fit! again to fit the new loss functions.

Initialization

If you don't have a good guess at a warm start for your model, you might try one of the initializations provided in LowRankModels.

  • init_svd! initializes the model as the truncated SVD of the matrix of observed entries, with unobserved entries filled in with zeros. This initialization is known to result in provably good solutions for a number of "PCA-like" problems. See our paper for details.
  • init_kmeanspp! initializes the model using a modification of the kmeans++ algorithm for data sets with missing entries; see our paper for details. This works well for fitting clustering models, and may help in achieving better fits for nonnegative matrix factorization problems as well.
  • init_nndsvd! initializes the model using a modification of the NNDSVD algorithm as implemented by the NMF package. This modification handles data sets with missing entries by replacing missing entries with zeros. Optionally, by setting the argument max_iters=n with n>0, it will iteratively replace missing entries by their values as imputed by the NNDSVD, and call NNDSVD again on the new matrix. (This procedure is similar to the soft impute method of Mazumder, Hastie and Tibshirani for matrix completion.)

Parameters

As mentioned earlier, LowRankModels uses alternating proximal gradient descent to derive estimates of X and Y. This can be done by two slightly different procedures: (A) compute the full reconstruction, X' * Y, to compute the gradient and objective function; (B) only compute the model estimate for entries of A that are observed. The first method is likely preferred when there are few missing entries for A because of hardware level optimizations (e.g. chunking the operations so they just fit in various caches). The second method is likely preferred when there are many missing entries of A.

To fit with the first (dense) method:

fit!(glrm, ProxGradParams(); kwargs...)

To fit with the second (sparse) method:

fit!(glrm, SparseProxGradParams(); kwargs...)

The first method is used by default if glrm.A is a standard matrix/array. The second method is used by default if glrm.A is a SparseMatrixCSC.

ProxGradParams() and SparseProxGradParams() run these respective methods with the default parameters:

  • stepsize: The step size controls the speed of convergence. Small step sizes will slow convergence, while large ones will cause divergence. stepsize should be of order 1.
  • abs_tol: The algorithm stops when the decrease in the objective per iteration is less than abs_tol*length(obs).
  • rel_tol: The algorithm stops when the decrease in the objective per iteration is less than rel_tol.
  • max_iter: The algorithm also stops if maximum number of rounds max_iter has been reached.
  • min_stepsize: The algorithm also stops if stepsize decreases below this limit.
  • inner_iter: specifies how many proximal gradient steps to take on X before moving on to Y (and vice versa).

The default parameters are: ProxGradParams(stepsize=1.0;max_iter=100,inner_iter=1,abs_tol=0.00001,rel_tol=0.0001,min_stepsize=0.01*stepsize)

Convergence

ch gives the convergence history so that the success of the optimization can be monitored; ch.objective stores the objective values, and ch.times captures the times these objective values were achieved. Try plotting this to see if you just need to increase max_iter to converge to a better model.

Imputation

After fitting a GLRM, you can use it to impute values of A in four different ways:

  • impute(glrm) gives the maximum likelihood estimates for each entry
  • impute_missing(glrm) imputes missing entries and leaves observed entries unchanged
  • sample(glrm) gives a draw from the posterior distribution, conditioned on the fit values of X and Y, for each entry
  • sample_missing(glrm) samples missing entries and leaves observed entries unchanged

Cross validation

A number of useful functions are available to help you check whether a given low rank model overfits to the test data set. These functions should help you choose adequate regularization for your model.

Cross validation

  • cross_validate(glrm::GLRM, nfolds=5, params=Params(); verbose=false, use_folds=None, error_fn=objective, init=None): performs n-fold cross validation and returns average loss among all folds. More specifically, splits observations in glrm into nfolds groups, and builds new GLRMs, each with one group of observations left out. Fits each GLRM to the training set (the observations revealed to each GLRM) and returns the average loss on the test sets (the observations left out of each GLRM).

    Optional arguments:

    • use_folds: build use_folds new GLRMs instead of n_folds new GLRMs, each with 1/nfolds of the entries left out. (use_folds defaults to nfolds.)
    • error_fn: use a custom error function to evaluate the fit, rather than the objective. For example, one might use the imputation error by setting error_fn = error_metric.
    • init: initialize the fit using a particular procedure. For example, consider init=init_svd!. See Initialization for more options.
  • cv_by_iter(glrm::GLRM, holdout_proportion=.1, params=Params(1,1,.01,.01), niters=30; verbose=true): computes the test error and train error of the GLRM as it is trained. Splits the observations into a training set (1-holdout_proportion of the original observations) and a test set (holdout_proportion of the original observations). Performs params.maxiter iterations of the fitting algorithm on the training set niters times, and returns the test and train error as a function of iteration.

Regularization paths

  • regularization_path(glrm::GLRM; params=Params(), reg_params=exp10.(range(2,stop=-2,length=5)), holdout_proportion=.1, verbose=true, ch::ConvergenceHistory=ConvergenceHistory("reg_path")): computes the train and test error for GLRMs varying the scaling of the regularization through any scaling factor in the array reg_params.

Utilities

  • get_train_and_test(obs, m, n, holdout_proportion=.1): splits observations obs into a train and test set. m and n must be at least as large as the maximal value of the first or second elements of the tuples in observations, respectively. Returns observed_features and observed_examples for both train and test sets.

ScikitLearn

This library implements the ScikitLearn.jl interface. These models are available: SkGLRM, PCA, QPCA, NNMF, KMeans, RPCA. See their docstrings for more information (e.g. ?QPCA). All models support the ScikitLearnBase.fit! and ScikitLearnBase.transform interface. Examples:

## Apply PCA to the iris dataset
using LowRankModels
import ScikitLearnBase
using RDatasets    # may require Pkg.add("RDatasets")

A = convert(Matrix, dataset("datasets", "iris")[[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
ScikitLearnBase.fit_transform!(PCA(k=3, max_iter=500), A)
## Fit K-Means to a fake dataset of two Gaussians
using LowRankModels
import ScikitLearnBase

# Generate two disjoint Gaussians with 100 and 50 points
gaussian1 = randn(100, 2) + 5
gaussian2 = randn(50, 2) - 10
# Merge them into a single dataset
A = vcat(gaussian1, gaussian2)

model = ScikitLearnBase.fit!(LowRankModels.KMeans(), A)
# Count how many points are assigned to each Gaussians (should be 100 and 50)
Set(sum(ScikitLearnBase.transform(model, A), 1))

See also this notebook demonstrating K-Means.

These models can be used inside a ScikitLearn pipeline, and every hyperparameter can be tuned with GridSearchCV.

Citing this package

If you use LowRankModels for published work, we encourage you to cite the software.

Use the following BibTeX citation:

@article{glrm,
    title = {Generalized Low Rank Models},
    author ={Madeleine Udell and Horn, Corinne and Zadeh, Reza and Boyd, Stephen},
    doi = {10.1561/2200000055},
    year = {2016},
    archivePrefix = "arXiv",
    eprint = {1410.0342},
    primaryClass = "stat-ml",
    journal = {Foundations and Trends in Machine Learning},
    number = {1},
    volume = {9},
    issn = {1935-8237},
    url = {http://dx.doi.org/10.1561/2200000055},
}
Comments
  • Penalties/regularization of imputed values

    Penalties/regularization of imputed values

    In many experimental contexts (RNA-sequencing is a good example), ground truth data that are below a detection threshold are not observed due to technical error. Thus, we have a data matrix A with many NA entries. However, we suspect many of these NA entries to be small, though not necessarily zero.

    For each observed entry A_ij we have a loss function: L[A_ij, dot(x_i,y_j)]

    Maybe for each unobserved entry we could add some regularization: R_a[dot(x_i,y_j)]

    Where R_a is a function defined by the user... Perhaps R_a[z] = z^2 or R_a[z] = abs(z)

    Aside: For the RNA sequencing application, something like R_a[z] = sqrt(z), z > 0 would be interesting (though not convex). Actually even something non-monotonic would be interesting for reasons I won't go into. These are probably too weird/specialized to include, but I would be curious.

    opened by ahwillia 14
  • ScikitLearn.jl wrap

    ScikitLearn.jl wrap

    get-params wasn't working out with the current type definition. I've decided to wrap GLRM instead, it's going to be more consistent with the rest of the ScikitLearn.jl algorithms, and allows me to expose more hyperparameters.

    I branched from dataframe-ux this time.

    PCA works fine, but I'm having issues with K-Means. I've translated this example, and instead of getting a nice Voronoi diagram, I get this:

    screen shot 2016-04-28 at 14 05 14

    Clearly not random, but not quite Voronoi either! To get this picture, I call transform(kmeans, X) where X is a two-column matrix containing all (x,y) coordinates. transform uses FixedLatentFeaturesConstraint. Is there any reason why this would be incompatible with K-Means? Its output doesn't look right, even with abs_tol=1.e-20 (it converges in ~20 iterations)

    2430x10 Array{Float64,2}:
      0.0       0.0        0.0      …  0.0      0.0      0.0      1.0     
      0.0       0.0        0.0         0.0      0.0      0.0      0.0     
      0.0       0.0        0.0         0.0      0.0      0.0      0.0     
     -0.570026  0.714604  -2.94676     2.10088  2.1646  -1.42798  0.191742
      0.0       0.0        0.0         0.0      0.0      0.0      0.0     
      0.0       0.0        0.0      …  0.0      0.0      0.0      0.0     
      0.0       0.0        0.0         0.0      0.0      0.0      0.0     
      0.0       0.0        0.0         0.0      0.0      0.0      0.0     
      0.0       0.0        0.0         0.0      0.0      0.0      1.0     
      0.0       0.0        0.0         0.0      0.0      0.0      0.0     
    ...
    

    (10 columns because there are 10 digits). Those 10 rows correspond to neighboring pixels. They should all be classified in the same cluster. But even ignoring the anomaly on row 4, the other rows are not consistent at all.

    It would be straight-forward to implement transform using the distance from each centroid, but if there's something wrong with transform, it's better to fix that instead. Any ideas?

    opened by cstjean 9
  • initial implementation of nndsvd

    initial implementation of nndsvd

    Here is a quick implementation of nndsvd.

    We should probably develop a few test cases to make sure this is working as expected before merging. But it seems to be behaving reasonably in my hands so far...

    Note: Sorry for the duplicate pull request. I'll delete the other one.

    opened by ahwillia 8
  • Working on randomly generated data returns 0 element Arrays for labels and glrm.ry

    Working on randomly generated data returns 0 element Arrays for labels and glrm.ry

    Hi again,

    I was trying to debug my analysis by working with a randomly generated dataset:

    A = rand(100,2)*rand(2,100) A = convert(DataFrame, A) glrm, labels = GLRM(A, 2)

    However the corresponding labels and glrm.ry objects are empty:

    julia> labels 0-element Array{Symbol,1}

    julia> glrm.ry 0-element Array{Regularizer,1}

    It would be great if you could help me understand why this is happening?

    Also could you suggest a base dataset that could be used to understand the different aspects of the LowRankModels code like the loss and regularizer options? I tried using the "psych" Dataset referred to in the README and this randomly generated matrix and ran into issues with them both.

    Thanks a lot, Nandana

    opened by NandanaSengupta 6
  • WIP: Support for the ScikitLearn API

    WIP: Support for the ScikitLearn API

    I followed your suggestion. The result is not the prettiest, but it works. If it looks reasonable, I'll add the remaining functions and support NMF/KMeans/etc.

    I'd like to understand why you copy Regularizer and Loss objects. Are they modified anywhere?

    opened by cstjean 5
  • Compute X given Y

    Compute X given Y

    Once a model has been fit to a matrix A, is there any way to fit it to another matrix holding Y constant? For example, if factor analysis is part of a pipeline that ends with an SVM classifier, the cross-validation code should learn the feature matrix Y on the training set, and compute the data matrix X on the test set, given Y.

    opened by cstjean 5
  • Change loss type names

    Change loss type names

    Changes all loss function types to have CamelCase names, preventing naming conflicts with StatsBase. See: https://github.com/madeleineudell/LowRankModels.jl/issues/41

    I can also rename all the regularization types to CamelCase if desired...

    opened by ahwillia 5
  • Regularizers for Soft K-means

    Regularizers for Soft K-means

    I've implemented some new regularizers that let you move between a soft and hard clustering. Here is a quick demo/explanation:

    http://nbviewer.ipython.org/github/ahwillia/notebooks/blob/master/code_pubs/2015_08_11_regularized_soft_kmeans.ipynb

    Is this too specialized to include in the repository? The code is pretty short so I've just pasted it below. Let me know if you want me to open a PR.

    ## indicator of vectors in the simplex, plus a penalty on Shannon entropy
    ## (intuition: soft k-means with encouraged sparseness)
    type entr_simplex<:Regularizer
        scale::Float64
    end
    function evaluate(r::entr_simplex,a::AbstractArray)
        evaluate(simplex(),a) == Inf && return Inf # simplex constraint
        b = length(a) # base for entropy calculation (normalizes: 0<=entropy<=1)
        return r.scale*entropy(a,b) # penalize entropy if constraint satisfied
    end
    function prox!(r::entr_simplex,u::AbstractArray,alpha::Number)
        prox!(simplex(),u,alpha) # first project onto unit simplex
        b = length(u)    # base for entropy calculation (normalizes: 0<=entropy<=1)
        g = -log(b,u)-1  # gradient of entropy
    
        # project entropy gradient onto simplex, ignoring Infs
        gs = g - mean(g[g.!=Inf]) 
        for i = 1:length(u)
            u[i] = (g[i]==Inf) ? 0.0 : u[i] - r.scale*alpha*gs[i]
        end
    
        # make sure we didn't step off the simplex before returning
        prox!(simplex(),u,alpha)
    end
    entr_simplex() = entr_simplex(1)
    
    ## indicator of vectors in the simplex, plus a penalty on
    ## the l1 distance to the center of the simplex
    ## (intuition: soft k-means with encouraged sparseness)
    type dist_simplex<:Regularizer
        scale::Float64
        k::Int # rank of model (store up front)
        d::Float64 # distance from corner to center (calculate up front)
    end
    function evaluate(r::dist_simplex,a::AbstractArray)
        evaluate(simplex(),a) == Inf && return Inf # simplex constraint
        dist = sum(abs(a-ones(r.k)/r.k)) # distance from center
        return r.scale*(1 - (dist/r.d))  # penalize dist from corners
    end
    function prox(r::dist_simplex,u::AbstractArray,alpha::Number)
        prox!(simplex(),u,alpha) # first project onto unit simplex
    
        # Calculate gradient
        shrink_step = -r.scale*alpha/(r.k-1)
        imax = indmax(u)
        for i = 1:length(u)
            if i == imax
                u[i] += r.scale*alpha
            else
                u[i] += shrink_step
            end
        end
    
        # make sure we didn't step off the simplex before returning
        prox!(simplex(),u,alpha)
    end
    function dist_simplex(s::Float64,k::Int)
        # k is the rank of the data
        e = zeros(k); e[1] = 1.0  # corner of simplex
        c = ones(k)/k             # center of simplex
        return dist_simplex(s,k,sum(abs(c-e)))
    end
    dist_simplex(k::Int) = dist_simplex(1.0,k)
    
    opened by ahwillia 5
  • Use Sparse Matrices

    Use Sparse Matrices

    I have a few applications that involve very large, very sparse (incomplete), matrices.

    At the moment it appears that the entire data matrix, A, is estimated/reconstructed from X'*Y on each gradient step (see here). However, if many entries of A are missing, then only a subset of the X'*Y entries need to be computed to evaluate the loss function.

    Would it make sense to make a new file sparse_proxgrad.jl in src/algorithms/ that is called when glrm.A is a sparse matrix?

    opened by ahwillia 5
  • @threads causes massive slowdown

    @threads causes massive slowdown

    I find, while running julia v1.2 and LowRankModels v1.0.1, that the @threads annotation here sometimes causes massive slowdowns (3x in execution time and CPU utilization very low). My problem size is m,n,k = 200,150,10 and nthreads() == 2. Removing the annotation makes everything much faster.

    opened by baggepinnen 4
  • One regularizer per column for X?

    One regularizer per column for X?

    I've noticed that while one can pass an array of regularizers for Y (i.e. one per column of Y), the same is not possible for X. Is there a reason for that? Is there any interest in generalizing the code in that direction? Maybe I could help!

    opened by amanoel 4
  • Compatibility and dependency updates

    Compatibility and dependency updates

    I tried to make the minimal changes here to make this package work with recent versions of Julia and package dependencies. Currently this package drastically downgrades DataFrames and probably other packages on installation. All tests seem to pass for me.

    opened by kshedden 1
  • Use a more recent version of DataFrames

    Use a more recent version of DataFrames

    DataFrames has now released their v1.0, but this repo still locks its dependency to some prior versions, which can lead to compat issues with other packages. It would be great to relax that dependency.

    opened by eperim 1
  • support `missing` in matrix rather than having to give `obs`

    support `missing` in matrix rather than having to give `obs`

    Since julia 1.0 has missing built in it would be nice to just naturally support it. Rather than having to pass in obs specifically.

    Example using pca.

    julia> data = map(x->rand()>0.2 ? x : missing, rand(5, 5)*rand(5,5))
    5×5 Matrix{Union{Missing, Float64}}:
     1.06069   1.08958   missing  1.67055   1.20225
     0.751405  1.1944   1.31671    missing  1.23701
     1.33475   1.62695   missing  2.05511   1.52327
     0.909519  1.18476  1.70125   1.93304   1.3857
      missing  1.07582  1.23387   1.57813   1.09896
    
    julia> pca(data, 3)
    ERROR: TypeError: non-boolean (Missing) used in boolean context
    Stacktrace:
     [1] GLRM(A::Matrix{Union{Missing, Float64}}, losses::Vector{Loss}, rx::Vector{Regularizer}, ry::Vector{Regularizer}, k::Int64; X::Matrix{Float64}, Y::Matrix{Float64}, obs::Nothing, observed_features::Vector{UnitRange{Int64}}, observed_examples::Vector{UnitRange{Int64}}, offset::Bool, scale::Bool, checknan::Bool, sparse_na::Bool)
       @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/glrm.jl:66
     [2] GLRM(A::Matrix{Union{Missing, Float64}}, losses::Vector{Loss}, rx::Vector{Regularizer}, ry::Vector{Regularizer}, k::Int64)
       @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/glrm.jl:38
     [3] #GLRM#172
       @ ~/JuliaEnvs/LowRankModels.jl/src/utilities/conveniencemethods.jl:48 [inlined]
     [4] GLRM
       @ ~/JuliaEnvs/LowRankModels.jl/src/utilities/conveniencemethods.jl:48 [inlined]
     [5] #pca#107
       @ ~/JuliaEnvs/LowRankModels.jl/src/simple_glrms.jl:8 [inlined]
     [6] pca(A::Matrix{Union{Missing, Float64}}, k::Int64)
       @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/simple_glrms.jl:6
     [7] top-level scope
       @ REPL[34]:1
    

    It's not to go and list all the obs manually and the it works:

    julia> obs = [Tuple(ind) for ind in CartesianIndices(data) if !(data[ind] isa Missing)];
    
    julia> pca(data, 3; obs=obs)
    GLRM(Union{Missing, Float64}[1.0606880432831138 1.089579677823029 … 1.6705515670551134 1.2022467329602542; 0.7514051967634282 1.1944014349828096 … missing 1.2370061047120438; … ; 0.9095188136002703 1.1847633026389974 … 1.9330409876767132 1.38570067320635; missing 1.0758247860288004 … 1.5781294994670754 1.0989562029066084], Loss[QuadLoss(1.0, RealDomain()), QuadLoss(1.0, RealDomain()), QuadLoss(1.0, RealDomain()),
     QuadLoss(1.0, RealDomain()), QuadLoss(1.0, RealDomain())], Regularizer[ZeroReg(), ZeroReg(), ZeroReg(),
     ZeroReg(), ZeroReg()], Regularizer[ZeroReg(), ZeroReg(), ZeroReg(), ZeroReg(), ZeroReg()], 3, [[1, 2, 4
    , 5], [1, 2, 3, 5], [1, 2, 4, 5], [1, 2, 3, 4, 5], [2, 3, 4, 5]], [[1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 4,
     5], [1, 3, 4, 5], [1, 2, 3, 4, 5]], [-0.6386482530950347 -1.018777729621554 … -1.9382553356751036 -1.053052636035794; -0.14517788192107056 -1.2458123989543037 … -2.2837186272581693 -1.5608357286351906; 0.2540552369948007 -1.4235247052609152 … 0.2976611081413319 0.7210816049116382], [0.4108564480213463 -1.0415581335449255 … 0.36006578018169627 0.11669723598708971; 0.05544466402953537 0.7168696680990544 … 0.19156334692424293 -0.48792765211129047; -0.8552253694850784 -1.5810995605873126 … -0.8580967626857172 -0.4419037908246037])
    

    but we should make it happen automatically.

    @jiahao said he has a fix for this.

    opened by oxinabox 0
  • fix NaNtoMissing

    fix NaNtoMissing

    running the tests was erroring with:

    ERROR: LoadError: LoadError: LoadError: type CategoricalValue has no field value
    Stacktrace:
      [1] getproperty(x::CategoricalValue{String, UInt8}, f::Symbol)
        @ Base ./Base.jl:33
      [2] isnan(x::CategoricalValue{String, UInt8})
        @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/fit_dataframe.jl:275
      [3] (::LowRankModels.var"#147#148"{DataFrame, Int64})(::Tuple{Int64, CategoricalValue{String, UInt8}})
        @ LowRankModels ./none:0
      [4] iterate
        @ ./generator.jl:47 [inlined]
      [5] collect(itr::Base.Generator{Base.Iterators.Enumerate{CategoricalVector{String, UInt8, String, Cate
    goricalValue{String, UInt8}, Union{}}}, LowRankModels.var"#147#148"{DataFrame, Int64}})
        @ Base ./array.jl:678
      [6] NaNs_to_Missing!(df::DataFrame)
        @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/fit_dataframe.jl:281
      [7] GLRM(df::DataFrame, k::Int64; losses::Vector{Loss}, rx::QuadReg, ry::QuadReg, offset::Bool, scale:
    :Bool, prob_scale::Bool, NaNs_to_Missing::Bool)
        @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/fit_dataframe_w_type_imputation.jl:17
      [8] GLRM(df::DataFrame, k::Int64)
        @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/fit_dataframe_w_type_imputation.jl:15
      [9] top-level scope
        @ ~/JuliaEnvs/LowRankModels.jl/examples/fit_rdataset.jl:9
     [10] include(fname::String)
        @ Base.MainInclude ./client.jl:444
     [11] top-level scope
        @ ~/JuliaEnvs/LowRankModels.jl/examples/runexamples.jl:17
     [12] include(fname::String)
        @ Base.MainInclude ./client.jl:444
     [13] top-level scope
        @ ~/JuliaEnvs/LowRankModels.jl/test/runtests.jl:34
     [14] include(fname::String)
        @ Base.MainInclude ./client.jl:444
     [15] top-level scope
        @ none:6
    in expression starting at /Users/oxinabox/JuliaEnvs/LowRankModels.jl/examples/fit_rdataset.jl:9
    in expression starting at /Users/oxinabox/JuliaEnvs/LowRankModels.jl/examples/runexamples.jl:17
    in expression starting at /Users/oxinabox/JuliaEnvs/LowRankModels.jl/test/runtests.jl:34
    ERROR: Package LowRankModels errored during testing
    

    Looking at the code for NaNtoMissing it was a bit funky. I made it less funky.

    Now tests pass without errors for me

    opened by oxinabox 1
  • Applying model to new data

    Applying model to new data

    Hi there, This looks a great package. I'm particularly interested in the ability to fit LRMs to datasets with missing data (or in my case, outliers that need to be masked). I have a quick question that may be pretty basic, but an answer would help me to apply the code to my own data. Apologies if I've missed something in the documentation. I'm also fairly new to Julia.

    If I fit a PCA model to a set of training data A (following your example):

    loss        = QuadLoss()
    r           = ZeroReg()
    n_comp      = 1
    glrm        = GLRM(A,loss,r,r,n_comp)
    X,Y,ch.     = fit!(glrm)
    

    how do I then apply the same model to a new set of data B? I would like to keep X fixed and obtain new values Y_b that give the best fit of X to B. That is, I would like to project the observations in B onto the PCA components found from A.

    There are other PCA packages in Julia that will do this (e.g., the reconstruct function in MultivariateStats), but they don't seem to be able to handle missing data or sparse arrays.

    Thanks in advance! Any help is appreciated!

    opened by gdbeck 2
Releases(v1.1.1)
Owner
Madeleine Udell
Madeleine Udell
BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

BLEND is a mechanism that can efficiently find fuzzy seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applic

SAFARI Research Group at ETH Zurich and Carnegie Mellon University 19 Dec 26, 2022
Semi-supervised semantic segmentation needs strong, varied perturbations

Semi-supervised semantic segmentation using CutMix and Colour Augmentation Implementations of our papers: Semi-supervised semantic segmentation needs

146 Dec 20, 2022
A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

Chumui Tripura 1 Oct 30, 2021
chainladder - Property and Casualty Loss Reserving in Python

chainladder (python) chainladder - Property and Casualty Loss Reserving in Python This package gets inspiration from the popular R ChainLadder package

Casualty Actuarial Society 130 Dec 07, 2022
Repo for code associated with Modeling the Mitral Valve.

Project Title Mitral Valve Getting Started Repo for code associated with Modeling the Mitral Valve. See https://arxiv.org/abs/1902.00018 for preprint,

Alex Kaiser 1 May 17, 2022
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition The unofficial code of CDistNet. Now, we ha

25 Jul 20, 2022
The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

Codebase for learning control flow in transformers The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformer

Csordás Róbert 24 Oct 15, 2022
Introducing neural networks to predict stock prices

IntroNeuralNetworks in Python: A Template Project IntroNeuralNetworks is a project that introduces neural networks and illustrates an example of how o

Vivek Palaniappan 637 Jan 04, 2023
Dynamic Bottleneck for Robust Self-Supervised Exploration

Dynamic Bottleneck Introduction This is a TensorFlow based implementation for our paper on "Dynamic Bottleneck for Robust Self-Supervised Exploration"

Bai Chenjia 4 Nov 14, 2022
TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation Zhaoyun Yin, Pichao Wang, Fan Wang, Xianzhe Xu, Hanling Zhang, Hao Li

DamoCV 25 Dec 16, 2022
High-Resolution 3D Human Digitization from A Single Image.

PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization (CVPR 2020) News: [2020/06/15] Demo with Google Colab (i

Meta Research 8.4k Dec 29, 2022
Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework Official code for paper, Self-supervised Video Representation Le

Li Tao 103 Dec 21, 2022
PyTorch code for Composing Partial Differential Equations with Physics-Aware Neural Networks

FInite volume Neural Network (FINN) This repository contains the PyTorch code for models, training, and testing, and Python code for data generation t

Cognitive Modeling 20 Dec 18, 2022
Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

Cross-media Structured Common Space for Multimedia Event Extraction Table of Contents Overview Requirements Data Quickstart Citation Overview The code

Manling Li 49 Nov 21, 2022
MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python

MNE-Python MNE-Python software is an open-source Python package for exploring, visualizing, and analyzing human neurophysiological data such as MEG, E

MNE tools for MEG and EEG data analysis 2.1k Dec 28, 2022
Tutorial for the PERFECTING FACTORY 5.0 WITH EDGE-POWERED AI workshop

Workshop Advantech Jetson Nano This tutorial has been designed for the PERFECTING FACTORY 5.0 WITH EDGE-POWERED AI workshop in collaboration with Adva

Edge Impulse 18 Nov 22, 2022
QAT(quantize aware training) for classification with MQBench

MQBench Quantization Aware Training with PyTorch I am using MQBench(Model Quantization Benchmark)(http://mqbench.tech/) to quantize the model for depl

Ling Zhang 29 Nov 18, 2022
[CVPR 2021] Involution: Inverting the Inherence of Convolution for Visual Recognition, a brand new neural operator

involution Official implementation of a neural operator as described in Involution: Inverting the Inherence of Convolution for Visual Recognition (CVP

Duo Li 1.3k Dec 28, 2022
Using Machine Learning to Create High-Res Fine Art

BIG.art: Using Machine Learning to Create High-Res Fine Art How to use GLIDE and BSRGAN to create ultra-high-resolution paintings with fine details By

Robert A. Gonsalves 13 Nov 27, 2022