LowRankModels.jl is a julia package for modeling and fitting generalized low rank models.

Last update: Dec 17, 2022

Related tags

Overview

LowRankModels.jl

LowRankModels.jl is a Julia package for modeling and fitting generalized low rank models (GLRMs). GLRMs model a data array by a low rank matrix, and include many well known models in data analysis, such as principal components analysis (PCA), matrix completion, robust PCA, nonnegative matrix factorization, k-means, and many more.

For more information on GLRMs, see our paper. There is a python interface to this package, and a GLRM implementation in the H2O machine learning platform with interfaces in a variety of languages.

LowRankModels.jl makes it easy to mix and match loss functions and regularizers to construct a model suitable for a particular data set. In particular, it supports

using different loss functions for different columns of the data array, which is useful when data types are heterogeneous (e.g., real, boolean, and ordinal columns);
fitting the model to only some of the entries in the table, which is useful for data tables with many missing (unobserved) entries; and
adding offsets and scalings to the model without destroying sparsity, which is useful when the data is poorly scaled.

Installation

To install, just call

Pkg.add("LowRankModels")

at the Julia prompt.

Generalized Low Rank Models

GLRMs form a low rank model for tabular data A with m rows and n columns, which can be input as an array or any array-like object (for example, a data frame). It is fine if only some of the entries have been observed (i.e., the others are missing); the GLRM will only be fit on the !ismissing entries. The desired model is specified by choosing a rank k for the model, an array of loss functions losses, and two regularizers, rx and ry. The data is modeled as X'*Y, where X is a kxm matrix and Y is a kxn matrix. X and Y are found by solving the optimization problem

minimize sum_{(i,j) in obs} losses[j]((X'*Y)[i,j], A[i,j]) + sum_i rx(X[:,i]) + sum_j ry(Y[:,j])

The basic type used by LowRankModels.jl is the GLRM. To form a GLRM, the user specifies

the data A (any AbstractArray, such as an array, a sparse matrix, or a data frame)
the array of loss functions losses
the regularizers rx and ry
the rank k

The user may also specify

the observed entries obs
starting matrices X₀ and Y₀

obs is a list of tuples of the indices of the observed entries in the matrix, and may be omitted if all the entries in the matrix have been observed. If A is a sparse matrix, implicit zeros are interpreted as missing entries by default; see the discussion of sparse matrices below for more details. X₀ and Y₀ are initialization matrices that represent a starting guess for the optimization.

Losses and regularizers must be of type Loss and Regularizer, respectively, and may be chosen from a list of supported losses and regularizers, which include

Losses:

quadratic loss QuadLoss
hinge loss HingeLoss
logistic loss LogisticLoss
Poisson loss PoissonLoss
weighted hinge loss WeightedHingeLoss
l1 loss L1Loss
ordinal hinge loss OrdinalHingeLoss
periodic loss PeriodicLoss
multinomial categorical loss MultinomialLoss
multinomial ordinal (aka ordered logit) loss OrderedMultinomialLoss
bigger-vs-smaller loss BvSLoss (for ordinal data)
one-vs all-loss OvALoss (for categorical data)

The constructors for all the ordinal and categorical losses take as an argument the maximum (or both minimum and maximum) value the variable may take. Using the one-vs-all loss is equivalent to transforming a categorical value to a one-hot vector and using a binary loss on each entry in that vector. Using the bigger-vs-smaller loss is equivalent to transforming the ordinal value to a Boolean vector and using a binary loss on each entry in that vector. By default, the binary loss used is the logistic loss.

Regularizers:

quadratic regularization QuadReg
constrained squared euclidean norm QuadConstraint
l1 regularization OneReg
no regularization ZeroReg
nonnegative constraint NonNegConstraint (e.g., for nonnegative matrix factorization)
1-sparse constraint OneSparseConstraint (e.g., for orthogonal NNMF)
unit 1-sparse constraint UnitOneSparseConstraint (e.g., for k-means)
simplex constraint SimplexConstraint
l1 regularization, combined with nonnegative constraint NonNegOneReg
fix features at values y0 FixedLatentFeaturesConstraint(y0)

Each of these losses and regularizers can be scaled (for example, to increase the importance of the loss relative to the regularizer) by calling mul!(loss, newscale). Users may also implement their own losses and regularizers, or adjust internal parameters of the losses and regularizers; see losses.jl and regularizers.jl for more details.

Example

For example, the following code forms a k-means model with k=5 on the 100x100 matrix A:

using LowRankModels
m, n, k = 100, 100, 5
losses = QuadLoss() # minimize squared distance to cluster centroids
rx = UnitOneSparseConstraint() # each row is assigned to exactly one cluster
ry = ZeroReg() # no regularization on the cluster centroids
glrm = GLRM(A, losses, rx, ry, k)

To fit the model, call

X, Y, ch = fit!(glrm)

which runs an alternating directions proximal gradient method on glrm to find the X and Y minimizing the objective function. (ch gives the convergence history; see Technical details below for more information.)

The losses argument can also be an array of loss functions, with one for each column (in order). For example, for a data set with 3 columns, you could use

losses = Loss[QuadLoss(), LogisticLoss(), HingeLoss()]

Similiarly, the ry argument can be an array of regularizers, with one for each column (in order). For example, for a data set with 3 columns, you could use

ry = Regularizer[QuadReg(1), QuadReg(10), FixedLatentFeaturesConstraint([1.,2.,3.])]

This regularizes the first to columns of Y with ||Y[:,1]||^2 + 10||Y[:,2]||^2 and constrains the third (and last) column of Y to be equal to [1,2,3].

More examples here.

Missing data

If not all entries are present in your data table, just tell the GLRM which observations to fit the model to by listing tuples of their indices in obs, e.g., if obs=[(1,2), (5,3)], exactly two entries have been observed. Then initialize the model using

GLRM(A, losses, rx, ry, k, obs=obs)

If A is a DataFrame and you just want the model to ignore any entry that is missing, you can use

obs = observations(A)

Standard low rank models

Low rank models can easily be used to fit standard models such as PCA, k-means, and nonnegative matrix factorization. The following functions are available:

pca: principal components analysis
qpca: quadratically regularized principal components analysis
rpca: robust principal components analysis
nnmf: nonnegative matrix factorization
k-means: k-means

See the code for usage. Any keyword argument valid for a GLRM object, such as an initial value for X or Y or a list of observations, can also be used with these standard low rank models.

Scaling and offsets

If you choose, LowRankModels.jl can add an offset to your model and scale the loss functions and regularizers so all columns have the same pull in the model. Simply call

glrm = GLRM(A, losses, rx, ry, k, offset=true, scale=true)

This transformation generalizes standardization, a common preprocessing technique applied before PCA. (For more about offsets and scaling, see the code or the paper.)

You can also add offsets and scalings to previously unscaled models:

Add an offset to the model (by applying no regularization to the last row of the matrix Y, and enforcing that the last column of X be all 1s) using

add_offset!(glrm)

Scale the loss functions and regularizers by calling

equilibrate_variance!(glrm)

Scale only the columns associated to QuadLoss or HuberLoss loss functions.

prob_scale!(glrm)

Fitting DataFrames

Perhaps all this sounds like too much work. Perhaps you happen to have a DataFrame df lying around that you'd like a low rank (e.g., k=2) model for. For example,

import RDatasets
df = RDatasets.dataset("psych", "msq")

Never fear! Just call

glrm, labels = GLRM(df, k)
X, Y, ch = fit!(glrm)

This will fit a GLRM with rank k to your data, using a QuadLoss loss for real valued columns, HingeLoss loss for boolean columns, and ordinal HingeLoss loss for integer columns, a small amount of QuadLoss regularization, and scaling and adding an offset to the model as described here. It returns the column labels for the columns it fit, along with the model. Right now, all other data types are ignored. NaN values are treated as missing values (missings) and ignored in the fit.

The full call signature is

function GLRM(df::DataFrame, k::Int;
              losses = Loss[], rx = QuadReg(.01), ry = QuadReg(.01),
              offset = true, scale = false,
              prob_scale = true, NaNs_to_NAs = true)

You can modify the losses or regularizers, or turn off offsets or scaling, using these keyword arguments.

Or to specify a map from data types to losses, define a new loss_map from datatypes to losses (like probabilistic_losses, below):

probabilistic_losses = Dict{Symbol, Any}(
    :real        => QuadLoss,
    :bool        => LogisticLoss,
    :ord         => MultinomialOrdinalLoss,
    :cat         => MultinomialLoss
)

and input an array of datatypes (one for each column of your data frame: GLRM(A, k, datatypes; loss_map = loss_map). The full call signature is

function GLRM(df::DataFrame, k::Int, datatypes::Array{Symbol,1};
              loss_map = probabilistic_losses,
              rx = QuadReg(.01), ry = QuadReg(.01),
              offset = true, scale = false, prob_scale = true,
              transform_data_to_numbers = true, NaNs_to_NAs = true)

You can modify the losses or regularizers, or turn off offsets or scaling, using these keyword arguments.

To fit a data frame with categorical values, you can use the function expand_categoricals! to turn categorical columns into a Boolean column for each level of the categorical variable. For example, expand_categoricals!(df, [:gender]) will replace the gender column with a column corresponding to gender=male, a column corresponding to gender=female, and other columns corresponding to labels outside the gender binary, if they appear in the data set.

You can use the model to get some intuition for the data set. For example, try plotting the columns of Y with the labels; you might see that similar features are close to each other!

Fitting Sparse Matrices

If you have a very large, sparsely observed dataset, then you may want to encode your data as a sparse matrix. By default, LowRankModels interprets the sparse entries of a sparse matrix as missing entries (i.e. NA values). There is no need to pass the indices of observed entries (obs) -- this is done automatically when GLRM(A::SparseMatrixCSC,...) is called. In addition, calling fit!(glrm) when glrm.A is a sparse matrix will use the sparse variant of the proximal gradient descent algorithm, fit!(glrm, SparseProxGradParams(); kwargs...).

If, instead, you'd like to interpret the sparse entries as zeros, rather than missing or NA entries, use:

glrm = GLRM(...; sparse_na=false)

In this case, the dataset is dense in terms of observations, but sparse in terms of nonzero values. Thus, it may make more sense to fit the model with the vanilla proximal gradient descent algorithm, fit!(glrm, ProxGradParams(); kwargs...).

Parallel fitting (experimental)

LowRankModels makes use of Julia v0.5's new multithreading functionality to fit models in parallel. To fit a LowRankModel in parallel using multithreading, simply set the number of threads from the command line before starting Julia: e.g.,

export JULIA_NUM_THREADS=4

Technical details

Optimization

The function fit! uses an alternating directions proximal gradient method to minimize the objective. This method is not guaranteed to converge to the optimum, or even to a local minimum. If your code is not converging or is converging to a model you dislike, there are a number of parameters you can tweak.

Warm start

The algorithm starts with glrm.X and glrm.Y as the initial estimates for X and Y. If these are not given explicitly, they will be initialized randomly. If you have a good guess for a model, try setting them explicitly. If you think that you're getting stuck in a local minimum, try reinitializing your GLRM (so as to construct a new initial random point) and see if the model you obtain improves.

The function fit! sets the fields glrm.X and glrm.Y after fitting the model. This is particularly useful if you want to use the model you generate as a warm start for further iterations. If you prefer to preserve the original glrm.X and glrm.Y (e.g., for cross validation), you should call the function fit, which does not mutate its arguments.

You can even start with an easy-to-optimize loss function, run fit!, change the loss function (glrm.losses = newlosses), and keep going from your warm start by calling fit! again to fit the new loss functions.

Initialization

If you don't have a good guess at a warm start for your model, you might try one of the initializations provided in LowRankModels.

init_svd! initializes the model as the truncated SVD of the matrix of observed entries, with unobserved entries filled in with zeros. This initialization is known to result in provably good solutions for a number of "PCA-like" problems. See our paper for details.
init_kmeanspp! initializes the model using a modification of the kmeans++ algorithm for data sets with missing entries; see our paper for details. This works well for fitting clustering models, and may help in achieving better fits for nonnegative matrix factorization problems as well.
init_nndsvd! initializes the model using a modification of the NNDSVD algorithm as implemented by the NMF package. This modification handles data sets with missing entries by replacing missing entries with zeros. Optionally, by setting the argument max_iters=n with n>0, it will iteratively replace missing entries by their values as imputed by the NNDSVD, and call NNDSVD again on the new matrix. (This procedure is similar to the soft impute method of Mazumder, Hastie and Tibshirani for matrix completion.)

Parameters

As mentioned earlier, LowRankModels uses alternating proximal gradient descent to derive estimates of X and Y. This can be done by two slightly different procedures: (A) compute the full reconstruction, X' * Y, to compute the gradient and objective function; (B) only compute the model estimate for entries of A that are observed. The first method is likely preferred when there are few missing entries for A because of hardware level optimizations (e.g. chunking the operations so they just fit in various caches). The second method is likely preferred when there are many missing entries of A.

To fit with the first (dense) method:

fit!(glrm, ProxGradParams(); kwargs...)

To fit with the second (sparse) method:

fit!(glrm, SparseProxGradParams(); kwargs...)

The first method is used by default if glrm.A is a standard matrix/array. The second method is used by default if glrm.A is a SparseMatrixCSC.

ProxGradParams() and SparseProxGradParams() run these respective methods with the default parameters:

stepsize: The step size controls the speed of convergence. Small step sizes will slow convergence, while large ones will cause divergence. stepsize should be of order 1.
abs_tol: The algorithm stops when the decrease in the objective per iteration is less than abs_tol*length(obs).
rel_tol: The algorithm stops when the decrease in the objective per iteration is less than rel_tol.
max_iter: The algorithm also stops if maximum number of rounds max_iter has been reached.
min_stepsize: The algorithm also stops if stepsize decreases below this limit.
inner_iter: specifies how many proximal gradient steps to take on X before moving on to Y (and vice versa).

The default parameters are: ProxGradParams(stepsize=1.0;max_iter=100,inner_iter=1,abs_tol=0.00001,rel_tol=0.0001,min_stepsize=0.01*stepsize)

Convergence

ch gives the convergence history so that the success of the optimization can be monitored; ch.objective stores the objective values, and ch.times captures the times these objective values were achieved. Try plotting this to see if you just need to increase max_iter to converge to a better model.

Imputation

After fitting a GLRM, you can use it to impute values of A in four different ways:

impute(glrm) gives the maximum likelihood estimates for each entry
impute_missing(glrm) imputes missing entries and leaves observed entries unchanged
sample(glrm) gives a draw from the posterior distribution, conditioned on the fit values of X and Y, for each entry
sample_missing(glrm) samples missing entries and leaves observed entries unchanged

Cross validation

A number of useful functions are available to help you check whether a given low rank model overfits to the test data set. These functions should help you choose adequate regularization for your model.

Cross validation

cross_validate(glrm::GLRM, nfolds=5, params=Params(); verbose=false, use_folds=None, error_fn=objective, init=None): performs n-fold cross validation and returns average loss among all folds. More specifically, splits observations in glrm into nfolds groups, and builds new GLRMs, each with one group of observations left out. Fits each GLRM to the training set (the observations revealed to each GLRM) and returns the average loss on the test sets (the observations left out of each GLRM).

Optional arguments:
- use_folds: build use_folds new GLRMs instead of n_folds new GLRMs, each with 1/nfolds of the entries left out. (use_folds defaults to nfolds.)
- error_fn: use a custom error function to evaluate the fit, rather than the objective. For example, one might use the imputation error by setting error_fn = error_metric.
- init: initialize the fit using a particular procedure. For example, consider init=init_svd!. See Initialization for more options.
cv_by_iter(glrm::GLRM, holdout_proportion=.1, params=Params(1,1,.01,.01), niters=30; verbose=true): computes the test error and train error of the GLRM as it is trained. Splits the observations into a training set (1-holdout_proportion of the original observations) and a test set (holdout_proportion of the original observations). Performs params.maxiter iterations of the fitting algorithm on the training set niters times, and returns the test and train error as a function of iteration.

Regularization paths

regularization_path(glrm::GLRM; params=Params(), reg_params=exp10.(range(2,stop=-2,length=5)), holdout_proportion=.1, verbose=true, ch::ConvergenceHistory=ConvergenceHistory("reg_path")): computes the train and test error for GLRMs varying the scaling of the regularization through any scaling factor in the array reg_params.

Utilities

get_train_and_test(obs, m, n, holdout_proportion=.1): splits observations obs into a train and test set. m and n must be at least as large as the maximal value of the first or second elements of the tuples in observations, respectively. Returns observed_features and observed_examples for both train and test sets.

ScikitLearn

This library implements the ScikitLearn.jl interface. These models are available: SkGLRM, PCA, QPCA, NNMF, KMeans, RPCA. See their docstrings for more information (e.g. ?QPCA). All models support the ScikitLearnBase.fit! and ScikitLearnBase.transform interface. Examples:

## Apply PCA to the iris dataset
using LowRankModels
import ScikitLearnBase
using RDatasets    # may require Pkg.add("RDatasets")

A = convert(Matrix, dataset("datasets", "iris")[[:SepalLength, :SepalWidth, :PetalLength, :PetalWidth]])
ScikitLearnBase.fit_transform!(PCA(k=3, max_iter=500), A)

## Fit K-Means to a fake dataset of two Gaussians
using LowRankModels
import ScikitLearnBase

# Generate two disjoint Gaussians with 100 and 50 points
gaussian1 = randn(100, 2) + 5
gaussian2 = randn(50, 2) - 10
# Merge them into a single dataset
A = vcat(gaussian1, gaussian2)

model = ScikitLearnBase.fit!(LowRankModels.KMeans(), A)
# Count how many points are assigned to each Gaussians (should be 100 and 50)
Set(sum(ScikitLearnBase.transform(model, A), 1))

These models can be used inside a ScikitLearn pipeline, and every hyperparameter can be tuned with GridSearchCV.

Citing this package

If you use LowRankModels for published work, we encourage you to cite the software.

Use the following BibTeX citation:

@article{glrm,
    title = {Generalized Low Rank Models},
    author ={Madeleine Udell and Horn, Corinne and Zadeh, Reza and Boyd, Stephen},
    doi = {10.1561/2200000055},
    year = {2016},
    archivePrefix = "arXiv",
    eprint = {1410.0342},
    primaryClass = "stat-ml",
    journal = {Foundations and Trends in Machine Learning},
    number = {1},
    volume = {9},
    issn = {1935-8237},
    url = {http://dx.doi.org/10.1561/2200000055},
}

Comments

Penalties/regularization of imputed values

In many experimental contexts (RNA-sequencing is a good example), ground truth data that are below a detection threshold are not observed due to technical error. Thus, we have a data matrix A with many NA entries. However, we suspect many of these NA entries to be small, though not necessarily zero.

For each observed entry A_ij we have a loss function: L[A_ij, dot(x_i,y_j)]

Maybe for each unobserved entry we could add some regularization: R_a[dot(x_i,y_j)]

Where R_a is a function defined by the user... Perhaps R_a[z] = z^2 or R_a[z] = abs(z)

Aside: For the RNA sequencing application, something like R_a[z] = sqrt(z), z > 0 would be interesting (though not convex). Actually even something non-monotonic would be interesting for reasons I won't go into. These are probably too weird/specialized to include, but I would be curious.

opened by ahwillia 14
ScikitLearn.jl wrap
get-params wasn't working out with the current type definition. I've decided to wrap GLRM instead, it's going to be more consistent with the rest of the ScikitLearn.jl algorithms, and allows me to expose more hyperparameters.

I branched from dataframe-ux this time.

PCA works fine, but I'm having issues with K-Means. I've translated this example, and instead of getting a nice Voronoi diagram, I get this:

Clearly not random, but not quite Voronoi either! To get this picture, I call transform(kmeans, X) where X is a two-column matrix containing all (x,y) coordinates. transform uses FixedLatentFeaturesConstraint. Is there any reason why this would be incompatible with K-Means? Its output doesn't look right, even with abs_tol=1.e-20 (it converges in ~20 iterations)

2430x10 Array{Float64,2}: 0.0 0.0 0.0 … 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.570026 0.714604 -2.94676 2.10088 2.1646 -1.42798 0.191742 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 … 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ...

(10 columns because there are 10 digits). Those 10 rows correspond to neighboring pixels. They should all be classified in the same cluster. But even ignoring the anomaly on row 4, the other rows are not consistent at all.

It would be straight-forward to implement transform using the distance from each centroid, but if there's something wrong with transform, it's better to fix that instead. Any ideas?
opened by cstjean 9
initial implementation of nndsvd

Here is a quick implementation of nndsvd.

We should probably develop a few test cases to make sure this is working as expected before merging. But it seems to be behaving reasonably in my hands so far...

Note: Sorry for the duplicate pull request. I'll delete the other one.

opened by ahwillia 8
Working on randomly generated data returns 0 element Arrays for labels and glrm.ry

Hi again,

I was trying to debug my analysis by working with a randomly generated dataset:

A = rand(100,2)*rand(2,100) A = convert(DataFrame, A) glrm, labels = GLRM(A, 2)

However the corresponding labels and glrm.ry objects are empty:

julia> labels 0-element Array{Symbol,1}

julia> glrm.ry 0-element Array{Regularizer,1}

It would be great if you could help me understand why this is happening?

Also could you suggest a base dataset that could be used to understand the different aspects of the LowRankModels code like the loss and regularizer options? I tried using the "psych" Dataset referred to in the README and this randomly generated matrix and ran into issues with them both.

Thanks a lot, Nandana

opened by NandanaSengupta 6
WIP: Support for the ScikitLearn API

I followed your suggestion. The result is not the prettiest, but it works. If it looks reasonable, I'll add the remaining functions and support NMF/KMeans/etc.

I'd like to understand why you copy Regularizer and Loss objects. Are they modified anywhere?

opened by cstjean 5
Compute X given Y

Once a model has been fit to a matrix A, is there any way to fit it to another matrix holding Y constant? For example, if factor analysis is part of a pipeline that ends with an SVM classifier, the cross-validation code should learn the feature matrix Y on the training set, and compute the data matrix X on the test set, given Y.

opened by cstjean 5
Change loss type names

Changes all loss function types to have CamelCase names, preventing naming conflicts with StatsBase. See: https://github.com/madeleineudell/LowRankModels.jl/issues/41

I can also rename all the regularization types to CamelCase if desired...

opened by ahwillia 5

Regularizers for Soft K-means

I've implemented some new regularizers that let you move between a soft and hard clustering. Here is a quick demo/explanation:

http://nbviewer.ipython.org/github/ahwillia/notebooks/blob/master/code_pubs/2015_08_11_regularized_soft_kmeans.ipynb

Is this too specialized to include in the repository? The code is pretty short so I've just pasted it below. Let me know if you want me to open a PR.

## indicator of vectors in the simplex, plus a penalty on Shannon entropy
## (intuition: soft k-means with encouraged sparseness)
type entr_simplex<:Regularizer
    scale::Float64
end
function evaluate(r::entr_simplex,a::AbstractArray)
    evaluate(simplex(),a) == Inf && return Inf # simplex constraint
    b = length(a) # base for entropy calculation (normalizes: 0<=entropy<=1)
    return r.scale*entropy(a,b) # penalize entropy if constraint satisfied
end
function prox!(r::entr_simplex,u::AbstractArray,alpha::Number)
    prox!(simplex(),u,alpha) # first project onto unit simplex
    b = length(u)    # base for entropy calculation (normalizes: 0<=entropy<=1)
    g = -log(b,u)-1  # gradient of entropy

    # project entropy gradient onto simplex, ignoring Infs
    gs = g - mean(g[g.!=Inf]) 
    for i = 1:length(u)
        u[i] = (g[i]==Inf) ? 0.0 : u[i] - r.scale*alpha*gs[i]
    end

    # make sure we didn't step off the simplex before returning
    prox!(simplex(),u,alpha)
end
entr_simplex() = entr_simplex(1)

## indicator of vectors in the simplex, plus a penalty on
## the l1 distance to the center of the simplex
## (intuition: soft k-means with encouraged sparseness)
type dist_simplex<:Regularizer
    scale::Float64
    k::Int # rank of model (store up front)
    d::Float64 # distance from corner to center (calculate up front)
end
function evaluate(r::dist_simplex,a::AbstractArray)
    evaluate(simplex(),a) == Inf && return Inf # simplex constraint
    dist = sum(abs(a-ones(r.k)/r.k)) # distance from center
    return r.scale*(1 - (dist/r.d))  # penalize dist from corners
end
function prox(r::dist_simplex,u::AbstractArray,alpha::Number)
    prox!(simplex(),u,alpha) # first project onto unit simplex

    # Calculate gradient
    shrink_step = -r.scale*alpha/(r.k-1)
    imax = indmax(u)
    for i = 1:length(u)
        if i == imax
            u[i] += r.scale*alpha
        else
            u[i] += shrink_step
        end
    end

    # make sure we didn't step off the simplex before returning
    prox!(simplex(),u,alpha)
end
function dist_simplex(s::Float64,k::Int)
    # k is the rank of the data
    e = zeros(k); e[1] = 1.0  # corner of simplex
    c = ones(k)/k             # center of simplex
    return dist_simplex(s,k,sum(abs(c-e)))
end
dist_simplex(k::Int) = dist_simplex(1.0,k)

opened by ahwillia 5

Use Sparse Matrices

I have a few applications that involve very large, very sparse (incomplete), matrices.

At the moment it appears that the entire data matrix, A, is estimated/reconstructed from X'*Y on each gradient step (see here). However, if many entries of A are missing, then only a subset of the X'*Y entries need to be computed to evaluate the loss function.

Would it make sense to make a new file sparse_proxgrad.jl in src/algorithms/ that is called when glrm.A is a sparse matrix?

opened by ahwillia 5
@threads causes massive slowdown

I find, while running julia v1.2 and LowRankModels v1.0.1, that the @threads annotation here sometimes causes massive slowdowns (3x in execution time and CPU utilization very low). My problem size is m,n,k = 200,150,10 and nthreads() == 2. Removing the annotation makes everything much faster.

opened by baggepinnen 4
One regularizer per column for X?

I've noticed that while one can pass an array of regularizers for Y (i.e. one per column of Y), the same is not possible for X. Is there a reason for that? Is there any interest in generalizing the code in that direction? Maybe I could help!

opened by amanoel 4
Compatibility and dependency updates

I tried to make the minimal changes here to make this package work with recent versions of Julia and package dependencies. Currently this package drastically downgrades DataFrames and probably other packages on installation. All tests seem to pass for me.

opened by kshedden 1
Use a more recent version of DataFrames

DataFrames has now released their v1.0, but this repo still locks its dependency to some prior versions, which can lead to compat issues with other packages. It would be great to relax that dependency.

opened by eperim 1

support `missing` in matrix rather than having to give `obs`

Since julia 1.0 has missing built in it would be nice to just naturally support it. Rather than having to pass in obs specifically.

Example using pca.

julia> data = map(x->rand()>0.2 ? x : missing, rand(5, 5)*rand(5,5))
5×5 Matrix{Union{Missing, Float64}}:
 1.06069   1.08958   missing  1.67055   1.20225
 0.751405  1.1944   1.31671    missing  1.23701
 1.33475   1.62695   missing  2.05511   1.52327
 0.909519  1.18476  1.70125   1.93304   1.3857
  missing  1.07582  1.23387   1.57813   1.09896

julia> pca(data, 3)
ERROR: TypeError: non-boolean (Missing) used in boolean context
Stacktrace:
 [1] GLRM(A::Matrix{Union{Missing, Float64}}, losses::Vector{Loss}, rx::Vector{Regularizer}, ry::Vector{Regularizer}, k::Int64; X::Matrix{Float64}, Y::Matrix{Float64}, obs::Nothing, observed_features::Vector{UnitRange{Int64}}, observed_examples::Vector{UnitRange{Int64}}, offset::Bool, scale::Bool, checknan::Bool, sparse_na::Bool)
   @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/glrm.jl:66
 [2] GLRM(A::Matrix{Union{Missing, Float64}}, losses::Vector{Loss}, rx::Vector{Regularizer}, ry::Vector{Regularizer}, k::Int64)
   @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/glrm.jl:38
 [3] #GLRM#172
   @ ~/JuliaEnvs/LowRankModels.jl/src/utilities/conveniencemethods.jl:48 [inlined]
 [4] GLRM
   @ ~/JuliaEnvs/LowRankModels.jl/src/utilities/conveniencemethods.jl:48 [inlined]
 [5] #pca#107
   @ ~/JuliaEnvs/LowRankModels.jl/src/simple_glrms.jl:8 [inlined]
 [6] pca(A::Matrix{Union{Missing, Float64}}, k::Int64)
   @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/simple_glrms.jl:6
 [7] top-level scope
   @ REPL[34]:1

It's not to go and list all the obs manually and the it works:

julia> obs = [Tuple(ind) for ind in CartesianIndices(data) if !(data[ind] isa Missing)];

julia> pca(data, 3; obs=obs)
GLRM(Union{Missing, Float64}[1.0606880432831138 1.089579677823029 … 1.6705515670551134 1.2022467329602542; 0.7514051967634282 1.1944014349828096 … missing 1.2370061047120438; … ; 0.9095188136002703 1.1847633026389974 … 1.9330409876767132 1.38570067320635; missing 1.0758247860288004 … 1.5781294994670754 1.0989562029066084], Loss[QuadLoss(1.0, RealDomain()), QuadLoss(1.0, RealDomain()), QuadLoss(1.0, RealDomain()),
 QuadLoss(1.0, RealDomain()), QuadLoss(1.0, RealDomain())], Regularizer[ZeroReg(), ZeroReg(), ZeroReg(),
 ZeroReg(), ZeroReg()], Regularizer[ZeroReg(), ZeroReg(), ZeroReg(), ZeroReg(), ZeroReg()], 3, [[1, 2, 4
, 5], [1, 2, 3, 5], [1, 2, 4, 5], [1, 2, 3, 4, 5], [2, 3, 4, 5]], [[1, 2, 3, 4], [1, 2, 3, 4, 5], [2, 4,
 5], [1, 3, 4, 5], [1, 2, 3, 4, 5]], [-0.6386482530950347 -1.018777729621554 … -1.9382553356751036 -1.053052636035794; -0.14517788192107056 -1.2458123989543037 … -2.2837186272581693 -1.5608357286351906; 0.2540552369948007 -1.4235247052609152 … 0.2976611081413319 0.7210816049116382], [0.4108564480213463 -1.0415581335449255 … 0.36006578018169627 0.11669723598708971; 0.05544466402953537 0.7168696680990544 … 0.19156334692424293 -0.48792765211129047; -0.8552253694850784 -1.5810995605873126 … -0.8580967626857172 -0.4419037908246037])

but we should make it happen automatically.

@jiahao said he has a fix for this.

opened by oxinabox 0

fix NaNtoMissing

running the tests was erroring with:

ERROR: LoadError: LoadError: LoadError: type CategoricalValue has no field value
Stacktrace:
  [1] getproperty(x::CategoricalValue{String, UInt8}, f::Symbol)
    @ Base ./Base.jl:33
  [2] isnan(x::CategoricalValue{String, UInt8})
    @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/fit_dataframe.jl:275
  [3] (::LowRankModels.var"#147#148"{DataFrame, Int64})(::Tuple{Int64, CategoricalValue{String, UInt8}})
    @ LowRankModels ./none:0
  [4] iterate
    @ ./generator.jl:47 [inlined]
  [5] collect(itr::Base.Generator{Base.Iterators.Enumerate{CategoricalVector{String, UInt8, String, Cate
goricalValue{String, UInt8}, Union{}}}, LowRankModels.var"#147#148"{DataFrame, Int64}})
    @ Base ./array.jl:678
  [6] NaNs_to_Missing!(df::DataFrame)
    @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/fit_dataframe.jl:281
  [7] GLRM(df::DataFrame, k::Int64; losses::Vector{Loss}, rx::QuadReg, ry::QuadReg, offset::Bool, scale:
:Bool, prob_scale::Bool, NaNs_to_Missing::Bool)
    @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/fit_dataframe_w_type_imputation.jl:17
  [8] GLRM(df::DataFrame, k::Int64)
    @ LowRankModels ~/JuliaEnvs/LowRankModels.jl/src/fit_dataframe_w_type_imputation.jl:15
  [9] top-level scope
    @ ~/JuliaEnvs/LowRankModels.jl/examples/fit_rdataset.jl:9
 [10] include(fname::String)
    @ Base.MainInclude ./client.jl:444
 [11] top-level scope
    @ ~/JuliaEnvs/LowRankModels.jl/examples/runexamples.jl:17
 [12] include(fname::String)
    @ Base.MainInclude ./client.jl:444
 [13] top-level scope
    @ ~/JuliaEnvs/LowRankModels.jl/test/runtests.jl:34
 [14] include(fname::String)
    @ Base.MainInclude ./client.jl:444
 [15] top-level scope
    @ none:6
in expression starting at /Users/oxinabox/JuliaEnvs/LowRankModels.jl/examples/fit_rdataset.jl:9
in expression starting at /Users/oxinabox/JuliaEnvs/LowRankModels.jl/examples/runexamples.jl:17
in expression starting at /Users/oxinabox/JuliaEnvs/LowRankModels.jl/test/runtests.jl:34
ERROR: Package LowRankModels errored during testing

Looking at the code for NaNtoMissing it was a bit funky. I made it less funky.

Now tests pass without errors for me

opened by oxinabox 1

Applying model to new data
Hi there, This looks a great package. I'm particularly interested in the ability to fit LRMs to datasets with missing data (or in my case, outliers that need to be masked). I have a quick question that may be pretty basic, but an answer would help me to apply the code to my own data. Apologies if I've missed something in the documentation. I'm also fairly new to Julia.

If I fit a PCA model to a set of training data A (following your example):

loss = QuadLoss() r = ZeroReg() n_comp = 1 glrm = GLRM(A,loss,r,r,n_comp) X,Y,ch. = fit!(glrm)

how do I then apply the same model to a new set of data B? I would like to keep X fixed and obtain new values Y_b that give the best fit of X to B. That is, I would like to project the observations in B onto the PCA components found from A.

There are other PCA packages in Julia that will do this (e.g., the reconstruct function in MultivariateStats), but they don't seem to be able to handle missing data or sparse arrays.

Thanks in advance! Any help is appreciated!
opened by gdbeck 2

Releases(v1.1.1)

v1.1.1(Nov 12, 2020)
LowRankModels v1.1.1

Diff since v1.1.0

Closed issues:

Register new version (#110)

Source code(tar.gz)
Source code(zip)
v1.1.0(Oct 23, 2020)
LowRankModels v1.1.0

Diff since v1.0.2

Closed issues:

Register new version (#103)

Merged pull requests:

Install TagBot as a GitHub Action (#105) (@JuliaTagBot)

Add Array() initialize.jl line 16 (#108) (@kathy908000)

Modernize package (#109) (@madeleineudell)

Source code(tar.gz)
Source code(zip)
v1.0.2(Nov 9, 2019)
v1.0.2 (2019-11-09)

Diff since v1.0.1

Closed issues:

Register v1.0.2 (#102)

Register v1.0.1 (#101)

@threads causes massive slowdown (#98)

Register v1.0.0 (compatible with Julia 1.0+) (#92)

A non linear GLRM is good for visualization? (#89)

How to run distributely? (#88)

expand_categoricals function not working (#84)

Specifying MultinomialLoss() for multiple columns (#81)

Merged pull requests:

fix julia1 errors (#93) (matbesancon)

Source code(tar.gz)
Source code(zip)
v1.0.1(Jul 23, 2019)
v1.0.1 (2019-07-23)

Diff since v1.0.0

Closed issues:

What Julia version is this package compatible for? (#90)

Merged pull requests:

fix Symbol typo (#86) (ramcha24)

Source code(tar.gz)
Source code(zip)
v1.0.0(Jul 18, 2019)

This release updates functionality to work on Julia 1.0+

One minor syntax change: mul! has replaced scale!, in keeping with new conventions in Julia 1.0+.
Source code(tar.gz)
Source code(zip)
v0.2.4(Jun 4, 2019)

This release adds functionality related to missing value imputation using the new Missing type. This is also the last version of LowRankModels that will be compatible with Julia v0.6.
Source code(tar.gz)
Source code(zip)
v0.2.3(Mar 30, 2018)

This version fixes some numerical problems in fitting the Poisson Loss, and removes defunct requirements for version numbers of dependencies.
Source code(tar.gz)
Source code(zip)
v0.2.2(Nov 1, 2017)

This release updates LowRankModels to remove warnings on Julia v0.6.
Source code(tar.gz)
Source code(zip)
v0.2.0(Nov 1, 2017)

This release fixes deprecations for Julia v0.6.
Source code(tar.gz)
Source code(zip)

Owner

Madeleine Udell

GitHub Repository

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

BLEND is a mechanism that can efficiently find fuzzy seed matches between sequences to significantly improve the performance and accuracy while reducing the memory space usage of two important applic

19 Dec 26, 2022

Semi-supervised semantic segmentation needs strong, varied perturbations

Semi-supervised semantic segmentation using CutMix and Colour Augmentation Implementations of our papers: Semi-supervised semantic segmentation needs

146 Dec 20, 2022

A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

AI_Personal_Voice_Assistant_Using_Python A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perf

1 Oct 30, 2021

chainladder - Property and Casualty Loss Reserving in Python

chainladder (python) chainladder - Property and Casualty Loss Reserving in Python This package gets inspiration from the popular R ChainLadder package

130 Dec 07, 2022

Repo for code associated with Modeling the Mitral Valve.

Project Title Mitral Valve Getting Started Repo for code associated with Modeling the Mitral Valve. See https://arxiv.org/abs/1902.00018 for preprint,

1 May 17, 2022

Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

81 Nov 26, 2022

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition The unofficial code of CDistNet. Now, we ha

25 Jul 20, 2022

The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

Codebase for learning control flow in transformers The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformer

24 Oct 15, 2022

Introducing neural networks to predict stock prices

IntroNeuralNetworks in Python: A Template Project IntroNeuralNetworks is a project that introduces neural networks and illustrates an example of how o

637 Jan 04, 2023

Dynamic Bottleneck for Robust Self-Supervised Exploration

Dynamic Bottleneck Introduction This is a TensorFlow based implementation for our paper on "Dynamic Bottleneck for Robust Self-Supervised Exploration"

4 Nov 14, 2022

TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation Zhaoyun Yin, Pichao Wang, Fan Wang, Xianzhe Xu, Hanling Zhang, Hao Li

25 Dec 16, 2022

High-Resolution 3D Human Digitization from A Single Image.

PIFuHD: Multi-Level Pixel-Aligned Implicit Function for High-Resolution 3D Human Digitization (CVPR 2020) News: [2020/06/15] Demo with Google Colab (i

8.4k Dec 29, 2022

Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'

Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework Official code for paper, Self-supervised Video Representation Le

103 Dec 21, 2022

LowRankModels.jl is a julia package for modeling and fitting generalized low rank models.

Related tags

Overview

LowRankModels.jl

Installation

Generalized Low Rank Models

Example

Missing data

Standard low rank models

Scaling and offsets

Fitting DataFrames

Fitting Sparse Matrices

Parallel fitting (experimental)

Technical details

Optimization

Warm start

Initialization

Parameters

Convergence

Imputation

Cross validation

Cross validation

Regularization paths

Utilities

ScikitLearn

Citing this package

Comments

Releases(v1.1.1)

v1.1.1(Nov 12, 2020)

LowRankModels v1.1.1

v1.1.0(Oct 23, 2020)

LowRankModels v1.1.0

v1.0.2(Nov 9, 2019)

v1.0.2 (2019-11-09)

v1.0.1(Jul 23, 2019)

v1.0.1 (2019-07-23)

v1.0.0(Jul 18, 2019)

v0.2.4(Jun 4, 2019)

v0.2.3(Mar 30, 2018)

v0.2.2(Nov 1, 2017)

v0.2.0(Nov 1, 2017)

Owner

Madeleine Udell

BLEND: A Fast, Memory-Efficient, and Accurate Mechanism to Find Fuzzy Seed Matches

Semi-supervised semantic segmentation needs strong, varied perturbations

A project to build an AI voice assistant using Python . The Voice assistant interacts with the humans to perform basic tasks.

chainladder - Property and Casualty Loss Reserving in Python

Repo for code associated with Modeling the Mitral Valve.

Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

PyTorch implementation of CDistNet: Perceiving Multi-Domain Character Distance for Robust Text Recognition

The official repository for our paper "The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization".

Introducing neural networks to predict stock prices

Dynamic Bottleneck for Robust Self-Supervised Exploration

TransFGU: A Top-down Approach to Fine-Grained Unsupervised Semantic Segmentation

High-Resolution 3D Human Digitization from A Single Image.

Official implementation of ACMMM'20 paper 'Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework'

PyTorch code for Composing Partial Differential Equations with Physics-Aware Neural Networks

Cross-media Structured Common Space for Multimedia Event Extraction (ACL2020)

MNE: Magnetoencephalography (MEG) and Electroencephalography (EEG) in Python

Tutorial for the PERFECTING FACTORY 5.0 WITH EDGE-POWERED AI workshop

QAT(quantize aware training) for classification with MQBench

[CVPR 2021] Involution: Inverting the Inherence of Convolution for Visual Recognition, a brand new neural operator

Using Machine Learning to Create High-Res Fine Art