Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Last update: Dec 27, 2022

Overview

scikit-fda: Functional Data Analysis in Python

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

This package offers classes, methods and functions to give support to FDA in Python. Includes a wide range of utils to work with functional data, and its representation, exploratory analysis, or preprocessing, among other tasks such as inference, classification, regression or clustering of functional data. See documentation for further information on the features included in the package.

Documentation

The documentation is available at fda.readthedocs.io/en/stable/, which includes detailed information of the different modules, classes and methods of the package, along with several examples showing different functionalities.

The documentation of the latest version, corresponding with the develop version of the package, can be found at fda.readthedocs.io/en/latest/.

Installation

Currently, scikit-fda is available in Python 3.6 and 3.7, regardless of the platform. The stable version can be installed via PyPI:

pip install scikit-fda

Installation from source

It is possible to install the latest version of the package, available in the develop branch, by cloning this repository and doing a manual installation.

git clone https://github.com/GAA-UAM/scikit-fda.git
pip install ./scikit-fda

Make sure that your default Python version is currently supported, or change the python and pip commands by specifying a version, such as python3.6:

git clone https://github.com/GAA-UAM/scikit-fda.git
python3.6 -m pip install ./scikit-fda

Requirements

scikit-fda depends on the following packages:

cython - Python to C compiler
fdasrsf - SRSF framework
findiff - Finite differences
matplotlib - Plotting with Python
multimethod - Multiple dispatch
numpy - The fundamental package for scientific computing with Python
pandas - Powerful Python data analysis toolkit
rdata - Reader of R datasets in .rda format in Python
scikit-datasets - Scikit-learn compatible datasets
scikit-learn - Machine learning in Python
scipy - Scientific computation in Python
setuptools - Python Packaging

The dependencies are automatically installed.

Contributions

All contributions are welcome. You can help this project grow in multiple ways, from creating an issue, reporting an improvement or a bug, to doing a repository fork and creating a pull request to the development branch.

The people involved at some point in the development of the package can be found in the contributors file.

License

The package is licensed under the BSD 3-Clause License. A copy of the license can be found along with the code.

Comments

Feature/improve visualization

The plotting functions have been changed to not use global state (using pyplot) whenever possible. Instead, the plotting functions in skfda create and return a new figure, if none is passed.

Pyplot is still used for Sphinx-gallery, as it has to scrap the produced images.

All the examples have been changed in order to use the new API and the object oriented functionalities of Matplotlib.

opened by vnmabus 17
Retrieve coefficients for function reconstruction

Hey there!

I fitted a KNN FDataGrid to my input data. It actually looks pretty good so far and I now would like to "export" it so I can represent the function as numerical values (preferable in a numpy array).

I saw that you offer some basis that can be used to "export" the underlying representation. Could you elaborrate on what basis should be used when? My data represents a demand/supply curve. I tried the BSpline one but it only constructs something close to a sine wave which doesn't really represent my data.

Here is an image of the graph itself:

Is there some way to get the raw representation instead of transforming it to another basis?

opened by mainrs 16
add inverse_transform #297 method to FPCA
Reference issue: https://github.com/GAA-UAM/scikit-fda/issues/297#issue-775429060

What does it address ? The inverse_transform method projects the functional principal score (coefficient value w.r.t to eigenfunctions basis) of a (or set of) sample back to the input space. In other words FPCA.inverse_transform(FPCA.transform()) should approximate the identity map.

Empirical tests: I've only tested the inverse_transform method on synthetic datasets (both FDataGrid and FDataBasis) and assessed the results in terms of 'identity regression' i.e. with QQ-plots on the distribution of the residuals where residuals = X_input.value - X_recovered.value, where:

X_recovered is computed as FPCA.inverse_transform(FPCA.transform(X_input)),

value is .coefficients or .data_matrix attribute, depending on the input data format.

X_input is generated according to:

cov = Exponential(length_scale=0.5) n_grid = 100 n_samples = 50 fd = make_gaussian_process( start=0, stop=4, n_samples=n_samples, n_features=n_grid, mean=lambda t: np.power(t, 2) + 5, cov=cov, )

Herebelow are the resulting QQ-plots (computed with scipy.stats.probplot) when `X_recovered' is computed with 3 principal components:

When `fd' is let in FDataGrid format: slope ~0.675, intercept ~ -3.e-17 and an R^2 ~ 0.9994. Note that the residuals are computed in terms of funtional data values.

When `fd' is mapped to FDataBasis format with a 4-order Bspline basis with cardinality 50: -slope ~ 1.08, intercept -5.e-17 and R^2 ~0.9996. Note that the residuals are computed in terms of the coefficients in the (here Bspline) basis.

I can enhance the PR and provide further emprical results, just tell me :)
opened by Clej 12

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

Scikit-fda is successfully installed, but when I try to import it, I receive the following error:

ValueError                                Traceback (most recent call last)
<ipython-input-17-97c44e0d3210> in <module>
----> 1 import skfda

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/__init__.py in <module>
     35 from .representation._functional_data import concatenate
     36 
---> 37 from . import representation, datasets, preprocessing, exploratory, misc, ml, \
     38     inference
     39 

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/datasets/__init__.py in <module>
      5                              fetch_weather, fetch_aemet,
      6                              fetch_octane, fetch_gait)
----> 7 from ._samples_generators import (make_gaussian,
      8                                   make_gaussian_process,
      9                                   make_sinusoidal_process,

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/datasets/_samples_generators.py in <module>
      7 from .. import FDataGrid
      8 from .._utils import _cartesian_product
----> 9 from ..misc import covariances
     10 from ..preprocessing.registration import normalize_warping
     11 from ..representation.interpolation import SplineInterpolation

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/__init__.py in <module>
----> 1 from . import covariances, kernels, metrics
      2 from . import operators
      3 from . import regularization
      4 from ._math import (log, log2, log10, exp, sqrt, cumsum,
      5                     inner_product, inner_product_matrix)

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/covariances.py in <module>
      7 import sklearn.gaussian_process.kernels as sklearn_kern
      8 
----> 9 from ..exploratory.visualization._utils import _create_figure, _figure_to_svg
     10 
     11 

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/__init__.py in <module>
      1 from . import depth
----> 2 from . import outliers
      3 from . import stats
      4 from . import visualization

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/outliers/__init__.py in <module>
      4 )
      5 from ._iqr import IQROutlierDetector
----> 6 from .neighbors_outlier import LocalOutlierFactor

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/outliers/neighbors_outlier.py in <module>
      2 from sklearn.base import OutlierMixin
      3 
----> 4 from ...misc.metrics import lp_distance
      5 from ...ml._neighbors_base import (
      6     KNeighborsMixin,

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/metrics.py in <module>
      6 
      7 from .._utils import _pairwise_commutative
----> 8 from ..preprocessing.registration import normalize_warping, ElasticRegistration
      9 from ..preprocessing.registration._warping import _normalize_scale
     10 from ..preprocessing.registration.elastic import SRSF

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/__init__.py in <module>
----> 1 from . import registration
      2 from . import smoothing
      3 from . import dim_reduction

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/registration/__init__.py in <module>
     14 from ._warping import invert_warping, normalize_warping
     15 
---> 16 from .elastic import ElasticRegistration
     17 
     18 from . import validation, elastic

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/registration/elastic.py in <module>
      1 
----> 2 from fdasrsf.utility_functions import optimum_reparam
      3 import scipy.integrate
      4 from sklearn.base import BaseEstimator, TransformerMixin
      5 from sklearn.utils.validation import check_is_fitted

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/__init__.py in <module>
     20 del sys
     21 
---> 22 from .time_warping import fdawarp, align_fPCA, align_fPLS, pairwise_align_bayes
     23 from .plot_style import f_plot, rstyle, plot_curve, plot_reg_open_curve, plot_geod_open_curve, plot_geod_close_curve
     24 from .utility_functions import smooth_data, optimum_reparam, f_to_srsf, gradient_spline, elastic_distance, invertGamma, srsf_to_f

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/time_warping.py in <module>
      7 import numpy as np
      8 import matplotlib.pyplot as plt
----> 9 import fdasrsf.utility_functions as uf
     10 import fdasrsf.fPCA as fpca
     11 import fdasrsf.geometry as geo

~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/utility_functions.py in <module>
     17 from joblib import Parallel, delayed
     18 import numpy.random as rn
---> 19 import optimum_reparamN2 as orN2
     20 import optimum_reparam_N as orN
     21 import cbayesian as bay

src/optimum_reparamN2.pyx in init optimum_reparamN2()

ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

I'm using the standard format

import skfda

I don't know why this is happening, any help is greatly appreciated

Version information

OS: MacOS
Python version: 3.7.9
scikit-fda version: 0.5
Version of other packages involved: [numpy: 1.19.2, scipy: 1.6.0, matplotlib: 3.3.4 , conda: 4.9.2 ]

bug

opened by adeyemi-lawal 10

ModuleNotFoundError: No module named 'optimum_reparam'

Hi, I just forked/cloned the repository and I found a module which is not listed in the dependencies. It's called "optimum_reparam" and it's imported right here. Maybe somewhere else. I failed finding the module myself, could someone provide details about this module in the README requirements section, please?

Thanks for this initiative!

opened by alejandro-ariza 8
Problem compiling binaries in macOS
Describe the bug Problem compiling C code from fdasrsf:

error: $MACOSX_DEPLOYMENT_TARGET mismatch: now "10.14" but "10.15" during configure

In the fdasrsf package it is fixed the variable with os.environ['MACOSX_DEPLOYMENT_TARGET'] = '10.14' in the setup.py. After cloning the repository and remove the line I get the same error. I tried to export the environment variable before running the installation with no success.

Complete trace:

Building wheel for fdasrsf (PEP 517) ... error ERROR: Command errored out with exit status 1: command: /Users/pablomm/scikit_fda_test2/venv/bin/python /Users/pablomm/scikit_fda_test2/venv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/tmpo2_rkkrs cwd: /private/var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/pip-install-ym8nls3t/fdasrsf_a923c378ec5a4ec689f1d7459d35c8c7 Complete output (38 lines): generating build/_DP.c (already up-to-date) running bdist_wheel running build running build_py creating build/lib.macosx-10.15-x86_64-3.8 creating build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/plot_style.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_stats.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_functions.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/geodesic.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/utility_functions.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/tolerance.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/pcr_regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/init.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/fPCA.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/umap_metric.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/time_warping.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/geometry.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/rbfgs.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/fPLS.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/boxplots.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf running build_ext cythoning src/optimum_reparamN2.pyx to src/optimum_reparamN2.c cythoning src/fpls_warp.pyx to src/fpls_warp.c cythoning src/mlogit_warp.pyx to src/mlogit_warp.c cythoning src/ocmlogit_warp.pyx to src/ocmlogit_warp.c cythoning src/oclogit_warp.pyx to src/oclogit_warp.c cythoning src/optimum_reparam_N.pyx to src/optimum_reparam_N.c cythoning src/cbayesian.pyx to src/cbayesian.cpp building 'optimum_reparamN2' extension creating build/temp.macosx-10.15-x86_64-3.8 creating build/temp.macosx-10.15-x86_64-3.8/src clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk -I/private/var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/pip-build-env-qnbgsf0y/overlay/lib/python3.8/site-packages/numpy/core/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/opt/tcl-tk/include -I/Users/pablomm/scikit_fda_test2/venv/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/include/python3.8 -c src/optimum_reparamN2.c -o build/temp.macosx-10.15-x86_64-3.8/src/optimum_reparamN2.o not modified: 'build/_DP.c' error: $MACOSX_DEPLOYMENT_TARGET mismatch: now "10.14" but "10.15" during configure

ERROR: Failed building wheel for fdasrsf Building wheel for findiff (setup.py) ... done Created wheel for findiff: filename=findiff-0.8.9-py3-none-any.whl size=29218 sha256=c2bc96e93c195fb5c2a1acbf932b5311e8e94e571edf74929eca6e019c66532d Stored in directory: /Users/pablomm/Library/Caches/pip/wheels/df/48/68/71cc95b16d5f7c5115a009f92f9a5a3896fb2ece31228b0aa5 Successfully built findiff Failed to build fdasrsf ERROR: Could not build wheels for fdasrsf which use PEP 517 and cannot be installed directly

To Reproduce

virtualenv venv source venv/bin/activate python --version # 3.8.8 and also tried 3.9.2 pip install scikit-fda

Also, it is happening when the package is installed from code with python setup.py install

Version information

OS: macOS catalina 10.15.7

Python version: 3.9.2 and 3.8.8

scikit-fda version: stable master (0.5) and develop

gcc and g++ version: 12.0.0

bug
opened by pablomm 7

Importing from skfda import FDataGrid gives you ValueError

Describe the bug When importing FDataGrid from skfda it will launch the following error: ` ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

`To Reproduce We are installing scikit-fda==0.3 with the following docker image:

FROM python:3.7-slim-buster

# ------------------------------------------------------------------------------------Basic Linux Packages------------------------------------------------------------------------------------
RUN apt-get update \
    && apt-get install -y --no-install-recommends \
    ca-certificates \
    cmake \
    build-essential \
    gcc \
    g++ \
    git \
    wget \
    curl \
    libffi-dev \
    python-dev \
    unixodbc-dev \
    && rm -rf /var/lib/apt/lists/*

Expected behavior To be able to import the library Version information

OS: Linux
Python version: python:3.7-slim-buster
scikit-fda version: 0.3
Version of other packages involved [e.g. numpy, scipy, matplotlib, ... ]: numpy==1.16.2 , scipy==1.3.1, matplotlib==3.1.1

Additional context Add any other context about the problem here. Since 01/02/2021 we have the issue of not being able to import the library. Before of that date we could import it without problem with the same requirements and environment

Error Thread:

src/hvl/utils/model_utils.py:12: in <module>
    from skfda import FDataGrid
/usr/local/lib/python3.7/site-packages/skfda/__init__.py:36: in <module>
    from . import representation, datasets, preprocessing, exploratory, misc, ml
/bin/bash failed with return code: 1
/usr/local/lib/python3.7/site-packages/skfda/datasets/__init__.py:6: in <module>
return code: 1
    from ._samples_generators import (make_gaussian_process,
/usr/local/lib/python3.7/site-packages/skfda/datasets/_samples_generators.py:8: in <module>
    from ..misc import covariances
/usr/local/lib/python3.7/site-packages/skfda/misc/__init__.py:2: in <module>
    from . import covariances, kernels, metrics
/usr/local/lib/python3.7/site-packages/skfda/misc/covariances.py:9: in <module>
    from ..exploratory.visualization._utils import _create_figure, _figure_to_svg
/usr/local/lib/python3.7/site-packages/skfda/exploratory/__init__.py:4: in <module>
    from . import visualization
/usr/local/lib/python3.7/site-packages/skfda/exploratory/visualization/__init__.py:1: in <module>
    from . import clustering, representation
/usr/local/lib/python3.7/site-packages/skfda/exploratory/visualization/clustering.py:11: in <module>
    from ...ml.clustering.base_kmeans import FuzzyKMeans
/usr/local/lib/python3.7/site-packages/skfda/ml/__init__.py:2: in <module>
    from . import classification, clustering, regression
/usr/local/lib/python3.7/site-packages/skfda/ml/classification/__init__.py:3: in <module>
    from ..._neighbors import (KNeighborsClassifier, RadiusNeighborsClassifier,
/usr/local/lib/python3.7/site-packages/skfda/_neighbors/__init__.py:11: in <module>
    from .unsupervised import NearestNeighbors
>   ???
E   ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

deps/fdasrsf/optimum_reparam.pyx:1: ValueError

bug

opened by JavierHernandezMontes 6

Feature/local outlier factor
Created LocalOutlierFactor (which wraps scikit-learn multivariate version)

Example in gallery of detection of outliers

New real dataset employed in the example (fetch_octane)

Test and Doctests added

enhancement pending theory
opened by pablomm 6
Error when calling L2Regularization

I have tried different python versions, but calling:

regularization=L2Regularization(LinearDifferentialOperator())

As in the docs, results in the following error:

TypeError: init() takes 1 positional argument but 2 were given

opened by dcbrien 5
Feature/neighbors
Added the following neighbors estimators:

NearestNeighbors

KNeighborsClassifier

RadiusNeighborsClassifier

NearestCentroids

KNeighborsScalarRegressor

RadiusNeighborsScalarRegressor

KNeighborsFunctionalRegressor

RadiusNeighborsFunctionalRegressor

I wrote some examples of KNeighborsClassifier, RadiusNeighborsClassifier and KNeighborsScalarRegressor, I will write other for the regressors with functional response and the nearest centroids, but in another PR.

Also, I had to modify the lp_distance lo accept the infinity case and the mean to accept a list of weights.

There are some little things which can be improved after merge #101.
opened by pablomm 5
Transformer to reduce the image dimension

Created a transformer which receives a multivariate function from R^n->R^d and applies a vectorial norm to reduce the image dimension to 1.

There are two version of the transformer, a procedural one and a sklearn style transformer.

I put the functions in skfda.preprocessing.dim_reduction, but maybe there is a better place.

opened by pablomm 5
Dose BasisSmoother rely on smoothing_parameter values?
Describe the bug It seems to me change smoothing_parameter doesn't effect result of BasisSmoother, particularly the score

To Reproduce I'm using something similar to the one used in the docs

t = np.linspace(0, 1, 5) x = np.sin(2 * np.pi * t) + np.cos(2 * np.pi * t) + 2 fd = skfda.FDataGrid(data_matrix=x, grid_points=t) basis = skfda.representation.basis.FourierBasis((0, 1), n_basis=3) smoother = skfda.preprocessing.smoothing.BasisSmoother(basis,smoothing_parameter=100) fd_smooth = smoother.fit_transform(fd) fd_smooth.data_matrix.round(2) smoother.score(fd,fd)

Expected behavior changing smoothing_parameter doesn't seem to affect the score. I'm having the updated github version

Version information

OS: [e.g. MacOs 13.0]

Python version: [e.g. 3.9.15]

scikit-fda version: [e.g. latest]

Version of other packages involved [e.g. numpy 1.23.4]

Additional context
bug
opened by mikeaalv 2
Allow BasisSmoother to accept irregularly sampled data

Alternatively, I have implemented a separate SparseBasisSmoother class which encapsulates the functionality of BasisSmoother in order to apply it to irregularly sampled data, in the format given in Issue #325

opened by opintosant 1
Data structure for discretized sparse data
Is your feature request related to a problem? Please describe. We need a FData subclass capable of storing discretized functions where:

The points in which the functions are measured are NOT in a grid (specially relevant for high dimensional functions such as surfaces).

The points in which each function is measured are different, and probably a different number of points per observation.

There are typically few points measured per observation (sparse instead of dense).

This structure is necessary for efficient storage and computation with this kind of data.

Describe the solution you'd like The proposal for the implementation is to have three arrays:

A first array containing the (concatenated) input points for the observations. For example, for three functional observations $f_1$, $f_2$, $f_3$, evaluated at $f_1(1, 2) = (3, 4, 5), f_1(2, 4) = (1, 7, 3), f_2(0, 0) = (1, 0, 1), f_3(0, 1) = (2, 2, 2), f_3(3, 5) = (1, 0, 0)$, this array would contain [(1, 2), (2, 4), (0, 0), (0, 1), (3, 5)].

A second array with the corresponding values. In the previous example it would be [(3, 4, 5), (1, 7, 3), (1, 0, 1), (2, 2, 2), (1, 0, 0)].

A third array with the indexes where the points of each observation start. In this case [0, 2, 3].

This approach has a compact representation and also allows for vectorization to be applied. The indexes can be used to apply the reduceat method of NumPy ufuncs.

Describe alternatives you've considered A more high-level API should be also exposed and used when possible.
enhancement
opened by vnmabus 1
Fix regularized FPCA with FDataGrid
The issue

When using FPCA with regularization of FDataGrid objects, the obtained components are orthogonal wrt to the usual inner product, which is not the one that should be used in the regularization. As specified in Ramsay, J. O. and Silverman, B. W.. Functional Data Analysis (page 178), the components should not be orthogonal, but they should fulfill

$$ \int \xi_j(s) \xi_k(s) ds + \int D^2 \xi_j(s) D^2 \xi_k(s) ds = 0, \quad (j\ne k) $$

where $\xi_i$ are the loadings and $D^2$ is the second order diferential operator.

Steps to reproduce

The following code shows the problem:

X, y = skfda.datasets.fetch_phoneme(return_X_y=True) X = X[:300] y = y[:300] n_components = 4 fpca_regression = FPCA( n_components=n_components, regularization=L2Regularization( LinearDifferentialOperator(1), regularization_parameter=20, ), ) fpca_regression.fit(X, y) matrix = skfda.misc.inner_product_matrix(fpca_regression.components_) print("Grid regularized:\n", matrix) grid_points = X.grid_points X = X.to_basis(Fourier(n_basis=10)) fpca_regression = FPCA( n_components=n_components, regularization=L2Regularization( LinearDifferentialOperator(1), regularization_parameter=0.25, ), ) fpca_regression.fit(X, y) matrix = skfda.misc.inner_product_matrix(fpca_regression.components_) print("Basis regularized:\n", matrix)

Output:

Grid regularized: [[ 1.00000000e+00 6.96057795e-16 9.19403442e-17 -2.91433544e-16] [ 6.71554826e-16 1.00000000e+00 1.51354623e-16 -2.83627288e-16] [ 6.93889390e-18 1.55257751e-16 1.00000000e+00 1.49619900e-16] [-2.43294968e-16 -2.56305394e-16 1.47451495e-16 1.00000000e+00]] Basis regularized: [[ 0.99393024 -0.0180874 -0.01324601 0.0030897 ] [-0.0180874 0.79784464 -0.06858017 0.07861877] [-0.01324601 -0.06858017 0.74805033 0.09757001] [ 0.0030897 0.07861877 0.09757001 0.70994851]]

In the grid regularized case, the output is the identity matrix, even though the regularization coefficient has been set to a very high value (20). In contrast, the basis regularized case outputs a matrix very different from the identity.

Current implementation

The relevant extract of the code is the following:

basis = FDataGrid( data_matrix=np.identity(n_points_discretization), grid_points=X.grid_points, ) regularization_matrix = compute_penalty_matrix( basis_iterable=(basis,), regularization_parameter=1, regularization=self.regularization, ) basis_matrix = basis.data_matrix[..., 0] if regularization_matrix is not None: basis_matrix += regularization_matrix fd_data = np.linalg.solve( basis_matrix.T, fd_data.T, ).T # see docstring for more information final_matrix = fd_data @ np.sqrt(weights_matrix) pca = PCA(n_components=self.n_components) pca.fit(final_matrix) self.components_ = X.copy( data_matrix=np.transpose( np.linalg.solve( np.sqrt(weights_matrix), np.transpose(pca.components_), ), ), sample_names=(None,) * self.n_components, )

We can see that the code "matches" the TFG corresponding to this functionality. In this TFG, it is stated that the problem is equivalent to solving the multivariate PCA on the following matrix:

$$ \mathbf{V}=\frac{\mathbf{X}\mathbf{W}^{1/2}}{\sqrt{N}(\mathbf{I}_M + \mathbf{P}_k)} $$

This formula has been extracted directly from the TFG, where the matrix operation is expressed as a fraction. Note, however that the correct equation would have the inverse of the denominator multiplying the nominator from the right side. In the TFG, this equation is justified by stating that maximizing the penalized variance is equivalent to the following eigenvalue problem:

$$ \frac{\mathbf{W}^{1/2} (\mathbf{X}^\intercal \mathbf{X})\mathbf{W}^{1/2} v_j(t)} {N (\mathbf{I}_M + \mathbf{P}_k)^\intercal(\mathbf{I}_M + \mathbf{P}_k)} = \lambda_j v_j(t) $$

Then, the previous equation can be seen as the usual eigenvalue problem for the covariance matrix of the data, but with a modified covariance matrix. However, even though the previous equation is similar to $\mathbf{V}^\intercal \mathbf{V}$, there are issues with this notation and analysis.

fda.usc implementation

The implementation in fda.usc is the following:

if (lambda > 0) { if (is.vector(P)) { P <- P.penalty(tt, P) } dimp <- dim(P) if (!(dimp[1] == dimp[2] & dimp[1] == J)) { stop("Incorrect matrix dimension P") } M <- solve(diag(J) + lambda * P) Xcen.fdata$data <- Xcen.fdata$data %*% t(M) }

where

diag(J) is the identity matrix of size J

P is the penalty matrix

Therefore, the implementation in fda.usc is equivalent to the current implementation and also returns the identity matrix as the inner product matrix. The issue (moviedo5/fda.usc/#6) has been opened in the fda.usc repository.

Proposed solution

Mathematical analysis

Non-regularized FPCA summary

In the usual PCA, the goal is to maximize the variance of the projections upon the principal components found subject to two restrictions. That is to say, maximize

$$ \frac{1}{N} \mathbf{\phi^\intercal X^\intercal X\phi} $$

subject to:

$\mathbf{\phi^\intercal \phi}=1$

The loadings are orthogonal $\mathbf{\phi_i^\intercal \phi_j}=0$ if $i\neq j$

Regularized FPCA

The regularized version of FPCA aims to maximize the penalized sample variance. If we want to penalized the second derivative, we can define it as:

$$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\phi,2) }, $$

where $p(\phi,2)$ is the penalty function for the second derivative applied to $\phi$. The restrictions now are:

$|\phi|^2=1$

$\int \phi_i \phi_j + \lambda \int D^2\phi_i D^2\phi_j= 0$

For simplicity, we will ignore the quadrature for now and assume $\mathbf{W=Id}$ (the integration weights). Then, we can make the following substitutions:

$$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\xi,2) } \approx \frac{ \frac 1N \boldsymbol \phi^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \boldsymbol \phi + \lambda \boldsymbol \phi^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi } = \frac{ \frac 1N \boldsymbol {\phi}^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi }, $$

where $\mathbf P_2$ is the penalty matrix for the second derivative, $\boldsymbol \phi$ is the vector of $\phi$ evaluated in the grid points and $\mathbf X$ is the usual data matrix. That is to say, the functions evaluated in the grid points.

We can also obtain:

$$ 0=\int \phi_i \phi_j + \int D^2\phi_i D^2\phi_j= \boldsymbol \phi_i^\intercal \boldsymbol \phi_j + \lambda \boldsymbol \phi_i ^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi_j = \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi $$

In light of these relationships, we can see that we can transform the problem back to the usual PCA problem using the Cholesky decomposition $\mathbf{L} \mathbf{L}^\intercal = \mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2$. Then, we can apply the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$. Using the Cholesky decomposition, we can obtain the following equation.

$$ psv(\phi) = \frac{ \frac 1N \boldsymbol \psi^\intercal \mathbf{L}^{-1} \mathbf X^\intercal \mathbf X \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }{ \boldsymbol \psi^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }= \frac{ \frac 1N \boldsymbol \psi ^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big)^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big) \boldsymbol \psi } { \boldsymbol \psi^\intercal \boldsymbol \psi } $$

Then the constraints become:

$$ 1 = |\boldsymbol \phi|^2 = |(\mathbf L^\intercal)^{-1} \boldsymbol \psi|^2 = \boldsymbol \psi^\intercal \mathbf L^{-1} (\mathbf L^{-1})^\intercal \boldsymbol \psi $$

$$ 0 = \boldsymbol \psi_i^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi_j = \boldsymbol \psi_i^\intercal \boldsymbol \psi_j $$

Threfore, we can see that the problem is equivalent to the usual PCA problem with the following changes:

The data matrix $\mathbf X$ is replaced by $\mathbf X \mathbf {(L^\intercal)}^{-1}$.

Use the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$ so that loadings obtained are orthogonal with respect to the new inner product given by $\mathbf{L}^{-1} (\mathbf L^{-1})^\intercal$.

Note that after the change of variable, the loadings will no longer have norm 1 with respect to the original inner product. However, this can be easily fixed by dividing normailizing them at the end.

Progress so far

This Choesky decomposition approach has been implemented in the attached pull request. Note however that the code has not been optimized yet (still using np.inv instead of np.linalg.solve for example), nor has it been tested extensively.

Regularization parameter effect

The main issue remaining is that the regularization parameter $\lambda$ does not seem to have the same effect in grid implementation as in the basis implementation. The reason may be related to the penalty matrix. To reproduce the issue, run the following code:

X, y = skfda.datasets.fetch_phoneme(return_X_y=True) X = X[:300] y = y[:300] def fit_plot(ax, title, regularization_parameter=0): fpca_regression = FPCA( n_components=4, regularization=L2Regularization( LinearDifferentialOperator(1), regularization_parameter=regularization_parameter, ), ) fpca_regression.fit(X, y) fpca_regression.components_.plot(axes=ax) ax.set_title(title) ax.set_xlabel("") ax.set_ylabel("") fig, ax = plt.subplots(3, 2, figsize=(10, 10)) fit_plot(ax[0, 0], "Grid not regularized") fit_plot(ax[0, 1], "Grid regularized", regularization_parameter=2.5) grid_points = X.grid_points X = X.to_basis(Fourier(n_basis=10)) fit_plot(ax[1, 0], "Basis not regularized") fit_plot(ax[1, 1], "Basis regularized", regularization_parameter=0.125) # Convert back to grid using the same points as before X = X.to_grid(grid_points) fit_plot(ax[2, 0], "Grid from basis not regularized") fit_plot(ax[2, 1], "Grid from basis regularized", regularization_parameter=2.5) plt.show()

As it can be seen, the regularization parameter has less of an effect in the basis implementation than in the grid implementation. However, by adjusting the parameters, the results do seem to match almost perfectly.

Next steps

[ ] Fix the regularization parameter effect issue.

[ ] Either change the tests so that they do not compare against fda.usc in this particular case or wait for them to respond to the issue (moviedo5/fda.usc/#6). Taking advantage of the migration from unittest to pytest, it would be possible to use pytest mechanisms to prevent these tests from being executed while documenting the reason why they should not be executed.

[ ] Optimize and revise the implementation in #485

[ ] Investigate the fact that the results in FDataBasis might not have norm 1.
opened by Ddelval 0

Releases(0.7.1)

0.7.1(Jan 25, 2022)

Fix license in Zenodo citation file.
Source code(tar.gz)
Source code(zip)
0.7(Jan 25, 2022)
TikhonovRegularization renamed to L2Regularization. The old name is deprecated. See #409 for details.

Add the option for MagnitudeShapePlot to plot the inlier ellipse.

Additional bug fixes.

Source code(tar.gz)
Source code(zip)
0.6.1(Dec 23, 2021)

This version only adds small bug fixes.
Source code(tar.gz)
Source code(zip)
0.6(Sep 12, 2021)
Some highlights of this version:

Add typing support for most functionalities of the package. Now you can use Mypy to check for errors.

New depth-based classifiers

New visualization functions

Parametric plot

Outliergram

Multiple interactive plots

New historical functional regression model

Other minor additions and fixes

Source code(tar.gz)
Source code(zip)
0.5(Dec 31, 2020)
Some highlights of the new functionality:

Depth functions refactored

Removed C code and added fdasrsf as a dependency

Added new variable selection methods

Added Maximum Depth classifier

Optimized FDataGrid inner product

Source code(tar.gz)
Source code(zip)
0.4(Jul 23, 2020)

Adds functional component analysis, ANOVA and Hotelling tests, basis representation for multivariate and vector valued functions and Tikhonov regularization.
Source code(tar.gz)
Source code(zip)
v0.3(Oct 1, 2019)

Source code(tar.gz)
Source code(zip)

Owner

Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid

Machine Learning Group at Universidad Autónoma de Madrid

GitHub Repository

A pipeline that creates consensus sequences from a Nanopore reads. I

A pipeline that creates consensus sequences from a Nanopore reads. It clusters reads that are similar to each other and creates a consensus that is then identified using BLAST.

2 May 15, 2022

Analysis scripts for QG equations

qg-edgeofchaos Analysis scripts for QG equations FIle/Folder Structure eigensolvers.py - Spectral and finite-difference solvers for Rossby wave eigenf

2 Sep 27, 2022

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

MapReader A computer vision pipeline for the semantic exploration of maps at scale MapReader is an end-to-end computer vision (CV) pipeline designed b

25 Dec 26, 2022

collect training and calibration data for gaze tracking

Collect Training and Calibration Data for Gaze Tracking This tool allows collecting gaze data necessary for personal calibration or training of eye-tr

5 Dec 17, 2022

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

Overview OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase

3 Feb 12, 2022

PyPSA: Python for Power System Analysis

1 Python for Power System Analysis Contents 1 Python for Power System Analysis 1.1 About 1.2 Documentation 1.3 Functionality 1.4 Example scripts as Ju

758 Dec 30, 2022

Methylation/modified base calling separated from basecalling.

Remora Methylation/modified base calling separated from basecalling. Remora primarily provides an API to call modified bases for basecaller programs s

72 Jan 05, 2023

ETL flow framework based on Yaml configs in Python

ETL framework based on Yaml configs in Python A light framework for creating data streams. Setting up streams through configuration in the Yaml file.

18 Jul 06, 2022

Python for Data Analysis, 2nd Edition

Python for Data Analysis, 2nd Edition Materials and IPython notebooks for "Python for Data Analysis" by Wes McKinney, published by O'Reilly Media Buy

18.6k Jan 08, 2023

Falcon: Interactive Visual Analysis for Big Data

Falcon: Interactive Visual Analysis for Big Data Crossfilter millions of records without latencies. This project is work in progress and not documente

803 Dec 27, 2022

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

CRISPRanalysis InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family. In this work, we present a workflow

2 Jan 31, 2022

Convert monolithic Jupyter notebooks into Ploomber pipelines.

65 Dec 16, 2022

Additional tools for particle accelerator data analysis and machine information

PyLHC Tools This package is a collection of useful scripts and tools for the Optics Measurements and Corrections group (OMC) at CERN. Documentation Au

3 Apr 13, 2022

A tool to compare differences between dataframes and create a differences report in Excel

similarpanda A module to check for differences between pandas Dataframes, and generate a report in Excel format. This is helpful in a workplace settin

9 Sep 15, 2022

🌍 Create 3d-printable STLs from satellite elevation data 🌏

mapa 🌍 Create 3d-printable STLs from satellite elevation data Installation pip install mapa Usage mapa uses numpy and numba under the hood to crunch

13 Dec 15, 2022

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

81 Dec 26, 2022

A Python module for clustering creators of social media content into networks

sm_content_clustering A Python module for clustering creators of social media content into networks. Currently supports identifying potential networks

72 Dec 30, 2022

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

dbt-osmosis First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we wan

150 Jan 06, 2023

Python implementation of Principal Component Analysis

Principal Component Analysis Principal Component Analysis (PCA) is a dimension-reduction algorithm. The idea is to use the singular value decompositio

1 Nov 06, 2021

Manage large and heterogeneous data spaces on the file system.

signac - simple data management The signac framework helps users manage and scale file-based workflows, facilitating data reuse, sharing, and reproduc

109 Dec 14, 2022

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Related tags

Overview

scikit-fda: Functional Data Analysis in Python

Documentation

Installation

Installation from source

Requirements

Contributions

License

Comments

The issue

Steps to reproduce

Current implementation

fda.usc implementation

Proposed solution

Mathematical analysis

Non-regularized FPCA summary

Regularized FPCA

Progress so far

Regularization parameter effect

Next steps

Releases(0.7.1)

0.7.1(Jan 25, 2022)

0.7(Jan 25, 2022)

0.6.1(Dec 23, 2021)

0.6(Sep 12, 2021)

0.5(Dec 31, 2020)

0.4(Jul 23, 2020)

v0.3(Oct 1, 2019)

Owner

Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid

A pipeline that creates consensus sequences from a Nanopore reads. I

Analysis scripts for QG equations

MapReader: A computer vision pipeline for the semantic exploration of maps at scale

collect training and calibration data for gaze tracking

OpenARB is an open source program aiming to emulate a free market while encouraging players to participate in arbitrage in order to increase working capital.

PyPSA: Python for Power System Analysis

Methylation/modified base calling separated from basecalling.

ETL flow framework based on Yaml configs in Python

Python for Data Analysis, 2nd Edition

Falcon: Interactive Visual Analysis for Big Data

InDels analysis of CRISPR lines by NGS amplicon sequencing technology for a multicopy gene family.

Convert monolithic Jupyter notebooks into Ploomber pipelines.

Additional tools for particle accelerator data analysis and machine information

A tool to compare differences between dataframes and create a differences report in Excel

🌍 Create 3d-printable STLs from satellite elevation data 🌏

Tablexplore is an application for data analysis and plotting built in Python using the PySide2/Qt toolkit.

A Python module for clustering creators of social media content into networks

First and foremost, we want dbt documentation to retain a DRY principle. Every time we repeat ourselves, we waste our time. Second, we want to understand column level lineage and automate impact analysis.

Python implementation of Principal Component Analysis

Manage large and heterogeneous data spaces on the file system.