Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

Overview

scikit-fda: Functional Data Analysis in Python

scikit-fda: Functional Data Analysis in Python

python build status Documentation Status Codecov PyPIBadge license doi

Functional Data Analysis, or FDA, is the field of Statistics that analyses data that depend on a continuous parameter.

This package offers classes, methods and functions to give support to FDA in Python. Includes a wide range of utils to work with functional data, and its representation, exploratory analysis, or preprocessing, among other tasks such as inference, classification, regression or clustering of functional data. See documentation for further information on the features included in the package.

Documentation

The documentation is available at fda.readthedocs.io/en/stable/, which includes detailed information of the different modules, classes and methods of the package, along with several examples showing different functionalities.

The documentation of the latest version, corresponding with the develop version of the package, can be found at fda.readthedocs.io/en/latest/.

Installation

Currently, scikit-fda is available in Python 3.6 and 3.7, regardless of the platform. The stable version can be installed via PyPI:

pip install scikit-fda

Installation from source

It is possible to install the latest version of the package, available in the develop branch, by cloning this repository and doing a manual installation.

git clone https://github.com/GAA-UAM/scikit-fda.git
pip install ./scikit-fda

Make sure that your default Python version is currently supported, or change the python and pip commands by specifying a version, such as python3.6:

git clone https://github.com/GAA-UAM/scikit-fda.git
python3.6 -m pip install ./scikit-fda

Requirements

scikit-fda depends on the following packages:

The dependencies are automatically installed.

Contributions

All contributions are welcome. You can help this project grow in multiple ways, from creating an issue, reporting an improvement or a bug, to doing a repository fork and creating a pull request to the development branch.

The people involved at some point in the development of the package can be found in the contributors file.

License

The package is licensed under the BSD 3-Clause License. A copy of the license can be found along with the code.

Comments
  • Feature/improve visualization

    Feature/improve visualization

    The plotting functions have been changed to not use global state (using pyplot) whenever possible. Instead, the plotting functions in skfda create and return a new figure, if none is passed.

    Pyplot is still used for Sphinx-gallery, as it has to scrap the produced images.

    All the examples have been changed in order to use the new API and the object oriented functionalities of Matplotlib.

    opened by vnmabus 17
  • Retrieve coefficients for function reconstruction

    Retrieve coefficients for function reconstruction

    Hey there!

    I fitted a KNN FDataGrid to my input data. It actually looks pretty good so far and I now would like to "export" it so I can represent the function as numerical values (preferable in a numpy array).

    I saw that you offer some basis that can be used to "export" the underlying representation. Could you elaborrate on what basis should be used when? My data represents a demand/supply curve. I tried the BSpline one but it only constructs something close to a sine wave which doesn't really represent my data.

    Here is an image of the graph itself: grafik

    Is there some way to get the raw representation instead of transforming it to another basis?

    opened by mainrs 16
  • add inverse_transform #297 method to FPCA

    add inverse_transform #297 method to FPCA

    Reference issue: https://github.com/GAA-UAM/scikit-fda/issues/297#issue-775429060

    What does it address ? The inverse_transform method projects the functional principal score (coefficient value w.r.t to eigenfunctions basis) of a (or set of) sample back to the input space. In other words FPCA.inverse_transform(FPCA.transform()) should approximate the identity map.

    Empirical tests: I've only tested the inverse_transform method on synthetic datasets (both FDataGrid and FDataBasis) and assessed the results in terms of 'identity regression' i.e. with QQ-plots on the distribution of the residuals where residuals = X_input.value - X_recovered.value, where:

    • X_recovered is computed as FPCA.inverse_transform(FPCA.transform(X_input)),
    • value is .coefficients or .data_matrix attribute, depending on the input data format.

    X_input is generated according to:

    cov = Exponential(length_scale=0.5)
    n_grid = 100
    n_samples = 50
    
    fd = make_gaussian_process(
        start=0,
        stop=4,
        n_samples=n_samples,
        n_features=n_grid,
        mean=lambda t: np.power(t, 2) + 5,
        cov=cov,
    )
    

    Herebelow are the resulting QQ-plots (computed with scipy.stats.probplot) when `X_recovered' is computed with 3 principal components:

    • When `fd' is let in FDataGrid format: slope ~0.675, intercept ~ -3.e-17 and an R^2 ~ 0.9994. Note that the residuals are computed in terms of funtional data values. qqplot-fdatagrid

    • When `fd' is mapped to FDataBasis format with a 4-order Bspline basis with cardinality 50: -slope ~ 1.08, intercept -5.e-17 and R^2 ~0.9996. Note that the residuals are computed in terms of the coefficients in the (here Bspline) basis. qqplot-fdatabasis

    I can enhance the PR and provide further emprical results, just tell me :)

    opened by Clej 12
  • ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    Scikit-fda is successfully installed, but when I try to import it, I receive the following error:

    ValueError                                Traceback (most recent call last)
    <ipython-input-17-97c44e0d3210> in <module>
    ----> 1 import skfda
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/__init__.py in <module>
         35 from .representation._functional_data import concatenate
         36 
    ---> 37 from . import representation, datasets, preprocessing, exploratory, misc, ml, \
         38     inference
         39 
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/datasets/__init__.py in <module>
          5                              fetch_weather, fetch_aemet,
          6                              fetch_octane, fetch_gait)
    ----> 7 from ._samples_generators import (make_gaussian,
          8                                   make_gaussian_process,
          9                                   make_sinusoidal_process,
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/datasets/_samples_generators.py in <module>
          7 from .. import FDataGrid
          8 from .._utils import _cartesian_product
    ----> 9 from ..misc import covariances
         10 from ..preprocessing.registration import normalize_warping
         11 from ..representation.interpolation import SplineInterpolation
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/__init__.py in <module>
    ----> 1 from . import covariances, kernels, metrics
          2 from . import operators
          3 from . import regularization
          4 from ._math import (log, log2, log10, exp, sqrt, cumsum,
          5                     inner_product, inner_product_matrix)
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/covariances.py in <module>
          7 import sklearn.gaussian_process.kernels as sklearn_kern
          8 
    ----> 9 from ..exploratory.visualization._utils import _create_figure, _figure_to_svg
         10 
         11 
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/__init__.py in <module>
          1 from . import depth
    ----> 2 from . import outliers
          3 from . import stats
          4 from . import visualization
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/outliers/__init__.py in <module>
          4 )
          5 from ._iqr import IQROutlierDetector
    ----> 6 from .neighbors_outlier import LocalOutlierFactor
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/exploratory/outliers/neighbors_outlier.py in <module>
          2 from sklearn.base import OutlierMixin
          3 
    ----> 4 from ...misc.metrics import lp_distance
          5 from ...ml._neighbors_base import (
          6     KNeighborsMixin,
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/misc/metrics.py in <module>
          6 
          7 from .._utils import _pairwise_commutative
    ----> 8 from ..preprocessing.registration import normalize_warping, ElasticRegistration
          9 from ..preprocessing.registration._warping import _normalize_scale
         10 from ..preprocessing.registration.elastic import SRSF
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/__init__.py in <module>
    ----> 1 from . import registration
          2 from . import smoothing
          3 from . import dim_reduction
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/registration/__init__.py in <module>
         14 from ._warping import invert_warping, normalize_warping
         15 
    ---> 16 from .elastic import ElasticRegistration
         17 
         18 from . import validation, elastic
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/skfda/preprocessing/registration/elastic.py in <module>
          1 
    ----> 2 from fdasrsf.utility_functions import optimum_reparam
          3 import scipy.integrate
          4 from sklearn.base import BaseEstimator, TransformerMixin
          5 from sklearn.utils.validation import check_is_fitted
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/__init__.py in <module>
         20 del sys
         21 
    ---> 22 from .time_warping import fdawarp, align_fPCA, align_fPLS, pairwise_align_bayes
         23 from .plot_style import f_plot, rstyle, plot_curve, plot_reg_open_curve, plot_geod_open_curve, plot_geod_close_curve
         24 from .utility_functions import smooth_data, optimum_reparam, f_to_srsf, gradient_spline, elastic_distance, invertGamma, srsf_to_f
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/time_warping.py in <module>
          7 import numpy as np
          8 import matplotlib.pyplot as plt
    ----> 9 import fdasrsf.utility_functions as uf
         10 import fdasrsf.fPCA as fpca
         11 import fdasrsf.geometry as geo
    
    ~/opt/anaconda3/envs/elephant/lib/python3.7/site-packages/fdasrsf/utility_functions.py in <module>
         17 from joblib import Parallel, delayed
         18 import numpy.random as rn
    ---> 19 import optimum_reparamN2 as orN2
         20 import optimum_reparam_N as orN
         21 import cbayesian as bay
    
    src/optimum_reparamN2.pyx in init optimum_reparamN2()
    
    ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    

    I'm using the standard format

    import skfda
    

    I don't know why this is happening, any help is greatly appreciated

    Version information

    • OS: MacOS
    • Python version: 3.7.9
    • scikit-fda version: 0.5
    • Version of other packages involved: [numpy: 1.19.2, scipy: 1.6.0, matplotlib: 3.3.4 , conda: 4.9.2 ]
    bug 
    opened by adeyemi-lawal 10
  • ModuleNotFoundError: No module named 'optimum_reparam'

    ModuleNotFoundError: No module named 'optimum_reparam'

    Hi, I just forked/cloned the repository and I found a module which is not listed in the dependencies. It's called "optimum_reparam" and it's imported right here. Maybe somewhere else. I failed finding the module myself, could someone provide details about this module in the README requirements section, please?

    Thanks for this initiative!

    opened by alejandro-ariza 8
  • Problem compiling binaries in macOS

    Problem compiling binaries in macOS

    Describe the bug Problem compiling C code from fdasrsf:

    error: $MACOSX_DEPLOYMENT_TARGET mismatch: now "10.14" but "10.15" during configure

    In the fdasrsf package it is fixed the variable with os.environ['MACOSX_DEPLOYMENT_TARGET'] = '10.14' in the setup.py. After cloning the repository and remove the line I get the same error. I tried to export the environment variable before running the installation with no success.

    Complete trace:

    Building wheel for fdasrsf (PEP 517) ... error ERROR: Command errored out with exit status 1: command: /Users/pablomm/scikit_fda_test2/venv/bin/python /Users/pablomm/scikit_fda_test2/venv/lib/python3.8/site-packages/pip/_vendor/pep517/_in_process.py build_wheel /var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/tmpo2_rkkrs cwd: /private/var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/pip-install-ym8nls3t/fdasrsf_a923c378ec5a4ec689f1d7459d35c8c7 Complete output (38 lines): generating build/_DP.c (already up-to-date) running bdist_wheel running build running build_py creating build/lib.macosx-10.15-x86_64-3.8 creating build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/plot_style.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_stats.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_functions.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/geodesic.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/utility_functions.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/tolerance.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/pcr_regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/init.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/fPCA.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/umap_metric.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/time_warping.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/geometry.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/curve_regression.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/rbfgs.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/fPLS.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf copying fdasrsf/boxplots.py -> build/lib.macosx-10.15-x86_64-3.8/fdasrsf running build_ext cythoning src/optimum_reparamN2.pyx to src/optimum_reparamN2.c cythoning src/fpls_warp.pyx to src/fpls_warp.c cythoning src/mlogit_warp.pyx to src/mlogit_warp.c cythoning src/ocmlogit_warp.pyx to src/ocmlogit_warp.c cythoning src/oclogit_warp.pyx to src/oclogit_warp.c cythoning src/optimum_reparam_N.pyx to src/optimum_reparam_N.c cythoning src/cbayesian.pyx to src/cbayesian.cpp building 'optimum_reparamN2' extension creating build/temp.macosx-10.15-x86_64-3.8 creating build/temp.macosx-10.15-x86_64-3.8/src clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -isysroot /Library/Developer/CommandLineTools/SDKs/MacOSX10.15.sdk -I/private/var/folders/tk/bl_pbdpj6kb7r3s584k85jxw0000gn/T/pip-build-env-qnbgsf0y/overlay/lib/python3.8/site-packages/numpy/core/include -I/usr/local/include -I/usr/local/opt/[email protected]/include -I/usr/local/opt/sqlite/include -I/usr/local/opt/tcl-tk/include -I/Users/pablomm/scikit_fda_test2/venv/include -I/usr/local/opt/[email protected]/Frameworks/Python.framework/Versions/3.8/include/python3.8 -c src/optimum_reparamN2.c -o build/temp.macosx-10.15-x86_64-3.8/src/optimum_reparamN2.o not modified: 'build/_DP.c' error: $MACOSX_DEPLOYMENT_TARGET mismatch: now "10.14" but "10.15" during configure


    ERROR: Failed building wheel for fdasrsf Building wheel for findiff (setup.py) ... done Created wheel for findiff: filename=findiff-0.8.9-py3-none-any.whl size=29218 sha256=c2bc96e93c195fb5c2a1acbf932b5311e8e94e571edf74929eca6e019c66532d Stored in directory: /Users/pablomm/Library/Caches/pip/wheels/df/48/68/71cc95b16d5f7c5115a009f92f9a5a3896fb2ece31228b0aa5 Successfully built findiff Failed to build fdasrsf ERROR: Could not build wheels for fdasrsf which use PEP 517 and cannot be installed directly

    To Reproduce

    virtualenv venv
    source venv/bin/activate
    python --version # 3.8.8 and also tried 3.9.2
    pip install scikit-fda
    

    Also, it is happening when the package is installed from code with python setup.py install

    Version information

    • OS: macOS catalina 10.15.7
    • Python version: 3.9.2 and 3.8.8
    • scikit-fda version: stable master (0.5) and develop
    • gcc and g++ version: 12.0.0
    bug 
    opened by pablomm 7
  • Importing from skfda import FDataGrid gives you ValueError

    Importing from skfda import FDataGrid gives you ValueError

    Describe the bug When importing FDataGrid from skfda it will launch the following error: ` ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

    `To Reproduce We are installing scikit-fda==0.3 with the following docker image:

    FROM python:3.7-slim-buster
    
    # ------------------------------------------------------------------------------------Basic Linux Packages------------------------------------------------------------------------------------
    RUN apt-get update \
        && apt-get install -y --no-install-recommends \
        ca-certificates \
        cmake \
        build-essential \
        gcc \
        g++ \
        git \
        wget \
        curl \
        libffi-dev \
        python-dev \
        unixodbc-dev \
        && rm -rf /var/lib/apt/lists/*
    

    Expected behavior To be able to import the library Version information

    • OS: Linux
    • Python version: python:3.7-slim-buster
    • scikit-fda version: 0.3
    • Version of other packages involved [e.g. numpy, scipy, matplotlib, ... ]: numpy==1.16.2 , scipy==1.3.1, matplotlib==3.1.1

    Additional context Add any other context about the problem here. Since 01/02/2021 we have the issue of not being able to import the library. Before of that date we could import it without problem with the same requirements and environment

    Error Thread:

    src/hvl/utils/model_utils.py:12: in <module>
        from skfda import FDataGrid
    /usr/local/lib/python3.7/site-packages/skfda/__init__.py:36: in <module>
        from . import representation, datasets, preprocessing, exploratory, misc, ml
    /bin/bash failed with return code: 1
    /usr/local/lib/python3.7/site-packages/skfda/datasets/__init__.py:6: in <module>
    return code: 1
        from ._samples_generators import (make_gaussian_process,
    /usr/local/lib/python3.7/site-packages/skfda/datasets/_samples_generators.py:8: in <module>
        from ..misc import covariances
    /usr/local/lib/python3.7/site-packages/skfda/misc/__init__.py:2: in <module>
        from . import covariances, kernels, metrics
    /usr/local/lib/python3.7/site-packages/skfda/misc/covariances.py:9: in <module>
        from ..exploratory.visualization._utils import _create_figure, _figure_to_svg
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/__init__.py:4: in <module>
        from . import visualization
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/visualization/__init__.py:1: in <module>
        from . import clustering, representation
    /usr/local/lib/python3.7/site-packages/skfda/exploratory/visualization/clustering.py:11: in <module>
        from ...ml.clustering.base_kmeans import FuzzyKMeans
    /usr/local/lib/python3.7/site-packages/skfda/ml/__init__.py:2: in <module>
        from . import classification, clustering, regression
    /usr/local/lib/python3.7/site-packages/skfda/ml/classification/__init__.py:3: in <module>
        from ..._neighbors import (KNeighborsClassifier, RadiusNeighborsClassifier,
    /usr/local/lib/python3.7/site-packages/skfda/_neighbors/__init__.py:11: in <module>
        from .unsupervised import NearestNeighbors
    >   ???
    E   ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject
    
    deps/fdasrsf/optimum_reparam.pyx:1: ValueError
    
    bug 
    opened by JavierHernandezMontes 6
  • Feature/local outlier factor

    Feature/local outlier factor

    • Created LocalOutlierFactor (which wraps scikit-learn multivariate version)
    • Example in gallery of detection of outliers
    • New real dataset employed in the example (fetch_octane)
    • Test and Doctests added
    enhancement pending theory 
    opened by pablomm 6
  • Error when calling L2Regularization

    Error when calling L2Regularization

    I have tried different python versions, but calling:

    regularization=L2Regularization(LinearDifferentialOperator())

    As in the docs, results in the following error:

    TypeError: init() takes 1 positional argument but 2 were given

    opened by dcbrien 5
  • Feature/neighbors

    Feature/neighbors

    Added the following neighbors estimators:

    • NearestNeighbors
    • KNeighborsClassifier
    • RadiusNeighborsClassifier
    • NearestCentroids
    • KNeighborsScalarRegressor
    • RadiusNeighborsScalarRegressor
    • KNeighborsFunctionalRegressor
    • RadiusNeighborsFunctionalRegressor

    I wrote some examples of KNeighborsClassifier, RadiusNeighborsClassifier and KNeighborsScalarRegressor, I will write other for the regressors with functional response and the nearest centroids, but in another PR.

    Also, I had to modify the lp_distance lo accept the infinity case and the mean to accept a list of weights.

    There are some little things which can be improved after merge #101.

    opened by pablomm 5
  • Transformer to reduce the image dimension

    Transformer to reduce the image dimension

    Created a transformer which receives a multivariate function from R^n->R^d and applies a vectorial norm to reduce the image dimension to 1.

    There are two version of the transformer, a procedural one and a sklearn style transformer.

    I put the functions in skfda.preprocessing.dim_reduction, but maybe there is a better place.

    opened by pablomm 5
  • Dose BasisSmoother rely on smoothing_parameter values?

    Dose BasisSmoother rely on smoothing_parameter values?

    Describe the bug It seems to me change smoothing_parameter doesn't effect result of BasisSmoother, particularly the score

    To Reproduce I'm using something similar to the one used in the docs

    t = np.linspace(0, 1, 5)
    x = np.sin(2 * np.pi * t) + np.cos(2 * np.pi * t) + 2
    fd = skfda.FDataGrid(data_matrix=x, grid_points=t)
    basis = skfda.representation.basis.FourierBasis((0, 1), n_basis=3)
    smoother = skfda.preprocessing.smoothing.BasisSmoother(basis,smoothing_parameter=100)
    fd_smooth = smoother.fit_transform(fd)
    fd_smooth.data_matrix.round(2)
    smoother.score(fd,fd)
    

    Expected behavior changing smoothing_parameter doesn't seem to affect the score. I'm having the updated github version

    Version information

    • OS: [e.g. MacOs 13.0]
    • Python version: [e.g. 3.9.15]
    • scikit-fda version: [e.g. latest]
    • Version of other packages involved [e.g. numpy 1.23.4]

    Additional context

    bug 
    opened by mikeaalv 2
  • Allow BasisSmoother to accept irregularly sampled data

    Allow BasisSmoother to accept irregularly sampled data

    Alternatively, I have implemented a separate SparseBasisSmoother class which encapsulates the functionality of BasisSmoother in order to apply it to irregularly sampled data, in the format given in Issue #325

    opened by opintosant 1
  • Data structure for discretized sparse data

    Data structure for discretized sparse data

    Is your feature request related to a problem? Please describe. We need a FData subclass capable of storing discretized functions where:

    • The points in which the functions are measured are NOT in a grid (specially relevant for high dimensional functions such as surfaces).
    • The points in which each function is measured are different, and probably a different number of points per observation.
    • There are typically few points measured per observation (sparse instead of dense).

    This structure is necessary for efficient storage and computation with this kind of data.

    Describe the solution you'd like The proposal for the implementation is to have three arrays:

    • A first array containing the (concatenated) input points for the observations. For example, for three functional observations $f_1$, $f_2$, $f_3$, evaluated at $f_1(1, 2) = (3, 4, 5), f_1(2, 4) = (1, 7, 3), f_2(0, 0) = (1, 0, 1), f_3(0, 1) = (2, 2, 2), f_3(3, 5) = (1, 0, 0)$, this array would contain [(1, 2), (2, 4), (0, 0), (0, 1), (3, 5)].
    • A second array with the corresponding values. In the previous example it would be [(3, 4, 5), (1, 7, 3), (1, 0, 1), (2, 2, 2), (1, 0, 0)].
    • A third array with the indexes where the points of each observation start. In this case [0, 2, 3].

    This approach has a compact representation and also allows for vectorization to be applied. The indexes can be used to apply the reduceat method of NumPy ufuncs.

    Describe alternatives you've considered A more high-level API should be also exposed and used when possible.

    enhancement 
    opened by vnmabus 1
  • Fix regularized FPCA with FDataGrid

    Fix regularized FPCA with FDataGrid

    The issue

    When using FPCA with regularization of FDataGrid objects, the obtained components are orthogonal wrt to the usual inner product, which is not the one that should be used in the regularization. As specified in Ramsay, J. O. and Silverman, B. W.. Functional Data Analysis (page 178), the components should not be orthogonal, but they should fulfill

    $$ \int \xi_j(s) \xi_k(s) ds + \int D^2 \xi_j(s) D^2 \xi_k(s) ds = 0, \quad (j\ne k) $$

    where $\xi_i$ are the loadings and $D^2$ is the second order diferential operator.

    Steps to reproduce

    The following code shows the problem:

    X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
    X = X[:300]
    y = y[:300]
    
    n_components = 4
    
    fpca_regression = FPCA(
        n_components=n_components,
        regularization=L2Regularization(
            LinearDifferentialOperator(1),
            regularization_parameter=20,
        ),
    )
    
    fpca_regression.fit(X, y)
    
    matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
    print("Grid regularized:\n", matrix)
    
    
    grid_points = X.grid_points
    X = X.to_basis(Fourier(n_basis=10))
    
    fpca_regression = FPCA(
        n_components=n_components,
        regularization=L2Regularization(
            LinearDifferentialOperator(1),
            regularization_parameter=0.25,
        ),
    )
    fpca_regression.fit(X, y)
    
    
    matrix = skfda.misc.inner_product_matrix(fpca_regression.components_)
    print("Basis regularized:\n", matrix)
    

    Output:

    Grid regularized:
     [[ 1.00000000e+00  6.96057795e-16  9.19403442e-17 -2.91433544e-16]
     [ 6.71554826e-16  1.00000000e+00  1.51354623e-16 -2.83627288e-16]
     [ 6.93889390e-18  1.55257751e-16  1.00000000e+00  1.49619900e-16]
     [-2.43294968e-16 -2.56305394e-16  1.47451495e-16  1.00000000e+00]]
    Basis regularized:
     [[ 0.99393024 -0.0180874  -0.01324601  0.0030897 ]
     [-0.0180874   0.79784464 -0.06858017  0.07861877]
     [-0.01324601 -0.06858017  0.74805033  0.09757001]
     [ 0.0030897   0.07861877  0.09757001  0.70994851]]
    

    In the grid regularized case, the output is the identity matrix, even though the regularization coefficient has been set to a very high value (20). In contrast, the basis regularized case outputs a matrix very different from the identity.

    Current implementation

    The relevant extract of the code is the following:

    basis = FDataGrid(
        data_matrix=np.identity(n_points_discretization),
        grid_points=X.grid_points,
    )
    regularization_matrix = compute_penalty_matrix(
        basis_iterable=(basis,),
        regularization_parameter=1,
        regularization=self.regularization,
    )
    
    basis_matrix = basis.data_matrix[..., 0]
    if regularization_matrix is not None:
        basis_matrix += regularization_matrix
    
    fd_data = np.linalg.solve(
        basis_matrix.T,
        fd_data.T,
    ).T
    
    # see docstring for more information
    final_matrix = fd_data @ np.sqrt(weights_matrix)
    
    pca = PCA(n_components=self.n_components)
    pca.fit(final_matrix)
    self.components_ = X.copy(
        data_matrix=np.transpose(
            np.linalg.solve(
                np.sqrt(weights_matrix),
                np.transpose(pca.components_),
            ),
        ),
        sample_names=(None,) * self.n_components,
    )
    

    We can see that the code "matches" the TFG corresponding to this functionality. In this TFG, it is stated that the problem is equivalent to solving the multivariate PCA on the following matrix:

    $$ \mathbf{V}=\frac{\mathbf{X}\mathbf{W}^{1/2}}{\sqrt{N}(\mathbf{I}_M + \mathbf{P}_k)} $$

    This formula has been extracted directly from the TFG, where the matrix operation is expressed as a fraction. Note, however that the correct equation would have the inverse of the denominator multiplying the nominator from the right side. In the TFG, this equation is justified by stating that maximizing the penalized variance is equivalent to the following eigenvalue problem:

    $$ \frac{\mathbf{W}^{1/2} (\mathbf{X}^\intercal \mathbf{X})\mathbf{W}^{1/2} v_j(t)} {N (\mathbf{I}_M + \mathbf{P}_k)^\intercal(\mathbf{I}_M + \mathbf{P}_k)} = \lambda_j v_j(t) $$

    Then, the previous equation can be seen as the usual eigenvalue problem for the covariance matrix of the data, but with a modified covariance matrix. However, even though the previous equation is similar to $\mathbf{V}^\intercal \mathbf{V}$, there are issues with this notation and analysis.

    fda.usc implementation

    The implementation in fda.usc is the following:

     if (lambda > 0) {
        if (is.vector(P)) {
          P <- P.penalty(tt, P)
        }
        dimp <- dim(P)
        if (!(dimp[1] == dimp[2] & dimp[1] == J)) {
          stop("Incorrect matrix dimension P")
        }
        M <- solve(diag(J) + lambda * P)
        Xcen.fdata$data <- Xcen.fdata$data %*% t(M)
      }
    

    where

    • diag(J) is the identity matrix of size J
    • P is the penalty matrix

    Therefore, the implementation in fda.usc is equivalent to the current implementation and also returns the identity matrix as the inner product matrix. The issue (moviedo5/fda.usc/#6) has been opened in the fda.usc repository.

    Proposed solution

    Mathematical analysis

    Non-regularized FPCA summary

    In the usual PCA, the goal is to maximize the variance of the projections upon the principal components found subject to two restrictions. That is to say, maximize

    $$ \frac{1}{N} \mathbf{\phi^\intercal X^\intercal X\phi} $$

    subject to:

    • $\mathbf{\phi^\intercal \phi}=1$
    • The loadings are orthogonal $\mathbf{\phi_i^\intercal \phi_j}=0$ if $i\neq j$

    Regularized FPCA

    The regularized version of FPCA aims to maximize the penalized sample variance. If we want to penalized the second derivative, we can define it as:

    $$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\phi,2) }, $$

    where $p(\phi,2)$ is the penalty function for the second derivative applied to $\phi$. The restrictions now are:

    • $|\phi|^2=1$
    • $\int \phi_i \phi_j + \lambda \int D^2\phi_i D^2\phi_j= 0$

    For simplicity, we will ignore the quadrature for now and assume $\mathbf{W=Id}$ (the integration weights). Then, we can make the following substitutions:

    $$ psv(\phi) = \frac{ \frac 1N \sum_{n=1}^N \left(\int \phi x_n\right)^2 } { |\phi|^2 + \lambda p(\xi,2) } \approx \frac{ \frac 1N \boldsymbol \phi^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \boldsymbol \phi + \lambda \boldsymbol \phi^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi } = \frac{ \frac 1N \boldsymbol {\phi}^\intercal \mathbf X^\intercal \mathbf X \boldsymbol \phi }{ \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi }, $$

    where $\mathbf P_2$ is the penalty matrix for the second derivative, $\boldsymbol \phi$ is the vector of $\phi$ evaluated in the grid points and $\mathbf X$ is the usual data matrix. That is to say, the functions evaluated in the grid points.

    We can also obtain:

    $$ 0=\int \phi_i \phi_j + \int D^2\phi_i D^2\phi_j= \boldsymbol \phi_i^\intercal \boldsymbol \phi_j + \lambda \boldsymbol \phi_i ^\intercal \mathbf P_2^\intercal \mathbf P_2 \boldsymbol \phi_j = \boldsymbol \phi^\intercal \left(\mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2\right)\boldsymbol \phi $$

    In light of these relationships, we can see that we can transform the problem back to the usual PCA problem using the Cholesky decomposition $\mathbf{L} \mathbf{L}^\intercal = \mathbf{Id} + \lambda \mathbf P_2^\intercal \mathbf P_2$. Then, we can apply the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$. Using the Cholesky decomposition, we can obtain the following equation.

    $$ psv(\phi) = \frac{ \frac 1N \boldsymbol \psi^\intercal \mathbf{L}^{-1} \mathbf X^\intercal \mathbf X \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }{ \boldsymbol \psi^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi }= \frac{ \frac 1N \boldsymbol \psi ^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big)^\intercal \Big(\mathbf X \mathbf {(L^\intercal)}^{-1}\Big) \boldsymbol \psi } { \boldsymbol \psi^\intercal \boldsymbol \psi } $$

    Then the constraints become:

    $$ 1 = |\boldsymbol \phi|^2 = |(\mathbf L^\intercal)^{-1} \boldsymbol \psi|^2 = \boldsymbol \psi^\intercal \mathbf L^{-1} (\mathbf L^{-1})^\intercal \boldsymbol \psi $$

    $$ 0 = \boldsymbol \psi_i^\intercal \mathbf {L}^{-1} \left(\mathbf{LL^\intercal}\right) \mathbf{(L^\intercal)}^{-1} \boldsymbol \psi_j = \boldsymbol \psi_i^\intercal \boldsymbol \psi_j $$

    Threfore, we can see that the problem is equivalent to the usual PCA problem with the following changes:

    • The data matrix $\mathbf X$ is replaced by $\mathbf X \mathbf {(L^\intercal)}^{-1}$.
    • Use the change of variable $\boldsymbol \phi = \mathbf{(L^\intercal)^{-1}} \boldsymbol \psi$ so that loadings obtained are orthogonal with respect to the new inner product given by $\mathbf{L}^{-1} (\mathbf L^{-1})^\intercal$.

    Note that after the change of variable, the loadings will no longer have norm 1 with respect to the original inner product. However, this can be easily fixed by dividing normailizing them at the end.

    Progress so far

    This Choesky decomposition approach has been implemented in the attached pull request. Note however that the code has not been optimized yet (still using np.inv instead of np.linalg.solve for example), nor has it been tested extensively.

    Regularization parameter effect

    The main issue remaining is that the regularization parameter $\lambda$ does not seem to have the same effect in grid implementation as in the basis implementation. The reason may be related to the penalty matrix. To reproduce the issue, run the following code:

    X, y = skfda.datasets.fetch_phoneme(return_X_y=True)
    X = X[:300]
    y = y[:300]
    
    
    def fit_plot(ax, title, regularization_parameter=0):
        fpca_regression = FPCA(
            n_components=4,
            regularization=L2Regularization(
                LinearDifferentialOperator(1),
                regularization_parameter=regularization_parameter,
            ),
        )
        fpca_regression.fit(X, y)
        fpca_regression.components_.plot(axes=ax)
        ax.set_title(title)
        ax.set_xlabel("")
        ax.set_ylabel("")
    
    
    fig, ax = plt.subplots(3, 2, figsize=(10, 10))
    
    fit_plot(ax[0, 0], "Grid not regularized")
    fit_plot(ax[0, 1], "Grid regularized", regularization_parameter=2.5)
    
    grid_points = X.grid_points
    X = X.to_basis(Fourier(n_basis=10))
    
    fit_plot(ax[1, 0], "Basis not regularized")
    fit_plot(ax[1, 1], "Basis regularized", regularization_parameter=0.125)
    
    # Convert back to grid using the same points as before
    X = X.to_grid(grid_points)
    
    fit_plot(ax[2, 0], "Grid from basis not regularized")
    fit_plot(ax[2, 1], "Grid from basis regularized", regularization_parameter=2.5)
    
    plt.show()
    

    image

    As it can be seen, the regularization parameter has less of an effect in the basis implementation than in the grid implementation. However, by adjusting the parameters, the results do seem to match almost perfectly.

    Next steps

    • [ ] Fix the regularization parameter effect issue.
    • [ ] Either change the tests so that they do not compare against fda.usc in this particular case or wait for them to respond to the issue (moviedo5/fda.usc/#6). Taking advantage of the migration from unittest to pytest, it would be possible to use pytest mechanisms to prevent these tests from being executed while documenting the reason why they should not be executed.
    • [ ] Optimize and revise the implementation in #485
    • [ ] Investigate the fact that the results in FDataBasis might not have norm 1.
    opened by Ddelval 0
Releases(0.7.1)
  • 0.7.1(Jan 25, 2022)

  • 0.7(Jan 25, 2022)

    • TikhonovRegularization renamed to L2Regularization. The old name is deprecated. See #409 for details.
    • Add the option for MagnitudeShapePlot to plot the inlier ellipse.
    • Additional bug fixes.
    Source code(tar.gz)
    Source code(zip)
  • 0.6.1(Dec 23, 2021)

  • 0.6(Sep 12, 2021)

    Some highlights of this version:

    • Add typing support for most functionalities of the package. Now you can use Mypy to check for errors.
    • New depth-based classifiers
    • New visualization functions
      • Parametric plot
      • Outliergram
      • Multiple interactive plots
    • New historical functional regression model
    • Other minor additions and fixes
    Source code(tar.gz)
    Source code(zip)
  • 0.5(Dec 31, 2020)

    Some highlights of the new functionality:

    • Depth functions refactored
    • Removed C code and added fdasrsf as a dependency
    • Added new variable selection methods
    • Added Maximum Depth classifier
    • Optimized FDataGrid inner product
    Source code(tar.gz)
    Source code(zip)
  • 0.4(Jul 23, 2020)

    Adds functional component analysis, ANOVA and Hotelling tests, basis representation for multivariate and vector valued functions and Tikhonov regularization.

    Source code(tar.gz)
    Source code(zip)
Owner
Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid
Machine Learning Group at Universidad Autónoma de Madrid
Grupo de Aprendizaje Automático - Universidad Autónoma de Madrid
Python Implementation of Scalable In-Memory Updatable Bitmap Indexing

PyUpBit CS490 Large Scale Data Analytics — Implementation of Updatable Compressed Bitmap Indexing Paper Table of Contents About The Project Usage Cont

Hyeong Kyun (Daniel) Park 1 Jun 28, 2022
A Python Tools to imaging the shallow seismic structure

ShallowSeismicImaging Tools to imaging the shallow seismic structure, above 10 km, based on the ZH ratio measured from the ambient seismic noise, and

Xiao Xiao 9 Aug 09, 2022
Convert monolithic Jupyter notebooks into Ploomber pipelines.

Soorgeon Join our community | Newsletter | Contact us | Blog | Website | YouTube Convert monolithic Jupyter notebooks into Ploomber pipelines. soorgeo

Ploomber 65 Dec 16, 2022
Pypeln is a simple yet powerful Python library for creating concurrent data pipelines.

Pypeln Pypeln (pronounced as "pypeline") is a simple yet powerful Python library for creating concurrent data pipelines. Main Features Simple: Pypeln

Cristian Garcia 1.4k Dec 31, 2022
Statistical Analysis 📈 focused on statistical analysis and exploration used on various data sets for personal and professional projects.

Statistical Analysis 📈 This repository focuses on statistical analysis and the exploration used on various data sets for personal and professional pr

Andy Pham 1 Sep 03, 2022
Weather Image Recognition - Python weather application using series of data

Weather Image Recognition - Python weather application using series of data

Kushal Shingote 1 Feb 04, 2022
Elasticsearch tool for easily collecting and batch inserting Python data and pandas DataFrames

ElasticBatch Elasticsearch buffer for collecting and batch inserting Python data and pandas DataFrames Overview ElasticBatch makes it easy to efficien

Dan Kaslovsky 21 Mar 16, 2022
A Numba-based two-point correlation function calculator using a grid decomposition

A Numba-based two-point correlation function (2PCF) calculator using a grid decomposition. Like Corrfunc, but written in Numba, with simplicity and hackability in mind.

Lehman Garrison 3 Aug 24, 2022
Parses data out of your Google Takeout (History, Activity, Youtube, Locations, etc...)

google_takeout_parser parses both the Historical HTML and new JSON format for Google Takeouts caches individual takeout results behind cachew merge mu

Sean Breckenridge 27 Dec 28, 2022
Titanic data analysis for python

Titanic-data-analysis This Repo is an analysis on Titanic_mod.csv This csv file contains some assumed data of the Titanic ship after sinking This full

Hardik Bhanot 1 Dec 26, 2021
An ETL framework + Monitoring UI/API (experimental project for learning purposes)

Fastlane An ETL framework for building pipelines, and Flask based web API/UI for monitoring pipelines. Project structure fastlane |- fastlane: (ETL fr

Dan Katz 2 Jan 06, 2022
Sensitivity Analysis Library in Python (Numpy). Contains Sobol, Morris, Fractional Factorial and FAST methods.

Sensitivity Analysis Library (SALib) Python implementations of commonly used sensitivity analysis methods. Useful in systems modeling to calculate the

SALib 663 Jan 05, 2023
Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which he recommends to buy. We will use this data to build a portfolio

Backtesting the "Cramer Effect" & Recommendations from Cramer Recommendations from Cramer: On the show Mad-Money (CNBC) Jim Cramer picks stocks which

Gábor Vecsei 12 Aug 30, 2022
A stock analysis app with streamlit

StockAnalysisApp A stock analysis app with streamlit. You select the ticker of the stock and the app makes a series of analysis by using the price cha

Antonio Catalano 50 Nov 27, 2022
Detecting Underwater Objects (DUO)

Underwater object detection for robot picking has attracted a lot of interest. However, it is still an unsolved problem due to several challenges. We take steps towards making it more realistic by ad

27 Dec 12, 2022
This is an analysis and prediction project for house prices in King County, USA based on certain features of the house

This is a project for analysis and estimation of House Prices in King County USA The .csv file contains the data of the house and the .ipynb file con

Amit Prakash 1 Jan 21, 2022
Flexible HDF5 saving/loading and other data science tools from the University of Chicago

deepdish Flexible HDF5 saving/loading and other data science tools from the University of Chicago. This repository also host a Deep Learning blog: htt

UChicago - Department of Computer Science 255 Dec 10, 2022
Python tools for querying and manipulating BIDS datasets.

PyBIDS is a Python library to centralize interactions with datasets conforming BIDS (Brain Imaging Data Structure) format.

Brain Imaging Data Structure 180 Dec 18, 2022
A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful.

How useful is the aswer? A Streamlit web-app for a data-science project that aims to evaluate if the answer to a question is helpful. If you want to l

1 Dec 17, 2021
CINECA molecular dynamics tutorial set

High Performance Molecular Dynamics Logging into CINECA's computer systems To logon to the M100 system use the following command from an SSH client ss

J. W. Dell 0 Mar 13, 2022