Title: | Clustering and Prediction using Multi-Task Gaussian Processes with Common Mean |
---|---|
Description: | An implementation for the multi-task Gaussian processes with common mean framework. Two main algorithms, called 'Magma' and 'MagmaClust', are available to perform predictions for supervised learning problems, in particular for time series or any functional/continuous data applications. The corresponding articles has been respectively proposed by Arthur Leroy, Pierre Latouche, Benjamin Guedj and Servane Gey (2022) <doi:10.1007/s10994-022-06172-1>, and Arthur Leroy, Pierre Latouche, Benjamin Guedj and Servane Gey (2023) <https://jmlr.org/papers/v24/20-1321.html>. Theses approaches leverage the learning of cluster-specific mean processes, which are common across similar tasks, to provide enhanced prediction performances (even far from data) at a linear computational cost (in the number of tasks). 'MagmaClust' is a generalisation of 'Magma' where the tasks are simultaneously clustered into groups, each being associated to a specific mean process. User-oriented functions in the package are decomposed into training, prediction and plotting functions. Some basic features (classic kernels, training, prediction) of standard Gaussian processes are also implemented. |
Authors: | Arthur Leroy [aut, cre] |
Maintainer: | Arthur Leroy <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.2.1 |
Built: | 2025-02-19 06:01:16 UTC |
Source: | https://github.com/arthurleroy/magmaclustr |
Allocate training data into the most probable cluster
data_allocate_cluster(trained_model)
data_allocate_cluster(trained_model)
trained_model |
A list, containing the information coming from a
MagmaClust model, previously trained using the
|
The original dataset used to train the MagmaClust model, with additional 'Cluster' and associated 'Proba' columns, indicating the most probable cluster for each individual/task at the end of the training procedure.
TRUE
TRUE
Expand a grid of inputs
expand_grid_inputs(Input, ...)
expand_grid_inputs(Input, ...)
Input |
A vector of inputs. |
... |
As many vector of covariates as desired. We advise to give explicit names when using the function. |
A tibble containing all the combination of values of the parameters.
TRUE
TRUE
Generate a set of random hyper-parameters, specific to the chosen type of kernel, under the format that is used in Magma.
hp( kern = "SE", list_ID = NULL, list_hp = NULL, noise = FALSE, common_hp = FALSE )
hp( kern = "SE", list_ID = NULL, list_hp = NULL, noise = FALSE, common_hp = FALSE )
kern |
A function, or a character string indicating the chosen type of kernel among:
In case of a custom kernel function, the argument |
list_ID |
A vector, associating an |
list_hp |
A vector of characters, providing the name of each
hyper-parameter, in case where |
noise |
A logical value, indicating whether a 'noise' hyper-parameter should be included. |
common_hp |
A logical value, indicating whether the set of hyper-parameters is assumed to be common to all individuals. |
A tibble, providing a set of random hyper-parameters associated with
the kernel specified through the argument kern
.
TRUE
TRUE
Compute the parameters of the hyper-posterior Gaussian distribution of the
mean process in Magma (similarly to the expectation step of the EM
algorithm used for learning). This hyper-posterior distribution, evaluated
on a grid of inputs provided through the grid_inputs
argument, is a
key component for making prediction in Magma, and is required in the function
pred_magma
.
hyperposterior( trained_model = NULL, data = NULL, hp_0 = NULL, hp_i = NULL, kern_0 = NULL, kern_i = NULL, prior_mean = NULL, grid_inputs = NULL, pen_diag = 1e-10 )
hyperposterior( trained_model = NULL, data = NULL, hp_0 = NULL, hp_i = NULL, kern_0 = NULL, kern_i = NULL, prior_mean = NULL, grid_inputs = NULL, pen_diag = 1e-10 )
trained_model |
A list, containing the information coming from a
Magma model, previously trained using the |
data |
A tibble or data frame. Required columns: 'Input',
'Output'. Additional columns for covariates can be specified.
The 'Input' column should define the variable that is used as
reference for the observations (e.g. time for longitudinal data). The
'Output' column specifies the observed values (the response
variable). The data frame can also provide as many covariates as desired,
with no constraints on the column names. These covariates are additional
inputs (explanatory variables) of the models that are also observed at
each reference 'Input'. Recovered from |
hp_0 |
A named vector, tibble or data frame of hyper-parameters
associated with |
hp_i |
A tibble or data frame of hyper-parameters
associated with |
kern_0 |
A kernel function, associated with the mean GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
kern_i |
A kernel function, associated with the individual GPs. ("SE",
"PERIO" and "RQ" are aso available here). Recovered from
|
prior_mean |
Hyper-prior mean parameter of the mean GP. This argument, can be specified under various formats, such as:
|
grid_inputs |
A vector or a data frame, indicating the grid of additional reference inputs on which the mean process' hyper-posterior should be evaluated. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
A list gathering the parameters of the mean processes' hyper-posterior distributions, namely:
mean: A tibble, the hyper-posterior mean parameter
evaluated at each training Input
.
cov: A matrix, the covariance parameter for the hyper-posterior distribution of the mean process.
pred: A tibble, the predicted mean and variance at
Input
for the mean process' hyper-posterior
distribution under a format that allows the direct
visualisation as a GP prediction.
TRUE
TRUE
Recompute the E-step of the VEM algorithm in MagmaClust for a new set of
reference Input
. Once training is completed, it can be necessary to
evaluate the hyper-posterior distributions of the mean processes at specific
locations, for which we want to make predictions. This process is directly
implemented in the pred_magmaclust
function but the user
might want to use hyperpost_clust
for a tailored control of
the prediction procedure.
hyperposterior_clust( trained_model = NULL, data = NULL, mixture = NULL, hp_k = NULL, hp_i = NULL, kern_k = NULL, kern_i = NULL, prior_mean_k = NULL, grid_inputs = NULL, pen_diag = 1e-10 )
hyperposterior_clust( trained_model = NULL, data = NULL, mixture = NULL, hp_k = NULL, hp_i = NULL, kern_k = NULL, kern_i = NULL, prior_mean_k = NULL, grid_inputs = NULL, pen_diag = 1e-10 )
trained_model |
A list, containing the information coming from a
Magma model, previously trained using the |
data |
A tibble or data frame. Required columns: |
mixture |
A tibble or data frame, indicating the mixture probabilities
of each cluster for each individual. Required column: |
hp_k |
A tibble or data frame of hyper-parameters
associated with |
hp_i |
A tibble or data frame of hyper-parameters
associated with |
kern_k |
A kernel function, associated with the mean GPs. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
kern_i |
A kernel function, associated with the individual GPs. ("SE",
"LIN", PERIO" and "RQ" are also available here). Recovered from
|
prior_mean_k |
The set of hyper-prior mean parameters (m_k) for the K mean GPs, one value for each cluster. cluster. This argument can be specified under various formats, such as:
|
grid_inputs |
A vector or a data frame, indicating the grid of additional reference inputs on which the mean process' hyper-posterior should be evaluated. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
A list containing the parameters of the mean processes' hyper-posterior distribution, namely:
mean: A list of tibbles containing, for each cluster, the
hyper-posterior mean parameters evaluated at each
Input
.
cov: A list of matrices containing, for each cluster, the hyper-posterior covariance parameter of the mean process.
mixture: A tibble, indicating the mixture probabilities in each cluster for each individual.
TRUE
TRUE
The MagmaClustR package implements two main algorithms, called Magma and MagmaClust, using a multi-task GPs model to perform predictions for supervised learning problems. Theses approaches leverage the learning of cluster-specific mean processes, which are common across similar tasks, to provide enhanced prediction performances (even far from data) at a linear computational cost (in the number of tasks). MagmaClust is a generalisation of Magma where the tasks are simultaneously clustered into groups, each being associated to a specific mean process. User-oriented functions in the package are decomposed into training, prediction and plotting functions. Some basic features of standard GPs are also implemented.
For a quick introduction to MagmaClustR, please refer to the README at https://github.com/ArthurLeroy/MagmaClustR
Arthur Leroy, Pierre Pathe and Pierre Latouche
Maintainer: Arthur Leroy - [email protected]
Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey.
MAGMA: Inference and Prediction with Multi-Task Gaussian Processes.
Machine Learning, 2022,
https://link.springer.com/article/10.1007/s10994-022-06172-1
Arthur Leroy, Pierre Latouche, Benjamin Guedj, and Servane Gey.
Cluster-Specific Predictions with Multi-Task Gaussian Processes.
Journal of Machine Learning Research, 2023,
https://jmlr.org/papers/v24/20-1321.html
set.seed(4242)
data_magma <- simu_db(M = 11, N = 10, K = 1)
magma_train <- data_magma %>% subset(ID %in% 1:10)
magma_test <- data_magma %>% subset(ID == 11) %>% head(7)
magma_model <- train_magma(data = magma_train)
magma_pred <- pred_magma(data = magma_test, trained_model = magma_model,
grid_inputs = seq(0, 10, 0.01))
set.seed(4242)
data_magmaclust <- simu_db(M = 4, N = 10, K = 3)
list_ID = unique(data_magmaclust$ID)
magmaclust_train <- data_magmaclust %>% subset(ID %in% list_ID[1:11])
magmaclust_test <- data_magmaclust %>% subset(ID == list_ID[12]) %>%
head(5)
magmaclust_model <- train_magmaclust(data = magmaclust_train)
magmaclust_pred <- pred_magmaclust(data = magmaclust_test,
trained_model = magmaclust_model, grid_inputs = seq(0, 10, 0.01))
Maintainer: Arthur Leroy [email protected] (ORCID)
Authors:
Pierre Latouche [email protected]
Other contributors:
Pierre Pathé [email protected] [contributor]
Alexia Grenouillat [email protected] [contributor]
Hugo Lelievre [email protected] [contributor]
Useful links:
Report bugs at https://github.com/ArthurLeroy/MagmaClustR/issues
Display raw data under the Magma format as smoothed curves.
plot_db(data, cluster = FALSE, legend = FALSE)
plot_db(data, cluster = FALSE, legend = FALSE)
data |
A data frame or tibble with format : ID, Input, Output. |
cluster |
A boolean indicating whether data should be coloured by cluster. Requires a column named 'Cluster'. |
legend |
A boolean indicating whether the legend should be displayed. |
Graph of smoothed curves of raw data.
TRUE
TRUE
Create a GIF animation displaying how Magma or classic GP predictions evolve and improve when the number of data points increase.
plot_gif( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5, export_gif = FALSE, path = "gif_gp.gif", ... )
plot_gif( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5, export_gif = FALSE, path = "gif_gp.gif", ... )
pred_gp |
A tibble, typically coming from the |
x_input |
A vector of character strings, indicating which input should be displayed. If NULL(default) the 'Input' column is used for the x-axis. If providing a 2-dimensional vector, the corresponding columns are used for the x-axis and y-axis. |
data |
(Optional) A tibble or data frame. Required columns: 'Input', 'Output'. Additional columns for covariates can be specified. The 'Input' column should define the variable that is used as reference for the observations (e.g. time for longitudinal data). The 'Output' column specifies the observed values (the response variable). The data frame can also provide as many covariates as desired, with no constraints on the column names. These covariates are additional inputs (explanatory variables) of the models that are also observed at each reference 'Input'. |
data_train |
(Optional) A tibble or data frame, containing the training
data of the Magma model. The data set should have the same format as the
|
prior_mean |
(Optional) A tibble or a data frame, containing the 'Input' and associated 'Output' prior mean parameter of the GP prediction. |
y_grid |
A vector, indicating the grid of values on the y-axis for which
probabilities should be computed for heatmaps of 1-dimensional
predictions. If NULL (default), a vector of length 50 is defined, ranging
between the min and max 'Output' values contained in |
heatmap |
A logical value indicating whether the GP prediction should be represented as a heatmap of probabilities for 1-dimensional inputs. If FALSE (default), the mean curve and associated 95% CI are displayed. |
prob_CI |
A number between 0 and 1 (default is 0.95), indicating the level of the Credible Interval associated with the posterior mean curve. |
size_data |
A number, controlling the size of the |
size_data_train |
A number, controlling the size of the
|
alpha_data_train |
A number, between 0 and 1, controlling transparency
of the |
export_gif |
A logical value indicating whether the animation should be exported as a .gif file. |
path |
A character string defining the path where the GIF file should be exported. |
... |
Any additional parameters that can be passed to the function
|
Visualisation of a Magma or GP prediction (optional: display data points, training data points and the prior mean function), where data points are added sequentially for visualising changes in prediction as information increases.
TRUE
TRUE
Display Magma or classic GP predictions. According to the dimension of the inputs, the graph may be a mean curve + Credible Interval or a heatmap of probabilities.
plot_gp( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5 ) plot_magma( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5 )
plot_gp( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5 ) plot_magma( pred_gp, x_input = NULL, data = NULL, data_train = NULL, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5 )
pred_gp |
A tibble or data frame, typically coming from
|
x_input |
A vector of character strings, indicating which input should be displayed. If NULL (default) the 'Input' column is used for the x-axis. If providing a 2-dimensional vector, the corresponding columns are used for the x-axis and y-axis. |
data |
(Optional) A tibble or data frame. Required columns: 'Input', 'Output'. Additional columns for covariates can be specified. This argument corresponds to the raw data on which the prediction has been performed. |
data_train |
(Optional) A tibble or data frame, containing the training
data of the Magma model. The data set should have the same format as the
|
prior_mean |
(Optional) A tibble or a data frame, containing the 'Input' and associated 'Output' prior mean parameter of the GP prediction. |
y_grid |
A vector, indicating the grid of values on the y-axis for which
probabilities should be computed for heatmaps of 1-dimensional
predictions. If NULL (default), a vector of length 50 is defined, ranging
between the min and max 'Output' values contained in |
heatmap |
A logical value indicating whether the GP prediction should be represented as a heatmap of probabilities for 1-dimensional inputs. If FALSE (default), the mean curve and associated Credible Interval are displayed. |
samples |
A logical value indicating whether the GP prediction should be represented as a collection of samples drawn from the posterior. If FALSE (default), the mean curve and associated Credible Interval are displayed. |
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
plot_mean |
A logical value, indicating whether the mean prediction
should be displayed on the graph when |
alpha_samples |
A number, controlling transparency of the sample curves. |
prob_CI |
A number between 0 and 1 (default is 0.95), indicating the level of the Credible Interval associated with the posterior mean curve. If this this argument is set to 1, the Credible Interval is not displayed. |
size_data |
A number, controlling the size of the |
size_data_train |
A number, controlling the size of the
|
alpha_data_train |
A number, between 0 and 1, controlling transparency
of the |
Visualisation of a Magma or GP prediction (optional: display data
points, training data points and the prior mean function). For 1-D
inputs, the prediction is represented as a mean curve and its associated
95% Credible Interval, as a collection of samples drawn from the
posterior if samples
= TRUE, or as a heatmap of probabilities if
heatmap
= TRUE. For 2-D inputs, the prediction is represented as a
heatmap, where each couple of inputs on the x-axis and y-axis are
associated with a gradient of colours for the posterior mean values,
whereas the uncertainty is indicated by the transparency (the narrower is
the Credible Interval, the more opaque is the associated colour, and vice
versa)
TRUE
TRUE
Display MagmaClust predictions. According to the dimension of the inputs, the graph may be a mean curve (dim inputs = 1) or a heatmap (dim inputs = 2) of probabilities. Moreover, MagmaClust can provide credible intervals only by visualising cluster-specific predictions (e.g. for the most probable cluster). When visualising the full mixture-of-GPs prediction, which can be multimodal, the user should choose between the simple mean function or the full heatmap of probabilities (more informative but slower).
plot_magmaclust( pred_clust, cluster = "all", x_input = NULL, data = NULL, data_train = NULL, col_clust = FALSE, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5 )
plot_magmaclust( pred_clust, cluster = "all", x_input = NULL, data = NULL, data_train = NULL, col_clust = FALSE, prior_mean = NULL, y_grid = NULL, heatmap = FALSE, samples = FALSE, nb_samples = 50, plot_mean = TRUE, alpha_samples = 0.3, prob_CI = 0.95, size_data = 3, size_data_train = 1, alpha_data_train = 0.5 )
pred_clust |
A list of predictions, typically coming from
|
cluster |
A character string, indicating which cluster to plot from. If 'all' (default) the mixture of GPs prediction is displayed as a mean curve (1-D inputs) or a mean heatmap (2-D inputs). Alternatively, if the name of one cluster is provided, the classic mean curve + credible interval is displayed (1-D inputs), or a heatmap with colour gradient for the mean and transparency gradient for the Credible Interval (2-D inputs). |
x_input |
A vector of character strings, indicating which input should be displayed. If NULL (default) the 'Input' column is used for the x-axis. If providing a 2-dimensional vector, the corresponding columns are used for the x-axis and y-axis. |
data |
(Optional) A tibble or data frame. Required columns: |
data_train |
(Optional) A tibble or data frame, containing the training
data of the MagmaClust model. The data set should have the same format as
the |
col_clust |
A boolean indicating whether backward points are coloured
according to the individuals or to their most probable cluster. If one
wants to colour by clusters, a column |
prior_mean |
(Optional) A list providing, for each cluster, a
tibble containing prior mean parameters of the prediction. This argument
typically comes as an outcome |
y_grid |
A vector, indicating the grid of values on the y-axis for which
probabilities should be computed for heatmaps of 1-dimensional
predictions. If NULL (default), a vector of length 50 is defined, ranging
between the min and max 'Output' values contained in |
heatmap |
A logical value indicating whether the GP mixture should be represented as a heatmap of probabilities for 1-dimensional inputs. If FALSE (default), the mean curve (and associated Credible Interval if available) are displayed. |
samples |
A logical value indicating whether the GP mixture should be represented as a collection of samples drawn from the posterior. If FALSE (default), the mean curve (and associated Credible Interval if available) are displayed. |
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
plot_mean |
A logical value, indicating whether the mean prediction
should be displayed on the graph when |
alpha_samples |
A number, controlling transparency of the sample curves. |
prob_CI |
A number between 0 and 1 (default is 0.95), indicating the level of the Credible Interval associated with the posterior mean curve. If this this argument is set to 1, the Credible Interval is not displayed. |
size_data |
A number, controlling the size of the |
size_data_train |
A number, controlling the size of the
|
alpha_data_train |
A number, between 0 and 1, controlling transparency
of the |
Visualisation of a MagmaClust prediction (optional: display data
points, training data points and the prior mean functions). For 1-D
inputs, the prediction is represented as a mean curve (and its associated
95% Credible Interval for cluster-specific predictions), or as a heatmap
of probabilities if heatmap
= TRUE. In the case of MagmaClust,
the heatmap representation should be preferred for clarity, although the
default display remains mean curve for quicker execution. For 2-D inputs,
the prediction is represented as a heatmap, where each couple of inputs on
the x-axis and y-axis are associated with a gradient of colours for the
posterior mean values, whereas the uncertainty is indicated by the
transparency (the narrower is the Credible Interval, the more opaque is
the associated colour, and vice versa). As for 1-D inputs, Credible
Interval information is only available for cluster-specific predictions.
TRUE
TRUE
Display samples drawn from the posterior of a GP, Magma or MagmaClust prediction. According to the dimension of the inputs, the graph may represent curves or a heatmap.
plot_samples( pred = NULL, samples = NULL, nb_samples = 50, x_input = NULL, plot_mean = TRUE, alpha_samples = 0.3 )
plot_samples( pred = NULL, samples = NULL, nb_samples = 50, x_input = NULL, plot_mean = TRUE, alpha_samples = 0.3 )
pred |
A list, typically coming from |
samples |
A tibble or data frame, containing the samples generated from
a GP, Magma, or MagmaClust prediction. Required columns: |
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
x_input |
A vector of character strings, indicating which 'column' should be displayed in the case of multidimensional inputs. If NULL(default) the Input' column is used for the x-axis. If providing a 2-dimensional vector, the corresponding columns are used for the x-axis and the y-axis. |
plot_mean |
A logical value, indicating whether the mean prediction should be displayed on the graph. |
alpha_samples |
A number, controlling transparency of the sample curves. |
Graph of samples drawn from a posterior distribution of a GP, Magma, or MagmaClust prediction.
TRUE
TRUE
Generate a Magma or classic GP prediction under a format that is compatible
with a further GIF visualisation of the results. For a Magma prediction,
either the trained_model
or hyperpost
argument is required.
Otherwise, a classic GP prediction is applied and the prior mean can be
specified through the mean
argument.
pred_gif( data, trained_model = NULL, grid_inputs = NULL, hyperpost = NULL, mean = NULL, hp = NULL, kern = "SE", pen_diag = 1e-10 )
pred_gif( data, trained_model = NULL, grid_inputs = NULL, hyperpost = NULL, mean = NULL, hp = NULL, kern = "SE", pen_diag = 1e-10 )
data |
A tibble or data frame. Required columns: 'Input', 'Output'. Additional columns for covariates can be specified. The 'Input' column should define the variable that is used as reference for the observations (e.g. time for longitudinal data). The 'Output' column specifies the observed values (the response variable). The data frame can also provide as many covariates as desired, with no constraints on the column names. These covariates are additional inputs (explanatory variables) of the models that are also observed at each reference 'Input'. |
trained_model |
A list, containing the information coming from a
Magma model, previously trained using the |
grid_inputs |
The grid of inputs (reference Input and covariates) values
on which the GP should be evaluated. Ideally, this argument should be a
tibble or a data frame, providing the same columns as |
hyperpost |
A list, containing the elements 'mean' and 'cov', the
parameters of the hyper-posterior distribution of the mean process.
Typically, this argument should from a previous learning using
|
mean |
Mean parameter of the GP. This argument can be specified under various formats, such as:
|
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
A tibble, representing Magma or GP predictions as two column 'Mean'
and 'Var', evaluated on the grid_inputs
. The column 'Input' and
additional covariates columns are associated to each predicted values. An
additional 'Index' column is created for the sake of GIF creation using
the function plot_gif
TRUE
TRUE
Compute the posterior distribution of a standard GP, using the formalism of Magma. By providing observed data, the prior mean and covariance matrix (by defining a kernel and its associated hyper-parameters), the mean and covariance parameters of the posterior distribution are computed on the grid of inputs that has been specified. This predictive distribution can be evaluated on any arbitrary inputs since a GP is an infinite-dimensional object.
pred_gp( data = NULL, grid_inputs = NULL, mean = NULL, hp = NULL, kern = "SE", get_full_cov = FALSE, plot = TRUE, pen_diag = 1e-10 )
pred_gp( data = NULL, grid_inputs = NULL, mean = NULL, hp = NULL, kern = "SE", get_full_cov = FALSE, plot = TRUE, pen_diag = 1e-10 )
data |
A tibble or data frame. Required columns: 'Input', 'Output'. Additional columns for covariates can be specified. The 'Input' column should define the variable that is used as reference for the observations (e.g. time for longitudinal data). The 'Output' column specifies the observed values (the response variable). The data frame can also provide as many covariates as desired, with no constraints on the column names. These covariates are additional inputs (explanatory variables) of the models that are also observed at each reference 'Input'. If NULL, the prior GP is returned. |
grid_inputs |
The grid of inputs (reference Input and covariates) values
on which the GP should be evaluated. Ideally, this argument should be a
tibble or a data frame, providing the same columns as |
mean |
Mean parameter of the GP. This argument can be specified under various formats, such as:
|
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
get_full_cov |
A logical value, indicating whether the full posterior covariance matrix should be returned. |
plot |
A logical value, indicating whether a plot of the results is automatically displayed. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
A tibble, representing the GP predictions as two column 'Mean' and
'Var', evaluated on the grid_inputs
. The column 'Input' and
additional covariates columns are associated to each predicted values.
If the get_full_cov
argument is TRUE, the function returns a list,
in which the tibble described above is defined as 'pred' and the full
posterior covariance matrix is defined as 'cov'.
TRUE
TRUE
Compute the posterior predictive distribution in Magma. Providing data of any new individual/task, its trained hyper-parameters and a previously trained Magma model, the predictive distribution is evaluated on any arbitrary inputs that are specified through the 'grid_inputs' argument.
pred_magma( data = NULL, trained_model = NULL, grid_inputs = NULL, hp = NULL, kern = "SE", hyperpost = NULL, get_hyperpost = FALSE, get_full_cov = FALSE, plot = TRUE, pen_diag = 1e-10 )
pred_magma( data = NULL, trained_model = NULL, grid_inputs = NULL, hp = NULL, kern = "SE", hyperpost = NULL, get_hyperpost = FALSE, get_full_cov = FALSE, plot = TRUE, pen_diag = 1e-10 )
data |
A tibble or data frame. Required columns: 'Input',
'Output'. Additional columns for covariates can be specified.
The 'Input' column should define the variable that is used as
reference for the observations (e.g. time for longitudinal data). The
'Output' column specifies the observed values (the response
variable). The data frame can also provide as many covariates as desired,
with no constraints on the column names. These covariates are additional
inputs (explanatory variables) of the models that are also observed at
each reference 'Input'. If NULL, the mean process from
|
trained_model |
A list, containing the information coming from a
Magma model, previously trained using the |
grid_inputs |
The grid of inputs (reference Input and covariates) values
on which the GP should be evaluated. Ideally, this argument should be a
tibble or a data frame, providing the same columns as |
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hyperpost |
A list, containing the elements 'mean' and 'cov', the
parameters of the hyper-posterior distribution of the mean process.
Typically, this argument should come from a previous learning using
|
get_hyperpost |
A logical value, indicating whether the hyper-posterior distribution of the mean process should be returned. This can be useful when planning to perform several predictions on the same grid of inputs, since recomputation of the hyper-posterior can be prohibitive for high dimensional grids. |
get_full_cov |
A logical value, indicating whether the full posterior covariance matrix should be returned. |
plot |
A logical value, indicating whether a plot of the results is automatically displayed. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
A tibble, representing Magma predictions as two column 'Mean' and
'Var', evaluated on the grid_inputs
. The column 'Input' and
additional covariates columns are associated to each predicted values.
If the get_full_cov
or get_hyperpost
arguments are TRUE,
the function returns a list, in which the tibble described above is
defined as 'pred_gp' and the full posterior covariance matrix is
defined as 'cov', and the hyper-posterior distribution of the mean process
is defined as 'hyperpost'.
TRUE
TRUE
Compute the posterior predictive distribution in MagmaClust. Providing data from any new individual/task, its trained hyper-parameters and a previously trained MagmaClust model, the multi-task posterior distribution is evaluated on any arbitrary inputs that are specified through the 'grid_inputs' argument. Due to the nature of the model, the prediction is defined as a mixture of Gaussian distributions. Therefore the present function computes the parameters of the predictive distribution associated with each cluster, as well as the posterior mixture probabilities for this new individual/task.
pred_magmaclust( data = NULL, trained_model = NULL, grid_inputs = NULL, mixture = NULL, hp = NULL, kern = "SE", hyperpost = NULL, prop_mixture = NULL, get_hyperpost = FALSE, get_full_cov = TRUE, plot = TRUE, pen_diag = 1e-10 )
pred_magmaclust( data = NULL, trained_model = NULL, grid_inputs = NULL, mixture = NULL, hp = NULL, kern = "SE", hyperpost = NULL, prop_mixture = NULL, get_hyperpost = FALSE, get_full_cov = TRUE, plot = TRUE, pen_diag = 1e-10 )
data |
A tibble or data frame. Required columns: |
trained_model |
A list, containing the information coming from a
MagmaClust model, previously trained using the
|
grid_inputs |
The grid of inputs (reference Input and covariates) values
on which the GP should be evaluated. Ideally, this argument should be a
tibble or a data frame, providing the same columns as |
mixture |
A tibble or data frame, indicating the mixture probabilities
of each cluster for the new individual/task.
If NULL, the |
hp |
A named vector, tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hyperpost |
A list, containing the elements |
prop_mixture |
A tibble or a named vector of the mixture proportions.
Each name of column or element should refer to a cluster. The value
associated with each cluster is a number between 0 and 1. If both
|
get_hyperpost |
A logical value, indicating whether the hyper-posterior distributions of the mean processes should be returned. This can be useful when planning to perform several predictions on the same grid of inputs, since recomputation of the hyper-posterior can be prohibitive for high dimensional grids. |
get_full_cov |
A logical value, indicating whether the full posterior covariance matrices should be returned. |
plot |
A logical value, indicating whether a plot of the results is automatically displayed. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
A list of GP prediction results composed of:
pred: As sub-list containing, for each cluster:
pred_gp: A tibble, representing the GP predictions as two
column Mean
and Var
, evaluated on the
grid_inputs
. The column Input
and additional
covariates columns are associated with each predicted values.
proba: A number, the posterior probability associated with this cluster.
cov (if get_full_cov
= TRUE): A matrix, the full
posterior covariance matrix associated with this cluster.
mixture: A tibble, indicating the mixture probabilities of each cluster for the predicted individual/task.
hyperpost (if get_hyperpost
= TRUE): A list,
containing the hyper-posterior distributions information useful
for visualisation purposes.
TRUE
TRUE
Indicates the most probable cluster
proba_max_cluster(mixture)
proba_max_cluster(mixture)
mixture |
A tibble or data frame containing mixture probabilities. |
A tibble, retaining only the most probable cluster. The column
Cluster
indicates the the cluster's name whereas Proba
refers to its associated probability. If ID
is initially
a column of mixture
(optional), the function returns the most
probable cluster for all the different ID
values.
TRUE
TRUE
Modify the original grid of inputs to make it more 'regular' (in the sense that the interval between each observation is constant, or corresponds to a specific pattern defined by the user). In particular, this function can also be used to summarise several data points into one, at a specific location. In this case, the output values are averaged according to the 'summarise_fct' argument.
regularize_data( data, size_grid = 30, grid_inputs = NULL, summarise_fct = base::mean ) regularise_data( data, size_grid = 30, grid_inputs = NULL, summarise_fct = base::mean )
regularize_data( data, size_grid = 30, grid_inputs = NULL, summarise_fct = base::mean ) regularise_data( data, size_grid = 30, grid_inputs = NULL, summarise_fct = base::mean )
data |
A tibble or data frame. Required columns: |
size_grid |
An integer, which indicates the number of equispaced points each column must contain. Each original input value will be collapsed to the closest point of the new regular grid, and the associated outputs are averaged using the 'summarise_fct' function. This argument is used when 'grid_inputs' is left to 'NULL'. Default value is 30. |
grid_inputs |
A data frame, corresponding to a pre-defined grid of
inputs according to which we want to regularise a dataset. Column names
must be similar to those appearing in |
summarise_fct |
A character string or a function. If several similar inputs are associated with different outputs, the user can choose the summarising function for the output among the following: min, max, mean, median. A custom function can be defined if necessary. Default is "mean". |
A data frame, where input columns have been regularised as desired.
data = tibble::tibble(ID = 1, Input = 0:100, Output = -50:50) ## Define a 1D input grid of 10 points regularize_data(data, size_grid = 10) ## Define a 1D custom grid my_grid = tibble::tibble(Input = c(5, 10, 25, 50, 100)) regularize_data(data, grid_inputs = my_grid) ## Define a 2D input grid of 5x5 points data_2D = cbind(ID = 1, expand.grid(Input=1:10, Input2=1:10), Output = 1:100) regularize_data(data_2D, size_grid = 5) ## Define a 2D custom input grid my_grid_2D = MagmaClustR::expand_grid_inputs(c(2, 4, 8), 'Input2' = c(3, 5)) regularize_data(data_2D, grid_inputs = my_grid_2D)
data = tibble::tibble(ID = 1, Input = 0:100, Output = -50:50) ## Define a 1D input grid of 10 points regularize_data(data, size_grid = 10) ## Define a 1D custom grid my_grid = tibble::tibble(Input = c(5, 10, 25, 50, 100)) regularize_data(data, grid_inputs = my_grid) ## Define a 2D input grid of 5x5 points data_2D = cbind(ID = 1, expand.grid(Input=1:10, Input2=1:10), Output = 1:100) regularize_data(data_2D, size_grid = 5) ## Define a 2D custom input grid my_grid_2D = MagmaClustR::expand_grid_inputs(c(2, 4, 8), 'Input2' = c(3, 5)) regularize_data(data_2D, grid_inputs = my_grid_2D)
Draw samples from a posterior GP/Magma distribution
sample_gp(pred_gp, nb_samples = 50) sample_magma(pred_gp, nb_samples = 50)
sample_gp(pred_gp, nb_samples = 50) sample_magma(pred_gp, nb_samples = 50)
pred_gp |
A list, typically coming from
|
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
A tibble or data frame, containing the samples generated from
a GP prediction. Format: Input
, Sample
, Output
.
TRUE
TRUE
Draw samples from a MagmaClust posterior distribution
sample_magmaclust(pred_clust, nb_samples = 50)
sample_magmaclust(pred_clust, nb_samples = 50)
pred_clust |
A list, typically coming from
|
nb_samples |
A number, indicating the number of samples to be drawn from the predictive posterior distribution. For two-dimensional graphs, only one sample can be displayed. |
A tibble or data frame, containing the samples generated from
a GP prediction. Format: Cluster
, Proba
, Input
,
Sample
, Output
.
TRUE
TRUE
In MagmaClust, as for any clustering method, the number K of clusters has to be provided as an hypothesis of the model. This function implements a model selection procedure, by maximising a variational BIC criterion, computed for different values of K. A heuristic for a fast approximation of the procedure is proposed as well, although the corresponding models would not be properly trained.
select_nb_cluster( data, fast_approx = TRUE, grid_nb_cluster = 1:10, ini_hp_k = NULL, ini_hp_i = NULL, kern_k = "SE", kern_i = "SE", plot = TRUE, ... )
select_nb_cluster( data, fast_approx = TRUE, grid_nb_cluster = 1:10, ini_hp_k = NULL, ini_hp_i = NULL, kern_k = "SE", kern_i = "SE", plot = TRUE, ... )
data |
A tibble or data frame. Columns required: |
fast_approx |
A boolean, indicating whether a fast approximation should
be used for selecting the number of clusters. If TRUE, each Magma or
MagmaClust model will perform only one E-step of the training, using
the same fixed values for the hyper-parameters ( |
grid_nb_cluster |
A vector of integer, corresponding to grid of values that will be tested for the number of clusters. |
ini_hp_k |
A tibble or data frame of hyper-parameters associated with
|
ini_hp_i |
A tibble or data frame of hyper-parameters associated with
|
kern_k |
A kernel function associated to the mean processes. |
kern_i |
A kernel function associated to the individuals/tasks. |
plot |
A boolean indicating whether the plot of V-BIC values for all numbers of clusters should displayed. |
... |
Any additional argument that could be passed to
|
A list, containing the results of model selection procedure for selecting the optimal number of clusters thanks to a V-BIC criterion maximisation. The elements of the list are:
best_k: An integer, indicating the resulting optimal number of clusters
seq_vbic: A vector, corresponding to the sequence of the V-BIC values
associated with the models trained for each provided cluster's number in
grid_nb_cluster
.
trained_models: A list, named by associated number of clusters, of
Magma or MagmaClust models that have been trained (or approximated if
fast_approx
= T) during the model selection procedure.
TRUE
TRUE
Simulate a complete training dataset, which may be representative of various applications. Several flexible arguments allow adjustment of the number of individuals, of observed inputs, and the values of many parameters controlling the data generation.
simu_db( M = 10, N = 10, K = 1, covariate = FALSE, grid = seq(0, 10, 0.05), grid_cov = seq(0, 10, 0.5), common_input = TRUE, common_hp = TRUE, add_hp = FALSE, add_clust = FALSE, int_mu_v = c(4, 5), int_mu_l = c(0, 1), int_i_v = c(1, 2), int_i_l = c(0, 1), int_i_sigma = c(0, 0.2), lambda_int = c(30, 40), m_int = c(0, 10), lengthscale_int = c(30, 40), m0_slope = c(-5, 5), m0_intercept = c(-50, 50) )
simu_db( M = 10, N = 10, K = 1, covariate = FALSE, grid = seq(0, 10, 0.05), grid_cov = seq(0, 10, 0.5), common_input = TRUE, common_hp = TRUE, add_hp = FALSE, add_clust = FALSE, int_mu_v = c(4, 5), int_mu_l = c(0, 1), int_i_v = c(1, 2), int_i_l = c(0, 1), int_i_sigma = c(0, 0.2), lambda_int = c(30, 40), m_int = c(0, 10), lengthscale_int = c(30, 40), m0_slope = c(-5, 5), m0_intercept = c(-50, 50) )
M |
An integer. The number of individual per cluster. |
N |
An integer. The number of observations per individual. |
K |
An integer. The number of underlying clusters. |
covariate |
A logical value indicating whether the dataset should include an additional input covariate named 'Covariate'. |
grid |
A vector of numbers defining a grid of observations (i.e. the reference inputs). |
grid_cov |
A vector of numbers defining a grid of observations (i.e. the covariate reference inputs). |
common_input |
A logical value indicating whether the reference inputs are common to all individual. |
common_hp |
A logical value indicating whether the hyper-parameters are common to all individual. If TRUE and K>1, the hyper-parameters remain different between the clusters. |
add_hp |
A logical value indicating whether the values of hyper-parameters should be added as columns in the dataset. |
add_clust |
A logical value indicating whether the name of the clusters should be added as a column in the dataset. |
int_mu_v |
A vector of 2 numbers, defining an interval of admissible values for the variance hyper-parameter of the mean process' kernel. |
int_mu_l |
A vector of 2 numbers, defining an interval of admissible values for the lengthscale hyper-parameter of the mean process' kernel. |
int_i_v |
A vector of 2 numbers, defining an interval of admissible values for the variance hyper-parameter of the individual process' kernel. |
int_i_l |
A vector of 2 numbers, defining an interval of admissible values for the lengthscale hyper-parameter of the individual process' kernel. |
int_i_sigma |
A vector of 2 numbers, defining an interval of admissible values for the noise hyper-parameter. |
lambda_int |
A vector of 2 numbers, defining an interval of admissible values for the lambda parameter of the 2D exponential. |
m_int |
A vector of 2 numbers, defining an interval of admissible values for the mean of the 2D exponential. |
lengthscale_int |
A vector of 2 numbers, defining an interval of admissible values for the lengthscale parameter of the 2D exponential. |
m0_slope |
A vector of 2 numbers, defining an interval of admissible values for the slope of m0. |
m0_intercept |
A vector of 2 numbers, defining an interval of admissible values for the intercept of m0. |
A full dataset of simulated training data.
## Generate a dataset with 3 clusters of 4 individuals, observed at 10 inputs data = simu_db(M = 4, N = 10, K = 3) ## Generate a 2-D dataset with an additional input 'Covariate' data = simu_db(covariate = TRUE) ## Generate a dataset where input locations are different among individuals data = simu_db(common_input = FALSE) ## Generate a dataset with an additional column indicating the true clusters data = simu_db(K = 3, add_clust = TRUE)
## Generate a dataset with 3 clusters of 4 individuals, observed at 10 inputs data = simu_db(M = 4, N = 10, K = 3) ## Generate a 2-D dataset with an additional input 'Covariate' data = simu_db(covariate = TRUE) ## Generate a dataset where input locations are different among individuals data = simu_db(common_input = FALSE) ## Generate a dataset with an additional column indicating the true clusters data = simu_db(K = 3, add_clust = TRUE)
A subset of data from reported performances of French swimmers during 100m freestyle competitions between 2002 and 2016. See https://link.springer.com/article/10.1007/s10994-022-06172-1 and https://www.mdpi.com/2076-3417/8/10/1766 for dedicated description and analysis.
swimmers
swimmers
swimmers
A data frame with 76,832 rows and 4 columns:
Indentifying number associated to each swimmer
Age in years
Performance in seconds on a 100m freestyle event
Competition gender
https://ffn.extranat.fr/webffn/competitions.php?idact=nat
Learning hyper-parameters of any new individual/task in Magma
is
required in the prediction procedure. This function can also be used to learn
hyper-parameters of a simple GP (just let the hyperpost
argument set
to NULL, and use prior_mean
instead). When using within Magma
,
by providing data for the new individual/task, the hyper-posterior mean and
covariance parameters, and initialisation values for the hyper-parameters,
the function computes maximum likelihood estimates of the hyper-parameters.
train_gp( data, prior_mean = NULL, ini_hp = NULL, kern = "SE", hyperpost = NULL, pen_diag = 1e-10 )
train_gp( data, prior_mean = NULL, ini_hp = NULL, kern = "SE", hyperpost = NULL, pen_diag = 1e-10 )
data |
A tibble or data frame. Required columns: |
prior_mean |
Mean parameter of the GP. This argument can be specified under various formats, such as:
|
ini_hp |
A named vector, tibble or data frame of hyper-parameters
associated with the |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hyperpost |
A list, containing the elements 'mean' and 'cov',
the parameters of the hyper-posterior distribution of the mean process.
Typically, this argument should come from a previous learning using
|
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
A tibble, containing the trained hyper-parameters for the kernel of the new individual/task.
TRUE
TRUE
Learning hyper-parameters and mixture probabilities of any new
individual/task is required in MagmaClust
in the prediction procedure.
By providing data for the new individual/task, the hyper-posterior mean and
covariance parameters, the mixture proportions, and initialisation values for
the hyper-parameters, train_gp_clust
uses an EM algorithm to compute
maximum likelihood estimates of the hyper-parameters and hyper-posterior
mixture probabilities of the new individual/task.
train_gp_clust( data, prop_mixture = NULL, ini_hp = NULL, kern = "SE", hyperpost = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001 )
train_gp_clust( data, prop_mixture = NULL, ini_hp = NULL, kern = "SE", hyperpost = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001 )
data |
A tibble or data frame. Required columns: |
prop_mixture |
A tibble or a named vector. Each name of column or element should refer to a cluster. The value associated with each cluster is a number between 0 and 1, corresponding to the mixture proportions. |
ini_hp |
A tibble or data frame of hyper-parameters
associated with |
kern |
A kernel function, defining the covariance structure of the GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
hyperpost |
A list, containing the elements |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
n_iter_max |
A number, indicating the maximum number of iterations of the EM algorithm to proceed while not reaching convergence. |
cv_threshold |
A number, indicating the threshold of the likelihood gain under which the EM algorithm will stop. |
A list, containing the results of the EM algorithm used during the prediction step of MagmaClust. The elements of the list are:
hp: A tibble of optimal hyper-parameters for the new individual's GP.
mixture: A tibble of mixture probabilities for the new individual.
TRUE
TRUE
The hyper-parameters and the hyper-posterior distribution involved in Magma
can be learned thanks to an EM algorithm implemented in train_magma
.
By providing a dataset, the model hypotheses (hyper-prior mean parameter and
covariance kernels) and initialisation values for the hyper-parameters, the
function computes maximum likelihood estimates of the HPs as well as the
mean and covariance parameters of the Gaussian hyper-posterior distribution
of the mean process.
train_magma( data, prior_mean = NULL, ini_hp_0 = NULL, ini_hp_i = NULL, kern_0 = "SE", kern_i = "SE", common_hp = TRUE, grid_inputs = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001, fast_approx = FALSE )
train_magma( data, prior_mean = NULL, ini_hp_0 = NULL, ini_hp_i = NULL, kern_0 = "SE", kern_i = "SE", common_hp = TRUE, grid_inputs = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001, fast_approx = FALSE )
data |
A tibble or data frame. Required columns: |
prior_mean |
Hyper-prior mean parameter (m_0) of the mean GP. This argument can be specified under various formats, such as:
|
ini_hp_0 |
A named vector, tibble or data frame of hyper-parameters
associated with |
ini_hp_i |
A tibble or data frame of hyper-parameters
associated with |
kern_0 |
A kernel function, associated with the mean GP. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
kern_i |
A kernel function, associated with the individual GPs. ("SE", "PERIO" and "RQ" are also available here). |
common_hp |
A logical value, indicating whether the set of hyper-parameters is assumed to be common to all individuals. |
grid_inputs |
A vector, indicating the grid of additional reference inputs on which the mean process' hyper-posterior should be evaluated. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
n_iter_max |
A number, indicating the maximum number of iterations of the EM algorithm to proceed while not reaching convergence. |
cv_threshold |
A number, indicating the threshold of the likelihood gain
under which the EM algorithm will stop. The convergence condition is
defined as the difference of likelihoods between two consecutive steps,
divided by the absolute value of the last one
( |
fast_approx |
A boolean, indicating whether the EM algorithm should stop after only one iteration of the E-step. This advanced feature is mainly used to provide a faster approximation of the model selection procedure, by preventing any optimisation over the hyper-parameters. |
The user can specify custom kernel functions for the argument
kern_0
and kern_i
. The hyper-parameters used in the kernel
should have explicit names, and be contained within the hp
argument. hp
should typically be defined as a named vector or a
data frame. Although it is not mandatory for the train_magma
function to run, gradients can be provided within kernel function
definition. See for example se_kernel
to create a custom
kernel function displaying an adequate format to be used in Magma.
A list, gathering the results of the EM algorithm used for training in Magma. The elements of the list are:
hp_0: A tibble of the trained hyper-parameters for the mean process' kernel.
hp_i: A tibble of all the trained hyper-parameters for the individual processes' kernels.
hyperpost: A sub-list gathering the parameters of the mean processes' hyper-posterior distributions, namely:
mean: A tibble, the hyper-posterior mean parameter
(Output
) evaluated at each training reference Input
.
cov: A matrix, the covariance parameter for the hyper-posterior distribution of the mean process.
pred: A tibble, the predicted mean and variance at Input
for the mean process' hyper-posterior distribution under a format
that allows the direct visualisation as a GP prediction.
ini_args: A list containing the initial function arguments and values
for the hyper-prior mean, the hyper-parameters. In particular, if
those arguments were set to NULL, ini_args
allows us to retrieve
the (randomly chosen) initialisations used during training.
seq_loglikelihood: A vector, containing the sequence of log-likelihood values associated with each iteration.
converged: A logical value indicated whether the EM algorithm converged or not.
training_time: Total running time of the complete training.
TRUE
TRUE
The hyper-parameters and the hyper-posterior distributions involved in
MagmaClust can be learned thanks to a VEM algorithm implemented in
train_magmaclust
. By providing a dataset, the model hypotheses
(hyper-prior mean parameters, covariance kernels and number of clusters) and
initialisation values for the hyper-parameters, the function computes
maximum likelihood estimates of the HPs as well as the mean and covariance
parameters of the Gaussian hyper-posterior distributions of the mean
processes.
train_magmaclust( data, nb_cluster = NULL, prior_mean_k = NULL, ini_hp_k = NULL, ini_hp_i = NULL, kern_k = "SE", kern_i = "SE", ini_mixture = NULL, common_hp_k = TRUE, common_hp_i = TRUE, grid_inputs = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001, fast_approx = FALSE )
train_magmaclust( data, nb_cluster = NULL, prior_mean_k = NULL, ini_hp_k = NULL, ini_hp_i = NULL, kern_k = "SE", kern_i = "SE", ini_mixture = NULL, common_hp_k = TRUE, common_hp_i = TRUE, grid_inputs = NULL, pen_diag = 1e-10, n_iter_max = 25, cv_threshold = 0.001, fast_approx = FALSE )
data |
A tibble or data frame. Columns required: |
nb_cluster |
A number, indicating the number of clusters of individuals/tasks that are assumed to exist among the dataset. |
prior_mean_k |
The set of hyper-prior mean parameters (m_k) for the K mean GPs, one value for each cluster. cluster. This argument can be specified under various formats, such as:
|
ini_hp_k |
A tibble or data frame of hyper-parameters
associated with |
ini_hp_i |
A tibble or data frame of hyper-parameters
associated with |
kern_k |
A kernel function, associated with the mean GPs. Several popular kernels (see The Kernel Cookbook) are already implemented and can be selected within the following list:
|
kern_i |
A kernel function, associated with the individual GPs. (See
details above in |
ini_mixture |
Initial values of the probability to belong to each
cluster for each individual ( |
common_hp_k |
A boolean indicating whether hyper-parameters are common among the mean GPs. |
common_hp_i |
A boolean indicating whether hyper-parameters are common among the individual GPs. |
grid_inputs |
A vector, indicating the grid of additional reference inputs on which the mean processes' hyper-posteriors should be evaluated. |
pen_diag |
A number. A jitter term, added on the diagonal to prevent numerical issues when inverting nearly singular matrices. |
n_iter_max |
A number, indicating the maximum number of iterations of the VEM algorithm to proceed while not reaching convergence. |
cv_threshold |
A number, indicating the threshold of the likelihood gain
under which the VEM algorithm will stop. The convergence condition is
defined as the difference of elbo between two consecutive steps,
divided by the absolute value of the last one
( |
fast_approx |
A boolean, indicating whether the VEM algorithm should stop after only one iteration of the VE-step. This advanced feature is mainly used to provide a faster approximation of the model selection procedure, by preventing any optimisation over the hyper-parameters. |
The user can specify custom kernel functions for the argument
kern_k
and kern_i
. The hyper-parameters used in the kernel
should have explicit names, and be contained within the hp
argument. hp
should typically be defined as a named vector or a
data frame. Although it is not mandatory for the train_magmaclust
function to run, gradients be can provided within kernel function
definition. See for example se_kernel
to create a custom
kernel function displaying an adequate format to be used in
MagmaClust.
A list, containing the results of the VEM algorithm used in the training step of MagmaClust. The elements of the list are:
hp_k: A tibble containing the trained hyper-parameters for the mean process' kernel and the mixture proportions for each cluster.
hp_i: A tibble containing the trained hyper-parameters for the individual processes' kernels.
hyperpost: A sub-list containing the parameters of the mean processes' hyper-posterior distribution, namely:
mean: A list of tibbles containing, for each cluster, the
hyper-posterior mean parameters evaluated at each Input
.
cov: A list of matrices containing, for each cluster, the hyper-posterior covariance parameter of the mean process.
mixture: A tibble, indicating the mixture probabilities in each cluster for each individual.
ini_args: A list containing the initial function arguments and values
for the hyper-prior means, the hyper-parameters. In particular, if
those arguments were set to NULL, ini_args
allows us to retrieve
the (randomly chosen) initialisations used during training.
seq_elbo: A vector, containing the sequence of ELBO values associated with each iteration.
converged: A logical value indicated whether the algorithm converged.
training_time: Total running time of the complete training.
TRUE
TRUE
A subset of data from the GUSTO project (https://www.gusto.sg/) collecting the weight over time of several children in Singapore. See https://arxiv.org/abs/2011.07866 for dedicated description and analysis.
weight
weight
weight
A data frame with 3,629 rows and 4 columns:
Indentifying number associated to each child
Biological gender
Age in months
Weight in kilograms