The ClustersSelector class

Objects of this class are used to select optimal cluster expansion models, i.e. the optimal set of clusters, for given training data and clusters pool.

Initialization and methods

class clusterx.clusters_selector.ClustersSelector(basis='trigonometric', selector_type='identity', **selector_opts)

Clusters selector class

Objects of this class are used to select optimal cluster expansion models, i.e. the optimal set of clusters, for given training data and clusters pool.

After initializing a ClustersSelector object, the actual selection of clusters is achieved by calling the method ClustersSelector.select_clusters()

Different optimality criteria can be used.

Parameters:

basis: string, default = "trigonometric"

Cluster basis used during the optimization task. For details and allowed values, read the documentation in clusterx.correlations.CorrelationsCalculator

selector_type: string, default = "identity"

can be "identity", "subsets_cv", "lasso_cv" and "lasso_on_residual".

  • "identity": no slection is performed and the optimal clusters pool is the same as the argument "cpool" in the select_clusters() method.

  • "subsets_cv": a cross-validation optimization is performed on models defined by subsets of the clusters pool.

  • "lasso_cv": a cross-validation optimization is performed to find the optimal sparsity parameter.

  • "lasso_on_residual": a lasso selection is performed on the residual between a model defined by set0 (see below) the output. The final model contains the union of clusters in "set0" and those selected by lasso on the residual.

  • Deprecated options: "lasso", "linreg". Old "lasso" is identical to "lasso_cv" and old "linreg" is identical to "subsets_cv"

selector_opts: dictionary of selector options

The possible key-value pairs in this dictionary correspond to either general options, and options specific to the chosen selector_type.

  • General options:

    • fit_intercept: boolean (default True).

    • cv_type: String. Either "loo", for leave-one-out CV; "5-fold" for 5-fold CV, or 10-fold for 10-fold CV. Default: "loo"

    • cv_shuffle: Boolean. If cv_type is n-fold, this tells whether to shuffle the data for CV. Default: False.

    • cv_random_state: int or None. If cv_shuffle is True, this controls the ordering of the shuffling. If None, the shuffling is different in every call. If an integer is passed, same shuffling is got for same integer input, so reproducible results can be obtained.

    • standardize: Boolean. If True, standardize correlations. Default: False.

  • If selector_type is "lasso_cv": the selector_opts dict keys are:

    • sparsity_max: positive real, maximal sparsity parameter (default: 1)

    • sparsity_min: positive real, minimal sparsity parameter (default: 0.01)

    • sparsity_step: positive real, optional, if set to 0.0, a logarithmic grid from sparsity_max to sparsity_min is automatically created.

    • max_iter: integer, maximum number of iterations for LASSO algorithm.

    • tol: small positive real, tolerance of LASSO solution.

    • sparsity_scale: either "log" or "piece_log".

  • If selector_type is "subsets_cv": the selector_opts dict keys are:

    • clusters_sets: one of "size", "combinations", and "size+combinations".

      • "size": Clusters sub_pools of increasing size are extracted from the initial pool, and cross validation selects the optimal sub-pool.

      • "combinations": All possible combinations of clusters from the pool are considered, this can be very computationally demanding.

      • "size+combinations": A fixed pool of clusters up to certain size (see set0 parameter below) is always kept and the combinations are searched only for subsets of nclmax (see below) clusters.

        • set0: array with two elements [int,float] if clusters_sets is set to "size+combinations", this indicates the size of the fixed pool of clusters, above which a combinatorial search is performed. The first element of the array indicates the maximum number of cluster points and the second element the maximum radius, for the fixed subpool.

        • nclmax: integer. If clusters_sets is set to "size+combinations", this indicates the maximum number of clusters in the combinatorial subsets of clusters to be searched for (on top of the fixed subpool, see above).

    • alpha: float (default: 0). For the subset selection a ridge estimator from the scikit learn library is used. This parameter determines the regularization strength. If set to 0 (default), the estimator switches to ordinary least squares.

    • alphas: list or numpy array of float (default: []). If the len(alphas) > 1, for the subset selection a RidgeCV estimator from the scikit learn library is used. This parameter determines the list regularization strengths.

Notes:

  • Besides the indicated keys above, the **selector_opts dict may contain the key "method" (deprecated), which overrides the argument selector_type.

display_info()

Display in screen information about the optimal model

get_optimal_cpool()

Return optimal ClustersPool object

get_optimal_cpool_array()

Return optimal array of clusters

select_clusters(sset, cpool, prop, comat=None)

Select clusters

Returns a subpool containing the optimal set of clusters.

Parameters:

sset: StructuresSet object

The structures set corresponding to the training data.

cpool: ClustersPool object

the clusters pool from which the optimal model is selected.

prop: string

property label (must be in sset) of property for which the optimal set of clusters is to be selected.

comat: 2D numpy array (default “None”)

if the correlation matrix was precalculated, you can give it here