Objects of this class are used to select optimal cluster expansion models, i.e. the optimal set of clusters, for given training data and clusters pool.
Clusters selector class
Objects of this class are used to select optimal cluster expansion models, i.e. the optimal set of clusters, for given training data and clusters pool.
After initializing a ClustersSelector
object, the actual selection of clusters
is achieved by calling the method ClustersSelector.select_clusters()
Different optimality criteria can be used.
Parameters:
basis
: string, default = "trigonometric"
Cluster basis used during the optimization task. For details and allowed values,
read the documentation in clusterx.correlations.CorrelationsCalculator
selector_type
: string, default = "identity"
can be "identity"
, "subsets_cv"
, "lasso_cv"
and "lasso_on_residual"
.
"identity"
: no slection is performed and the optimal clusters pool is the same as the argument"cpool"
in theselect_clusters()
method.
"subsets_cv"
: a cross-validation optimization is performed on models defined by subsets of the clusters pool.
"lasso_cv"
: a cross-validation optimization is performed to find the optimal sparsity parameter.
"lasso_on_residual"
: a lasso selection is performed on the residual between a model defined byset0
(see below) the output. The final model contains the union of clusters in"set0"
and those selected by lasso on the residual.Deprecated options:
"lasso"
,"linreg"
. Old"lasso"
is identical to"lasso_cv"
and old"linreg"
is identical to"subsets_cv"
selector_opts
: dictionary of selector optionsThe possible key-value pairs in this dictionary correspond to either general options,
and options specific to the chosen selector_type
.
General options:
fit_intercept
: boolean (defaultTrue
).
cv_type
: String. Either"loo"
, for leave-one-out CV;"5-fold"
for 5-fold CV, or10-fold
for 10-fold CV. Default:"loo"
cv_shuffle
: Boolean. Ifcv_type
is n-fold, this tells whether to shuffle the data for CV. Default:False
.
cv_random_state
: int or None. Ifcv_shuffle
isTrue
, this controls the ordering of the shuffling. If None, the shuffling is different in every call. If an integer is passed, same shuffling is got for same integer input, so reproducible results can be obtained.
standardize
: Boolean. IfTrue
, standardize correlations. Default:False
.
If selector_type
is "lasso_cv"
: the selector_opts dict keys are:
sparsity_max
: positive real, maximal sparsity parameter (default: 1)
sparsity_min
: positive real, minimal sparsity parameter (default: 0.01)
sparsity_step
: positive real, optional, if set to 0.0, a logarithmic grid from sparsity_max to sparsity_min is automatically created.
max_iter
: integer, maximum number of iterations for LASSO algorithm.
tol
: small positive real, tolerance of LASSO solution.
sparsity_scale
: either"log"
or"piece_log"
.
If selector_type
is "subsets_cv"
: the selector_opts dict keys are:
clusters_sets
: one of"size"
,"combinations"
, and"size+combinations"
.
"size"
: Clusters sub_pools of increasing size are extracted from the initial pool, and cross validation selects the optimal sub-pool.
"combinations"
: All possible combinations of clusters from the pool are considered, this can be very computationally demanding.
"size+combinations"
: A fixed pool of clusters up to certain size (seeset0
parameter below) is always kept and the combinations are searched only for subsets ofnclmax
(see below) clusters.
set0
: array with two elements[int,float]
ifclusters_sets
is set to"size+combinations"
, this indicates the size of the fixed pool of clusters, above which a combinatorial search is performed. The first element of the array indicates the maximum number of cluster points and the second element the maximum radius, for the fixed subpool.
nclmax
: integer. Ifclusters_sets
is set to"size+combinations"
, this indicates the maximum number of clusters in the combinatorial subsets of clusters to be searched for (on top of the fixed subpool, see above).
alpha
: float (default:0
). For the subset selection a ridge estimator from the scikit learn library is used. This parameter determines the regularization strength. If set to0
(default), the estimator switches to ordinary least squares.
alphas
: list or numpy array of float (default:[]
). If the len(alphas) > 1, for the subset selection a RidgeCV estimator from the scikit learn library is used. This parameter determines the list regularization strengths.
Notes:
Besides the indicated keys above, the **selector_opts
dict may contain the key "method"
(deprecated), which overrides the argument selector_type
.
Display in screen information about the optimal model
Return optimal ClustersPool object
Return optimal array of clusters
Select clusters
Returns a subpool containing the optimal set of clusters.
Parameters:
sset
: StructuresSet objectThe structures set corresponding to the training data.
cpool
: ClustersPool objectthe clusters pool from which the optimal model is selected.
prop
: stringproperty label (must be in sset) of property for which the optimal set of clusters is to be selected.
comat
: 2D numpy array (default “None”)if the correlation matrix was precalculated, you can give it here