Feature selection

Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Stylometry and DNA microarray analysis are two cases where feature selection is used. It should be distinguished from feature extraction.^[1]

Feature selection techniques are used for several reasons:

The central premise when using a feature selection technique is that the data contains some features that are either redundant or irrelevant, and can thus be removed without incurring much loss of information.^[10] Redundant and irrelevant are two distinct notions, since one relevant feature may be redundant in the presence of another relevant feature with which it is strongly correlated.^[11]

Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features. Feature selection techniques are often used in domains where there are many features and comparatively few samples (or data points).

Wrapper methods use a predictive model to score feature subsets. Each new subset is used to train a model, which is tested on a hold-out set. Counting the number of mistakes made on that hold-out set (the error rate of the model) gives the score for that subset. As wrapper methods train a new model for each subset, they are very computationally intensive, but usually provide the best performing feature set for that particular type of model or typical problem.

Filter methods use a proxy measure instead of the error rate to score a feature subset. This measure is chosen to be fast to compute, while still capturing the usefulness of the feature set. Common measures include the ,^[11] the pointwise mutual information,^[12] Pearson product-moment correlation coefficient, Relief-based algorithms,^[13] and inter/intra class distance or the scores of significance tests for each class/feature combinations.^[12]^[14] Filters are usually less computationally intensive than wrappers, but they produce a feature set which is not tuned to a specific type of predictive model.^[15] This lack of tuning means a feature set from a filter is more general than the set from a wrapper, usually giving lower prediction performance than a wrapper. However the feature set doesn't contain the assumptions of a prediction model, and so is more useful for exposing the relationships between the features. Many filters provide a feature ranking rather than an explicit best feature subset, and the cut off point in the ranking is chosen via cross-validation. Filter methods have also been used as a preprocessing step for wrapper methods, allowing a wrapper to be used on larger problems. One other popular approach is the Recursive Feature Elimination algorithm,^[16] commonly used with Support Vector Machines to repeatedly construct a model and remove features with low weights.

mutual information

Embedded methods are a catch-all group of techniques which perform feature selection as part of the model construction process. The exemplar of this approach is the method for constructing a linear model, which penalizes the regression coefficients with an L1 penalty, shrinking many of them to zero. Any features which have non-zero regression coefficients are 'selected' by the LASSO algorithm. Improvements to the LASSO include Bolasso which bootstraps samples;^[17] Elastic net regularization, which combines the L1 penalty of LASSO with the L2 penalty of ridge regression; and FeaLect which scores all the features based on combinatorial analysis of regression coefficients.^[18] AEFS further extends LASSO to nonlinear scenario with autoencoders.^[19] These approaches tend to be between filters and wrappers in terms of computational complexity.

LASSO

A feature selection algorithm can be seen as the combination of a search technique for proposing new feature subsets, along with an evaluation measure which scores the different feature subsets. The simplest algorithm is to test each possible subset of features finding the one which minimizes the error rate. This is an exhaustive search of the space, and is computationally intractable for all but the smallest of feature sets. The choice of evaluation metric heavily influences the algorithm, and it is these evaluation metrics which distinguish between the three main categories of feature selection algorithms: wrappers, filters and embedded methods.^[11]

In traditional regression analysis, the most popular form of feature selection is stepwise regression, which is a wrapper technique. It is a greedy algorithm that adds the best feature (or deletes the worst feature) at each round. The main control issue is deciding when to stop the algorithm. In machine learning, this is typically done by cross-validation. In statistics, some criteria are optimized. This leads to the inherent problem of nesting. More robust methods have been explored, such as branch and bound and piecewise linear network.

Exhaustive

[20]

Best first

Simulated annealing

^[21]

Genetic algorithm

forward selection^[22]^[23]^[24]

Greedy

Greedy backward elimination

^[25]

Particle swarm optimization

Targeted projection pursuit

Scatter search^[27]

[26]

^[28]^[29]

Variable neighborhood search

Subset selection evaluates a subset of features as a group for suitability. Subset selection algorithms can be broken up into wrappers, filters, and embedded methods. Wrappers use a search algorithm to search through the space of possible features and evaluate each subset by running a model on the subset. Wrappers can be computationally expensive and have a risk of over fitting to the model. Filters are similar to wrappers in the search approach, but instead of evaluating against a model, a simpler filter is evaluated. Embedded techniques are embedded in, and specific to, a model.

Many popular search approaches use greedy hill climbing, which iteratively evaluates a candidate subset of features, then modifies the subset and evaluates if the new subset is an improvement over the old. Evaluation of the subsets requires a scoring metric that grades a subset of features. Exhaustive search is generally impractical, so at some implementor (or operator) defined stopping point, the subset of features with the highest score discovered up to that point is selected as the satisfactory feature subset. The stopping criterion varies by algorithm; possible criteria include: a subset score exceeds a threshold, a program's maximum allowed run time has been surpassed, etc.

Alternative search-based techniques are based on targeted projection pursuit which finds low-dimensional projections of the data that score highly: the features that have the largest projections in the lower-dimensional space are then selected.

Search approaches include:

Two popular filter metrics for classification problems are correlation and mutual information, although neither are true metrics or 'distance measures' in the mathematical sense, since they fail to obey the triangle inequality and thus do not compute any actual 'distance' – they should rather be regarded as 'scores'. These scores are computed between a candidate feature (or set of features) and the desired output category. There are, however, true metrics that are a simple function of the mutual information;^[30] see here.

Other available filter metrics include:

Optimality criteria[edit]

The choice of optimality criteria is difficult as there are multiple objectives in a feature selection task. Many common criteria incorporate a measure of accuracy, penalised by the number of features selected. Examples include Akaike information criterion (AIC) and Mallows's C_p, which have a penalty of 2 for each added feature. AIC is based on information theory, and is effectively derived via the maximum entropy principle.^[31]^[32]

Other criteria are Bayesian information criterion (BIC), which uses a penalty of ${\sqrt {\log {n}}}$ for each added feature, minimum description length (MDL) which asymptotically uses ${\sqrt {\log {n}}}$ , Bonferroni / RIC which use ${\sqrt {2\log {p}}}$ , maximum dependency feature selection, and a variety of new criteria that are motivated by false discovery rate (FDR), which use something close to ${\sqrt {2\log {\frac {p}{q}}}}$ . A maximum entropy rate criterion may also be used to select the most relevant subset of features.^[33]

Structure learning[edit]

Filter feature selection is a specific case of a more general paradigm called structure learning. Feature selection finds the relevant feature set for a specific target variable whereas structure learning finds the relationships between all the variables, usually by expressing these relationships as a graph. The most common structure learning algorithms assume the data is generated by a Bayesian Network, and so the structure is a directed graphical model. The optimal solution to the filter feature selection problem is the Markov blanket of the target node, and in a Bayesian Network, there is a unique Markov Blanket for each node.^[34]

Regularized trees[edit]

The features from a decision tree or a tree ensemble are shown to be redundant. A recent method called regularized tree^[45] can be used for feature subset selection. Regularized trees penalize using a variable similar to the variables selected at previous tree nodes for splitting the current node. Regularized trees only need build one tree model (or one tree ensemble model) and thus are computationally efficient.

Regularized trees naturally handle numerical and categorical features, interactions and nonlinearities. They are invariant to attribute scales (units) and insensitive to outliers, and thus, require little data preprocessing such as normalization. Regularized random forest (RRF)^[46] is one type of regularized trees. The guided RRF is an enhanced RRF which is guided by the importance scores from an ordinary random forest.

The increasing overfitting risk when the number of observations is insufficient.

The significant computation time when the number of variables is large.

$l_{1}$ -regularization techniques, such as sparse regression, LASSO, and $l_{1}$ -SVM

Regularized trees, e.g. regularized random forest implemented in the RRF package^[46]

[45]

^[72]

Decision tree

Memetic algorithm

(RMNL)

Random multinomial logit

networks with a bottleneck-layer

Auto-encoding

feature selection^[73]^[74]^[75]

Submodular

Local learning based feature selection. Compared with traditional methods, it does not involve any heuristic search, can easily handle multi-class problems, and works for both linear and nonlinear problems. It is also supported by a strong theoretical foundation. Numeric experiments showed that the method can achieve a close-to-optimal solution even when data contains >1M irrelevant features.

[76]

Recommender system based on feature selection. The feature selection methods are introduced into recommender system research.

[77]

Some learning algorithms perform feature selection as part of their overall operation. These include:

Cluster analysis

Data mining

Dimensionality reduction

Feature extraction

Hyperparameter optimization

Model selection

Relief (feature selection)

Guyon, Isabelle; Elisseeff, Andre (2003). . Journal of Machine Learning Research. 3: 1157–1182.

"An Introduction to Variable and Feature Selection"

Harrell, F. (2001). Regression Modeling Strategies. Springer. 0-387-95232-2.

ISBN

Liu, Huan; Motoda, Hiroshi (1998). . Springer. ISBN 0-7923-8198-X.

Feature Selection for Knowledge Discovery and Data Mining

Liu, Huan; Yu, Lei (2005). "Toward Integrating Feature Selection Algorithms for Classification and Clustering". IEEE Transactions on Knowledge and Data Engineering. 17 (4): 491–502. :10.1109/TKDE.2005.66. S2CID 1607600.

doi

Feature Selection Package, Arizona State University (Matlab Code)

(see also NIPS)

NIPS challenge 2003

Archived 2009-02-14 at the Wayback Machine (includes executable and source code)

Naive Bayes implementation with feature selection in Visual Basic

Minimum-redundancy-maximum-relevance (mRMR) feature selection program

(Open source Feature Selection algorithms in C and MATLAB)