Feature engineering
Feature engineering, a preprocessing step in supervised machine learning and statistical modeling,[1] transforms raw data into a more effective set of inputs. Each input comprises several attributes, known as features. By providing models with relevant information, feature engineering significantly enhances their predictive accuracy and decision-making capability.[2][3][4]
Beyond machine learning, the principles of feature engineering are applied in various scientific fields, including physics. For example, physicists construct dimensionless numbers such as the Reynolds number in fluid dynamics, the Nusselt number in heat transfer, and the Archimedes number in sedimentation. They also develop first approximations of solutions, such as analytical solutions for the strength of materials in mechanics.[5]
Clustering[edit]
One of the applications of Feature Engineering has been clustering of feature-objects or sample-objects in a dataset. Especially, feature engineering based on matrix/tensor decompositions have been extensively used for data clustering under non-negativity constraints on the feature coefficients. These include Non-Negative Matrix Factorization (NMF)[6], Non-Negative Matrix-Tri Factorization (NMTF)[7], Non-Negative Tensor Decomposition/Factorization (NTF/NTD)[8] etc. The non-negativity constraints on coefficients of the feature vectors mined by above-stated algorithms yields a part-based representation and different factor matrices exhibit natural clustering properties. Several extensions of the above-stated feature engineering methods have been reported in literature, including Orthogonality constrained factorization for hard clustering and manifold learning to overcome inherent issues with these algorithms.
Other class of feature engineering algorithms include leveraging common hidden structure across multiple inter-related datasets to obtain a consensus (common) clustering scheme. Examples include Multi-view Classification based on Consensus Matrix Decomposition (MCMD)[9] algorithm which mines common clustering scheme across multiple datasets. The algorithm is designed to output two types of class labels (scale-variant and scale-invariant clustering), is computational robustness to missing information, can obtain shape and scale based outliers and can handle high dimensional data effectively. Coupled matrix and tensor decompositions are popularly used in multi-view feature engineering[10].
Feature engineering in machine learning and statistical modeling involves selecting, creating, transforming, and extracting data features. Key components include feature creation from existing data, transforming and imputing missing or invalid features, reducing data dimensionality through methods like Principal Components Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA), and selecting the most relevant features for model training based on importance scores and correlation matrices.[11]
Features vary in significance.[12] Even relatively insignificant features may contribute to a model. Feature selection can reduce the number of features to prevent a model from becoming too specific to the training data set (overfitting).[13]
Feature explosion occurs when the number of identified features is too large for effective model estimation or optimization. Common causes include:
Feature explosion can be limited via techniques such as: regularization, kernel methods, and feature selection.[14]
Feature stores[edit]
The Feature Store is where the features are stored and organized for the explicit purpose of being used to either train models (by data scientists) or make predictions (by applications that have a trained model). It is a central location where you can either create or update groups of features created from multiple different data sources, or create and update new datasets from those feature groups for training models or for use in applications that do not want to compute the features but just retrieve them when it needs them to make predictions.[39]
A feature store includes the ability to store code used to generate features, apply the code to raw data, and serve those features to models upon request. Useful capabilities include feature versioning and policies governing the circumstances under which features can be used.[40]
Feature stores can be standalone software tools or built into machine learning platforms.
Alternatives[edit]
Feature engineering can be a time-consuming and error-prone process, as it requires domain expertise and often involves trial and error.[41][42] Deep learning algorithms may be used to process a large raw dataset without having to resort to feature engineering.[43] However, deep learning algorithms still require careful preprocessing and cleaning of the input data.[44] In addition, choosing the right architecture, hyperparameters, and optimization algorithm for a deep neural network can be a challenging and iterative process.[45]