Supervised learning

Supervised learning (SL) is a paradigm in machine learning where input objects (for example, a vector of predictor variables) and a desired output value (also known as human-labeled supervisory signal) train a model. The training data is processed, building a function that maps new data on expected output values.^[1] An optimal scenario will allow for the algorithm to correctly determine output values for unseen instances. This requires the learning algorithm to generalize from the training data to unseen situations in a "reasonable" way (see inductive bias). This statistical quality of an algorithm is measured through the so-called generalization error.

$_$_$DEEZ_NUTS#1__answer--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--8DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__answer--9DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#0__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#0__subtitleDEEZ_NUTS$_$_$

Generative training[edit]

The training methods described above are discriminative training methods, because they seek to find a function $g$ that discriminates well between the different output values (see discriminative model). For the special case where $f(x,y)=P(x,y)$ is a joint probability distribution and the loss function is the negative log likelihood $-\sum _{i}\log P(x_{i},y_{i}),$ a risk minimization algorithm is said to perform generative training, because $f$ can be regarded as a generative model that explains how the data were generated. Generative training algorithms are often simpler and more computationally efficient than discriminative training algorithms. In some cases, the solution can be computed in closed form as in naive Bayes and linear discriminant analysis.

$_$_$DEEZ_NUTS#2__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#2__descriptionDEEZ_NUTS$_$_$

or weak supervision: the desired output values are provided only for a subset of the training data. The remaining data is unlabeled or imprecisely labeled.

Semi-supervised learning

: Instead of assuming that all of the training examples are given at the start, active learning algorithms interactively collect new examples, typically by making queries to a human user. Often, the queries are based on unlabeled data, which is a scenario that combines semi-supervised learning with active learning.

Active learning

: When the desired output value is a complex object, such as a parse tree or a labeled graph, then standard methods must be extended.

Structured prediction

: When the input is a set of objects and the desired output is a ranking of those objects, then again the standard methods must be extended.

Learning to rank

$_$_$DEEZ_NUTS#3__quote--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__quote--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__quote--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__quote--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__quote--8DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--8DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--8DEEZ_NUTS$_$_$

There are several ways in which the standard supervised learning problem can be generalized:

$_$_$DEEZ_NUTS#4__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__subtextDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__answer--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__answer--1DEEZ_NUTS$_$_$

Analytical learning

Artificial neural network

Nearest neighbor algorithm

(PAC) learning

Probably approximately correct learning

a knowledge acquisition methodology

Ripple down rules

Symbolic machine learning algorithms

Subsymbolic machine learning algorithms

Support vector machines

$_$_$DEEZ_NUTS#3__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__subtextDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__quote--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__quote--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__quote--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__quote--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__company_or_position--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__subtextDEEZ_NUTS$_$_$

Heterogeneity of the data. If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Many algorithms, including , linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support-vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees is that they easily handle heterogeneous data.

support-vector machines

Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., , logistic regression, and distance-based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of regularization.

linear regression

Presence of interactions and non-linearities. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., , logistic regression, support-vector machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support-vector machines with Gaussian kernels) generally perform well. However, if there are complex interactions among features, then algorithms such as decision trees and neural networks work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.

linear regression

$_$_$DEEZ_NUTS#5__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__subtextDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__quote--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__company_or_position--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__quote--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__company_or_position--1DEEZ_NUTS$_$_$

Backpropagation

Boosting (meta-algorithm)

Bayesian statistics

Case-based reasoning

Decision tree learning

Inductive logic programming

Gaussian process regression

Genetic programming

Group method of data handling

Kernel estimators

Learning automata

Learning classifier systems

Learning vector quantization

(decision trees, decision graphs, etc.)

Minimum message length

Multilinear subspace learning

Naive Bayes classifier

Maximum entropy classifier

Conditional random field

$_$_$DEEZ_NUTS#6__quote--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__company_or_position--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__quote--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__company_or_position--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__quote--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__company_or_position--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__quote--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__company_or_position--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__quote--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__company_or_position--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__quote--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__company_or_position--5DEEZ_NUTS$_$_$

Minimum complexity machines (MCM)

Random forests

Ensembles of classifiers

Ordinal classification

Data pre-processing

Handling imbalanced datasets

Statistical relational learning

a multicriteria classification algorithm

Proaftn

Bioinformatics

Cheminformatics

Quantitative structure–activity relationship

Database marketing

Handwriting recognition

Information retrieval

Learning to rank

Information extraction

Object recognition in

computer vision

Optical character recognition

Spam detection

Pattern recognition

Speech recognition

Supervised learning is a special case of in biological systems

downward causation

Landform classification using ^[7]

satellite imagery

Spend classification in processes^[8]

procurement

Computational learning theory

Inductive bias

Overfitting (machine learning)

(Uncalibrated)

Supervised learning

$_$_$DEEZ_NUTS#1__question--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--8DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__question--9DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#0__titleDEEZ_NUTS$_$_$

Generative training[edit]

$_$_$DEEZ_NUTS#2__titleDEEZ_NUTS$_$_$

Semi-supervised learning

Active learning

Structured prediction

Learning to rank

$_$_$DEEZ_NUTS#3__name--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--5DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--6DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--7DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--8DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__question--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#4__question--1DEEZ_NUTS$_$_$

Artificial neural network

Nearest neighbor algorithm

Probably approximately correct learning

Ripple down rules

Support vector machines

$_$_$DEEZ_NUTS#3__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#3__name--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#1__titleDEEZ_NUTS$_$_$

support-vector machines

linear regression

linear regression

$_$_$DEEZ_NUTS#5__titleDEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#5__name--1DEEZ_NUTS$_$_$

Backpropagation

Boosting (meta-algorithm)

Bayesian statistics

Case-based reasoning

Decision tree learning

Inductive logic programming

Gaussian process regression

Genetic programming

Group method of data handling

Kernel estimators

Learning automata

Learning classifier systems

Learning vector quantization

Minimum message length

Multilinear subspace learning

Naive Bayes classifier

Maximum entropy classifier

Conditional random field

$_$_$DEEZ_NUTS#6__name--0DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--1DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--2DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--3DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--4DEEZ_NUTS$_$_$

$_$_$DEEZ_NUTS#6__name--5DEEZ_NUTS$_$_$

Random forests

Ensembles of classifiers

Ordinal classification

Data pre-processing

Statistical relational learning

Proaftn

Bioinformatics

Quantitative structure–activity relationship

Database marketing

Handwriting recognition

Learning to rank

Information extraction

computer vision