Ensemble learning explained

In statistics and machine learning, ensemble methods use multiple learning algorithms to obtain better predictive performance than could be obtained from any of the constituent learning algorithms alone.[1] [2] [3] Unlike a statistical ensemble in statistical mechanics, which is usually infinite, a machine learning ensemble consists of only a concrete finite set of alternative models, but typically allows for much more flexible structure to exist among those alternatives.

Overview

Supervised learning algorithms search through a hypothesis space to find a suitable hypothesis that will make good predictions with a particular problem.[4] Even if this space contains hypotheses that are very well-suited for a particular problem, it may be very difficult to find a good one. Ensembles combine multiple hypotheses to form one which should be theoretically better.

Ensemble learning trains two or more machine learning algorithms on a specific classification or regression task. The algorithms within the ensemble model are generally referred as "base models", "base learners", or "weak learners" in literature. These base models can be constructed using a single modelling algorithm, or several different algorithms. The idea is to train a diverse set of weak models on the same modelling task, such that the outputs of each weak learner have poor predictive ability (i.e., high bias), and among all weak learners, the outcome and error values exhibit high variance. Fundamentally, an ensemble learning model trains at least two high-bias (weak) and high-variance (diverse) models to be combined into a better-performing model. The set of weak models — which would not produce satisfactory predictive results individually — are combined or averaged to produce a single, high performing, accurate, and low-variance model to fit the task as required.

Ensemble learning typically refers to bagging (bootstrap aggregating), boosting or stacking/blending techniques to induce high variance among the base models. Bagging creates diversity by generating random samples from the training observations and fitting the same model to each different sample — also known as homogeneous parallel ensembles. Boosting follows an iterative process by sequentially training each base model on the up-weighted errors of the previous base model, producing an additive model to reduce the final model errors — also known as sequential ensemble learning. Stacking or blending consists of different base models, each trained independently (i.e. diverse/high variance) to be combined into the ensemble model — producing a heterogeneous parallel ensemble. Common applications of ensemble learning include random forests (an extension of bagging), Boosted Tree models, and Gradient Boosted Tree Models. Models in applications of stacking are generally more task-specific — such as combining clustering techniques with other parametric and/or non-parametric techniques.[5]

The broader term of multiple classifier systems also covers hybridization of hypotheses that are not induced by the same base learner.

Evaluating the prediction of an ensemble typically requires more computation than evaluating the prediction of a single model. In one sense, ensemble learning may be thought of as a way to compensate for poor learning algorithms by performing a lot of extra computation. On the other hand, the alternative is to do a lot more learning with one non-ensemble model. An ensemble may be more efficient at improving overall accuracy for the same increase in compute, storage, or communication resources by using that increase on two or more methods, than would have been improved by increasing resource use for a single method. Fast algorithms such as decision trees are commonly used in ensemble methods (e.g., random forests), although slower algorithms can benefit from ensemble techniques as well.

By analogy, ensemble techniques have been used also in unsupervised learning scenarios, for example in consensus clustering or in anomaly detection.

Ensemble theory

Empirically, ensembles tend to yield better results when there is a significant diversity among the models.[6] [7] Many ensemble methods, therefore, seek to promote diversity among the models they combine.[8] [9] Although perhaps non-intuitive, more random algorithms (like random decision trees) can be used to produce a stronger ensemble than very deliberate algorithms (like entropy-reducing decision trees).[10] Using a variety of strong learning algorithms, however, has been shown to be more effective than using techniques that attempt to dumb-down the models in order to promote diversity.[11] It is possible to increase diversity in the training stage of the model using correlation for regression tasks [12] or using information measures such as cross entropy for classification tasks.[13]

Theoretically, one can justify the diversity concept because the lower bound of the error rate of an ensemble system can be decomposed into accuracy, diversity, and the other term.[14]

The geometric framework

Ensemble learning, including both regression and classification tasks, can be explained using a geometric framework.[15] Within this framework, the output of each individual classifier or regressor for the entire dataset can be viewed as a point in a multi-dimensional space. Additionally, the target result is also represented as a point in this space, referred to as the "ideal point."

The Euclidean distance is used as the metric to measure both the performance of a single classifier or regressor (the distance between its point and the ideal point) and the dissimilarity between two classifiers or regressors (the distance between their respective points). This perspective transforms ensemble learning into a deterministic problem.

For example, within this geometric framework, it can be proved that the averaging of the outputs (scores) of all base classifiers or regressors can lead to equal or better results than the average of all the individual models. It can also be proved that if the optimal weighting scheme is used, then a weighted averaging approach can outperform any of the individual classifiers or regressors that make up the ensemble or as good as the best performer at least.

Ensemble size

While the number of component classifiers of an ensemble has a great impact on the accuracy of prediction, there is a limited number of studies addressing this problem. A priori determining of ensemble size and the volume and velocity of big data streams make this even more crucial for online ensemble classifiers. Mostly statistical tests were used for determining the proper number of components. More recently, a theoretical framework suggested that there is an ideal number of component classifiers for an ensemble such that having more or less than this number of classifiers would deteriorate the accuracy. It is called "the law of diminishing returns in ensemble construction." Their theoretical framework shows that using the same number of independent component classifiers as class labels gives the highest accuracy.[16] [17]

Common types of ensembles

Bayes optimal classifier

See main article: article and Bayes classifier.

The Bayes optimal classifier is a classification technique. It is an ensemble of all the hypotheses in the hypothesis space. On average, no other ensemble can outperform it.[18] The Naive Bayes classifier is a version of this that assumes that the data is conditionally independent on the class and makes the computation more feasible. Each hypothesis is given a vote proportional to the likelihood that the training dataset would be sampled from a system if that hypothesis were true. To facilitate training data of finite size, the vote of each hypothesis is also multiplied by the prior probability of that hypothesis. The Bayes optimal classifier can be expressed with the following equation:

y=\underset{cj\inC}{argmax

} \sum_

where

y

is the predicted class,

C

is the set of all possible classes,

H

is the hypothesis space,

P

refers to a probability, and

T

is the training data. As an ensemble, the Bayes optimal classifier represents a hypothesis that is not necessarily in

H

. The hypothesis represented by the Bayes optimal classifier, however, is the optimal hypothesis in ensemble space (the space of all possible ensembles consisting only of hypotheses in

H

).

This formula can be restated using Bayes' theorem, which says that the posterior is proportional to the likelihood times the prior:

P(hi|T)\proptoP(T|hi)P(hi)

hence,

y=\underset{cj\inC}{argmax

} \sum_

Bootstrap aggregating (bagging)

See main article: Bootstrap aggregating. Bootstrap aggregation (bagging) involves training an ensemble on bootstrapped data sets. A bootstrapped set is created by selecting from original training data set with replacement. Thus, a bootstrap set may contain a given example zero, one, or multiple times. Ensemble members can also have limits on the features (e.g., nodes of a decision tree), to encourage exploring of diverse features.[19] The variance of local information in the bootstrap sets and feature considerations promote diversity in the ensemble, and can strengthen the ensemble.[20] To reduce overfitting, a member can be validated using the out-of-bag set (the examples that are not in its bootstrap set).[21]

Inference is done by voting of predictions of ensemble members, called aggregation. It is illustrated below with an ensemble of four decision trees. The query example is classified by each tree. Because three of the four predict the positive class, the ensemble's overall classification is positive. Random forests like the one shown are a common application of bagging.

Boosting

See main article: Boosting (meta-algorithm). Boosting involves training successive models by emphasizing training data mis-classified by previously learned models. Initially, all data (D1) has equal weight and is used to learn a base model M1. The examples mis-classified by M1 are assigned a weight greater than correctly classified examples. This boosted data (D2) is used to train a second base model M2, and so on. Inference is done by voting.

In some cases, boosting has yielded better accuracy than bagging, but tends to over-fit more. The most common implementation of boosting is Adaboost, but some newer algorithms are reported to achieve better results.

Bayesian model averaging

Bayesian model averaging (BMA) makes predictions by averaging the predictions of models weighted by their posterior probabilities given the data.[22] BMA is known to generally give better answers than a single model, obtained, e.g., via stepwise regression, especially where very different models have nearly identical performance in the training set but may otherwise perform quite differently.

The question with any use of Bayes' theorem is the prior, i.e., the probability (perhaps subjective) that each model is the best to use for a given purpose. Conceptually, BMA can be used with any prior. R packages ensembleBMA and BMA[23] use the prior implied by the Bayesian information criterion, (BIC), following Raftery (1995). R package BAS supports the use of the priors implied by Akaike information criterion (AIC) and other criteria over the alternative models as well as priors over the coefficients.[24]

The difference between BIC and AIC is the strength of preference for parsimony. BIC's penalty for model complexity is

ln(n)k

, while AIC's is

2k

. Large-sample asymptotic theory establishes that if there is a best model, then with increasing sample sizes, BIC is strongly consistent, i.e., will almost certainly find it, while AIC may not, because AIC may continue to place excessive posterior probability on models that are more complicated than they need to be. On the other hand, AIC and AICc are asymptotically "efficient" (i.e., minimum mean square prediction error), while BIC is not .[25]

Haussler et al. (1994) showed that when BMA is used for classification, its expected error is at most twice the expected error of the Bayes optimal classifier.[26] Burnham and Anderson (1998, 2002) contributed greatly to introducing a wider audience to the basic ideas of Bayesian model averaging and popularizing the methodology.[27] The availability of software, including other free open-source packages for R beyond those mentioned above, helped make the methods accessible to a wider audience.[28]

Bayesian model combination

Bayesian model combination (BMC) is an algorithmic correction to Bayesian model averaging (BMA). Instead of sampling each model in the ensemble individually, it samples from the space of possible ensembles (with model weights drawn randomly from a Dirichlet distribution having uniform parameters). This modification overcomes the tendency of BMA to converge toward giving all the weight to a single model. Although BMC is somewhat more computationally expensive than BMA, it tends to yield dramatically better results. BMC has been shown to be better on average (with statistical significance) than BMA and bagging.[29]

Use of Bayes' law to compute model weights requires computing the probability of the data given each model. Typically, none of the models in the ensemble are exactly the distribution from which the training data were generated, so all of them correctly receive a value close to zero for this term. This would work well if the ensemble were big enough to sample the entire model-space, but this is rarely possible. Consequently, each pattern in the training data will cause the ensemble weight to shift toward the model in the ensemble that is closest to the distribution of the training data. It essentially reduces to an unnecessarily complex method for doing model selection.

The possible weightings for an ensemble can be visualized as lying on a simplex. At each vertex of the simplex, all of the weight is given to a single model in the ensemble. BMA converges toward the vertex that is closest to the distribution of the training data. By contrast, BMC converges toward the point where this distribution projects onto the simplex. In other words, instead of selecting the one model that is closest to the generating distribution, it seeks the combination of models that is closest to the generating distribution.

The results from BMA can often be approximated by using cross-validation to select the best model from a bucket of models. Likewise, the results from BMC may be approximated by using cross-validation to select the best ensemble combination from a random sampling of possible weightings.

Bucket of models

A "bucket of models" is an ensemble technique in which a model selection algorithm is used to choose the best model for each problem. When tested with only one problem, a bucket of models can produce no better results than the best model in the set, but when evaluated across many problems, it will typically produce much better results, on average, than any model in the set.

The most common approach used for model-selection is cross-validation selection (sometimes called a "bake-off contest"). It is described with the following pseudo-code:

For each model m in the bucket: Do c times: (where 'c' is some constant) Randomly divide the training dataset into two sets: A and B Train m with A Test m with B Select the model that obtains the highest average score

Cross-Validation Selection can be summed up as: "try them all with the training set, and pick the one that works best".[30]

Gating is a generalization of Cross-Validation Selection. It involves training another learning model to decide which of the models in the bucket is best-suited to solve the problem. Often, a perceptron is used for the gating model. It can be used to pick the "best" model, or it can be used to give a linear weight to the predictions from each model in the bucket.

When a bucket of models is used with a large set of problems, it may be desirable to avoid training some of the models that take a long time to train. Landmark learning is a meta-learning approach that seeks to solve this problem. It involves training only the fast (but imprecise) algorithms in the bucket, and then using the performance of these algorithms to help determine which slow (but accurate) algorithm is most likely to do best.[31]

Amended Cross-Entropy Cost: An Approach for Encouraging Diversity in Classification Ensemble

The most common approach for training classifier is using Cross-entropy cost function. However, one would like to train an ensemble of models that have diversity so when we combine them it would provide best results.[32] [33] Assuming we use a simple ensemble of averaging

K

classifiers. Then the Amended Cross-Entropy Cost is

ek=

k)-λ
K
H(p,q

\sumjH(qj,qk)

where

ek

is the cost function of the

kth

classifier,

qk

is the probability of the

k

classifier,

p

is the true probability that we need to estimate and

λ

is a parameter between 0 and 1 that define the diversity that we would like to establish. When

λ=0

we want each classifier to do its best regardless of the ensemble and when

λ=1

we would like the classifier to be as diverse as possible.

Stacking

Stacking (sometimes called stacked generalization) involves training a model to combine the predictions of several other learning algorithms. First, all of the other algorithms are trained using the available data, then a combiner algorithm (final estimator) is trained to make a final prediction using all the predictions of the other algorithms (base estimators) as additional inputs or using cross-validated predictions from the base estimators which can prevent overfitting.[34] If an arbitrary combiner algorithm is used, then stacking can theoretically represent any of the ensemble techniques described in this article, although, in practice, a logistic regression model is often used as the combiner.

Stacking typically yields performance better than any single one of the trained models.[35] It has been successfully used on both supervised learning tasks (regression,[36] classification and distance learning [37]) and unsupervised learning (density estimation).[38] It has also been used to estimate bagging's error rate.[39] It has been reported to out-perform Bayesian model-averaging.[40] The two top-performers in the Netflix competition utilized blending, which may be considered a form of stacking.[41]

Voting

Voting is another form of ensembling. See e.g. Weighted majority algorithm (machine learning).

Implementations in statistics packages

Ensemble learning applications

In recent years, due to growing computational power, which allows for training in large ensemble learning in a reasonable time frame, the number of ensemble learning applications has grown increasingly.[47] Some of the applications of ensemble classifiers include:

Remote sensing

See main article: Remote sensing.

Land cover mapping

Land cover mapping is one of the major applications of Earth observation satellite sensors, using remote sensing and geospatial data, to identify the materials and objects which are located on the surface of target areas. Generally, the classes of target materials include roads, buildings, rivers, lakes, and vegetation.[48] Some different ensemble learning approaches based on artificial neural networks,[49] kernel principal component analysis (KPCA),[50] decision trees with boosting,[51] random forest[52] and automatic design of multiple classifier systems,[53] are proposed to efficiently identify land cover objects.

Change detection

Change detection is an image analysis problem, consisting of the identification of places where the land cover has changed over time. Change detection is widely used in fields such as urban growth, forest and vegetation dynamics, land use and disaster monitoring.[54] The earliest applications of ensemble classifiers in change detection are designed with the majority voting,[55] Bayesian model averaging,[56] and the maximum posterior probability.[57] Given the growth of satellite data over time, the past decade sees more use of time series methods for continuous change detection from image stacks.[58] One example is a Bayesian ensemble changepoint detection method called BEAST, with the software available as a package Rbeast in R, Python, and Matlab.[59]

Computer security

Distributed denial of service

Distributed denial of service is one of the most threatening cyber-attacks that may happen to an internet service provider. By combining the output of single classifiers, ensemble classifiers reduce the total error of detecting and discriminating such attacks from legitimate flash crowds.[60]

Malware Detection

Classification of malware codes such as computer viruses, computer worms, trojans, ransomware and spywares with the usage of machine learning techniques, is inspired by the document categorization problem.[61] Ensemble learning systems have shown a proper efficacy in this area.[62] [63]

Intrusion detection

An intrusion detection system monitors computer network or computer systems to identify intruder codes like an anomaly detection process. Ensemble learning successfully aids such monitoring systems to reduce their total error.[64] [65]

Face recognition

Face recognition, which recently has become one of the most popular research areas of pattern recognition, copes with identification or verification of a person by their digital images.[66]

Hierarchical ensembles based on Gabor Fisher classifier and independent component analysis preprocessing techniques are some of the earliest ensembles employed in this field.[67] [68] [69]

Emotion recognition

See main article: Emotion recognition.

While speech recognition is mainly based on deep learning because most of the industry players in this field like Google, Microsoft and IBM reveal that the core technology of their speech recognition is based on this approach, speech-based emotion recognition can also have a satisfactory performance with ensemble learning.[70] [71]

It is also being successfully used in facial emotion recognition.[72] [73] [74]

Fraud detection

Fraud detection deals with the identification of bank fraud, such as money laundering, credit card fraud and telecommunication fraud, which have vast domains of research and applications of machine learning. Because ensemble learning improves the robustness of the normal behavior modelling, it has been proposed as an efficient technique to detect such fraudulent cases and activities in banking and credit card systems.[75] [76]

Financial decision-making

The accuracy of prediction of business failure is a very crucial issue in financial decision-making. Therefore, different ensemble classifiers are proposed to predict financial crises and financial distress.[77] Also, in the trade-based manipulation problem, where traders attempt to manipulate stock prices by buying and selling activities, ensemble classifiers are required to analyze the changes in the stock market data and detect suspicious symptom of stock price manipulation.[77]

Medicine

Ensemble classifiers have been successfully applied in neuroscience, proteomics and medical diagnosis like in neuro-cognitive disorder (i.e. Alzheimer or myotonic dystrophy) detection based on MRI datasets,[78] [79] [80] and cervical cytology classification.[81] [82]

See also

Further reading

External links

Notes and References

  1. Opitz . D. . Maclin . R. . Popular ensemble methods: An empirical study . . 11 . 169 - 198 . 1999 . 10.1613/jair.614. free. 1106.0257.
  2. Polikar . R. . Ensemble based systems in decision making . IEEE Circuits and Systems Magazine . 6 . 3 . 21 - 45 . 2006 . 10.1109/MCAS.2006.1688199. 18032543 .
  3. Rokach . L. . Ensemble-based classifiers . Artificial Intelligence Review . 33 . 1–2 . 1 - 39 . 2010 . 10.1007/s10462-009-9124-7. 11149239 . 11323/1748 . free .
  4. Book: Blockeel H.. Encyclopedia of Machine Learning . Hypothesis Space . 2011 . 511–513 . 10.1007/978-0-387-30164-8_373 . 978-0-387-30768-8 . https://lirias.kuleuven.be/handle/123456789/298291 .
  5. Book: Ibomoiye Domor Mienye, Yanxia Sun . A Survey of Ensemble Learning: Concepts, Algorithms, Applications and Prospects. 2022 .
  6. [Ludmila Kuncheva|Kuncheva, L.]
  7. Sollich, P. and Krogh, A., Learning with ensembles: How overfitting can be useful, Advances in Neural Information Processing Systems, volume 8, pp. 190-196, 1996.
  8. Brown, G. and Wyatt, J. and Harris, R. and Yao, X., Diversity creation methods: a survey and categorisation., Information Fusion, 6(1), pp.5-20, 2005.
  9. Adeva . J. J. García . Cerviño . Ulises . Calvo . R. . Accuracy and Diversity in Ensembles of Text Categorisers . PDF . CLEI Journal . December 2005 . 8 . 2. 1:1–1:12 . 10.19153/cleiej.8.2.1 . 1 November 2024 . free .
  10. Ho, T., Random Decision Forests, Proceedings of the Third International Conference on Document Analysis and Recognition, pp. 278-282, 1995.
  11. Book: Gashler . M. . Giraud-Carrier . C. . Martinez . T. . 2008 Seventh International Conference on Machine Learning and Applications . Decision Tree Ensemble: Small Heterogeneous is Better Than Large Homogeneous . 2008. http://axon.cs.byu.edu/papers/gashler2008icmla.pdf . 2008 . 900–905 . 10.1109/ICMLA.2008.154 . 978-0-7695-3495-4 . 614810 .
  12. Liu . Y. . Yao . X. . Ensemble learning via negative correlation . Neural Networks . December 1999 . 12 . 10 . 1399–1404 . 10.1016/S0893-6080(99)00073-8 . 12662623 . en . 0893-6080.
  13. Book: Shoham . Ron . Permuter . Haim . Cyber Security Cryptography and Machine Learning . Amended Cross-Entropy Cost: An Approach for Encouraging Diversity in Classification Ensemble (Brief Announcement) . Lecture Notes in Computer Science . 2019 . 11527 . 202–207 . 10.1007/978-3-030-20951-3_18. 978-3-030-20950-6 . 189926552 .
  14. Terufumi Morishita et al, Rethinking Fano’s Inequality in Ensemble Learning, International Conference on Machine Learning, 2022
  15. Wu, S., Li, J., & Ding, W. (2023) A geometric framework for multiclass ensemble classifiers, Machine Learning, 112(12), pp. 4929-4958.
  16. A Theoretical Framework on the Ideal Number of Classifiers for Online Ensembles in Data Streams . R. Bonab . Hamed . Can . Fazli . 2016 . ACM . 2053 . USA . CIKM.
  17. 1709.02925 . Bonab . Hamed . Can . Fazli . Less is More: A Comprehensive Framework for the Number of Components of Ensemble Classifiers . 2017 . cs.LG .
  18. [Tom M. Mitchell]
  19. Salman, R., Alzaatreh, A., Sulieman, H., & Faisal, S. (2021). A Bootstrap Framework for Aggregating within and between Feature Selection Methods. Entropy (Basel, Switzerland), 23(2), 200.
  20. Breiman, L., Bagging Predictors, Machine Learning, 24(2), pp.123-140, 1996.
  21. Brodeur, Z. P., Herman, J. D., & Steinschneider, S. (2020). Bootstrap aggregation and cross-validation methods to reduce overfitting in reservoir control policy search. Water Resources Research, 56, e2020WR027184.
  22. e.g.,
  23. .
  24. .
  25. , ch. 4.
  26. Haussler . David . Kearns . Michael . Schapire . Robert E. . 1994 . Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension . Machine Learning . 14 . 83–113 . 10.1007/bf00993163 . free.
  27. and .
  28. The Wikiversity article on Searching R Packages mentions several ways to find available packages for something like this. For example, "sos::findFn('')" from within R will search for help files in contributed packages that includes the search term and open two tabs in the default browser. The first will list all the help files found sorted by package. The second summarizes the packages found, sorted by the apparent strength of the match.
  29. Monteith, Kristine. Carroll, James . Seppi, Kevin . Martinez, Tony. . Turning Bayesian Model Averaging into Bayesian Model Combination. Proceedings of the International Joint Conference on Neural Networks IJCNN'11. 2011. 2657 - 2663.
  30. Saso Dzeroski, Bernard Zenko, Is Combining Classifiers Better than Selecting the Best One, Machine Learning, 2004, pp. 255-273
  31. Book: https://link.springer.com/content/pdf/10.1007/3-540-45372-5_32.pdf . 10.1007/3-540-45372-5_32. 978-3-540-41066-9. Discovering Task Neighbourhoods through Landmark Learning Performances. Principles of Data Mining and Knowledge Discovery. Lecture Notes in Computer Science. 2000. Bensusan. Hilan. Giraud-Carrier. Christophe. 1910. 325–330.
  32. Book: Shoham . Ron . Permuter . Haim . Amended Cross-Entropy Cost: An Approach for Encouraging Diversity in Classification Ensemble (Brief Announcement) . Cyber Security Cryptography and Machine Learning . Lecture Notes in Computer Science . 2019 . 11527 . 202–207 . 10.1007/978-3-030-20951-3_18 . 978-3-030-20950-6 .
  33. Shoham . Ron . Permuter . Haim . Amended Cross Entropy Cost: Framework For Explicit Diversity Encouragement . 2020 . cs.LG . 2007.08140 .
  34. Web site: 1.11. Ensemble methods .
  35. Wolpert . 1992 . Stacked Generalization. . Neural Networks . 5 . 2. 241–259 . 10.1016/s0893-6080(05)80023-1 .
  36. 10.1007/BF00117832 . 24 . Stacked regressions . 1996 . Machine Learning . 49–64 . Breiman . Leo. free .
  37. M. . Ozay . F. T. . Yarman Vural . A New Fuzzy Stacked Generalization Technique and Analysis of its Performance . 2013 . cs.LG . 1204.0171.
  38. Linearly Combining Density Estimators via Stacking . 10.1023/A:1007511322260 . 1999 . Smyth . Padhraic . Wolpert . David . Machine Learning . 36 . 1 . 59–83 . 16006860 .
  39. An Efficient Method to Estimate Bagging's Generalization Error . 10.1023/A:1007519102914 . 1999 . Wolpert . David H. . MacReady . William G. . Machine Learning . 35 . 1 . 41–55 . 14357246 .
  40. Clarke, B., Bayes model averaging and stacking when model approximation error cannot be ignored, Journal of Machine Learning Research, pp 683-712, 2003
  41. Sill . J. . Takacs . G. . Mackey . L. . Lin . D. . Feature-Weighted Linear Stacking . 2009 . cs.LG . 0911.0460.
  42. Shahram M. . Amini . Christopher F. . Parmeter . Bayesian model averaging in R . Journal of Economic and Social Measurement . 36 . 4 . 253–287 . 2011 . 10.3233/JEM-2011-0350.
  43. Web site: BMS: Bayesian Model Averaging Library . The Comprehensive R Archive Network . September 9, 2016 . 2015-11-24 .
  44. Web site: BAS: Bayesian Model Averaging using Bayesian Adaptive Sampling . The Comprehensive R Archive Network . September 9, 2016 .
  45. Web site: BMA: Bayesian Model Averaging . The Comprehensive R Archive Network . September 9, 2016 .
  46. Web site: Classification Ensembles . MATLAB & Simulink . June 8, 2017 .
  47. Woźniak . Michał . Graña . Manuel . Corchado . Emilio . A survey of multiple classifier systems as hybrid systems . Information Fusion . March 2014 . 16 . 3–17 . 10.1016/j.inffus.2013.04.006. 10366/134320 . 11632848 . free .
  48. Rodriguez-Galiano . V.F. . Ghimire . B. . Rogan . J. . Chica-Olmo . M. . Rigol-Sanchez . J.P. . An assessment of the effectiveness of a random forest classifier for land-cover classification . ISPRS Journal of Photogrammetry and Remote Sensing . January 2012 . 67 . 93–104 . 10.1016/j.isprsjprs.2011.11.002. 2012JPRS...67...93R .
  49. Giacinto . Giorgio . Roli . Fabio . Design of effective neural network ensembles for image classification purposes . Image and Vision Computing . August 2001 . 19 . 9–10 . 699–707 . 10.1016/S0262-8856(01)00045-2. 10.1.1.11.5820 .
  50. Book: Xia . Junshi . Yokoya . Naoto . Iwasaki . Yakira . 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . A novel ensemble classifier of hyperspectral and LiDAR data using morphological features . March 2017 . 6185–6189 . 10.1109/ICASSP.2017.7953345. 978-1-5090-4117-6 . 40210273 .
  51. Mochizuki . S. . Murakami . T. . Accuracy comparison of land cover mapping using the object-oriented image classification with machine learning algorithms . 33rd Asian Conference on Remote Sensing 2012, ACRS 2012 . November 2012 . 1 . 126–133.
  52. Liu . Dan . Toman . Elizabeth . Fuller . Zane . Chen . Gang . Londo . Alexis . Xuesong . Zhang . Kaiguang . Zhao . Integration of historical map and aerial imagery to characterize long-term land-use change and landscape dynamics: An object-based analysis via Random Forests . Ecological Indicators . 2018 . 95 . 1 . 595–605 . 10.1016/j.ecolind.2018.08.004 . 2018EcInd..95..595L . 92025959 .
  53. Book: Giacinto . G. . Roli . F. . Fumera . G. . Proceedings 15th International Conference on Pattern Recognition. ICPR-2000 . Design of effective multiple classifier systems by clustering of classifiers . 2 . September 2000 . 160–163 . 10.1109/ICPR.2000.906039. 978-0-7695-0750-7 . 10.1.1.11.5328 . 2625643 .
  54. Du . Peijun . Liu . Sicong . Xia . Junshi . Zhao . Yindi . Information fusion techniques for change detection from multi-temporal remote sensing images . Information Fusion . January 2013 . 14 . 1 . 19–27 . 10.1016/j.inffus.2012.05.003.
  55. Defined by Bruzzone et al. (2002) as "The data class that receives the largest number of votes is taken as the class of the input pattern", this is simple majority, more accurately described as plurality voting.
  56. Zhao . Kaiguang . Wulder . Michael A . Hu . Tongx . Bright . Ryan . Wu . Qiusheng . Qin . Haiming . Li . Yang . Detecting change-point, trend, and seasonality in satellite time series data to track abrupt changes and nonlinear dynamics: A Bayesian ensemble algorithm . Remote Sensing of Environment . 2019 . 232 . 111181 . 10.1016/j.rse.2019.04.034 . 2019RSEnv.23211181Z . 201310998 . free . 11250/2651134 . free .
  57. Bruzzone . Lorenzo . Cossu . Roberto . Vernazza . Gianni . Combining parametric and non-parametric algorithms for a partially unsupervised classification of multitemporal remote-sensing images . Information Fusion . December 2002 . 3 . 4 . 289–297 . 10.1016/S1566-2535(02)00091-X.
  58. Theodomir . Mugiraneza . Nascetti . Andrea . Ban. . Yifang . Continuous monitoring of urban land cover change trajectories with landsat time series and landtrendr-google earth engine cloud computing . Remote Sensing . 2020 . 12 . 18 . 2883. 10.3390/rs12182883 . 2020RemS...12.2883M . free .
  59. Web site: Li . Yang . Zhao . Kaiguang . Hu . Tongxi . Zhang . Xuesong . BEAST: A Bayesian Ensemble Algorithm for Change-Point Detection and Time Series Decomposition . .
  60. Raj Kumar . P. Arun . Selvakumar . S. . Distributed denial of service attack detection using an ensemble of neural classifier . Computer Communications . July 2011 . 34 . 11 . 1328–1341 . 10.1016/j.comcom.2011.01.012.
  61. Shabtai . Asaf . Moskovitch . Robert . Elovici . Yuval . Glezer . Chanan . Detection of malicious code by applying machine learning classifiers on static features: A state-of-the-art survey . Information Security Technical Report . February 2009 . 14 . 1 . 16–29 . 10.1016/j.istr.2009.03.003.
  62. Book: Boyun. Zhang. Jianping . Yin . Jingbo . Hao . Dingxing . Zhang . Shulin . Wang . Autonomic and Trusted Computing. Malicious Codes Detection Based on Ensemble Learning. 4610. 2007 . 468–477 . 10.1007/978-3-540-73547-2_48. Lecture Notes in Computer Science. 978-3-540-73546-5.
  63. Menahem . Eitan . Shabtai . Asaf . Rokach . Lior . Elovici . Yuval . Improving malware detection by applying multi-inducer ensemble . Computational Statistics & Data Analysis . February 2009 . 53 . 4 . 1483–1494 . 10.1016/j.csda.2008.10.015. 10.1.1.150.2722 .
  64. Book: Locasto . Michael E. . Wang . Ke . Keromytis . Angeles D. . Salvatore . J. Stolfo . Recent Advances in Intrusion Detection . FLIPS: Hybrid Adaptive Intrusion Prevention . 3858 . 2005 . 82–101 . 10.1007/11663812_5. Lecture Notes in Computer Science . 978-3-540-31778-4 . 10.1.1.60.3798 .
  65. Giacinto . Giorgio . Perdisci . Roberto . Del Rio . Mauro . Roli . Fabio . Intrusion detection in computer networks by a modular ensemble of one-class classifiers . Information Fusion . January 2008 . 9 . 1 . 69–82 . 10.1016/j.inffus.2006.10.002. 10.1.1.69.9132 .
  66. Book: Mu . Xiaoyan . Lu . Jiangfeng . Watta . Paul . Hassoun . Mohamad H. . 2009 International Joint Conference on Neural Networks . Weighted voting-based ensemble classifiers with application to human face recognition and voice recognition . 2168–2171 . July 2009 . 10.1109/IJCNN.2009.5178708. 978-1-4244-3548-7 . 18850747 .
  67. Book: Yu . Su . Shan . Shiguang . Chen . Xilin . Gao . Wen . 7th International Conference on Automatic Face and Gesture Recognition (FGR06) . Hierarchical ensemble of Gabor Fisher classifier for face recognition . 91–96 . April 2006 . 10.1109/FGR.2006.64. 978-0-7695-2503-7 . 1513315 .
  68. Book: Su . Y. . Shan . S. . Chen . X. . Gao . W. . 18th International Conference on Pattern Recognition (ICPR'06) . Patch-Based Gabor Fisher Classifier for Face Recognition . September 2006 . 2 . 528–531 . 10.1109/ICPR.2006.917. 978-0-7695-2521-1 . 5381806 .
  69. Book: Liu . Yang . Lin . Yongzheng . Chen . Yuehui . 2008 Congress on Image and Signal Processing . Ensemble Classification Based on ICA for Face Recognition . July 2008 . 144–148 . 10.1109/CISP.2008.581 . 978-0-7695-3119-9 . 16248842 .
  70. Book: Rieger . Steven A. . Muraleedharan . Rajani . Ramachandran . Ravi P. . The 9th International Symposium on Chinese Spoken Language Processing . Speech based emotion recognition using spectral feature extraction and an ensemble of KNN classifiers . 2014 . 589–593 . 10.1109/ISCSLP.2014.6936711. 978-1-4799-4219-0 . 31370450 .
  71. Book: Krajewski . Jarek . Batliner . Anton . Kessel . Silke . 2010 20th International Conference on Pattern Recognition . Comparing Multiple Classifiers for Speech-Based Detection of Self-Confidence - A Pilot Study . October 2010 . 3716–3719 . 10.1109/ICPR.2010.905. 978-1-4244-7542-1 . 15431610 .
  72. Rani . P. Ithaya . Muneeswaran . K. . Recognize the facial emotion in video sequences using eye and mouth temporal Gabor features . Multimedia Tools and Applications . 25 May 2016 . 76 . 7 . 10017–10040 . 10.1007/s11042-016-3592-y. 20143585 .
  73. Rani . P. Ithaya . Muneeswaran . K. . Facial Emotion Recognition Based on Eye and Mouth Regions . International Journal of Pattern Recognition and Artificial Intelligence . August 2016 . 30 . 7 . 1655020 . 10.1142/S021800141655020X.
  74. Rani . P. Ithaya . Muneeswaran . K . Emotion recognition based on facial components . Sādhanā . 28 March 2018 . 43 . 3 . 10.1007/s12046-018-0801-6. free .
  75. Louzada . Francisco . Ara . Anderson . Bagging k-dependence probabilistic networks: An alternative powerful fraud detection tool . Expert Systems with Applications . October 2012 . 39 . 14 . 11583–11592 . 10.1016/j.eswa.2012.04.024.
  76. Sundarkumar . G. Ganesh . Ravi . Vadlamani . A novel hybrid undersampling method for mining unbalanced datasets in banking and insurance . Engineering Applications of Artificial Intelligence . January 2015 . 37 . 368–377 . 10.1016/j.engappai.2014.09.019.
  77. Kim . Yoonseong . Sohn . So Young . Stock fraud detection using peer group analysis . Expert Systems with Applications . August 2012 . 39 . 10 . 8986–8992 . 10.1016/j.eswa.2012.02.025.
  78. Savio . A. . García-Sebastián . M.T. . Chyzyk . D. . Hernandez . C. . Graña . M. . Sistiaga . A. . López de Munain . A. . Villanúa . J. . Neurocognitive disorder detection based on feature vectors extracted from VBM analysis of structural MRI . Computers in Biology and Medicine . August 2011 . 41 . 8 . 600–610 . 10.1016/j.compbiomed.2011.05.010. 21621760 .
  79. Book: Ayerdi . B. . Savio . A. . Graña . M. . Natural and Artificial Computation in Engineering and Medical Applications . Meta-ensembles of Classifiers for Alzheimer's Disease Detection Using Independent ROI Features . 7931 . June 2013 . Part 2 . 122–130 . 10.1007/978-3-642-38622-0_13. Lecture Notes in Computer Science . 978-3-642-38621-3 .
  80. Gu . Quan . Ding . Yong-Sheng . Zhang . Tong-Liang . An ensemble classifier based prediction of G-protein-coupled receptor classes in low homology . Neurocomputing . April 2015 . 154 . 110–118 . 10.1016/j.neucom.2014.12.013.
  81. Xue. Dan. Zhou. Xiaomin. Li. Chen. Yao. Yudong. Rahaman. Md Mamunur. Zhang. Jinghua. Chen. Hao. Zhang. Jinpeng. Qi. Shouliang. Sun. Hongzan. 2020. An Application of Transfer Learning and Ensemble Learning Techniques for Cervical Histopathology Image Classification. IEEE Access. 8. 104603–104618. 10.1109/ACCESS.2020.2999816. 219689893. 2169-3536. free. 2020IEEEA...8j4603X .
  82. Manna. Ankur. Kundu. Rohit. Kaplun. Dmitrii. Sinitca. Aleksandr. Sarkar. Ram. December 2021. A fuzzy rank-based ensemble of CNN models for classification of cervical cytology. Scientific Reports. en. 11. 1. 14538. 10.1038/s41598-021-93783-8. 2045-2322. 8282795. 34267261. 2021NatSR..1114538M .