Isolation forest explained

Isolation Forest is an algorithm for data anomaly detection using binary trees. It was developed by Fei Tony Liu in 2008.[1] It has a linear time complexity and a low memory use, which works well for high-volume data.[2] [3] It is based on the assumption that because anomalies are few and different from other data, they can be isolated using few partitions. Like decision tree algorithms, it does not perform density estimation. Unlike decision tree algorithms, it uses only path length to output an anomaly score, and does not use leaf node statistics of class distribution or target value.

Isolation Forest is fast because it splits the data space, randomly selecting an attribute and split point. The anomaly score is inversely associated with the path-length because anomalies need fewer splits to be isolated, because they are few and different.

History

The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008. In 2012 the same authors showed that iForest has linear time complexity, a small memory requirement, and is applicable to high-dimensional data. In 2010, an extension of the algorithm, SCiforest, was published to address clustered and axis-paralleled anomalies.[4]

Isolation trees

The premise of the Isolation Forest algorithm is that anomalous data points are easier to separate from the rest of the sample. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value between the minimum and maximum values allowed for that attribute.An example of random partitioning in a 2D dataset of normally distributed points is shown in the first figure for a non-anomalous point and in the second one for a point that is more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.

Recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point

xi

in the first figure is greater than the path length of

xj

in the second figure.

Let

X=\{x1,...,xn\}

be a set of d-dimensional points and

X'\subsetX

. An Isolation Tree (iTree) is defined as a data structure with the following properties:
  1. for each node

T

in the Tree,

T

is either an external-node with no child, or an internal-node with one “test” and exactly two child nodes (

Tl

and

Tr

)
  1. a test at node

T

consists of an attribute

q

and a split value

p

such that the test

q<p

determines the traversal of a data point to either

Tl

or

Tr

.

In order to build an iTree, the algorithm recursively divides

X'

by randomly selecting an attribute

q

and a split value

p

, until either
  1. the node has only one instance, or
  2. all data at the node have the same values.

When the iTree is fully grown, each point in

X

is isolated at one of the external nodes. Intuitively, the anomalous points are those (easier to isolate, hence) with the smaller path length in the tree, where the path length

h(xi)

of point

xi\inX

is defined as the number of edges

xi

traverses from the root node to get to an external node.

A probabilistic explanation of iTree is provided in the original iForest paper.

Anomaly detection

Anomaly detection with Isolation Forest is done as follows:

  1. Use the training dataset to build some number of iTrees
  2. For each data point in the test set:
    1. Pass it through all the iTrees, counting the path length for each tree
    2. Assign an “anomaly score” to the instance
    3. Label the point as “anomaly” if its score is greater than a predefined threshold, which depends on the domain

Anomaly score

The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST. Therefore, the estimation of average

h(x)

for external node terminations is the same as that of the unsuccessful searches in BST, that is[5]

c(m)=\begin{cases}2H(m-1)-

2(m-1)
n

&form>2\ 1&form=2\ 0&otherwise\end{cases}

where

n

is the test set size,

m

is the sample set size and

H

is the harmonic number, which can be estimated by

H(i)=ln(i)+\gamma

, where

\gamma=0.5772156649

is the Euler-Mascheroni constant.

Above,

c(m)

is the average

h(x)

given

m

, so we can use it to normalize

h(x)

to get an estimate of the anomaly score for a given instance x:
-E(h(x))
c(m)
s(x,m)=2

where

E(h(x))

is the average value of

h(x)

from a collection of iTrees. For any data point

x

:

s

is close to

1

then

x

is very likely an anomaly

s

is smaller than

0.5

then

x

is likely normal

0.5

, then likely they are all normal

Properties

Parameter Selection

Selecting appropriate parameters is the key to the performance of the Isolation Forest algorithm. Each of the parameters influences anomaly detection differently. Key parameters include:

Number of Trees : The number of trees in the ensemble. More trees make it better at finding unusual patterns but also raise the cost of computing. Commonly used range is 100 to 300 which balances accuracy with how efficiently it uses computing power.

Subsample Size : The size of the sample that is used to construct each tree. Smaller sample sizes reduce computational complexity, while larger sample sizes capture more data variability. A common value for subsample size is 256, though the optimal value can vary depending on the dataset's characteristics.

Contamination Factor : An estimate of the proportion of outliers in the dataset. The higher the contamination factor, the more points are flagged as anomalies, and the lower it is, the fewer points it marks. It is usually determined by knowledge of the field or cross-validation. Finding the right value for contamination is important to avoid false positives and false negatives.

Tree Depth : This refers to the number of features that can be considered for a split in each tree. The smaller the tree depth, the faster the training is; however, it can make it difficult for the model to find anomalies, especially in complex datasets. Finding the right balance between depth and training speed is crucial. For example, shallow trees may perform well for simple datasets but struggle with high-dimensional or noisy data.

Split Selection Strategy : Splits are selected randomly between the minimum and maximum values of each feature in the sample, while maintaining efficiency and reducing overfitting. In the case of more complex data, other methods can be used.

Random Seed : Setting a fixed random seed ensures that the results from the algorithm will be reproducible over multiple runs. This is very important when comparing models or tuning hyperparameters because it avoids differences due to randomness. The stability of the results in each stage of running many iterations or cross-validation is essential.[7] [8]

The table below shows the summary of parameter selection based on dataset characteristics:

Guidelines for Selecting Key Parameters
Parameter Small Datasets Large Datasets High-Dimensional Data Imbalanced Datasets
Number of Trees 50–100 200+ Higher values for better accuracy Increase if many anomalies are expected
Subsample Size 100+ 512+ Reduce to avoid overfitting Larger subsample size for better accuracy
Contamination Factor Set based on domain knowledge Set based on domain knowledge Default (None) Increase for many anomalies
Tree Depth Shallow (1–2) Deeper (max features) Limit to prevent overfitting Set lower for generalization
Split Selection Strategy Random splits Random splits Random splits Default, unless anomalies are complex
Random Seed Set for reproducibility Set for reproducibility Set for reproducibility Set for consistency

SCiForest

SCiForest (Isolation Forest with Split-selection Criterion) is an extension of the original Isolation Forest algorithm. It targets specifically clustered anomalies through selecting a split criterion and finding hyper-planes that are more likely to produce good results. SCiForest introduces random hyper-planes which are non-axis-parallel to the originalattributes. Since it is not necessary to have the optimal hyper-plane in every node, with sufficient trials of randomly generated hyper-planes a good enough hyper-plane will emerge. Therefore, the resulting model is considerably effective due to aggregate power of the ensemble learner.

SCIForest: Subspace Clustering Isolation Forest Implementation

SCIForest (Subspace Clustering Isolation Forest) is an anomaly detection algorithm that combines subspace clustering with the isolation-based approach of Isolation Forest (iForest). Its implementation is tailored for high-dimensional datasets, addressing challenges such as irrelevant features, noise, and feature redundancy. Below is a detailed explanation of its theoretical implementation.

Steps in SCIForest Implementation

SCIForest consists of the following key steps:

1. Subspace Selection

Features are grouped into clusters (e.g., using KMeans or hierarchical clustering) to identify related subsets. Random subspaces are then sampled from these clusters, allowing SCIForest to focus on smaller, meaningful feature groups rather than the full feature space. This reduces the impact of irrelevant or noisy dimensions.[4][6]

2. Isolation Tree Construction

Within each selected subspace, isolation trees are constructed. These trees isolate points through random recursive splitting:

Anomalous points, being sparse or distinct, are isolated more quickly (shorter path lengths) compared to normal points. [2][4]

3. Anomaly Scoring

For each data point, the isolation depth (

h(x)

) is calculated for all trees across all subspaces. The anomaly score

S(x)

for a data point

x

is defined as:

S(x)=

1
n
n
\sum
i=1

hi(x)

Where:

hi(x)

: Path length for data point

x

in the

i

-th tree.

n

: Total number of isolation trees.

Points with lower average path lengths (

S(x)

) are more likely to be anomalies. [3][4]

4. Thresholding

The final anomaly scores are compared against a predefined threshold

\theta

to classify data points. If

S(x)>\theta

, the point is classified as anomalous; otherwise, it is normal. The threshold

\theta

can be adjusted based on application requirements or the contamination rate (expected proportion of anomalies). [6]

Summary of SCIForest Steps

Overview of SCIForest Implementation! Step !! Description !! Key Insights
Subspace Selection Clusters features and samples random subsets. Focuses on relevant feature subsets, reducing the impact of noise and irrelevant features. [4]
Isolation Tree Construction Constructs trees by recursively splitting points in subspaces. Anomalies are isolated with shorter path lengths. [2]
Anomaly Scoring Aggregates isolation depths across trees and subspaces. Provides a robust anomaly detection mechanism. [3]
Thresholding Compares scores to a threshold for classification. Enables customization based on specific application needs. [6]

Extended Isolation Forest

Extended Isolation Forest (Extended IF or EIF) is another extension of the original Isolation Forest algorithm. Extended IF uses rotated trees in different planes, similarly to SCiForest and random values are selected to split the data, such as a random slope or intercept.

The standard Isolation Forest requires two pieces of information, those being 1) a random feature or coordinate, and 2) a random value for the feature from the range of available values in the data. The Extended IF also requires only two pieces of information, this time being 1) a random slope for the branch cut, and 2) a random intercept for the branch cut which is chosen from the range of available values of the training data. This makes the Extended IF simpler than using rotation trees.[9]

The figure depicts a score map of a regular Isolation Forest in comparison to an Extended Isolation Forest for a sinusoidal-shaped data-set. This image allows us to clearly observe the improvement made by the Extended Isolation Forest in evaluating the scores much more accurately when compared to the shape of the data. While the regular isolation forest fails in capturing the sinusoid shape of the data and properly evaluating the anomaly scores. The regular Isolation Forest shapes the anomaly scores into a rectangular shape and simply assumes that any region nearby the sinusoid data point is not to be anomalous. In comparison, the EIF is more accurate in evaluating anomaly scores with more detail and unlike its predecessor, the EIF is able to detect anomalies that are close to the sinusoid shape of the data but are still anomalous. The original EIF publication includes also this comparison with a single-blob-shaped data-set and a two-blob-shaped data-set, also comparing the EIF results to isolation forest using rotation trees.

Improvements in Extended Isolation Forest

The Extended Isolation Forest enhances the traditional Isolation Forest algorithm by addressing some of its limitations, particularly in handling high-dimensional data and improving anomaly detection accuracy. Key improvements in EIF include:

Enhanced Splitting Mechanism: Unlike traditional Isolation Forest, which uses random axis-aligned splits, EIF uses hyperplanes for splitting data. This approach allows for more flexible and accurate partitioning of the data space, which is especially useful in high-dimensional datasets.

Improved Anomaly Scoring: EIF refines the anomaly scoring process by considering the distance of data points from the hyperplane used in splitting. This provides a more granular and precise anomaly score, leading to better differentiation between normal and anomalous points.

Handling of High-Dimensional Data: The use of hyperplanes also improves EIF's performance in high-dimensional spaces. Traditional Isolation Forest can suffer from the curse of dimensionality in such scenarios, but EIF mitigates this issue by creating more meaningful and informative partitions in the data space.[10]

Open source implementations

Original implementation by Fei Tony Liu is Isolation Forest in R.

Other implementations (in alphabetical order):

Other variations of Isolation Forest algorithm implementations:

Python implementation with Scikit-learn

The isolation forest algorithm is commonly used by data scientists through the version made available in the scikit-learn library. The snippet below depicts a brief implementation of an isolation forest, with direct explanations with comments.

import pandas as pdfrom sklearn.ensemble import IsolationForest

  1. Consider 'data.csv' is a file containing samples as rows and features as column, and a column labeled 'Class' with a binary classification of your samples.

df = pd.read_csv('data.csv')X = df.drop(columns=['Class'])y = df['Class']

  1. Determine how many samples will be outliers based on the classification

outlier_fraction = len(df[df['Class']

1])/float(len(df[df['Class']

0]))

  1. Create and fit model, parameters can be optimized

model = IsolationForest(n_estimators=100, contamination=outlier_fraction, random_state=42)model.fit(df)

In this snippet we can observe the simplicity of a standard implementation of the algorithm. The only requirement data that the user needs to adjust is the outlier fraction in which the user determines a percentage of the samples to be classifier as outliers. This can be commonly done by selection a group among the positive and negative samples according to a giving classification. Most of the other steps are pretty standard to any decision tree based technique made available through scikit-learn, in which the user simply needs to split the target variable from the features and fit the model after it is defined with a giving number of estimators (or trees).

This snippet is a shortened adapted version of an implementation explored by GeeksforGeeks, which can be accessed for further explorations.[13]

See also

Notes and References

  1. Web site: Liu . Fei Tony . First Isolation Forest implementation on Sourceforge .
  2. Book: Liu. Fei Tony. Ting. Kai Ming. Zhou. Zhi-Hua. 2008 Eighth IEEE International Conference on Data Mining . Isolation Forest . December 2008. 413–422. 10.1109/ICDM.2008.17. 978-0-7695-3502-9. 6505449.
  3. Liu. Fei Tony. Ting. Kai Ming. Zhou. Zhi-Hua. Isolation-Based Anomaly Detection . ACM Transactions on Knowledge Discovery from Data . December 2008. 6 . 3:1–3:39. 10.1145/2133360.2133363. 207193045 .
  4. Book: Liu . Fei Tony . Ting . Kai Ming . Zhou . Zhi-Hua . September 2010 . On Detecting Clustered Anomalies Using SCiForest . Joint European Conference on Machine Learning and Knowledge Discovery in Databases - ECML PKDD 2010: Machine Learning and Knowledge Discovery in Databases . Lecture Notes in Computer Science . 6322 . 274–290 . 10.1007/978-3-642-15883-4_18 . 978-3-642-15882-7 . free.
  5. Book: Shaffer, Clifford A. . Data structures & algorithm analysis in Java. 2011. Dover Publications . 9780486485812. 3rd Dover . Mineola, NY. 721884651.
  6. 1908.04000 . stat.ML . Priyanga . Dilini Talagala . Rob J. . Hyndman . Anomaly Detection in High Dimensional Data . 12 Aug 2019 . Smith-Miles . Kate.
  7. Web site: numpy.random.randn — NumPy v2.1 Manual . 2024-11-21 . numpy.org.
  8. Web site: numpy.random.uniform — NumPy v2.1 Manual . 2024-11-21 . numpy.org.
  9. Hariri . Sahand . Kind . Matias Carrasco . Brunner . Robert J. . April 2021 . Extended Isolation Forest . IEEE Transactions on Knowledge and Data Engineering . 33 . 4 . 1479–1489 . 10.1109/TKDE.2019.2947676 . 53236735 . 1558-2191. 1811.02141 .
  10. Hariri . Sahand . Kind . Matias Carrasco . Brunner . Robert J. . 2021-04-01 . Extended Isolation Forest . IEEE Transactions on Knowledge and Data Engineering . 33 . 4 . 1479–1489 . 10.1109/TKDE.2019.2947676 . 1041-4347. 1811.02141 .
  11. Web site: Detecting and preventing abuse on LinkedIn using isolation forests . 2023-07-02 . LinkedIn Engineering Blog . en . 13 August 2019 . James . Verbus.
  12. 1910.12362 . Cortes . David . Distance approximation using Isolation Forests . 2019 . stat.ML .
  13. GeeksforGeeks, What is Isolation Forest?, accessed November 19, 2024.