Isolation Forest is an algorithm for data anomaly detection using binary trees. It was developed by Fei Tony Liu in 2008.[1] It has a linear time complexity and a low memory use, which works well for high-volume data.[2] [3] It is based on the assumption that because anomalies are few and different from other data, they can be isolated using few partitions. Like decision tree algorithms, it does not perform density estimation. Unlike decision tree algorithms, it uses only path length to output an anomaly score, and does not use leaf node statistics of class distribution or target value.
Isolation Forest is fast because it splits the data space, randomly selecting an attribute and split point. The anomaly score is inversely associated with the path-length because anomalies need fewer splits to be isolated, because they are few and different.
The Isolation Forest (iForest) algorithm was initially proposed by Fei Tony Liu, Kai Ming Ting and Zhi-Hua Zhou in 2008. In 2012 the same authors showed that iForest has linear time complexity, a small memory requirement, and is applicable to high-dimensional data. In 2010, an extension of the algorithm, SCiforest, was published to address clustered and axis-paralleled anomalies.[4]
The premise of the Isolation Forest algorithm is that anomalous data points are easier to separate from the rest of the sample. In order to isolate a data point, the algorithm recursively generates partitions on the sample by randomly selecting an attribute and then randomly selecting a split value between the minimum and maximum values allowed for that attribute.An example of random partitioning in a 2D dataset of normally distributed points is shown in the first figure for a non-anomalous point and in the second one for a point that is more likely to be an anomaly. It is apparent from the pictures how anomalies require fewer random partitions to be isolated, compared to normal points.
Recursive partitioning can be represented by a tree structure named Isolation Tree, while the number of partitions required to isolate a point can be interpreted as the length of the path, within the tree, to reach a terminating node starting from the root. For example, the path length of point
xi
xj
Let
X=\{x1,...,xn\}
X'\subsetX
T
T
Tl
Tr
T
q
p
q<p
Tl
Tr
In order to build an iTree, the algorithm recursively divides
X'
q
p
When the iTree is fully grown, each point in
X
h(xi)
xi\inX
xi
A probabilistic explanation of iTree is provided in the original iForest paper.
Anomaly detection with Isolation Forest is done as follows:
The algorithm for computing the anomaly score of a data point is based on the observation that the structure of iTrees is equivalent to that of Binary Search Trees (BST): a termination to an external node of the iTree corresponds to an unsuccessful search in the BST. Therefore, the estimation of average
h(x)
c(m)=\begin{cases}2H(m-1)-
2(m-1) | |
n |
&form>2\ 1&form=2\ 0&otherwise\end{cases}
where
n
m
H
H(i)=ln(i)+\gamma
\gamma=0.5772156649
Above,
c(m)
h(x)
m
h(x)
| ||||
s(x,m)=2 |
where
E(h(x))
h(x)
x
s
1
x
s
0.5
x
0.5
Selecting appropriate parameters is the key to the performance of the Isolation Forest algorithm. Each of the parameters influences anomaly detection differently. Key parameters include:
Number of Trees : The number of trees in the ensemble. More trees make it better at finding unusual patterns but also raise the cost of computing. Commonly used range is 100 to 300 which balances accuracy with how efficiently it uses computing power.
Subsample Size : The size of the sample that is used to construct each tree. Smaller sample sizes reduce computational complexity, while larger sample sizes capture more data variability. A common value for subsample size is 256, though the optimal value can vary depending on the dataset's characteristics.
Contamination Factor : An estimate of the proportion of outliers in the dataset. The higher the contamination factor, the more points are flagged as anomalies, and the lower it is, the fewer points it marks. It is usually determined by knowledge of the field or cross-validation. Finding the right value for contamination is important to avoid false positives and false negatives.
Tree Depth : This refers to the number of features that can be considered for a split in each tree. The smaller the tree depth, the faster the training is; however, it can make it difficult for the model to find anomalies, especially in complex datasets. Finding the right balance between depth and training speed is crucial. For example, shallow trees may perform well for simple datasets but struggle with high-dimensional or noisy data.
Split Selection Strategy : Splits are selected randomly between the minimum and maximum values of each feature in the sample, while maintaining efficiency and reducing overfitting. In the case of more complex data, other methods can be used.
Random Seed : Setting a fixed random seed ensures that the results from the algorithm will be reproducible over multiple runs. This is very important when comparing models or tuning hyperparameters because it avoids differences due to randomness. The stability of the results in each stage of running many iterations or cross-validation is essential.[7] [8]
The table below shows the summary of parameter selection based on dataset characteristics:
Parameter | Small Datasets | Large Datasets | High-Dimensional Data | Imbalanced Datasets | |
---|---|---|---|---|---|
Number of Trees | 50–100 | 200+ | Higher values for better accuracy | Increase if many anomalies are expected | |
Subsample Size | 100+ | 512+ | Reduce to avoid overfitting | Larger subsample size for better accuracy | |
Contamination Factor | Set based on domain knowledge | Set based on domain knowledge | Default (None) | Increase for many anomalies | |
Tree Depth | Shallow (1–2) | Deeper (max features) | Limit to prevent overfitting | Set lower for generalization | |
Split Selection Strategy | Random splits | Random splits | Random splits | Default, unless anomalies are complex | |
Random Seed | Set for reproducibility | Set for reproducibility | Set for reproducibility | Set for consistency |
SCiForest (Isolation Forest with Split-selection Criterion) is an extension of the original Isolation Forest algorithm. It targets specifically clustered anomalies through selecting a split criterion and finding hyper-planes that are more likely to produce good results. SCiForest introduces random hyper-planes which are non-axis-parallel to the originalattributes. Since it is not necessary to have the optimal hyper-plane in every node, with sufficient trials of randomly generated hyper-planes a good enough hyper-plane will emerge. Therefore, the resulting model is considerably effective due to aggregate power of the ensemble learner.
SCIForest (Subspace Clustering Isolation Forest) is an anomaly detection algorithm that combines subspace clustering with the isolation-based approach of Isolation Forest (iForest). Its implementation is tailored for high-dimensional datasets, addressing challenges such as irrelevant features, noise, and feature redundancy. Below is a detailed explanation of its theoretical implementation.
SCIForest consists of the following key steps:
Features are grouped into clusters (e.g., using KMeans or hierarchical clustering) to identify related subsets. Random subspaces are then sampled from these clusters, allowing SCIForest to focus on smaller, meaningful feature groups rather than the full feature space. This reduces the impact of irrelevant or noisy dimensions.[4][6]
Within each selected subspace, isolation trees are constructed. These trees isolate points through random recursive splitting:
Anomalous points, being sparse or distinct, are isolated more quickly (shorter path lengths) compared to normal points. [2][4]
For each data point, the isolation depth (
h(x)
S(x)
x
S(x)=
1 | |
n |
n | |
\sum | |
i=1 |
hi(x)
Where:
hi(x)
x
i
n
Points with lower average path lengths (
S(x)
The final anomaly scores are compared against a predefined threshold
\theta
S(x)>\theta
\theta
Subspace Selection | Clusters features and samples random subsets. | Focuses on relevant feature subsets, reducing the impact of noise and irrelevant features. [4] | |
Isolation Tree Construction | Constructs trees by recursively splitting points in subspaces. | Anomalies are isolated with shorter path lengths. [2] | |
Anomaly Scoring | Aggregates isolation depths across trees and subspaces. | Provides a robust anomaly detection mechanism. [3] | |
Thresholding | Compares scores to a threshold for classification. | Enables customization based on specific application needs. [6] |
Extended Isolation Forest (Extended IF or EIF) is another extension of the original Isolation Forest algorithm. Extended IF uses rotated trees in different planes, similarly to SCiForest and random values are selected to split the data, such as a random slope or intercept.
The standard Isolation Forest requires two pieces of information, those being 1) a random feature or coordinate, and 2) a random value for the feature from the range of available values in the data. The Extended IF also requires only two pieces of information, this time being 1) a random slope for the branch cut, and 2) a random intercept for the branch cut which is chosen from the range of available values of the training data. This makes the Extended IF simpler than using rotation trees.[9]
The figure depicts a score map of a regular Isolation Forest in comparison to an Extended Isolation Forest for a sinusoidal-shaped data-set. This image allows us to clearly observe the improvement made by the Extended Isolation Forest in evaluating the scores much more accurately when compared to the shape of the data. While the regular isolation forest fails in capturing the sinusoid shape of the data and properly evaluating the anomaly scores. The regular Isolation Forest shapes the anomaly scores into a rectangular shape and simply assumes that any region nearby the sinusoid data point is not to be anomalous. In comparison, the EIF is more accurate in evaluating anomaly scores with more detail and unlike its predecessor, the EIF is able to detect anomalies that are close to the sinusoid shape of the data but are still anomalous. The original EIF publication includes also this comparison with a single-blob-shaped data-set and a two-blob-shaped data-set, also comparing the EIF results to isolation forest using rotation trees.
The Extended Isolation Forest enhances the traditional Isolation Forest algorithm by addressing some of its limitations, particularly in handling high-dimensional data and improving anomaly detection accuracy. Key improvements in EIF include:
Enhanced Splitting Mechanism: Unlike traditional Isolation Forest, which uses random axis-aligned splits, EIF uses hyperplanes for splitting data. This approach allows for more flexible and accurate partitioning of the data space, which is especially useful in high-dimensional datasets.
Improved Anomaly Scoring: EIF refines the anomaly scoring process by considering the distance of data points from the hyperplane used in splitting. This provides a more granular and precise anomaly score, leading to better differentiation between normal and anomalous points.
Handling of High-Dimensional Data: The use of hyperplanes also improves EIF's performance in high-dimensional spaces. Traditional Isolation Forest can suffer from the curse of dimensionality in such scenarios, but EIF mitigates this issue by creating more meaningful and informative partitions in the data space.[10]
Original implementation by Fei Tony Liu is Isolation Forest in R.
Other implementations (in alphabetical order):
Other variations of Isolation Forest algorithm implementations:
The isolation forest algorithm is commonly used by data scientists through the version made available in the scikit-learn library. The snippet below depicts a brief implementation of an isolation forest, with direct explanations with comments.
df = pd.read_csv('data.csv')X = df.drop(columns=['Class'])y = df['Class']
outlier_fraction = len(df[df['Class']
0]))
model = IsolationForest(n_estimators=100, contamination=outlier_fraction, random_state=42)model.fit(df)
In this snippet we can observe the simplicity of a standard implementation of the algorithm. The only requirement data that the user needs to adjust is the outlier fraction in which the user determines a percentage of the samples to be classifier as outliers. This can be commonly done by selection a group among the positive and negative samples according to a giving classification. Most of the other steps are pretty standard to any decision tree based technique made available through scikit-learn, in which the user simply needs to split the target variable from the features and fit the model after it is defined with a giving number of estimators (or trees).
This snippet is a shortened adapted version of an implementation explored by GeeksforGeeks, which can be accessed for further explorations.[13]