DecisionTreeAndRandomForest

Documentation for DecisionTreeAndRandomForest.

DecisionTreeAndRandomForest.DecisionTree
DecisionTreeAndRandomForest.Leaf
DecisionTreeAndRandomForest.Node
DecisionTreeAndRandomForest.RandomForest
Base.show
Base.show
DecisionTreeAndRandomForest.build_tree
DecisionTreeAndRandomForest.calculate_entropy
DecisionTreeAndRandomForest.calculate_gini
DecisionTreeAndRandomForest.calculate_variance
DecisionTreeAndRandomForest.fit!
DecisionTreeAndRandomForest.fit!
DecisionTreeAndRandomForest.get_split_criterions
DecisionTreeAndRandomForest.gini_impurity
DecisionTreeAndRandomForest.information_gain
DecisionTreeAndRandomForest.predict
DecisionTreeAndRandomForest.predict
DecisionTreeAndRandomForest.predict_single
DecisionTreeAndRandomForest.split_gini
DecisionTreeAndRandomForest.split_ig
DecisionTreeAndRandomForest.split_node
DecisionTreeAndRandomForest.split_variance
DecisionTreeAndRandomForest.variance_reduction
DecisionTreeAndRandomForest.weighted_entropy
DecisionTreeAndRandomForest.weighted_gini
DecisionTreeAndRandomForest.weighted_variance

DecisionTreeAndRandomForest.DecisionTree — Type

Represents a DecisionTree.

Fields

max_depth::Int64: Controls the maximum depth of the tree. If -1, the DecisionTree is of unlimited depth.
min_samples_split::Int64: Controls the minimum number of samples required to split a node.
num_features::Int64: Controls the number of features to consider for each split. If -1, all features are used.
split_criterion::Function: Contains the split criterion function.
root::Union{Missing, DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Contains the root node of the DecisionTree.

source

DecisionTreeAndRandomForest.Leaf — Type

Represents a Leaf in the DecisionTree structure.

Fields

values::AbstractVector

source

DecisionTreeAndRandomForest.Node — Type

Represents a Node in the DecisionTree structure.

Fields

left::Union{DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Points to the left child.
right::Union{DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Points to the right child.
feature_index::Int64: Stores the index of the selected feature.
split_value::Any: Stores the value on that the data is split.

source

DecisionTreeAndRandomForest.RandomForest — Type

Represents a RandomForest.

Fields

trees::Vector{DecisionTree}: Contains the vector of DecisionTree structures.
max_depth::Int64: Contains the maximum depth of the tree. If -1, the DecisionTree is of unlimited depth.
min_samples_split::Int64: Contains the minimum number of samples required to split a node.
split_criterion::Function: Contains the split criterion function.
number_of_trees::Int64: Contains the number of trees in the RandomForest structure.
subsample_percentage::Float64: Contains the percentage of the dataset to use for training each tree.
num_features::Int64: Contains the number of features to use when finding the best split. If -1, all the features are used.

source

Base.show — Method

show(io, tree)

This function recursively prints the structure of the DecisionTree, providing information about each node and leaf. It's primarily used for debugging and visualizing the tree's structure.

Arguments

io::IO: The IO context to print the tree structure.
tree::DecisionTree: The DecisionTree to print.

Returns

Nothing: This function prints the structure of the DecisionTree.

source

Base.show — Method

show(io, forest)

This function recursively prints the structure of the ForestTree. It's primarily used for debugging and visualizing the Forest structure.

Arguments

io::IO: The IO context to print the Forest structure.
forest::RandomForest: The RandomForest to be printed.

Returns

Nothing: This function prints the structure of the RandomForest.

source

DecisionTreeAndRandomForest.build_tree — Function

build_tree(
    data,
    labels,
    max_depth,
    min_samples_split,
    num_features,
    split_criterion
)
build_tree(
    data,
    labels,
    max_depth,
    min_samples_split,
    num_features,
    split_criterion,
    depth
)

This function recursively builds a DecisionTree by iteratively splitting the data based on the provided split_criterion. The process continues until either the maximum depth is reached, the number of samples in a node falls below min_samples_split or all labels in a node are the same.

Arguments

data::AbstractMatrix: The training data.
labels::AbstractVector: The labels for the training data.
max_depth::Int: The maximum depth of the tree.
min_samples_split::Int: The minimum number of samples required to split a node.
split_criterion::Function: The function used to determine the best split at each node.
depth::Int=0: The current depth of the tree (used recursively).

Returns

Union{Node, Leaf}: The return value can be one of two types, depending on the state of the tree at each point of recursion.

source

DecisionTreeAndRandomForest.calculate_entropy — Method

calculate_entropy(y)

Calculates the entropy of a vector of labels y.

Arguments

y::AbstractVector: A vector of labels.

Returns

Float64: The entropy of the vector.

source

DecisionTreeAndRandomForest.calculate_gini — Method

calculate_gini(y)

This function calculates the Gini impurity of a set of labels, which measures the homogeneity of the labels within a node. A lower Gini impurity indicates a more homogeneous set of labels.

Arguments

y::AbstractVector: A vector of labels.

Returns

Float64: The Gini impurity of the labels.

source

DecisionTreeAndRandomForest.calculate_variance — Method

calculate_variance(y)

Calculate the sample variance of a given set of labels. It uses the standard formula for sample variance.

Arguments

y::AbstractVector: A vector of numerical labels for which the variance is to be computed.

Returns

Float64: The sample variance of the input label vector y.

source

DecisionTreeAndRandomForest.fit! — Method

fit!(tree, data, labels)

This function builds the tree structure of the DecisionTree by calling the build_tree function.

Arguments

tree::DecisionTree: The DecisionTree to fit.
data::AbstractMatrix: The training data.
labels::AbstractVector: The labels for the training data.

Returns

Nothing: This function modifies the tree in-place.

source

DecisionTreeAndRandomForest.fit! — Method

fit!(forest, data, labels)

This function trains each individual tree in the RandomForest by calling the fit function on each ClassificationTree within the forest.trees vector. The num_features parameter from the RandomForest object is used to control the number of features considered for each split during training.

Arguments

forest::RandomForest: The RandomForest to be trained.

Returns

Nothing: This function modifies the forest in-place.

source

DecisionTreeAndRandomForest.get_split_criterions — Function

get_split_criterions()
get_split_criterions(task)

Retrieves the implemented split criterions that can be used.

Arguments

task::String: A String that indicates for which task the splitting criterions should be retrieved. Can be "classification" or "regression". Defaults to returning all available criterions

Returns

Tuple{Function}: A tuple containing the implemented functions.

source

DecisionTreeAndRandomForest.gini_impurity — Function

gini_impurity(X, y)
gini_impurity(X, y, num_features_to_use)

Finds the best split point for a decision tree node using gini impurity.

Arguments

X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
y::AbstractVector: A vector of labels corresponding to the data points.
num_features_to_use::Int: The number of features to consider when looking for the best split. If -1, all features are considered.

Returns

Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.

source

DecisionTreeAndRandomForest.information_gain — Function

information_gain(X, y)
information_gain(X, y, num_features_to_use)

Finds the best split point for a decision tree node using information gain.

Arguments

X::AbstractMatrix: A matrix of features.
y::AbstractVector: A vector of labels.
num_features_to_use::Int=-1: The number of features to consider for each split. If -1, all features are used.

Returns

best_feature::Int: The index of the best feature to split on.
best_threshold::Real: The threshold value for the best split.

source

DecisionTreeAndRandomForest.predict — Method

predict(tree, data)

This function traverses the tree structure of the DecisionTree for each datapoint in data. It follows the decision rules based on the split criteria and feature values. In case of a classification problem, the prediction is the most frequent label (mode) among the labels in the leaf node. In case of a regression problem, the prediction is the average of those values (mean).

Arguments

tree::DecisionTree: The trained DecisionTree.
data::AbstractMatrix: The datapoints to predict.

Returns

Vector: A vector of predictions for each datapoint in data.

source

DecisionTreeAndRandomForest.predict — Method

predict(forest, data)

This function predicts the labels for each datapoint in the input dataset by using the trained RandomForest. It therefore uses each individual tree in the forest and combines the predictions using either the most frequent label (Classification Task) or the average of the predictions (Regression Task)

Arguments

forest::RandomForest: The trained RandomForest.
data::AbstractMatrix: The dataset for which to make predictions.

Returns

AbstractVector: A vector of predictions for each datapoint in data.

source

DecisionTreeAndRandomForest.predict_single — Method

predict_single(tree, sample)

This function traverses the tree structure of the DecisionTree for the datapoint in sample. It follows the decision rules based on the split criteria and feature values. If the leaf node contains numerical values, its treated as a regreesion problem and the prediction is the average of those values. If a leaf node contains numerical values, it is treated as a regression problem, and the prediction is the average of those values. If the leaf node contains categorical labels, it is treated as a classification problem, and the prediction is the most frequent label (mode) among the labels in the leaf node.

Arguments

tree::DecisionTree: The trained DecisionTree.
sample::AbstractVector: The sample to predict.

Returns

Vector: A vector of predictions for each datapoint in data.

source

DecisionTreeAndRandomForest.split_gini — Function

split_gini(X, y)
split_gini(X, y, num_features)

This function is a wrapper for find_best_split to be used as the split criterion in the build_tree function.

Arguments

X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
y::AbstractVector: A vector of labels corresponding to the data points.
num_features::Int: The number of features to consider when looking for the best split. If -1, all features are considered.

Returns

Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.

source

DecisionTreeAndRandomForest.split_ig — Function

split_ig(data, labels)
split_ig(data, labels, num_features)

This function is a wrapper for best_split to be used as the split criterion in the build_tree function.

Arguments

data::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
labels::AbstractVector: A vector of labels corresponding to the data points.
num_features::Int: The number of features to consider for each split.

Returns

Tuple{Int, Real}: A tuple containing the index of the best feature and the best split value.

source

DecisionTreeAndRandomForest.split_node — Method

split_node(X, y, index, value)

This function splits the labels into two subsets based on the provided feature and value. It handles both numerical and categorical features.

Arguments

X::AbstractMatrix{T}: A matrix of features, where each row is a data point and each column is a feature.
y::AbstractVector: A vector of labels corresponding to the data points.
index::Int: The index of the feature to split on.
value::T: The value to split the feature on.

Returns

Tuple{AbstractVector, AbstractVector}: A tuple containing the left and right sets of labels.

source

DecisionTreeAndRandomForest.split_variance — Function

This function is a wrapper for find_best_split_vr to be used as the split criterion in the build_tree function.

Arguments

X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
y::AbstractVector: A vector of labels corresponding to the data points.
num_features::Int: The number of features to consider for each split.

Returns

Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.

source

DecisionTreeAndRandomForest.variance_reduction — Function

variance_reduction(X, y)
variance_reduction(X, y, num_features_to_use)

Finds the best split point for a decision tree node using variance reduction.

Arguments

X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
y::AbstractVector: A vector of labels corresponding to the data points.
num_features_to_use::Int=-1: The number of features to consider for each split. If -1, all features are used.

Returns

Tuple{Int, Any}: A tuple containing the index of the best feature and the best split value.

source

DecisionTreeAndRandomForest.weighted_entropy — Method

weighted_entropy(y_left, y_right)

Calculates the entropy of the left and right subsets and returns the weighted sum of the two entropies.

Arguments

y_left::AbstractVector{T}: The labels vector for the left split.
y_right::AbstractVector{T}: The labels vector for the right split.

Returns

Float64: The weighted entropy of the split.

source

DecisionTreeAndRandomForest.weighted_gini — Method

weighted_gini(y_left, y_right)

Calculates the gini impurity of the left and right subsets and returns the weighted sum of the two impurities.

Arguments

y_left::AbstractVector{T}: A vector of labels for the left subset of the data.
y_right::AbstractVector{T}: A vector of labels for the right subset of the data.

Returns

Float64: The weighted gini impurity of the split.

source

DecisionTreeAndRandomForest.weighted_variance — Method

weighted_variance(y_left, y_right)

Calculates the variance of the left and right subsets and returns the weighted sum of the two variances.

Arguments

y_left::AbstractVector{T}: A vector of labels for the left subset of the data.
y_right::AbstractVector{T}: A vector of labels for the right subset of the data.

Returns

Float64: The weighted variance of the split

source