DecisionTreeAndRandomForest
Documentation for DecisionTreeAndRandomForest.
DecisionTreeAndRandomForest.DecisionTree
DecisionTreeAndRandomForest.Leaf
DecisionTreeAndRandomForest.Node
DecisionTreeAndRandomForest.RandomForest
Base.show
Base.show
DecisionTreeAndRandomForest.build_tree
DecisionTreeAndRandomForest.calculate_entropy
DecisionTreeAndRandomForest.calculate_gini
DecisionTreeAndRandomForest.calculate_variance
DecisionTreeAndRandomForest.fit!
DecisionTreeAndRandomForest.fit!
DecisionTreeAndRandomForest.get_split_criterions
DecisionTreeAndRandomForest.gini_impurity
DecisionTreeAndRandomForest.information_gain
DecisionTreeAndRandomForest.predict
DecisionTreeAndRandomForest.predict
DecisionTreeAndRandomForest.predict_single
DecisionTreeAndRandomForest.split_gini
DecisionTreeAndRandomForest.split_ig
DecisionTreeAndRandomForest.split_node
DecisionTreeAndRandomForest.split_variance
DecisionTreeAndRandomForest.variance_reduction
DecisionTreeAndRandomForest.weighted_entropy
DecisionTreeAndRandomForest.weighted_gini
DecisionTreeAndRandomForest.weighted_variance
DecisionTreeAndRandomForest.DecisionTree
— TypeRepresents a DecisionTree.
Fields
max_depth::Int64
: Controls the maximum depth of the tree. If -1, the DecisionTree is of unlimited depth.min_samples_split::Int64
: Controls the minimum number of samples required to split a node.num_features::Int64
: Controls the number of features to consider for each split. If -1, all features are used.split_criterion::Function
: Contains the split criterion function.root::Union{Missing, DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}
: Contains the root node of the DecisionTree.
DecisionTreeAndRandomForest.Leaf
— TypeRepresents a Leaf in the DecisionTree structure.
Fields
values::AbstractVector
DecisionTreeAndRandomForest.Node
— TypeRepresents a Node in the DecisionTree structure.
Fields
left::Union{DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}
: Points to the left child.right::Union{DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}
: Points to the right child.feature_index::Int64
: Stores the index of the selected feature.split_value::Any
: Stores the value on that the data is split.
DecisionTreeAndRandomForest.RandomForest
— TypeRepresents a RandomForest.
Fields
trees::Vector{DecisionTree}
: Contains the vector of DecisionTree structures.max_depth::Int64
: Contains the maximum depth of the tree. If -1, the DecisionTree is of unlimited depth.min_samples_split::Int64
: Contains the minimum number of samples required to split a node.split_criterion::Function
: Contains the split criterion function.number_of_trees::Int64
: Contains the number of trees in the RandomForest structure.subsample_percentage::Float64
: Contains the percentage of the dataset to use for training each tree.num_features::Int64
: Contains the number of features to use when finding the best split. If -1, all the features are used.
Base.show
— Methodshow(io, tree)
This function recursively prints the structure of the DecisionTree
, providing information about each node and leaf. It's primarily used for debugging and visualizing the tree's structure.
Arguments
io::IO
: The IO context to print the tree structure.tree::DecisionTree
: The DecisionTree to print.
Returns
Nothing
: This function prints the structure of theDecisionTree
.
Base.show
— Methodshow(io, forest)
This function recursively prints the structure of the ForestTree. It's primarily used for debugging and visualizing the Forest structure.
Arguments
io::IO
: The IO context to print the Forest structure.forest::RandomForest
: The RandomForest to be printed.
Returns
Nothing
: This function prints the structure of theRandomForest
.
DecisionTreeAndRandomForest.build_tree
— Functionbuild_tree(
data,
labels,
max_depth,
min_samples_split,
num_features,
split_criterion
)
build_tree(
data,
labels,
max_depth,
min_samples_split,
num_features,
split_criterion,
depth
)
This function recursively builds a DecisionTree by iteratively splitting the data based on the provided split_criterion
. The process continues until either the maximum depth is reached, the number of samples in a node falls below min_samples_split
or all labels in a node are the same.
Arguments
data::AbstractMatrix
: The training data.labels::AbstractVector
: The labels for the training data.max_depth::Int
: The maximum depth of the tree.min_samples_split::Int
: The minimum number of samples required to split a node.split_criterion::Function
: The function used to determine the best split at each node.depth::Int=0
: The current depth of the tree (used recursively).
Returns
Union{Node, Leaf}
: The return value can be one of two types, depending on the state of the tree at each point of recursion.
DecisionTreeAndRandomForest.calculate_entropy
— Methodcalculate_entropy(y)
Calculates the entropy of a vector of labels y
.
Arguments
y::AbstractVector
: A vector of labels.
Returns
Float64
: The entropy of the vector.
DecisionTreeAndRandomForest.calculate_gini
— Methodcalculate_gini(y)
This function calculates the Gini impurity of a set of labels, which measures the homogeneity of the labels within a node. A lower Gini impurity indicates a more homogeneous set of labels.
Arguments
y::AbstractVector
: A vector of labels.
Returns
Float64
: The Gini impurity of the labels.
DecisionTreeAndRandomForest.calculate_variance
— Methodcalculate_variance(y)
Calculate the sample variance of a given set of labels. It uses the standard formula for sample variance.
Arguments
y::AbstractVector
: A vector of numerical labels for which the variance is to be computed.
Returns
Float64
: The sample variance of the input label vectory
.
DecisionTreeAndRandomForest.fit!
— Methodfit!(tree, data, labels)
This function builds the tree structure of the DecisionTree
by calling the build_tree
function.
Arguments
tree::DecisionTree
: The DecisionTree to fit.data::AbstractMatrix
: The training data.labels::AbstractVector
: The labels for the training data.
Returns
Nothing
: This function modifies thetree
in-place.
DecisionTreeAndRandomForest.fit!
— Methodfit!(forest, data, labels)
This function trains each individual tree in the RandomForest
by calling the fit
function on each ClassificationTree
within the forest.trees
vector. The num_features
parameter from the RandomForest
object is used to control the number of features considered for each split during training.
Arguments
forest::RandomForest
: The RandomForest to be trained.
Returns
Nothing
: This function modifies theforest
in-place.
DecisionTreeAndRandomForest.get_split_criterions
— Functionget_split_criterions()
get_split_criterions(task)
Retrieves the implemented split criterions that can be used.
Arguments
task::String
: A String that indicates for which task the splitting criterions should be retrieved. Can be"classification"
or"regression"
. Defaults to returning all available criterions
Returns
Tuple{Function}
: A tuple containing the implemented functions.
DecisionTreeAndRandomForest.gini_impurity
— Functiongini_impurity(X, y)
gini_impurity(X, y, num_features_to_use)
Finds the best split point for a decision tree node using gini impurity.
Arguments
X::AbstractMatrix
: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector
: A vector of labels corresponding to the data points.num_features_to_use::Int
: The number of features to consider when looking for the best split. If -1, all features are considered.
Returns
Tuple{Int, T}
: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.information_gain
— Functioninformation_gain(X, y)
information_gain(X, y, num_features_to_use)
Finds the best split point for a decision tree node using information gain.
Arguments
X::AbstractMatrix
: A matrix of features.y::AbstractVector
: A vector of labels.num_features_to_use::Int=-1
: The number of features to consider for each split. If -1, all features are used.
Returns
best_feature::Int
: The index of the best feature to split on.best_threshold::Real
: The threshold value for the best split.
DecisionTreeAndRandomForest.predict
— Methodpredict(tree, data)
This function traverses the tree structure of the DecisionTree
for each datapoint in data
. It follows the decision rules based on the split criteria and feature values. In case of a classification problem, the prediction is the most frequent label (mode) among the labels in the leaf node. In case of a regression problem, the prediction is the average of those values (mean).
Arguments
tree::DecisionTree
: The trained DecisionTree.data::AbstractMatrix
: The datapoints to predict.
Returns
Vector
: A vector of predictions for each datapoint indata
.
DecisionTreeAndRandomForest.predict
— Methodpredict(forest, data)
This function predicts the labels for each datapoint in the input dataset by using the trained RandomForest
. It therefore uses each individual tree in the forest and combines the predictions using either the most frequent label (Classification Task) or the average of the predictions (Regression Task)
Arguments
forest::RandomForest
: The trained RandomForest.data::AbstractMatrix
: The dataset for which to make predictions.
Returns
AbstractVector
: A vector of predictions for each datapoint indata
.
DecisionTreeAndRandomForest.predict_single
— Methodpredict_single(tree, sample)
This function traverses the tree structure of the DecisionTree
for the datapoint in sample
. It follows the decision rules based on the split criteria and feature values. If the leaf node contains numerical values, its treated as a regreesion problem and the prediction is the average of those values. If a leaf node contains numerical values, it is treated as a regression problem, and the prediction is the average of those values. If the leaf node contains categorical labels, it is treated as a classification problem, and the prediction is the most frequent label (mode) among the labels in the leaf node.
Arguments
tree::DecisionTree
: The trained DecisionTree.sample::AbstractVector
: The sample to predict.
Returns
Vector
: A vector of predictions for each datapoint indata
.
DecisionTreeAndRandomForest.split_gini
— Functionsplit_gini(X, y)
split_gini(X, y, num_features)
This function is a wrapper for find_best_split
to be used as the split criterion in the build_tree
function.
Arguments
X::AbstractMatrix
: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector
: A vector of labels corresponding to the data points.num_features::Int
: The number of features to consider when looking for the best split. If -1, all features are considered.
Returns
Tuple{Int, T}
: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.split_ig
— Functionsplit_ig(data, labels)
split_ig(data, labels, num_features)
This function is a wrapper for best_split
to be used as the split criterion in the build_tree
function.
Arguments
data::AbstractMatrix
: A matrix of features, where each row is a data point and each column is a feature.labels::AbstractVector
: A vector of labels corresponding to the data points.num_features::Int
: The number of features to consider for each split.
Returns
Tuple{Int, Real}
: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.split_node
— Methodsplit_node(X, y, index, value)
This function splits the labels into two subsets based on the provided feature and value. It handles both numerical and categorical features.
Arguments
X::AbstractMatrix{T}
: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector
: A vector of labels corresponding to the data points.index::Int
: The index of the feature to split on.value::T
: The value to split the feature on.
Returns
Tuple{AbstractVector, AbstractVector}
: A tuple containing the left and right sets of labels.
DecisionTreeAndRandomForest.split_variance
— FunctionThis function is a wrapper for find_best_split_vr
to be used as the split criterion in the build_tree
function.
Arguments
X::AbstractMatrix
: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector
: A vector of labels corresponding to the data points.num_features::Int
: The number of features to consider for each split.
Returns
Tuple{Int, T}
: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.variance_reduction
— Functionvariance_reduction(X, y)
variance_reduction(X, y, num_features_to_use)
Finds the best split point for a decision tree node using variance reduction.
Arguments
X::AbstractMatrix
: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector
: A vector of labels corresponding to the data points.num_features_to_use::Int=-1
: The number of features to consider for each split. If -1, all features are used.
Returns
Tuple{Int, Any}
: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.weighted_entropy
— Methodweighted_entropy(y_left, y_right)
Calculates the entropy of the left and right subsets and returns the weighted sum of the two entropies.
Arguments
y_left::AbstractVector{T}
: The labels vector for the left split.y_right::AbstractVector{T}
: The labels vector for the right split.
Returns
Float64
: The weighted entropy of the split.
DecisionTreeAndRandomForest.weighted_gini
— Methodweighted_gini(y_left, y_right)
Calculates the gini impurity of the left and right subsets and returns the weighted sum of the two impurities.
Arguments
y_left::AbstractVector{T}
: A vector of labels for the left subset of the data.y_right::AbstractVector{T}
: A vector of labels for the right subset of the data.
Returns
Float64
: The weighted gini impurity of the split.
DecisionTreeAndRandomForest.weighted_variance
— Methodweighted_variance(y_left, y_right)
Calculates the variance of the left and right subsets and returns the weighted sum of the two variances.
Arguments
y_left::AbstractVector{T}
: A vector of labels for the left subset of the data.y_right::AbstractVector{T}
: A vector of labels for the right subset of the data.
Returns
Float64
: The weighted variance of the split