DecisionTreeAndRandomForest
Documentation for DecisionTreeAndRandomForest.
DecisionTreeAndRandomForest.DecisionTreeDecisionTreeAndRandomForest.LeafDecisionTreeAndRandomForest.NodeDecisionTreeAndRandomForest.RandomForestBase.showBase.showDecisionTreeAndRandomForest.build_treeDecisionTreeAndRandomForest.calculate_entropyDecisionTreeAndRandomForest.calculate_giniDecisionTreeAndRandomForest.calculate_varianceDecisionTreeAndRandomForest.fit!DecisionTreeAndRandomForest.fit!DecisionTreeAndRandomForest.get_split_criterionsDecisionTreeAndRandomForest.gini_impurityDecisionTreeAndRandomForest.information_gainDecisionTreeAndRandomForest.predictDecisionTreeAndRandomForest.predictDecisionTreeAndRandomForest.predict_singleDecisionTreeAndRandomForest.split_giniDecisionTreeAndRandomForest.split_igDecisionTreeAndRandomForest.split_nodeDecisionTreeAndRandomForest.split_varianceDecisionTreeAndRandomForest.variance_reductionDecisionTreeAndRandomForest.weighted_entropyDecisionTreeAndRandomForest.weighted_giniDecisionTreeAndRandomForest.weighted_variance
DecisionTreeAndRandomForest.DecisionTree — TypeRepresents a DecisionTree.
Fields
max_depth::Int64: Controls the maximum depth of the tree. If -1, the DecisionTree is of unlimited depth.min_samples_split::Int64: Controls the minimum number of samples required to split a node.num_features::Int64: Controls the number of features to consider for each split. If -1, all features are used.split_criterion::Function: Contains the split criterion function.root::Union{Missing, DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Contains the root node of the DecisionTree.
DecisionTreeAndRandomForest.Leaf — TypeRepresents a Leaf in the DecisionTree structure.
Fields
values::AbstractVector
DecisionTreeAndRandomForest.Node — TypeRepresents a Node in the DecisionTree structure.
Fields
left::Union{DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Points to the left child.right::Union{DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Points to the right child.feature_index::Int64: Stores the index of the selected feature.split_value::Any: Stores the value on that the data is split.
DecisionTreeAndRandomForest.RandomForest — TypeRepresents a RandomForest.
Fields
trees::Vector{DecisionTree}: Contains the vector of DecisionTree structures.max_depth::Int64: Contains the maximum depth of the tree. If -1, the DecisionTree is of unlimited depth.min_samples_split::Int64: Contains the minimum number of samples required to split a node.split_criterion::Function: Contains the split criterion function.number_of_trees::Int64: Contains the number of trees in the RandomForest structure.subsample_percentage::Float64: Contains the percentage of the dataset to use for training each tree.num_features::Int64: Contains the number of features to use when finding the best split. If -1, all the features are used.
Base.show — Methodshow(io, tree)
This function recursively prints the structure of the DecisionTree, providing information about each node and leaf. It's primarily used for debugging and visualizing the tree's structure.
Arguments
io::IO: The IO context to print the tree structure.tree::DecisionTree: The DecisionTree to print.
Returns
Nothing: This function prints the structure of theDecisionTree.
Base.show — Methodshow(io, forest)
This function recursively prints the structure of the ForestTree. It's primarily used for debugging and visualizing the Forest structure.
Arguments
io::IO: The IO context to print the Forest structure.forest::RandomForest: The RandomForest to be printed.
Returns
Nothing: This function prints the structure of theRandomForest.
DecisionTreeAndRandomForest.build_tree — Functionbuild_tree(
data,
labels,
max_depth,
min_samples_split,
num_features,
split_criterion
)
build_tree(
data,
labels,
max_depth,
min_samples_split,
num_features,
split_criterion,
depth
)
This function recursively builds a DecisionTree by iteratively splitting the data based on the provided split_criterion. The process continues until either the maximum depth is reached, the number of samples in a node falls below min_samples_split or all labels in a node are the same.
Arguments
data::AbstractMatrix: The training data.labels::AbstractVector: The labels for the training data.max_depth::Int: The maximum depth of the tree.min_samples_split::Int: The minimum number of samples required to split a node.split_criterion::Function: The function used to determine the best split at each node.depth::Int=0: The current depth of the tree (used recursively).
Returns
Union{Node, Leaf}: The return value can be one of two types, depending on the state of the tree at each point of recursion.
DecisionTreeAndRandomForest.calculate_entropy — Methodcalculate_entropy(y)
Calculates the entropy of a vector of labels y.
Arguments
y::AbstractVector: A vector of labels.
Returns
Float64: The entropy of the vector.
DecisionTreeAndRandomForest.calculate_gini — Methodcalculate_gini(y)
This function calculates the Gini impurity of a set of labels, which measures the homogeneity of the labels within a node. A lower Gini impurity indicates a more homogeneous set of labels.
Arguments
y::AbstractVector: A vector of labels.
Returns
Float64: The Gini impurity of the labels.
DecisionTreeAndRandomForest.calculate_variance — Methodcalculate_variance(y)
Calculate the sample variance of a given set of labels. It uses the standard formula for sample variance.
Arguments
y::AbstractVector: A vector of numerical labels for which the variance is to be computed.
Returns
Float64: The sample variance of the input label vectory.
DecisionTreeAndRandomForest.fit! — Methodfit!(tree, data, labels)
This function builds the tree structure of the DecisionTree by calling the build_tree function.
Arguments
tree::DecisionTree: The DecisionTree to fit.data::AbstractMatrix: The training data.labels::AbstractVector: The labels for the training data.
Returns
Nothing: This function modifies thetreein-place.
DecisionTreeAndRandomForest.fit! — Methodfit!(forest, data, labels)
This function trains each individual tree in the RandomForest by calling the fit function on each ClassificationTree within the forest.trees vector. The num_features parameter from the RandomForest object is used to control the number of features considered for each split during training.
Arguments
forest::RandomForest: The RandomForest to be trained.
Returns
Nothing: This function modifies theforestin-place.
DecisionTreeAndRandomForest.get_split_criterions — Functionget_split_criterions()
get_split_criterions(task)
Retrieves the implemented split criterions that can be used.
Arguments
task::String: A String that indicates for which task the splitting criterions should be retrieved. Can be"classification"or"regression". Defaults to returning all available criterions
Returns
Tuple{Function}: A tuple containing the implemented functions.
DecisionTreeAndRandomForest.gini_impurity — Functiongini_impurity(X, y)
gini_impurity(X, y, num_features_to_use)
Finds the best split point for a decision tree node using gini impurity.
Arguments
X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector: A vector of labels corresponding to the data points.num_features_to_use::Int: The number of features to consider when looking for the best split. If -1, all features are considered.
Returns
Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.information_gain — Functioninformation_gain(X, y)
information_gain(X, y, num_features_to_use)
Finds the best split point for a decision tree node using information gain.
Arguments
X::AbstractMatrix: A matrix of features.y::AbstractVector: A vector of labels.num_features_to_use::Int=-1: The number of features to consider for each split. If -1, all features are used.
Returns
best_feature::Int: The index of the best feature to split on.best_threshold::Real: The threshold value for the best split.
DecisionTreeAndRandomForest.predict — Methodpredict(tree, data)
This function traverses the tree structure of the DecisionTree for each datapoint in data. It follows the decision rules based on the split criteria and feature values. In case of a classification problem, the prediction is the most frequent label (mode) among the labels in the leaf node. In case of a regression problem, the prediction is the average of those values (mean).
Arguments
tree::DecisionTree: The trained DecisionTree.data::AbstractMatrix: The datapoints to predict.
Returns
Vector: A vector of predictions for each datapoint indata.
DecisionTreeAndRandomForest.predict — Methodpredict(forest, data)
This function predicts the labels for each datapoint in the input dataset by using the trained RandomForest. It therefore uses each individual tree in the forest and combines the predictions using either the most frequent label (Classification Task) or the average of the predictions (Regression Task)
Arguments
forest::RandomForest: The trained RandomForest.data::AbstractMatrix: The dataset for which to make predictions.
Returns
AbstractVector: A vector of predictions for each datapoint indata.
DecisionTreeAndRandomForest.predict_single — Methodpredict_single(tree, sample)
This function traverses the tree structure of the DecisionTree for the datapoint in sample. It follows the decision rules based on the split criteria and feature values. If the leaf node contains numerical values, its treated as a regreesion problem and the prediction is the average of those values. If a leaf node contains numerical values, it is treated as a regression problem, and the prediction is the average of those values. If the leaf node contains categorical labels, it is treated as a classification problem, and the prediction is the most frequent label (mode) among the labels in the leaf node.
Arguments
tree::DecisionTree: The trained DecisionTree.sample::AbstractVector: The sample to predict.
Returns
Vector: A vector of predictions for each datapoint indata.
DecisionTreeAndRandomForest.split_gini — Functionsplit_gini(X, y)
split_gini(X, y, num_features)
This function is a wrapper for find_best_split to be used as the split criterion in the build_tree function.
Arguments
X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector: A vector of labels corresponding to the data points.num_features::Int: The number of features to consider when looking for the best split. If -1, all features are considered.
Returns
Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.split_ig — Functionsplit_ig(data, labels)
split_ig(data, labels, num_features)
This function is a wrapper for best_split to be used as the split criterion in the build_tree function.
Arguments
data::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.labels::AbstractVector: A vector of labels corresponding to the data points.num_features::Int: The number of features to consider for each split.
Returns
Tuple{Int, Real}: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.split_node — Methodsplit_node(X, y, index, value)
This function splits the labels into two subsets based on the provided feature and value. It handles both numerical and categorical features.
Arguments
X::AbstractMatrix{T}: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector: A vector of labels corresponding to the data points.index::Int: The index of the feature to split on.value::T: The value to split the feature on.
Returns
Tuple{AbstractVector, AbstractVector}: A tuple containing the left and right sets of labels.
DecisionTreeAndRandomForest.split_variance — FunctionThis function is a wrapper for find_best_split_vr to be used as the split criterion in the build_tree function.
Arguments
X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector: A vector of labels corresponding to the data points.num_features::Int: The number of features to consider for each split.
Returns
Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.variance_reduction — Functionvariance_reduction(X, y)
variance_reduction(X, y, num_features_to_use)
Finds the best split point for a decision tree node using variance reduction.
Arguments
X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.y::AbstractVector: A vector of labels corresponding to the data points.num_features_to_use::Int=-1: The number of features to consider for each split. If -1, all features are used.
Returns
Tuple{Int, Any}: A tuple containing the index of the best feature and the best split value.
DecisionTreeAndRandomForest.weighted_entropy — Methodweighted_entropy(y_left, y_right)
Calculates the entropy of the left and right subsets and returns the weighted sum of the two entropies.
Arguments
y_left::AbstractVector{T}: The labels vector for the left split.y_right::AbstractVector{T}: The labels vector for the right split.
Returns
Float64: The weighted entropy of the split.
DecisionTreeAndRandomForest.weighted_gini — Methodweighted_gini(y_left, y_right)
Calculates the gini impurity of the left and right subsets and returns the weighted sum of the two impurities.
Arguments
y_left::AbstractVector{T}: A vector of labels for the left subset of the data.y_right::AbstractVector{T}: A vector of labels for the right subset of the data.
Returns
Float64: The weighted gini impurity of the split.
DecisionTreeAndRandomForest.weighted_variance — Methodweighted_variance(y_left, y_right)
Calculates the variance of the left and right subsets and returns the weighted sum of the two variances.
Arguments
y_left::AbstractVector{T}: A vector of labels for the left subset of the data.y_right::AbstractVector{T}: A vector of labels for the right subset of the data.
Returns
Float64: The weighted variance of the split