DecisionTreeAndRandomForest

Documentation for DecisionTreeAndRandomForest.

DecisionTreeAndRandomForest.DecisionTreeType

Represents a DecisionTree.

Fields

  • max_depth::Int64: Controls the maximum depth of the tree. If -1, the DecisionTree is of unlimited depth.

  • min_samples_split::Int64: Controls the minimum number of samples required to split a node.

  • num_features::Int64: Controls the number of features to consider for each split. If -1, all features are used.

  • split_criterion::Function: Contains the split criterion function.

  • root::Union{Missing, DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Contains the root node of the DecisionTree.

source
DecisionTreeAndRandomForest.NodeType

Represents a Node in the DecisionTree structure.

Fields

  • left::Union{DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Points to the left child.

  • right::Union{DecisionTreeAndRandomForest.Node, DecisionTreeAndRandomForest.Leaf}: Points to the right child.

  • feature_index::Int64: Stores the index of the selected feature.

  • split_value::Any: Stores the value on that the data is split.

source
DecisionTreeAndRandomForest.RandomForestType

Represents a RandomForest.

Fields

  • trees::Vector{DecisionTree}: Contains the vector of DecisionTree structures.

  • max_depth::Int64: Contains the maximum depth of the tree. If -1, the DecisionTree is of unlimited depth.

  • min_samples_split::Int64: Contains the minimum number of samples required to split a node.

  • split_criterion::Function: Contains the split criterion function.

  • number_of_trees::Int64: Contains the number of trees in the RandomForest structure.

  • subsample_percentage::Float64: Contains the percentage of the dataset to use for training each tree.

  • num_features::Int64: Contains the number of features to use when finding the best split. If -1, all the features are used.

source
Base.showMethod
show(io, tree)

This function recursively prints the structure of the DecisionTree, providing information about each node and leaf. It's primarily used for debugging and visualizing the tree's structure.

Arguments

  • io::IO: The IO context to print the tree structure.
  • tree::DecisionTree: The DecisionTree to print.

Returns

  • Nothing: This function prints the structure of the DecisionTree.
source
Base.showMethod
show(io, forest)

This function recursively prints the structure of the ForestTree. It's primarily used for debugging and visualizing the Forest structure.

Arguments

  • io::IO: The IO context to print the Forest structure.
  • forest::RandomForest: The RandomForest to be printed.

Returns

  • Nothing: This function prints the structure of the RandomForest.
source
DecisionTreeAndRandomForest.build_treeFunction
build_tree(
    data,
    labels,
    max_depth,
    min_samples_split,
    num_features,
    split_criterion
)
build_tree(
    data,
    labels,
    max_depth,
    min_samples_split,
    num_features,
    split_criterion,
    depth
)

This function recursively builds a DecisionTree by iteratively splitting the data based on the provided split_criterion. The process continues until either the maximum depth is reached, the number of samples in a node falls below min_samples_split or all labels in a node are the same.

Arguments

  • data::AbstractMatrix: The training data.
  • labels::AbstractVector: The labels for the training data.
  • max_depth::Int: The maximum depth of the tree.
  • min_samples_split::Int: The minimum number of samples required to split a node.
  • split_criterion::Function: The function used to determine the best split at each node.
  • depth::Int=0: The current depth of the tree (used recursively).

Returns

  • Union{Node, Leaf}: The return value can be one of two types, depending on the state of the tree at each point of recursion.
source
DecisionTreeAndRandomForest.calculate_giniMethod
calculate_gini(y)

This function calculates the Gini impurity of a set of labels, which measures the homogeneity of the labels within a node. A lower Gini impurity indicates a more homogeneous set of labels.

Arguments

  • y::AbstractVector: A vector of labels.

Returns

  • Float64: The Gini impurity of the labels.
source
DecisionTreeAndRandomForest.calculate_varianceMethod
calculate_variance(y)

Calculate the sample variance of a given set of labels. It uses the standard formula for sample variance.

Arguments

  • y::AbstractVector: A vector of numerical labels for which the variance is to be computed.

Returns

  • Float64: The sample variance of the input label vector y.
source
DecisionTreeAndRandomForest.fit!Method
fit!(tree, data, labels)

This function builds the tree structure of the DecisionTree by calling the build_tree function.

Arguments

  • tree::DecisionTree: The DecisionTree to fit.
  • data::AbstractMatrix: The training data.
  • labels::AbstractVector: The labels for the training data.

Returns

  • Nothing: This function modifies the tree in-place.
source
DecisionTreeAndRandomForest.fit!Method
fit!(forest, data, labels)

This function trains each individual tree in the RandomForest by calling the fit function on each ClassificationTree within the forest.trees vector. The num_features parameter from the RandomForest object is used to control the number of features considered for each split during training.

Arguments

  • forest::RandomForest: The RandomForest to be trained.

Returns

  • Nothing: This function modifies the forest in-place.
source
DecisionTreeAndRandomForest.get_split_criterionsFunction
get_split_criterions()
get_split_criterions(task)

Retrieves the implemented split criterions that can be used.

Arguments

  • task::String: A String that indicates for which task the splitting criterions should be retrieved. Can be "classification" or "regression". Defaults to returning all available criterions

Returns

  • Tuple{Function}: A tuple containing the implemented functions.
source
DecisionTreeAndRandomForest.gini_impurityFunction
gini_impurity(X, y)
gini_impurity(X, y, num_features_to_use)

Finds the best split point for a decision tree node using gini impurity.

Arguments

  • X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
  • y::AbstractVector: A vector of labels corresponding to the data points.
  • num_features_to_use::Int: The number of features to consider when looking for the best split. If -1, all features are considered.

Returns

  • Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.
source
DecisionTreeAndRandomForest.information_gainFunction
information_gain(X, y)
information_gain(X, y, num_features_to_use)

Finds the best split point for a decision tree node using information gain.

Arguments

  • X::AbstractMatrix: A matrix of features.
  • y::AbstractVector: A vector of labels.
  • num_features_to_use::Int=-1: The number of features to consider for each split. If -1, all features are used.

Returns

  • best_feature::Int: The index of the best feature to split on.
  • best_threshold::Real: The threshold value for the best split.
source
DecisionTreeAndRandomForest.predictMethod
predict(tree, data)

This function traverses the tree structure of the DecisionTree for each datapoint in data. It follows the decision rules based on the split criteria and feature values. In case of a classification problem, the prediction is the most frequent label (mode) among the labels in the leaf node. In case of a regression problem, the prediction is the average of those values (mean).

Arguments

  • tree::DecisionTree: The trained DecisionTree.
  • data::AbstractMatrix: The datapoints to predict.

Returns

  • Vector: A vector of predictions for each datapoint in data.
source
DecisionTreeAndRandomForest.predictMethod
predict(forest, data)

This function predicts the labels for each datapoint in the input dataset by using the trained RandomForest. It therefore uses each individual tree in the forest and combines the predictions using either the most frequent label (Classification Task) or the average of the predictions (Regression Task)

Arguments

  • forest::RandomForest: The trained RandomForest.
  • data::AbstractMatrix: The dataset for which to make predictions.

Returns

  • AbstractVector: A vector of predictions for each datapoint in data.
source
DecisionTreeAndRandomForest.predict_singleMethod
predict_single(tree, sample)

This function traverses the tree structure of the DecisionTree for the datapoint in sample. It follows the decision rules based on the split criteria and feature values. If the leaf node contains numerical values, its treated as a regreesion problem and the prediction is the average of those values. If a leaf node contains numerical values, it is treated as a regression problem, and the prediction is the average of those values. If the leaf node contains categorical labels, it is treated as a classification problem, and the prediction is the most frequent label (mode) among the labels in the leaf node.

Arguments

  • tree::DecisionTree: The trained DecisionTree.
  • sample::AbstractVector: The sample to predict.

Returns

  • Vector: A vector of predictions for each datapoint in data.
source
DecisionTreeAndRandomForest.split_giniFunction
split_gini(X, y)
split_gini(X, y, num_features)

This function is a wrapper for find_best_split to be used as the split criterion in the build_tree function.

Arguments

  • X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
  • y::AbstractVector: A vector of labels corresponding to the data points.
  • num_features::Int: The number of features to consider when looking for the best split. If -1, all features are considered.

Returns

  • Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.
source
DecisionTreeAndRandomForest.split_igFunction
split_ig(data, labels)
split_ig(data, labels, num_features)

This function is a wrapper for best_split to be used as the split criterion in the build_tree function.

Arguments

  • data::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
  • labels::AbstractVector: A vector of labels corresponding to the data points.
  • num_features::Int: The number of features to consider for each split.

Returns

  • Tuple{Int, Real}: A tuple containing the index of the best feature and the best split value.
source
DecisionTreeAndRandomForest.split_nodeMethod
split_node(X, y, index, value)

This function splits the labels into two subsets based on the provided feature and value. It handles both numerical and categorical features.

Arguments

  • X::AbstractMatrix{T}: A matrix of features, where each row is a data point and each column is a feature.
  • y::AbstractVector: A vector of labels corresponding to the data points.
  • index::Int: The index of the feature to split on.
  • value::T: The value to split the feature on.

Returns

  • Tuple{AbstractVector, AbstractVector}: A tuple containing the left and right sets of labels.
source
DecisionTreeAndRandomForest.split_varianceFunction

This function is a wrapper for find_best_split_vr to be used as the split criterion in the build_tree function.

Arguments

  • X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
  • y::AbstractVector: A vector of labels corresponding to the data points.
  • num_features::Int: The number of features to consider for each split.

Returns

  • Tuple{Int, T}: A tuple containing the index of the best feature and the best split value.
source
DecisionTreeAndRandomForest.variance_reductionFunction
variance_reduction(X, y)
variance_reduction(X, y, num_features_to_use)

Finds the best split point for a decision tree node using variance reduction.

Arguments

  • X::AbstractMatrix: A matrix of features, where each row is a data point and each column is a feature.
  • y::AbstractVector: A vector of labels corresponding to the data points.
  • num_features_to_use::Int=-1: The number of features to consider for each split. If -1, all features are used.

Returns

  • Tuple{Int, Any}: A tuple containing the index of the best feature and the best split value.
source
DecisionTreeAndRandomForest.weighted_entropyMethod
weighted_entropy(y_left, y_right)

Calculates the entropy of the left and right subsets and returns the weighted sum of the two entropies.

Arguments

  • y_left::AbstractVector{T}: The labels vector for the left split.
  • y_right::AbstractVector{T}: The labels vector for the right split.

Returns

  • Float64: The weighted entropy of the split.
source
DecisionTreeAndRandomForest.weighted_giniMethod
weighted_gini(y_left, y_right)

Calculates the gini impurity of the left and right subsets and returns the weighted sum of the two impurities.

Arguments

  • y_left::AbstractVector{T}: A vector of labels for the left subset of the data.
  • y_right::AbstractVector{T}: A vector of labels for the right subset of the data.

Returns

  • Float64: The weighted gini impurity of the split.
source
DecisionTreeAndRandomForest.weighted_varianceMethod
weighted_variance(y_left, y_right)

Calculates the variance of the left and right subsets and returns the weighted sum of the two variances.

Arguments

  • y_left::AbstractVector{T}: A vector of labels for the left subset of the data.
  • y_right::AbstractVector{T}: A vector of labels for the right subset of the data.

Returns

  • Float64: The weighted variance of the split
source