library.phases.phases_implementation.dataset.split.strategies.base module

class library.phases.phases_implementation.dataset.split.strategies.base.Split(dataset)[source]

Bases: ABC

plot_per_set_distribution(features: list[str], save_plots: bool = False, save_path: str = None)[source]

Plots the distribution of the features for the training, validation and test sets. This is going to be meaningful for checking the similarity in statistical distributions between the sets. Note: for high-dimesionality dataset this is going to be computationally expensive.

Parameters:

features: list[str]

The names of the features to plot

save_plots: bool

Whether to save the plots

save_path: str

The path to save the plots

abstractmethod split_data(y_column: str, otherColumnsToDrop: list[str] = [], train_size: float = 0.8, validation_size: float = 0.1, test_size: float = 0.1, plot_distribution: bool = True, **kwargs)[source]