library.phases.phases_implementation.data_preprocessing.outliers_bounds module

class library.phases.phases_implementation.data_preprocessing.outliers_bounds.OutliersBounds(dataset: Dataset)[source]

Bases: object

bound_checking() None[source]

Apply numeric BOUNDS to dataset.df and remove rare violators.

The global constant BOUNDS must map column names to (min, max) tuples. For each column, the helper will: - Drop rows that lie outside the interval when they represent < 0.5% of the total dataset - Keep (but record) them for manual analysis otherwise

Return type:

None

compare_distributions_grid(original_df: DataFrame, cleaned_df: DataFrame, columns: list[str] | None = None, bins: int = 50, max_features: int = 20) None[source]

Side-by-side histograms to compare original vs. cleaned features.

Parameters:
  • original_df (pandas.DataFrame) – Pre and post-processing datasets.

  • cleaned_df (pandas.DataFrame) – Pre and post-processing datasets.

  • columns (list[str] | None) – Subset of columns to display. Defaults to first max_features numeric columns.

  • bins (int) – Number of histogram bins.

  • max_features (int) – Maximum number of features to plot.

Return type:

None

get_outliers(detection_type: str = 'iqr', threshold: float = 1.5, save_plots: bool = False, save_path: str = None) str[source]

Detects outliers, removes them from X_train, and returns a summary.

Parameters:
  • detection_type (str) – Method used to detect outliers (‘iqr’ or ‘percentile’).

  • plot (bool) – Whether to show distribution plots of the outlier features.

  • threshold (float) – Multiplier for IQR used to define outlier bounds.

Returns:

Summary of the outlier detection operation.

Return type:

str