library.phases.phases_implementation.data_preprocessing.outliers_bounds module¶
- class library.phases.phases_implementation.data_preprocessing.outliers_bounds.OutliersBounds(dataset: Dataset)[source]¶
Bases:
object
- bound_checking() None [source]¶
Apply numeric BOUNDS to dataset.df and remove rare violators.
The global constant
BOUNDS
must map column names to (min, max) tuples. For each column, the helper will: - Drop rows that lie outside the interval when they represent < 0.5% of the total dataset - Keep (but record) them for manual analysis otherwise- Return type:
None
- compare_distributions_grid(original_df: DataFrame, cleaned_df: DataFrame, columns: list[str] | None = None, bins: int = 50, max_features: int = 20) None [source]¶
Side-by-side histograms to compare original vs. cleaned features.
- Parameters:
original_df (pandas.DataFrame) – Pre and post-processing datasets.
cleaned_df (pandas.DataFrame) – Pre and post-processing datasets.
columns (list[str] | None) – Subset of columns to display. Defaults to first max_features numeric columns.
bins (int) – Number of histogram bins.
max_features (int) – Maximum number of features to plot.
- Return type:
None
- get_outliers(detection_type: str = 'iqr', threshold: float = 1.5, save_plots: bool = False, save_path: str = None) str [source]¶
Detects outliers, removes them from X_train, and returns a summary.
- Parameters:
detection_type (str) – Method used to detect outliers (‘iqr’ or ‘percentile’).
plot (bool) – Whether to show distribution plots of the outlier features.
threshold (float) – Multiplier for IQR used to define outlier bounds.
- Returns:
Summary of the outlier detection operation.
- Return type:
str