library.phases.phases_implementation.data_preprocessing.outliers_bounds module¶

class library.phases.phases_implementation.data_preprocessing.outliers_bounds.OutliersBounds(dataset: Dataset)[source]¶

Bases: object

bound_checking() → None[source]¶

Apply numeric BOUNDS to dataset.df and remove rare violators.

The global constant BOUNDS must map column names to (min, max) tuples. For each column, the helper will: - Drop rows that lie outside the interval when they represent < 0.5% of the total dataset - Keep (but record) them for manual analysis otherwise

Return type:: None

compare_distributions_grid(original_df: DataFrame, cleaned_df: DataFrame, columns: list[str] | None = None, bins: int = 50, max_features: int = 20) → None[source]¶

Side-by-side histograms to compare original vs. cleaned features.

Parameters:

original_df (pandas.DataFrame) – Pre and post-processing datasets.
cleaned_df (pandas.DataFrame) – Pre and post-processing datasets.
columns (list[str] | None) – Subset of columns to display. Defaults to first max_features numeric columns.
bins (int) – Number of histogram bins.
max_features (int) – Maximum number of features to plot.

Return type:

None

get_outliers(detection_type: str = 'iqr', threshold: float = 1.5, save_plots: bool = False, save_path: str = None) → str[source]¶

Detects outliers, removes them from X_train, and returns a summary.

Parameters:

detection_type (str) – Method used to detect outliers (‘iqr’ or ‘percentile’).
plot (bool) – Whether to show distribution plots of the outlier features.
threshold (float) – Multiplier for IQR used to define outlier bounds.

Returns:

Summary of the outlier detection operation.

Return type:

str

library.phases.phases_implementation.data_preprocessing.outliers_bounds module¶

Efficient Malware Classfier

Navigation

Related Topics