Authors: MohammadReza EskandariNasab (Utah State University), Shah Muhammad Hamdi (Utah State University), Soukaina Filali Boubrahimi (Utah State University)
Solar Energetic Particle (SEP) prediction is challenging due to extreme class imbalance and heterogeneous multi-instrument time-series measurements. This study evaluates the impact of data preparation techniques on SEP classification using a pre-flare dataset with 17,794 samples, 288 timesteps, and 10 physical features from OMNI and GOES observations. The dataset contains 17,625 non-SEP events and only 169 SEP events, making imbalance handling essential. Each sample represents a 24-hour pre-flare window at five-minute cadence. We compare normalization methods, temporal-window reduction, borderline cleaning, random undersampling, minority-class oversampling, and SEP data augmentation using SMOTE, TimeGAN, and AVATAR. The goal is to identify which preprocessing and sampling strategies most improve SEP detection while preserving meaningful pre-flare physical patterns. This work provides a systematic benchmark for building more reliable machine-learning pipelines for rare SEP event prediction.
