Impacts of Data Preprocessing and Sampling Techniques on Solar Flare Prediction from Multivariate Time Series Data of Photospheric Magnetic Field Parameters

Authors: MohammadReza EskandariNasab (Utah State University), Shah Muhammad Hamdi (Utah State University), Soukaina Filali Boubrahimi (Utah State University)

The accurate prediction of solar flares is crucial due to their risks to astronauts, space equipment, and satellite communication systems. Our research enhances solar flare prediction by employing sophisticated data preprocessing and sampling techniques to SWAN-SF dataset, a rich source of multivariate time series data of solar active regions. Our study adopts a multi-faceted approach encompassing four key methodologies. Initially, we address over 10 million missing values in the SWAN-SF dataset through our innovative imputation technique called Fast Pearson Correlation-based K-nearest neighbors imputation (FPCKNN imputation). Subsequently, we propose a precise normalization technique, called LSBZM normalization, tailored for time series data, merging various strategies (Log, Square Root, BoxCox, Z-score, and Min-Max) to uniformly scale the dataset’s 24 attributes (photospheric magnetic field parameters), addressing issues such as skewness. We also explore the ‘Near Decision Boundary Sample Removal’ technique to enhance the classification performance of the dataset by effectively resolving the challenge of class overlap. Finally, a pivotal aspect of our research is a thorough evaluation of diverse over-sampling and under-sampling methods, including SMOTE, ADASYN, Gaussian Noise Injection, TimeGAN, Tomek-links, and Random Under Sampling, to counter the severe imbalance in SWAN-SF dataset, notably a 60:1 ratio of major (X and M) to minor (C, B, and FQ) flaring events in binary classification. To demonstrate the effectiveness of our methods, we use eight classification algorithms, including advanced deep learning-based architectures. Our analysis shows outstanding True Skill Statistics (TSS) scores, underscoring the importance of data preprocessing and sampling in time series-based solar flare prediction.