2025-11-23T16:40:17.365633

Detecting wide binaries using machine learning algorithms

Ashesh, Kaur, Aashish

We present a machine learning (ML) framework for the detection of wide binary star systems using Gaia DR3 data. By training supervised ML models on established wide binary catalogues, we efficiently classify wide binaries and employ clustering and nearest neighbour search to pair candidate systems. Our approach incorporates data preprocessing techniques such as SMOTE, correlation analysis, and PCA, and achieves high accuracy and recall in the task of wide binary classification. The resulting publicly available code enables rapid, scalable, and customizable analysis of wide binaries, complementing conventional analyses and providing a valuable resource for future astrophysical studies.

academic

Detecting Wide Binaries Using Machine Learning Algorithms

Basic Information

Paper ID: 2506.19942
Title: Detecting wide binaries using machine learning algorithms
Authors: Amoy Ashesh (Indian Institute of Technology Patna & Trinity College Dublin), Harsimran Kaur (Indian Institute of Technology Patna), Sandeep Aashish (Indian Institute of Technology Patna)
Classification: astro-ph.GA gr-qc
Publication Date: Version of October 17, 2025
Paper Link: https://arxiv.org/abs/2506.19942

Abstract

This paper proposes a machine learning framework for detecting wide binary systems using Gaia DR3 data. By training supervised machine learning models on established wide binary catalogs, the researchers efficiently classify wide binaries and employ clustering and nearest neighbor search to pair candidate systems. The methodology integrates data preprocessing techniques including SMOTE, correlation analysis, and PCA, achieving high accuracy and recall rates in wide binary classification tasks. The publicly available code provided enables rapid, scalable, and customizable analysis of wide binaries, offering an effective complement to traditional analytical methods and providing valuable resources for future astrophysical research.

Research Background and Motivation

Problem Definition

Wide binary systems consist of pairs of stars gravitationally bound at distances ranging from thousands to tens of thousands of astronomical units. These systems operate in low-acceleration environments and serve as ideal laboratories for testing modified gravity theories and standard gravitational deviations.

Research Significance

Astrophysical Value: Wide binaries can be used to study stellar evolution, dynamics, and galactic structure
Gravity Theory Testing: Low-acceleration environments may reveal signatures of modified gravity effects
Gaia Data Opportunity: Gaia DR3 provides unprecedented high-precision data covering the entire Milky Way

Limitations of Existing Methods

Computational Complexity: Traditional statistical methods rely on Monte Carlo simulations and complex probabilistic analysis, incurring high computational costs
Noise and Contamination: Identifying true gravitationally-bound pairs and detecting their dynamical anomalies is complicated by noise, contamination, and data scale
Chance Alignments: As separation distance increases, the number of chance alignments increases, presenting challenges for accurate identification

Research Motivation

Machine learning methods provide scalable alternatives that can efficiently predict binary systems from noisy background populations through clustering algorithms and nearest neighbor search techniques, offering tools for discovering new physics.

Core Contributions

Machine Learning Framework: First application of machine learning-assisted search to wide binary classification in the Gaia DR3 dataset
Data Preprocessing Pipeline: Integration of preprocessing techniques including SMOTE balancing, correlation analysis, and PCA
Multi-Algorithm Comparison: Systematic evaluation of multiple supervised learning algorithms' performance
Public Tool: Provision of customizable public code (https://github.com/DespCAP/G-ML)
High-Performance Classification: Achievement of high accuracy (99.8%) and recall rate (92.3%) in wide binary classification tasks

Methodology Details

Task Definition

Input: Stellar records from raw Gaia DR3 data Output: Binary classification labels (wide binary system membership) + binary pairing Constraint: Supervised learning based on the wide binary catalog established by El-Badry et al.

Model Architecture

1. Data Preprocessing Module

SMOTE Balancing: Addresses data imbalance (wide binaries comprise only ~1% of raw data)
Correlation Analysis: Quantifies linear relationships between features using Pearson correlation coefficient
Feature Selection: Removes positional information (right ascension, declination) to avoid overfitting

2. Machine Learning Classifiers

The study tested multiple algorithms:

Random Forest Classifier (RFC): Ensemble learning-based, showing best performance
Logistic Regression (LR): Linear classifier with probabilistic output
Support Vector Machine (SVM): High-dimensional separation using RBF kernel
Decision Tree Classifier (DTC): Tree-structured decision making
K-Nearest Neighbors (KNN): Non-parametric method based on proximity
Naive Bayes (NB): Probabilistic classifier

3. Pairing Module

K-means Clustering: Clustering based on spatial position (RA, Dec) and parallax to reduce computational complexity
Nearest Neighbor Search: Binary pairing search in 3D Euclidean space

Technical Innovations

1. SMOTE Balancing Strategy

The original data distribution is highly imbalanced (494,664 vs 5,336). SMOTE technology generates synthetic minority class samples through interpolation, significantly improving model performance.

2. 3D Spatial Pairing Algorithm

Nearest neighbor search using 3D Cartesian coordinates:

D3D = √[(xA - xB)² + (yA - yB)² + (zA - zB)²]

3. Hierarchical Processing Strategy

Clustering for dimensionality reduction is performed first, followed by nearest neighbor search within each cluster, effectively reducing the O(n²) pairing complexity.

Experimental Setup

Dataset

Source: Raw Gaia DR3 data
Annotation: El-Badry et al.'s wide binary catalog as ground truth
Scale: Total of 500,000 records, with 5,336 wide binaries labeled
Split: 80:20 training-test ratio

Selection Criteria

Based on El-Badry et al.'s standards:

Projected Separation Condition: s ≤ 1 pc
Parallax Condition: |ω̃₁ - ω̃₂| < b√(σ²ω̃,1 + σ²ω̃,2)
Proper Motion Condition: Proper motion differences must satisfy Keplerian orbital constraints

Evaluation Metrics

Accuracy: Proportion of correct predictions
Recall: True positive identification capability
F1 Score: Harmonic mean of precision and recall
Confusion Matrix: Detailed classification performance analysis

Implementation Details

Number of Clusters: K-means set to 10 clusters
Distance Metric: 3D Euclidean distance
Feature Selection: Excludes positional information, retains physical features

Experimental Results

Main Results

Performance Comparison Table

Algorithm	Precision	Recall	F1 Score	Accuracy
RFC (Original)	0.375	0.008	0.016	0.989
RFC (SMOTE)	0.917	0.923	0.920	0.998

Classification Analysis

Algorithm	True Positives	True Positive Rate (%)	Misclassifications	Misclassification Rate (%)
RFC (Original)	9	0.82	1099	100.5
RFC (SMOTE)	1009	92.31	175	16.01

Ablation Study

The SMOTE balancing technique shows significant effects:

Recall Improvement: From 0.8% to 92.3%
Misclassification Rate Reduction: From 100.5% to 16.0%
F1 Score Enhancement: From 0.016 to 0.920

Algorithm Comparison Analysis

Random Forest: Best performance, achieving 99.8% accuracy after SMOTE balancing
Decision Tree: Second-best choice with 90.0% recall
Bagging Classifier: Third place with 83.9% recall
Other Algorithms: Poorer performance on imbalanced data

Clustering and Pairing Results

Successfully partitioned predicted wide binaries into 10 spatial clusters
Effectively identified binary pairing relationships within each cluster
Provided quantitative measurements of local stellar density

Traditional Methods

Statistical Methods: El-Badry et al. used Monte Carlo simulations to exclude chance alignments
Proper Motion Analysis: Chanamé and Gould introduced proper motion information to improve identification accuracy
Parallax Constraints: Andrews et al. utilized parallax and radial velocity

Machine Learning Applications

Stellar Classification: Cody et al.'s application on SIMBAD database
Black Hole Accretion States: Sreehari and Nandi's classification studies
Gravitational Wave Detection: Koloniari et al.'s parameter estimation

Advantages of This Work

First Systematic Approach: First ML framework specifically for Gaia DR3 wide binaries
End-to-End Solution: Complete pipeline from classification to pairing
Open Source Tool: Provision of reusable code resources

Conclusions and Discussion

Main Conclusions

Technical Feasibility: Machine learning methods demonstrate excellent performance in wide binary detection
SMOTE Criticality: Data balancing techniques are crucial for performance improvement
Random Forest Optimality: Shows best performance among multiple algorithms
Practical Value: Provides rapid, scalable analysis tools

Limitations

Annotation Quality Dependence: Model performance is limited by training data quality
Distance Uncertainty: Error propagation exists in 3D distance calculations
Feature Engineering: May overlook important physical features
Generalization Ability: Performance across different sky regions requires verification

Future Directions

Anomaly Detection: Extend ML to supervised anomaly detection problems
Gravity Theory Testing: Identify anomalous wide binaries deviating from Newtonian gravity
Multi-Source Data Fusion: Integrate additional observational data to improve performance
Deep Learning: Explore more complex neural network architectures

In-Depth Evaluation

Strengths

Methodological Innovation: First systematic application of ML to Gaia DR3 wide binary detection
Technical Comprehensiveness: Integration of multiple preprocessing and classification techniques
Excellent Performance: Significant improvements in key metrics
Practical Value: Open-source tools promote field development
Sufficient Experimentation: Multi-algorithm comparison and detailed performance analysis

Weaknesses

Theoretical Analysis: Lacks theoretical guarantees for ML methods in astrophysical applications
Verification Scope: Validated only on a single catalog; generalization requires confirmation
Physical Interpretation: Insufficient explanation of physical meaning behind ML decisions
Noise Modeling: Insufficient consideration of observational noise effects

Impact

Academic Contribution: Provides new perspectives for astronomical big data analysis
Practical Value: Tools can be directly applied to research practice
Reproducibility: Open-source code ensures result reproducibility
Field Advancement: Promotes ML applications in astrophysics

Applicable Scenarios

Large-Scale Astronomical Surveys: Applicable to large datasets like Gaia
Rapid Screening: Initial screening of candidate wide binary systems
Auxiliary Analysis: Verification alongside traditional methods
Educational Research: Example of ML applications in astronomy

References

El-Badry et al. (2021) - Foundational work on wide binary catalog construction
Chawla et al. (2002) - Original SMOTE technique paper
Breiman (2001) - Random Forest algorithm
Baron (2019) - Survey of machine learning applications in astronomy

Overall Assessment: This is a technically solid and highly practical application paper. The authors successfully apply machine learning techniques to a specific astrophysical problem, achieving significant performance improvements. While relatively limited in theoretical innovation, its open-source tools and systematic methodology make substantial contributions to the field. This work provides an important foundation for subsequent gravity theory testing and anomalous wide binary detection.