2025-11-23T16:40:17.365633

Detecting wide binaries using machine learning algorithms

Ashesh, Kaur, Aashish
We present a machine learning (ML) framework for the detection of wide binary star systems using Gaia DR3 data. By training supervised ML models on established wide binary catalogues, we efficiently classify wide binaries and employ clustering and nearest neighbour search to pair candidate systems. Our approach incorporates data preprocessing techniques such as SMOTE, correlation analysis, and PCA, and achieves high accuracy and recall in the task of wide binary classification. The resulting publicly available code enables rapid, scalable, and customizable analysis of wide binaries, complementing conventional analyses and providing a valuable resource for future astrophysical studies.
academic

Detecting Wide Binaries Using Machine Learning Algorithms

Basic Information

  • Paper ID: 2506.19942
  • Title: Detecting wide binaries using machine learning algorithms
  • Authors: Amoy Ashesh (Indian Institute of Technology Patna & Trinity College Dublin), Harsimran Kaur (Indian Institute of Technology Patna), Sandeep Aashish (Indian Institute of Technology Patna)
  • Classification: astro-ph.GA gr-qc
  • Publication Date: Version of October 17, 2025
  • Paper Link: https://arxiv.org/abs/2506.19942

Abstract

This paper proposes a machine learning framework for detecting wide binary systems using Gaia DR3 data. By training supervised machine learning models on established wide binary catalogs, the researchers efficiently classify wide binaries and employ clustering and nearest neighbor search to pair candidate systems. The methodology integrates data preprocessing techniques including SMOTE, correlation analysis, and PCA, achieving high accuracy and recall rates in wide binary classification tasks. The publicly available code provided enables rapid, scalable, and customizable analysis of wide binaries, offering an effective complement to traditional analytical methods and providing valuable resources for future astrophysical research.

Research Background and Motivation

Problem Definition

Wide binary systems consist of pairs of stars gravitationally bound at distances ranging from thousands to tens of thousands of astronomical units. These systems operate in low-acceleration environments and serve as ideal laboratories for testing modified gravity theories and standard gravitational deviations.

Research Significance

  1. Astrophysical Value: Wide binaries can be used to study stellar evolution, dynamics, and galactic structure
  2. Gravity Theory Testing: Low-acceleration environments may reveal signatures of modified gravity effects
  3. Gaia Data Opportunity: Gaia DR3 provides unprecedented high-precision data covering the entire Milky Way

Limitations of Existing Methods

  1. Computational Complexity: Traditional statistical methods rely on Monte Carlo simulations and complex probabilistic analysis, incurring high computational costs
  2. Noise and Contamination: Identifying true gravitationally-bound pairs and detecting their dynamical anomalies is complicated by noise, contamination, and data scale
  3. Chance Alignments: As separation distance increases, the number of chance alignments increases, presenting challenges for accurate identification

Research Motivation

Machine learning methods provide scalable alternatives that can efficiently predict binary systems from noisy background populations through clustering algorithms and nearest neighbor search techniques, offering tools for discovering new physics.

Core Contributions

  1. Machine Learning Framework: First application of machine learning-assisted search to wide binary classification in the Gaia DR3 dataset
  2. Data Preprocessing Pipeline: Integration of preprocessing techniques including SMOTE balancing, correlation analysis, and PCA
  3. Multi-Algorithm Comparison: Systematic evaluation of multiple supervised learning algorithms' performance
  4. Public Tool: Provision of customizable public code (https://github.com/DespCAP/G-ML)
  5. High-Performance Classification: Achievement of high accuracy (99.8%) and recall rate (92.3%) in wide binary classification tasks

Methodology Details

Task Definition

Input: Stellar records from raw Gaia DR3 data Output: Binary classification labels (wide binary system membership) + binary pairing Constraint: Supervised learning based on the wide binary catalog established by El-Badry et al.

Model Architecture

1. Data Preprocessing Module

  • SMOTE Balancing: Addresses data imbalance (wide binaries comprise only ~1% of raw data)
  • Correlation Analysis: Quantifies linear relationships between features using Pearson correlation coefficient
  • Feature Selection: Removes positional information (right ascension, declination) to avoid overfitting

2. Machine Learning Classifiers

The study tested multiple algorithms:

  • Random Forest Classifier (RFC): Ensemble learning-based, showing best performance
  • Logistic Regression (LR): Linear classifier with probabilistic output
  • Support Vector Machine (SVM): High-dimensional separation using RBF kernel
  • Decision Tree Classifier (DTC): Tree-structured decision making
  • K-Nearest Neighbors (KNN): Non-parametric method based on proximity
  • Naive Bayes (NB): Probabilistic classifier

3. Pairing Module

  • K-means Clustering: Clustering based on spatial position (RA, Dec) and parallax to reduce computational complexity
  • Nearest Neighbor Search: Binary pairing search in 3D Euclidean space

Technical Innovations

1. SMOTE Balancing Strategy

The original data distribution is highly imbalanced (494,664 vs 5,336). SMOTE technology generates synthetic minority class samples through interpolation, significantly improving model performance.

2. 3D Spatial Pairing Algorithm

Nearest neighbor search using 3D Cartesian coordinates:

D3D = √[(xA - xB)² + (yA - yB)² + (zA - zB)²]

3. Hierarchical Processing Strategy

Clustering for dimensionality reduction is performed first, followed by nearest neighbor search within each cluster, effectively reducing the O(n²) pairing complexity.

Experimental Setup

Dataset

  • Source: Raw Gaia DR3 data
  • Annotation: El-Badry et al.'s wide binary catalog as ground truth
  • Scale: Total of 500,000 records, with 5,336 wide binaries labeled
  • Split: 80:20 training-test ratio

Selection Criteria

Based on El-Badry et al.'s standards:

  1. Projected Separation Condition: s ≤ 1 pc
  2. Parallax Condition: |ω̃₁ - ω̃₂| < b√(σ²ω̃,1 + σ²ω̃,2)
  3. Proper Motion Condition: Proper motion differences must satisfy Keplerian orbital constraints

Evaluation Metrics

  • Accuracy: Proportion of correct predictions
  • Recall: True positive identification capability
  • F1 Score: Harmonic mean of precision and recall
  • Confusion Matrix: Detailed classification performance analysis

Implementation Details

  • Number of Clusters: K-means set to 10 clusters
  • Distance Metric: 3D Euclidean distance
  • Feature Selection: Excludes positional information, retains physical features

Experimental Results

Main Results

Performance Comparison Table

AlgorithmPrecisionRecallF1 ScoreAccuracy
RFC (Original)0.3750.0080.0160.989
RFC (SMOTE)0.9170.9230.9200.998

Classification Analysis

AlgorithmTrue PositivesTrue Positive Rate (%)MisclassificationsMisclassification Rate (%)
RFC (Original)90.821099100.5
RFC (SMOTE)100992.3117516.01

Ablation Study

The SMOTE balancing technique shows significant effects:

  • Recall Improvement: From 0.8% to 92.3%
  • Misclassification Rate Reduction: From 100.5% to 16.0%
  • F1 Score Enhancement: From 0.016 to 0.920

Algorithm Comparison Analysis

  1. Random Forest: Best performance, achieving 99.8% accuracy after SMOTE balancing
  2. Decision Tree: Second-best choice with 90.0% recall
  3. Bagging Classifier: Third place with 83.9% recall
  4. Other Algorithms: Poorer performance on imbalanced data

Clustering and Pairing Results

  • Successfully partitioned predicted wide binaries into 10 spatial clusters
  • Effectively identified binary pairing relationships within each cluster
  • Provided quantitative measurements of local stellar density

Traditional Methods

  1. Statistical Methods: El-Badry et al. used Monte Carlo simulations to exclude chance alignments
  2. Proper Motion Analysis: Chanamé and Gould introduced proper motion information to improve identification accuracy
  3. Parallax Constraints: Andrews et al. utilized parallax and radial velocity

Machine Learning Applications

  1. Stellar Classification: Cody et al.'s application on SIMBAD database
  2. Black Hole Accretion States: Sreehari and Nandi's classification studies
  3. Gravitational Wave Detection: Koloniari et al.'s parameter estimation

Advantages of This Work

  1. First Systematic Approach: First ML framework specifically for Gaia DR3 wide binaries
  2. End-to-End Solution: Complete pipeline from classification to pairing
  3. Open Source Tool: Provision of reusable code resources

Conclusions and Discussion

Main Conclusions

  1. Technical Feasibility: Machine learning methods demonstrate excellent performance in wide binary detection
  2. SMOTE Criticality: Data balancing techniques are crucial for performance improvement
  3. Random Forest Optimality: Shows best performance among multiple algorithms
  4. Practical Value: Provides rapid, scalable analysis tools

Limitations

  1. Annotation Quality Dependence: Model performance is limited by training data quality
  2. Distance Uncertainty: Error propagation exists in 3D distance calculations
  3. Feature Engineering: May overlook important physical features
  4. Generalization Ability: Performance across different sky regions requires verification

Future Directions

  1. Anomaly Detection: Extend ML to supervised anomaly detection problems
  2. Gravity Theory Testing: Identify anomalous wide binaries deviating from Newtonian gravity
  3. Multi-Source Data Fusion: Integrate additional observational data to improve performance
  4. Deep Learning: Explore more complex neural network architectures

In-Depth Evaluation

Strengths

  1. Methodological Innovation: First systematic application of ML to Gaia DR3 wide binary detection
  2. Technical Comprehensiveness: Integration of multiple preprocessing and classification techniques
  3. Excellent Performance: Significant improvements in key metrics
  4. Practical Value: Open-source tools promote field development
  5. Sufficient Experimentation: Multi-algorithm comparison and detailed performance analysis

Weaknesses

  1. Theoretical Analysis: Lacks theoretical guarantees for ML methods in astrophysical applications
  2. Verification Scope: Validated only on a single catalog; generalization requires confirmation
  3. Physical Interpretation: Insufficient explanation of physical meaning behind ML decisions
  4. Noise Modeling: Insufficient consideration of observational noise effects

Impact

  1. Academic Contribution: Provides new perspectives for astronomical big data analysis
  2. Practical Value: Tools can be directly applied to research practice
  3. Reproducibility: Open-source code ensures result reproducibility
  4. Field Advancement: Promotes ML applications in astrophysics

Applicable Scenarios

  1. Large-Scale Astronomical Surveys: Applicable to large datasets like Gaia
  2. Rapid Screening: Initial screening of candidate wide binary systems
  3. Auxiliary Analysis: Verification alongside traditional methods
  4. Educational Research: Example of ML applications in astronomy

References

  1. El-Badry et al. (2021) - Foundational work on wide binary catalog construction
  2. Chawla et al. (2002) - Original SMOTE technique paper
  3. Breiman (2001) - Random Forest algorithm
  4. Baron (2019) - Survey of machine learning applications in astronomy

Overall Assessment: This is a technically solid and highly practical application paper. The authors successfully apply machine learning techniques to a specific astrophysical problem, achieving significant performance improvements. While relatively limited in theoretical innovation, its open-source tools and systematic methodology make substantial contributions to the field. This work provides an important foundation for subsequent gravity theory testing and anomalous wide binary detection.