Ali BaniAsad

Robotics Engineer at Fasta | Robust RL & Embedded AI Researcher | Seeking PhD Positions

Robust RL Differential Game Guidance | Ali BaniAsad

Robust RL Differential Game Guidance

Robust Reinforcement Learning Differential Game Guidance

Master’s Thesis — Low-Thrust Multi-Body Dynamical Environments

Ali Bani Asad
Department of Aerospace Engineering, Sharif University of Technology
Supervised by: Dr. Hadi Nobahari · September 2025

GitHub Python PyTorch ROS2 License


📋 Abstract

This research presents a zero-sum multi-agent reinforcement learning (MARL) framework for robust spacecraft guidance in the challenging Earth-Moon three-body dynamical system. The work addresses critical problems in low-thrust spacecraft guidance under significant environmental uncertainties through a novel differential game formulation.

Multi-Agent RL: Robust spacecraft guidance in three-body problem

🎯 Key Contributions

🏆 Key Results

The zero-sum MARL approach demonstrates superior robustness, highlighted by MA-TD3 performance:


🏗️ Repository Structure

master-thesis/
├── 📚 Report/                      # LaTeX thesis document
│   ├── thesis.tex                  # Main thesis file
│   ├── Chapters/                   # 8 chapters (Introduction → Conclusion)
│   ├── bibs/                       # Bibliography
│   └── plots/                      # Result plots and figures
│
├── 📜 Paper/                       # Conference paper (IEEE format)
│
├── 💻 Code/
│   ├── Python/
│   │   ├── Algorithms/             # DDPG, TD3, SAC, PPO implementations
│   │   ├── Environment/            # Three-body problem dynamics (TBP.py)
│   │   ├── TBP/                    # Single-agent training
│   │   ├── Robust_eval/            # Robustness testing (Standard & ZeroSum)
│   │   ├── Benchmark/              # OpenAI Gym environments
│   │   └── utils/                  # Utility functions
│   │
│   ├── C/                          # C++ real-time inference (PyTorch models)
│   ├── ROS2/                       # ROS2 packages for hardware integration
│   ├── Simulink/                   # MATLAB Simulink models
│   └── ros_legacy/                 # Legacy ROS1 implementation
│
├── 🖼️ Figure/                      # Visualizations (TBP, HIL)
├── 🎓 Presentation/                # Defense slides (Beamer)
└── 📖 Proposal/                    # Research proposal

Full Repository: https://github.com/alibaniasad1999/master-thesis


🔬 Research Methodology

Problem Formulation

The spacecraft guidance problem in the Circular Restricted Three-Body Problem (CR3BP) is formulated as a zero-sum differential game:

🎯 Player 1: Guidance Agent

Minimizes trajectory deviation and fuel consumption

⚔️ Player 2: Disturbance Agent

Maximizes trajectory deviation (models worst-case uncertainties)

This formulation enables the development of inherently robust control policies that perform well under adversarial conditions.


🤖 Multi-Agent RL Algorithms

Algorithm Type Key Features Best For
MA-DDPG Off-policy, Deterministic Simple, efficient, strong baseline Fast prototyping
MA-TD3 Off-policy, Deterministic Target smoothing, delayed updates, clipped double Q Best overall performance
MA-SAC Off-policy, Stochastic Maximum entropy, automatic temperature tuning Exploration-heavy tasks
MA-PPO On-policy, Stochastic Trust-region optimization, stable updates Sparse rewards

Training Strategy


📊 Key Results

Performance Comparison

Algorithm Trajectory Error (m) Fuel Consumption (m/s) Success Rate (%) Robustness
PID Control 8,432 ± 2,156 45.2 ± 8.3 72.4 ⭐⭐
DDPG 1,234 ± 892 28.7 ± 5.2 84.6 ⭐⭐⭐
TD3 967 ± 654 26.4 ± 4.1 88.2 ⭐⭐⭐⭐
SAC 1,045 ± 721 27.8 ± 4.8 86.9 ⭐⭐⭐⭐
PPO 1,398 ± 978 31.2 ± 6.3 81.5 ⭐⭐⭐
MA-DDPG 892 ± 423 25.1 ± 3.2 91.7 ⭐⭐⭐⭐
MA-TD3 🏆 687 ± 312 23.4 ± 2.8 95.3 ⭐⭐⭐⭐⭐
MA-SAC 734 ± 367 24.2 ± 3.1 93.8 ⭐⭐⭐⭐⭐
MA-PPO 856 ± 445 26.7 ± 3.9 90.4 ⭐⭐⭐⭐

Results averaged over 1,000 test episodes with combined uncertainty scenarios.


Trajectory Tracking Performance: TD3

Standard vs Zero-Sum MA-TD3 Comparison

Standard TD3 Trajectory

Standard TD3 Trajectory

Zero-Sum MA-TD3 Trajectory

Zero-Sum MA-TD3 Trajectory

Standard TD3 with Control Forces

Standard TD3 with Forces

Zero-Sum MA-TD3 with Control Forces

Zero-Sum MA-TD3 with Forces

Key Observation: MA-TD3 maintains tighter trajectories with smaller control-effort fluctuations than standard TD3 in the same scenarios.


Robustness Analysis Under Uncertainty

Comparative Performance: All Four Algorithms

The violin plots below show the performance distribution of all four RL algorithms (DDPG, TD3, SAC, PPO) under various uncertainty scenarios. Each plot compares Standard (single-agent) vs Zero-Sum (multi-agent) variants.

Zero-Sum Multi-Agent RL - All Algorithms Combined

Actuator Disturbance
Actuator Disturbance - ZS
Sensor Noise
Sensor Noise - ZS
Initial Condition Shift
Initial Condition - ZS
Time Delay
Time Delay - ZS
Model Mismatch
Model Mismatch - ZS
Partial Observation
Partial Observation - ZS

Standard Single-Agent RL - All Algorithms Combined

Actuator Disturbance
Actuator Disturbance - Standard
Sensor Noise
Sensor Noise - Standard
Initial Condition Shift
Initial Condition - Standard
Time Delay
Time Delay - Standard
Model Mismatch
Model Mismatch - Standard
Partial Observation
Partial Observation - Standard

🎯 Key Findings

✅ Zero-sum MARL outperforms single-agent RL across every metric
✅ MA-TD3 offers ~30 % error reduction versus TD3
✅ Robustness improves across all uncertainty classes with narrower performance distributions
✅ Real-time deployment is feasible thanks to sub-5 ms inference latency


🚀 Getting Started

Prerequisites

Software Requirements:

Hardware Requirements:

Installation

1. Clone the Repository

git clone https://github.com/alibaniasad1999/master-thesis.git
cd master-thesis

2. Set Up Python Environment

# Create virtual environment
python -m venv venv

# Activate virtual environment
source venv/bin/activate  # Linux/macOS
# or
venv\Scripts\activate  # Windows

# Install dependencies
pip install -r requirements.txt

3. Verify Installation

python -c "import torch; import gymnasium; import numpy; print('✓ All packages installed successfully')"

💡 Usage Guide

Training RL Agents

Single-Agent Training (Baseline)

cd Code/Python/TBP/SAC
jupyter notebook SAC_TBP.ipynb

Follow the notebook to:

  1. Configure environment parameters
  2. Set hyperparameters
  3. Train the agent
  4. Evaluate performance
  5. Save trained models

Zero-Sum Multi-Agent Training

cd Code/Python/TBP/SAC/ZeroSum
jupyter notebook Zero_Sum_SAC_TBP.ipynb

The notebook demonstrates:

  1. Zero-sum game setup
  2. Alternating training procedure
  3. Nash equilibrium convergence
  4. Robustness evaluation

Robustness Evaluation

cd Code/Python/Robust_eval/ZeroSum/sensor_noise
jupyter notebook sensor_noise.ipynb

This evaluates trained policies under sensor noise perturbations and generates comparison plots.

C++ Inference (Real-Time Deployment)

cd Code/C
mkdir build && cd build
cmake ..
make
./main

The C++ implementation loads PyTorch traced models for fast inference.

ROS2 Integration

cd Code/ROS2
colcon build
source install/setup.bash
ros2 launch tbp_rl_controler tbp_system.launch.py

This launches:


🎓 Why Multi-Agent & Zero-Sum?

Many real-world control problems are effectively games:

  • 🎯 Pursuit-evasion – interceptor vs. target aircraft, autonomous car vs. pedestrian prediction
  • 🌪️ Disturbance rejection – controller vs. nature; treat wind-gusts or hardware faults as an adversary
  • 🤝 Competitive resource allocation – multiple robots vying for the same power or bandwidth budget

Model-free RL lifts the need for hand-crafted opponent models; differential-game extensions push agents toward robust Nash equilibria, not brittle one-shot optima.


📚 Reinforcement Learning — Quick Primer

Reinforcement learning (RL) is a paradigm in which an agent discovers an optimal control policy by interacting with an environment and maximising cumulative reward.

Fundamentals

At each discrete time $t$ the agent:

  1. Observes a state $s_t$
  2. Acts using policy $a_t\sim\pi_\theta(\,\cdot\,\mid s_t)$
  3. Receives a reward $r_t$ and the next state $s_{t+1}$

This loop repeats until a terminal condition resets the episode.
Conceptually, the environment, agent, and action align with the classic control terms plant, controller, and control input.

Critic network (value).

The agent–environment process in a Markov decision process

Mathematically, the problem is cast as a Markov Decision Process (MDP)
$\langle S,A,P,r,q_0,\gamma\rangle$:

The agent seeks to maximise the expected return

\[G_t=\sum_{k=t+1}^{T}\gamma^{\,k-t-1}r_k.\]

Value functions formalise “how good” a state or action is:

\[\begin{aligned} V^\pi(s_t) &=\mathbb E_\pi\Bigl[G_t\mid s_t\Bigr],\\[2pt] Q^\pi(s_t,a_t) &=\mathbb E_\pi\Bigl[G_t\mid s_t,a_t\Bigr]. \end{aligned}\]

📖 Algorithms

Actor–Critic Frameworks & Neural Networks

Most modern continuous-control algorithms are actor–critic:

Both are neural networks (fully-connected, ReLU) trained with gradient-based updates.

Actor network

Actor network (policy)

Critic network

Critic network (value)


DDPG Algorithm (Deep Deterministic Policy Gradient, Multi-Agent Zero-Sum)


TD3 Algorithm (Twin Delayed Deep Deterministic Policy Gradient, Multi-Agent Zero-Sum)


SAC Algorithm (Soft Actor-Critic, Multi-Agent Zero-Sum)


PPO Algorithm (Proximal Policy Optimisation, Multi-Agent Zero-Sum)


Notation Key


🎯 Reproducibility

Reproduce Training Results

# Train MA-TD3 agent
cd Code/Python/TBP/TD3/ZeroSum
jupyter notebook Zero_Sum_TD3_TBP.ipynb
# Execute all cells

Reproduce Evaluation Results

# Run robustness evaluation
cd Code/Python/Robust_eval/ZeroSum/All_in_one/actuator_disturbance
jupyter notebook all_in_one.ipynb

Random Seeds

All experiments use fixed random seeds for reproducibility:


📚 Citation

If you use this work in your research, please cite:

@mastersthesis{baniasad2025robust,
  author       = {Ali Bani Asad},
  title        = {Robust Reinforcement Learning Differential Game Guidance 
                  in Low-Thrust, Multi-Body Dynamical Environments},
  school       = {Sharif University of Technology},
  year         = {2025},
  address      = {Tehran, Iran},
  month        = {September},
  type         = {Master's Thesis},
  note         = {Department of Aerospace Engineering}
}

📧 Contact

Ali Bani Asad
Department of Aerospace Engineering, Sharif University of Technology
📧 ali_baniasad@ae.sharif.edu
🔗 @alibaniasad1999

Supervisor — Dr. Hadi Nobahari
📧 nobahari@sharif.edu


🙏 Acknowledgments

This research was conducted at the Sharif University of Technology, Department of Aerospace Engineering, under the supervision of Dr. Hadi Nobahari and the advisory of Dr. Seyed Ali Emami Khooansari.

Special thanks to:


📜 License

This project is licensed under the MIT License - see the LICENSE file for details.



⭐ If you find this research useful, please consider giving it a star! ⭐

Made with ❤️ at Sharif University of Technology