Enhancing Safety via Deep Reinforcement Learning in Trajectory Planning for Agile Flights in Unknown Environments
 Abstract
- Motivation 
- Increase necessary to swiftly evade obstacles and adapt trajectories under hard real-time constraints. 
- Generate viable paths that prevent collisions while maintaining high speeds with minimal tracking errors.
 
 
 - Increase necessary to swiftly evade obstacles and adapt trajectories under hard real-time constraints. 
 - Method 
- The proposed method combines a supervised learning approach, as teacher policy, with deep reinforcement learning (DRL), as student policy. 
- Train the teacher policy using a path planning algorithm that prioritizes safety while minimizing jerk and flight time.
 - Use this policy to guide the learning of the student policy in various unknown environments.
 
 
 - The proposed method combines a supervised learning approach, as teacher policy, with deep reinforcement learning (DRL), as student policy. 
 
 Introduction
- Previous works’ limitation 
- Require 
- Known environments.
 - Extensive information for reliable outcomes, which is seldom the case in real-world missions.
 
 
 - Require 
 - Proposed Method 
- A trajectory planning method for generating agile flight trajectories in unknown environments solely based on data from onboard sensors.
 - The framework integrates two neural networks: 
- The teacher policy 
- Supervised learning.
 - Incorporates a geometry-based trajectory planning strategy enriched with a heuristic to optimize flight time and enhance safety.
 
 - The student policy 
- DQN-PER.
 
 
 - The teacher policy 
 
 
Methodology
- The primary aim 
- The primary aim of our proposed approach is to facilitate the safe and agile navigation of UAVs in unknown environments, leveraging 3D LiDAR for environmental perception.
 - Our strategy involves establishing the UAV trajectory with a minimal flight time based on real-time sensory data while prioritizing safety.
 
 - The proposed privileged reinforcement learning framework 
- The teacher policy 
- Deep Feedforward Neural Network (DFNN).
 - Trains using an expert algorithm proficient.
 - Provides optimal action insights across diverse environments.
 - Evaluates the student policy (DRL reward).
 
 - The student policy 
- DQN-PER
 - Operates based on data identifiable by the UAV’s 3D Lidar around its current pose.
 
 - Both networks output the new ideal waypoints to be followed (\(F\), \(B\), \(R\), \(L\), \(U\), \(D\)).
 - The distilled knowledge from the teacher policy is integrated into the student policy, functioning without privileged information.
 
 - The teacher policy 
 
 Problem Definition
- Our focus is on formulating a practical and reliable trajectory planning strategy, aiming to optimize three key objectives: 
- Minimizing jerk.
 - Reducing flight time.
 - Enhancing safety by maximizing the distance to the obstacles in the environment.
 
 - The goal is to minimize equations (1), (2), and (3) while maximizing (4) to enhance safety and ensure collision free trajectories for high-speed flights. 
- Goal Distance: \(D_{\text{goal}} = \sum_{i=0}^n \sqrt{(\mathbf{p}_{\text{goal}} - \mathbf{p}_i)^2}\)
 - Next Step Distance: \(D_{\text{next}} = \sum_{i=0}^{n-1} \sqrt{(\mathbf{p}_{i+1} - \mathbf{p}_i)^2}\)
 - Jerk Cost: \(D_{\text{jerk}} = \int \left( \frac{d^3 \mathbf{p}}{dt^3} \right)^2 dt\)
 - Obstacle Distance: \(D_{\text{obs}} = \max \sum_{j=0}^k \sqrt{(\mathbf{p}_i - \mathbf{O}_j)^2}\)
 
 
Expert
- Bidirectional A* algorithm 
- We integrate objective equations as heuristics for trajectory generation: \(f(i) = (D_{\text{goal}} + D_{\text{obs}} + D_{\text{next}}) \times D_{\text{jerk}}\)
 
 - Input: 
- Entire environment.
 - Global position.
 - Goal Node.
 
 - Output: 
- The ideal next waypoint for each waypoint.
 
 
Teacher Policy
- The policy is trained across multiple randomly generated scenarios. 
- Rainforests.
 - Mazes.
 - Disaster Areas.
 
 - To ensure precision, a distinct model is learned for each scenario, resulting in 30 models trained across different scenarios, with 10 models for each environment.
 - Input: 
- Goal node.
 - Global position.
 - Orientation.
 - The environment around within a 5 × 5 × 2 meter radius range.
 
 - Output: 
- (\(F\), \(B\), \(R\), \(L\), \(U\), \(D\)).
 - The subsequent ideal action determined by our expert.
 
 
Student Policy
- The student policy efficiently produces real-time, collision-free trajectories for agile flights, relying solely on onboard sensor measurements.
 - Input: 
- obstacle positions within a 5 × 5 × 2 meter radius obtained from the 3D Lidar.
 - UAV position
 - Orientation.
 - Goal node.
 
 - Ouput: 
- (\(F\), \(B\), \(R\), \(L\), \(U\), \(D\)).
 
 - During flight, the UAV records obstacle positions, triggering policy execution upon identifying new obstacles.
 - The policy generates a new trajectory using obstacle positions providing a single waypoint per iteration.
 - Multiple policy runs are conducted to create the final trajectory.
 - Upon generating a trajectory, it utilizes Bézier curves to transform the anticipated trajectory into a comprehensive state representation.
 
 Results
- The proposed method is evaluated based on: 
- Flight time (seconds)
 - Processing time (seconds)
 - Representing the time the UAV is at least at 80% of the maximum speed
 - RMSE from the guidance trajectory (meters)
 - Success rate (%)
 
 - Baseline: 
- the standard bidirectional A* in an unknown environment
 - our expert operating in an unknown environment
 - The original DQN-PER
 - Our teacher policy
 
 - Testing in simulation demonstrates noteworthy advancements, including an 80% reduction in tracking error, a 31% decrease in flight time, a 19% increase in high-speed duration, and a success rate improvement from 50% to 100%, as compared to baseline methods.
 
 Enjoy Reading This Article?
Here are some more articles you might like to read next: