One Subgoal at a Time: Zero-Shot Generalization to Arbitrary Linear Temporal Logic Requirements in Multi-Task Reinforcement Learning

NeurIPS 2025

¹ Division of Systems Engineering, Boston University ² Department of Electrical and Computer Engineering, Boston University

Paper Code BibTeX

Abstract

Generalizing to complex and temporally extended task objectives and safety constraints remains a critical challenge in reinforcement learning (RL). Linear temporal logic (LTL) offers a unified formalism to specify such requirements, yet existing methods are limited in their abilities to handle nested long-horizon tasks and safety constraints, and cannot identify situations when a subgoal is not satisfiable and an alternative should be sought. In this paper, we introduce GenZ-LTL, a method that enables zero-shot generalization to arbitrary LTL specifications. GenZ-LTL leverages the structure of Büchi automata to decompose an LTL task specification into sequences of reach-avoid subgoals. Contrary to the current state-of-the-art method that conditions on subgoal sequences, we show that it is more effective to achieve zero-shot generalization by solving these reach-avoid problems one subgoal at a time through proper safe RL formulations. In addition, we introduce a novel subgoal-induced observation reduction technique that can mitigate the exponential complexity of subgoal-state combinations under realistic assumptions. Empirical results show that GenZ-LTL substantially outperforms existing methods in zero-shot generalization to unseen LTL specifications.

Method Overview

Overview of GenZ-LTL.

In this work, we propose a method, called GenZ-LTL, for learning to satisfy arbitrary linear temporal logic specifications. During training, we sample reach-avoid subgoals, where each subgoal specifies reach subgoal and a set of avoid subgoals. To reduce complexity, we apply a subgoal-induced observation reduction, which simplifies the state representation when possible. We then train a general subgoal-conditioned policy together with value functions. At test time, given a target LTL task specification, we first construct an equivalent Büchi automaton, and then extract a set of reach-avoid subgoals based on the current state. Each candidate subgoal is evaluated using the learned value functions, and the optimal one is selected and executed.

Illustration of the subgoal-induced observation reduction (for LiDAR observations).

Main Results

LTL specifications used for evaluation. For \(\varphi_{9-16}\) and \(\psi_{4-6}\), b, g, m, and y denote blue, green, magenta, and yellow

Finite Horizon Tasks: success rate η_s, violation rate η_v, and average steps μ to satisfy the specifications

Infinite Horizon Tasks: violation rate η_v and average number of visits to accepting states μ_acc

As shown in the Tables, for finite-horizon tasks, GenZ-LTL consistently outperforms the baselines in both success rate and violation rate. In most tasks, it also learns more efficient policies, requiring fewer steps to satisfy a given specification. For infinite-horizon tasks, GenZ-LTL again achieves lower violation rates and a higher average number of visits to the accepting state, demonstrating improved efficiency over the baselines. We also evaluate the zero-shot generalization performance of GenZ-LTL on increasing complexity of the LTL specifications and environment, the results still show that GenZ-LTL significantly outperforms the baselines.

Show References

Visualizations

We present visualization results of the policy learned by GenZ-LTL in the Zone environment. Our method consistently achieves the desired behavior under both complex finite-horizon and infinite-horizon specifications.

!green U ((blue | magenta) & (!green U yellow))

F (green & (!(blue | yellow) U (magenta & (!yellow U blue)))) & F (!green U yellow)

!(magenta | yellow) U (blue & (!green U (yellow & F (green & (!blue U magenta)))))

FG yellow & G !(green | blue | magenta)

GF blue & GF green & G !(yellow | magenta)

GF blue & GF yellow & GF green & G !magenta