PESSIMISTIC OFFLINE REINFORCEMENT LEARNING SYSTEM AND METHOD

    公开(公告)号:US20240037445A1

    公开(公告)日:2024-02-01

    申请号:US17969129

    申请日:2022-10-19

    CPC classification number: G06N20/00 G06N7/005

    Abstract: Systems and methods for pessimistic offline reinforcement learning are described herein. In one example, a method for performing offline reinforcement learning determines when sampled states are out of distribution, assigns high probability weights to the sampled states that are out of distribution, generates a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, estimates a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action, and updates the policy according to an existing reinforcement learning algorithm. The minimization term penalizes an overall expected reward when a present state is out of distribution. The maximization term cancels the minimization term when the present state is an in-distribution state.

Patent Agency Ranking