-
公开(公告)号:US20240037445A1
公开(公告)日:2024-02-01
申请号:US17969129
申请日:2022-10-19
Applicant: Denso International America, Inc. , DENSO CORPORATION
Inventor: Minglei Huang , Jinning Li , Chen Tang , Masayoshi Tomizuka , Wei Zhan
Abstract: Systems and methods for pessimistic offline reinforcement learning are described herein. In one example, a method for performing offline reinforcement learning determines when sampled states are out of distribution, assigns high probability weights to the sampled states that are out of distribution, generates a fitted Q-function by solving an optimization problem with a minimization term and a maximization term, estimates a Q-value using the fitted Q-function by estimating the overall expected reward assuming the agent is in the present state and performs a present action, and updates the policy according to an existing reinforcement learning algorithm. The minimization term penalizes an overall expected reward when a present state is out of distribution. The maximization term cancels the minimization term when the present state is an in-distribution state.
-
公开(公告)号:US20240336277A1
公开(公告)日:2024-10-10
申请号:US18439222
申请日:2024-02-12
Inventor: Minglei HUANG , Wei Zhan , Masayoshi Tomizuka , Chen Tang , Jinning Li
CPC classification number: B60W60/001 , G06V20/56
Abstract: A method and system for controlling a device includes training a low-level policy to form a trained low-level policy and a low-level value function to form a trained goal conditioned value function, wherein training is performed using a static data set using goal conditioned episodes, training a high-level goal planner having high level goals having high-level sub-goals corresponding to a plurality of future time steps using the low-level value function to maximize a cumulative reward over the sub-goals for the plurality of future time steps so that the sub-goals are reachable by the low-level policy, obtaining an observation of a device, and generating an executable action using the low-level policy and the high-level goal planner and operating the device with the executable action.
-