-
1.
公开(公告)号:US20210397959A1
公开(公告)日:2021-12-23
申请号:US17354991
申请日:2021-06-22
Applicant: Google LLC
Inventor: Olivier Claude Pietquin , Léonard Hussenot Desenonges , Robert Dadashi-Tazehozi , Matthieu Florent Geist
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a neural network used to select actions performed by an agent interacting with an environment by performing actions that cause the environment to transition states. One of the methods includes obtaining a transition generated as a result of the reinforcement learning agent interacting with the environment, processing a bonus input using a bonus estimation neural network to generate an exploration bonus estimate that encourages the agent to explore the environment in accordance with an expert exploration strategy that would be adopted by an expert agent; generating a modified reward from the reward included in the transition and the exploration bonus estimate; and determining an update to current parameter values of the neural network to optimize a reinforcement learning objective function that maximizes returns to be received by the agent with respect to the modified reward.
-
公开(公告)号:US20250124256A1
公开(公告)日:2025-04-17
申请号:US18486792
申请日:2023-10-13
Applicant: Google LLC
IPC: G06N3/0455 , G06N3/092
Abstract: An example method is provided for training a machine-learned student sequence processing model, the method comprising: obtaining a respective input; obtaining, from the student machine-learned sequence processing model, a respective output corresponding to the respective input; generating a multiscale refinement objective configured to jointly distill knowledge from a teacher machine-learned sequence processing model and reinforce preferred behavior of the student machine-learned sequence processing model, wherein the multiscale refinement objective comprises: a first component based on a divergence metric characterizing, for the respective input, a comparison of a plurality of predictions of the student machine-learned sequence processing model to a plurality of predictions of the teacher machine-learned sequence processing model; and a second component based on a reinforcement learning signal associated with the respective output; and updating the machine-learned student sequence processing model based on the multiscale refinement objective.
-
公开(公告)号:US20230093451A1
公开(公告)日:2023-03-23
申请号:US17947985
申请日:2022-09-19
Applicant: Google LLC
Inventor: Robert Dadashi-Tazehozi , Olivier Claude Pietquin , Léonard Hussenot Desenonges , Matthieu Florent Geist , Anton Raichuk , Damien Vincent , Sertan Girgin
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for controlling agents. In particular, an agent can be controlled using a discretization neural network that generates a state-dependent discretization of an original action space and a policy neural network that is used to select an action from the state-dependent quantization rather than from the original action space.
-
公开(公告)号:US20210390409A1
公开(公告)日:2021-12-16
申请号:US17347264
申请日:2021-06-14
Applicant: Google LLC
Inventor: Matthieu Florent Geist , Nino Vieillard , Olivier Claude Pietquin
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for training a neural network used to select actions performed by an agent interacting with an environment by performing actions that cause the environment to transition states. One of the methods includes training the neural network on one or more transitions selected from a replay memory, including: generating, using the neural network, an action selection output for the current observation; determining, based on the action selection output and the current action performed by the agent in response to the current observation, a state-action target for the current observation; determining a gradient of a temporal difference (TD) loss function with respect to parameters of the neural network, wherein the TD loss function comprises a first term that depends on the state-action target for the current observation; and adjusting current parameter values of the neural network based on the gradient.
-
-
-