-
公开(公告)号:US20240265690A1
公开(公告)日:2024-08-08
申请号:US18544840
申请日:2023-12-19
Applicant: NVIDIA Corporation
Inventor: Animashree Anandkumar , Linxi Fan , Zhiding Yu , Chaowei Xiao , Shikun Liu
CPC classification number: G06V10/82 , G06V10/811
Abstract: A vision-language model learns skills and domain knowledge via distinct and separate task-specific neural networks, referred to as experts. Each expert is independently optimized for a specific task, facilitating the use of domain-specific data and architectures that are not feasible with a single large neural network trained for multiple tasks. The vision-language model implemented as an ensemble of pre-trained experts and is more efficiently trained compared with the single large neural network. During training, the vision-language model integrates specialized skills and domain knowledge, rather than trying to simultaneously learn multiple tasks, resulting in effective multi-modal learning.