-
公开(公告)号:US20250060944A1
公开(公告)日:2025-02-20
申请号:US18449498
申请日:2023-08-14
Applicant: Salesforce, Inc.
Inventor: Shruthan Radhakrishna , Hadi Minooei , Yazdan Jamshidi
Abstract: An automated data extraction pipeline for large language model (LLM) training may include extracting a set of code segments from a set of natural language question-answer (Q&A) combinations that each include a provided input, a provided output, and a provided code segment formatted to transform the provided input into the provided output. The data extraction pipeline may then generate a predicted output from a question portion of a first natural language Q&A combination using a first LLM. A first extracted code segment from the extracted set of code segments may then be executed to generate a first actual output of the first extracted code segment. One or more data samples may then be generated for training a second LLM based on a comparison of the first actual output to the predicted output. The second LLM may then be trained using the one or more data samples.