AUTOMATED DATA EXTRACTION PIPELINE FOR LARGE LANGUAGE MODEL TRAINING

    公开(公告)号:US20250060944A1

    公开(公告)日:2025-02-20

    申请号:US18449498

    申请日:2023-08-14

    Abstract: An automated data extraction pipeline for large language model (LLM) training may include extracting a set of code segments from a set of natural language question-answer (Q&A) combinations that each include a provided input, a provided output, and a provided code segment formatted to transform the provided input into the provided output. The data extraction pipeline may then generate a predicted output from a question portion of a first natural language Q&A combination using a first LLM. A first extracted code segment from the extracted set of code segments may then be executed to generate a first actual output of the first extracted code segment. One or more data samples may then be generated for training a second LLM based on a comparison of the first actual output to the predicted output. The second LLM may then be trained using the one or more data samples.

Patent Agency Ranking