MULTI-STAGE MACHINE LEARNING MODEL TRAINING FOR KEY-VALUE EXTRACTION

    公开(公告)号:US20240221407A1

    公开(公告)日:2024-07-04

    申请号:US18149795

    申请日:2023-01-04

    Abstract: Techniques for multi-stage training of a machine learning model to extract key-value pairs from documents are disclosed. A system trains a machine learning model using a set of training data including unlabeled documents of various document categories. The initial stage identifies relationships among tokens, or words, numbers, and punctuation, in documents. The system re-trains the machine learning model using a set of training data which includes a particular category of documents while excluding other categories of documents. The second training stage is a supervised machine learning stage in which the training data is labeled to identify key-value pairs in the documents. In the initial training stage, the system sets parameters of the machine learning model to an initial state. In the second stage, the system modifies the parameters of the machine learning model based on the characteristics of the training data set including the documents of the particular category.

    GENERATING SEMANTICALLY REPETITION-FREE LLM TEXT

    公开(公告)号:US20250094687A1

    公开(公告)日:2025-03-20

    申请号:US18758441

    申请日:2024-06-28

    Abstract: Techniques for generating repetition-free text using a large language model (LLM) are provided. In one technique, textual content that was generated by an LLM is accessed, where the textual content comprises a plurality of sub-components including a first sub-component and a second sub-component. A first embedding that represents the first sub-component is generated and a second embedding that represents the second sub-component is generated. Based on a similarity between the first embedding and the second embedding, it is determined whether the second sub-component is repetitious with respect to the first sub-component. In response to determining that the second sub-component is repetitious with respect to the first sub-component, at least a portion of the second sub-component is removed from the textual content.

    NARRATIVE POINT OF VIEW MODIFICATION FOR CONTENT GENERATED BY A MACHINE-LEARNED MODEL

    公开(公告)号:US20250094686A1

    公开(公告)日:2025-03-20

    申请号:US18758321

    申请日:2024-06-28

    Abstract: Techniques for modifying a narrative point of view for content generated by a machine-learned model, such as a large language model (LLM), are provided. In one technique, a first textual content that was generated by an LLM is accessed. A narrative point of view (NPOV) detection operation is performed on a first portion of the first textual content to identify a first NPOV corresponding to the first portion of the first textual content. Based on an output, of the NPOV detection operation, that indicates that the first NPOV does not meet one or more NPOV criteria, the first portion of the first textual content is modified to generate a modified textual content. The modified textual content is submitted to the LLM, causing the LLM to generate a second textual content.

    PSEUDO-LABELLING BASED BOOTSTRAPPING FOR SEMI SUPERVISED LEARNING

    公开(公告)号:US20250068983A1

    公开(公告)日:2025-02-27

    申请号:US18237234

    申请日:2023-08-23

    Abstract: In some implementations, the techniques may include receiving an accuracy target for one or more machine learning models. In addition, the techniques may include training the models on a labeled training set of labeled data. The techniques may include, until the accuracy of the models satisfies the accuracy target: sampling, a set of unlabeled data to obtain a random training set of unlabeled data; labeling the random training set of unlabeled data using the models to produce a pseudo labeled training set; correcting the labels on a random subset of the pseudo labeled training set; training the models on the labeled training set, the corrected random subset, and the pseudo labeled training set; and evaluating the accuracy of the models using an evaluation set of labeled data. The one or more models can be deployed based at least in part on the models satisfying the accuracy target.

    LAYOUT AWARE MULTI-MODAL NETWORKS FOR DOCUMENT UNDERSTANDING

    公开(公告)号:US20240420496A1

    公开(公告)日:2024-12-19

    申请号:US18210498

    申请日:2023-06-15

    Abstract: Techniques for layout-aware multi-modal networks for document understanding are provided. In one technique, word data representations that were generated based on words that were extracted from an image of a document are identified. Based on the image, table features of one or more tables in the document are determined. One or more table data representations that were generated based on the table features are identified. The word data representations and the one or more table data representations are input into a machine-learned model to generate a document data representation for the document. A task is performed based on the document data representation. In a related technique, instead of the one or more table data representations, one or more layout data representations that were generated based on a set of layout features, of the document, that was determined based on the image are identified and input into the machine-learned model.

    GENERATING SYNTHETIC TRAINING DATA INCLUDING DOCUMENT IMAGES WITH KEY-VALUE PAIRS

    公开(公告)号:US20240177511A1

    公开(公告)日:2024-05-30

    申请号:US18058982

    申请日:2022-11-28

    CPC classification number: G06V30/19147 G06V30/153 G06V30/41

    Abstract: Automated techniques are for generating a large volume of diverse training data that can be used for training machine learning models to extract KV pairs from document images. Given a single input document image and associated annotation data, a large number of diverse synthetic training datapoints are automatically generated by a synthetic data generation system, each datapoint including a synthetic document image and associated annotation data. The generated synthetic training datapoints can be used to train and improve the performance of ML models for extracting KV pairs from document images. In certain implementations, multiple synthetic datapoints are generated by varying the values associated with a key for a content item within the input document image.

    RESPONDING TO HALLUCINATIONS IN GENERATIVE LARGE LANGUAGE MODELS

    公开(公告)号:US20250094866A1

    公开(公告)日:2025-03-20

    申请号:US18678914

    申请日:2024-05-30

    Abstract: Techniques for correcting hallucinations produced by generative large language models (LLMs). In one technique, a computing system accesses first output generated by an LLM. The computing system identifies, within the first output, a plurality of assertions. The computing system determines that a first assertion in the plurality of assertions is false. The computing system generates a prompt that indicates that the first assertion is false. The computing system submits the prompt as input to the LLM. The computing system accesses second output that is generated by the LLM, where the second output includes a second assertion that is different than the first assertion and corresponds to the first assertion.

    FINE-TUNING A LARGE LANGUAGE MODEL (LLM) TO REDUCE THE INSTABILITY OF LLM OUTPUTS TO VARIATIONS IN PROMPTS

    公开(公告)号:US20250094814A1

    公开(公告)日:2025-03-20

    申请号:US18824570

    申请日:2024-09-04

    Abstract: Techniques are provided for fine-tuning large language models (LLMs) to reduce the instability of LLM outputs to prompt. In one technique, a plurality of prompts is stored. For each prompt of the plurality of prompts, a plurality of variants of that prompt is generated. A prompt generating LLM is fine-tuned based on that prompt and the plurality of variants. Each variant-prompt association (where the variant is generated based on the prompt and has an identical or similar meaning) is a training sample that is used to train or fine-tune the prompt generating LLM. The prompt generating LLM is configured to generate standardized prompts based on input prompts. In another technique, a response generating LLM is fine-tuned based on sets of training samples, each training sample in a set comprising a different variant of a prompt and a response that the response generating LLM generated based on the prompt.

    ENSURING THAT LANGUAGE MODELS FOLLOW INSTRUCTIONS INDICATED IN PROMPTS

    公开(公告)号:US20250094865A1

    公开(公告)日:2025-03-20

    申请号:US18629917

    申请日:2024-04-08

    Abstract: Techniques for ensuring that language models follow instructions indicated in prompts are provided. In one technique, a first language model generates a response based on a prompt. A set of instructions in the prompt is identified. For each instruction in the set, a second language model determines whether the response indicates that the first language model followed the instruction. In another technique, for each prompt of a plurality of prompts: (1) a first language model generates a response based on the prompt; (2) multiple instructions are identified based on the prompt; (3) a second language model generates, based on the plurality of instructions, an output that indicates that the first language model followed each instruction; and (4) the prompt, the response, and the multiple instructions are stored in a training instance. The first language model is finetuned based on the training instances.

    LANGUAGE MODEL SUMMARIZATION USING SEMANTICAL CLUSTERING

    公开(公告)号:US20250094716A1

    公开(公告)日:2025-03-20

    申请号:US18657308

    申请日:2024-05-07

    Abstract: Techniques for language model (LM) summarization using semantical clustering are provided. In one technique, a plurality of concepts reflected in text data is identified. A plurality of concept clusters is generated based on similarity among the plurality of concepts. Thus, some concept clusters may include multiple concepts. For each concept cluster of the plurality of concept clusters, an LM generates a summary of the text corresponding to that concept cluster. A summary response of the text data is generated by aggregating the summary of each concept cluster of the plurality of concept clusters. In another technique, an LM generates a summary based on text data. A first set of concepts reflected in the summary is identified and a second set of concepts reflected in the text data is identified. A difference between the two sets may indicate that the summary is missing one or more concepts.

Patent Agency Ranking