Invention Application
- Patent Title: TOKENIZING DATA AND TRAINING LARGE CODE LANGUAGE MODELS
-
Application No.: US18749448Application Date: 2024-06-20
-
Publication No.: US20240427992A1Publication Date: 2024-12-26
- Inventor: Carmit Sahar , Daniel Yellin , Stojancho Ganchev , Zohar Fox
- Applicant: Aurora Labs Ltd.
- Applicant Address: IL Tel Aviv
- Assignee: Aurora Labs Ltd.
- Current Assignee: Aurora Labs Ltd.
- Current Assignee Address: IL Tel Aviv
- Main IPC: G06F40/284
- IPC: G06F40/284

Abstract:
Disclosed herein are techniques for creating and using tokens representing portions of programming code. Techniques include identifying a first body of programming code associated with a hardware or software source attribute; associating a plurality of tokens with respective portions of the first body of programming code; configuring model input data for training a code language processing model customized in accordance with the hardware or software source attribute, the model input data comprising the plurality of tokens; and training, using the model input data, the code language processing model to analyze at least a part of the first body of programming code or a part of a second body of programming code, thus producing a customized and trained code language processing model in accordance with the hardware or software source attribute.
Information query