TOKENIZING DATA AND TRAINING LARGE CODE LANGUAGE MODELS

Invention Application

US20240427992A1 TOKENIZING DATA AND TRAINING LARGE CODE LANGUAGE MODELS 有权

Please log in to see more content

Patent Title: TOKENIZING DATA AND TRAINING LARGE CODE LANGUAGE MODELS
Application No.: US18749448

Application Date: 2024-06-20
Publication No.: US20240427992A1

Publication Date: 2024-12-26
Inventor: Carmit Sahar , Daniel Yellin , Stojancho Ganchev , Zohar Fox
Applicant: Aurora Labs Ltd.
Applicant Address: IL Tel Aviv
Assignee: Aurora Labs Ltd.
Current Assignee: Aurora Labs Ltd.
Current Assignee Address: IL Tel Aviv
Main IPC: G06F40/284
IPC: G06F40/284

TOKENIZING DATA AND TRAINING LARGE CODE LANGUAGE MODELS

Abstract:

Disclosed herein are techniques for creating and using tokens representing portions of programming code. Techniques include identifying a first body of programming code associated with a hardware or software source attribute; associating a plurality of tokens with respective portions of the first body of programming code; configuring model input data for training a code language processing model customized in accordance with the hardware or software source attribute, the model input data comprising the plurality of tokens; and training, using the model input data, the code language processing model to analyze at least a part of the first body of programming code or a part of a second body of programming code, thus producing a customized and trained code language processing model in accordance with the hardware or software source attribute.

Information query

Global Dossier Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F40/00	处理自然语言数据（语音分析或综合，语音识别G10L）
G06F40/20	.自然语言分析（自然语言的语义分析入G06F40/30）
G06F40/279	..文字实体的识别
G06F40/284	...词汇分析，例如标记或搭配词