TOKENIZING DATA AND TRAINING LARGE CODE LANGUAGE MODELS
Abstract:
Disclosed herein are techniques for creating and using tokens representing portions of programming code. Techniques include identifying a first body of programming code associated with a hardware or software source attribute; associating a plurality of tokens with respective portions of the first body of programming code; configuring model input data for training a code language processing model customized in accordance with the hardware or software source attribute, the model input data comprising the plurality of tokens; and training, using the model input data, the code language processing model to analyze at least a part of the first body of programming code or a part of a second body of programming code, thus producing a customized and trained code language processing model in accordance with the hardware or software source attribute.
Information query
Patent Agency Ranking
0/0