Systems and methods for classifying data objects

    公开(公告)号:US11755626B1

    公开(公告)日:2023-09-12

    申请号:US17390289

    申请日:2021-07-30

    申请人: SPLUNK Inc.

    IPC分类号: G06F16/28 G06F16/22 G06F16/93

    摘要: A computer-implemented method is disclosed that includes operations of receiving document to be classified, performing pre-processing operations on the document resulting in generation of a tokenized document, performing word embedding operations on the tokenized document resulting in generation of a vectorized document, performing text similarity operations on the vectorized document and each of one or more vectorized topics resulting in a set of one or more similarity scores, wherein a first similarity score indicates a level of similarity between the vectorized document and a first vectorized topic, and wherein each vectorized topic represents one of a predetermined set of topics and classifying the document into one of the predetermined set of topics based on the set of one or more similarity scores. Performing the word embedding operations includes mapping each token of the remaining subset to a multi-dimensional vector, with each multi-dimensional vector representing a semantic meaning of a token.

    Machine-learning techniques for evaluating suitability of candidate datasets for target applications

    公开(公告)号:US11704598B2

    公开(公告)日:2023-07-18

    申请号:US17929394

    申请日:2022-09-02

    申请人: Adobe Inc.

    IPC分类号: G06N20/00 G06F16/22 G06F16/28

    摘要: Techniques disclosed herein relate generally to evaluating and selecting candidate datasets for use by software applications, such as selecting candidate datasets for training machine-learning models used in software applications. Various machine-learning and other data science techniques are used to identify unique entities in a candidate dataset that are likely to be part of target entities for a software application. A merit attribute is then determined for the candidate dataset based on the number of unique entities that are likely to be part of the target entities, and weights associated with these unique entities. The merit attribute is used to identify the most efficient or most cost-effective candidate dataset for the software application.

    FINGERPRINT-BASED DATA CLASSIFICICATION
    49.
    发明公开

    公开(公告)号:US20230177071A1

    公开(公告)日:2023-06-08

    申请号:US17541704

    申请日:2021-12-03

    IPC分类号: G06F16/28 G06F16/22 G06N20/00

    摘要: Systems and methods are provided for automated classification of data using fingerprints. In embodiments, a method includes: generating, by a computing device based on predetermined rules, a fingerprint of a data column in a data set to be classified, the fingerprint comprising dimensions, wherein each of the dimension is assigned an attribute representing a characteristic of data in the data column; determining, by the computing device, that the fingerprint matches one or more target fingerprints by comparing the fingerprint to the target fingerprints, wherein each target fingerprint is associated with a class and includes dimensions, and each dimension is assigned an attribute representing a characteristic of data in the class; and assigning, by the computing device, one or more classes to the data column based on the one or more target fingerprints, thereby generating classified data.