CLUSTERING NUMERICAL VALUES USING LOGARITHMIC BINNING

    公开(公告)号:US20240202214A1

    公开(公告)日:2024-06-20

    申请号:US18067770

    申请日:2022-12-19

    发明人: Rajesh Bordawekar

    IPC分类号: G06F16/28 G06F16/242

    CPC分类号: G06F16/285 G06F16/2433

    摘要: Clustering data points of a relational database having special data types is performed by establishing logarithmic bins in which the data is collected. Special data types include (i) zero; (ii) positive and negative values; (iii) infinity (positive and negative); (iv) not-a-number values (NaNs); (v) out-of-range values; and (vi) IEEE DECFloat (decimal floating-point) values. The numerical data is mapped to bins according to their values and redistributed among the bins based on median bin value. An occupancy-based partitioning process assures each bin has no more than a pre-defined threshold percentage of the data. Assigning data bins to clusters facilitates prediction of placement of input values into a particular cluster for response to database queries.

    SCALABLE COUNT BASED INTERPRETABILITY FOR DATABASE ARTIFICIAL INTELLIGENCE (AI)

    公开(公告)号:US20240045866A1

    公开(公告)日:2024-02-08

    申请号:US17817428

    申请日:2022-08-04

    摘要: Systems, computer-implemented methods or computer program products to facilitate receiving results of a semantic structured query language (SQL) query and employing sparse hash-table based sketches to interpret a semantic structured query language (SQL) query result. A computing component stores a first space-efficient structure sketch in a compressed serialize form. The computing component can load a second space-efficient data structure sketch along with the first space-efficient data structure sketch and can compute one or more interpretability scores by extracting co-occurrence information from the first space-efficient data structure sketch. The second space-efficient data structure sketch can include a sketch for containment check.

    Comparing time series data using context-based similarity

    公开(公告)号:US11244224B2

    公开(公告)日:2022-02-08

    申请号:US15926109

    申请日:2018-03-20

    摘要: A first observation window in a first time series is identified. The first observation window is preceded by a first portion of the first time series. A neural network is trained using the first portion of the first time series and the first observation window, and weights are extracted from the middle layers of the neural network. A first feature vector is generated based on the weights. A second observation window in a second time series is identified, where the second observation window is preceded by a first portion of the second time series. A second feature vector associated with the second observation window is determined. The second feature vector is based at least in part on the first set of weights. A similarity between the first and second observation windows is determined based on comparing the first feature vector and the second feature vector.

    BUILDING A WORD EMBEDDING MODEL TO CAPTURE RELATIONAL DATA SEMANTICS

    公开(公告)号:US20210124724A1

    公开(公告)日:2021-04-29

    申请号:US16665364

    申请日:2019-10-28

    发明人: Rajesh Bordawekar

    摘要: A computer-implemented method according to one embodiment includes identifying a relational database; determining columns of interest within the relational database; creating an unordered group of string tokens for each row of the relational database, utilizing the determined columns of interest; assigning weights for one or more columns within the relational database to one or more string tokens within each unordered group of string tokens to create a plurality of weighted unordered groups of string tokens; and determining a meaning vector for an identifier of each row of the relational database, utilizing the plurality of weighted unordered groups of string tokens.

    RECORD CORRECTION AND COMPLETION USING DATA SOURCED FROM CONTEXTUALLY SIMILAR RECORDS

    公开(公告)号:US20200159853A1

    公开(公告)日:2020-05-21

    申请号:US16197137

    申请日:2018-11-20

    IPC分类号: G06F17/30 G06F17/27

    摘要: From a first attribute-value pair in a record, new data comprising a first token is created. From each token using a processor and a memory, new data including a corresponding vector is computed. From the record, a target row is selected, wherein a target attribute-value pair in the target row includes a value requiring correction. Using a similarity measure, a set of most similar rows to the target row is determined, wherein each row in the set of most similar rows to the target row has a corresponding similarity measure above a threshold similarity measure and wherein each row in the set of most similar rows includes the target attribute. From values corresponding to the target attribute in the set of most similar rows, a replacement value is determined. The value requiring correction in the target row is replaced with the replacement value.

    Provisioning service requests in a computer system

    公开(公告)号:US10217053B2

    公开(公告)日:2019-02-26

    申请号:US14747062

    申请日:2015-06-23

    摘要: Disclosed is a system, computer program product, and method for provisioning a new service request. The computer-implemented method begins with receiving a new service request for computational resources in a computing system. The required computational resources are memory usage, storage usage, processor usage, or a combination thereof to fulfill the new service request. Next a sandbox computing environment is used to operate the new service request. The sandbox computing environment is used to isolate the computing system. The sandbox computing environment produces a current computational resources usage data to fulfill the new service request in the sandbox computing environment. The current sandbox computational resources usage data and historical computational resources usage data are both used by a machine learning module to create a prediction of the computational resources that will be required in the computing system to fulfill the new service request.

    Parallelized in-place radix sorting

    公开(公告)号:US09892149B2

    公开(公告)日:2018-02-13

    申请号:US14750363

    申请日:2015-06-25

    IPC分类号: G06F17/30

    摘要: Methods for sorting a data set. A data storage is divided into a plurality of buckets that is each associated with a respective key value. A plurality of stripes is identified in each bucket. At least one data stripe set is defined that has one stripe within each respective bucket. An in-place partial bucket radix sort is performed on data items contained within one data stripe set with a first processor using an initial radix. Incorrectly sorted data items are then grouped in each bucket into a respective incorrect data item group within each bucket. A radix sort is then performed using the initial radix on the items within the respective incorrect data item group. A first level sorted output is produced.

    INTERPRETATION OF RESULTS OF A SEMANTIC QUERY OVER A STRUCTURED DATABASE

    公开(公告)号:US20220269686A1

    公开(公告)日:2022-08-25

    申请号:US17184303

    申请日:2021-02-24

    摘要: Systems, computer-implemented methods and/or computer program products to facilitate interpretation of a result of execution of a query over a structured database are provided. According to an embodiment, a system can comprise a memory that stores computer executable components and a processor that executes the computer executable components stored in the memory. The computer executable components can comprise a determination component that determines a result of execution of a query over a structured database. The computer executable components also can comprise an interpretation component that interprets data underlying the result of execution of the query to determine one or more reasons that the result is provided in response to the query.

    COMMUNICATION-EFFICIENT DATA PARALLEL ENSEMBLE BOOSTING

    公开(公告)号:US20220180253A1

    公开(公告)日:2022-06-09

    申请号:US17114644

    申请日:2020-12-08

    IPC分类号: G06N20/20 G06F9/52 G06N5/00

    摘要: Data-parallel ensemble training using gradient boosted trees includes training an ensemble of trees. The training includes splitting a training dataset into several data portions. Each data portion is assigned to each thread group from a set of thread groups. The training further includes executing a stage, in which each thread group, in parallel, trains a respective ensemble of decision trees. Executing the stage includes performing, by each thread group, in parallel, machine learning operations for the respective ensemble of decision trees using the data portion assigned to each thread group. Further, each thread group validates, in parallel, the respective ensemble of decision trees using a data portion assigned to another thread group. Execution of the stage is repeated until a predetermined threshold is satisfied. Further, a prediction is inferenced using the ensemble of decision trees that is formed using the respective ensemble of trees from each of the thread groups.