Abstract:
Techniques for deploying a vector index on multiple nodes of a cluster are provided. In one technique, an instruction is received to create a vector index on a set of vectors that is stored in a vector database that is connected to the multiple nodes. In response, an HNSW index is created based on the set of vectors and the HNSW index is stored on each node. In response to receiving a vector query, a node processes the vector query against its copy of the HNSW index. In another technique, each node retrieves, from a vector database, a respective subset of a set of vectors and generates, based on the respective subset, a respective HNSW index. A vector query is transmitted to each node, which traverses its HNSW index to generate results of the vector query. The results from each node are combined to generate final results.
Abstract:
Techniques are provided for optimizing workload performance by automatically discovering and implementing performance optimizations for in-memory units (IMUs). A system maintains a set of IMUs for processing database operations in a database. The system obtains a database workload information for the database system and filters the database workload information to identify database operations in the database workload information that may benefit from performance optimizations. The system analyzes the database operations to identify a set of performance optimizations and ranks the performance optimizations based on their potential benefit. The system selects a subset of the performance optimizations, based on their ranking, and generates new versions of IMUs that reflect the performance optimizations. The system performs verification tests on the new versions of IMUs and analyzes the tests to determine whether the new versions of IMUs yield expected performance benefits. The system then categorizes the new set of IMUs into a first set of IMUs to be retained and a second set of IMUs to be discarded. The system then makes the first set of IMUs available to the current workload and discards the second set of IMUs.
Abstract:
The present invention relates to join acceleration. In an embodiment, a computer receives a request for a relational join of build data rows with probe data rows. Based on the request for the relational join, a particular kind of data map from many kinds of data map that can implement the relational join is dynamically selected. Based on the build data rows, an instance of the particular kind of data map is populated. A response is sent for the request for the relational join that is based on the probe data rows and the instance of the particular kind of data map.
Abstract:
Herein are techniques that concurrently populate entries in a compressed sparse row (CSR) encoding, of a type of edge of a heterogenous graph. In an embodiment, a computer obtains a mapping of a relational schema to a graph data model. The relational schema defines vertex tables that correspond to vertex types in the graph data model, and edge tables that correspond to edge types in the graph data model. Each edge type is associated with a source vertex type and a target vertex type. For each vertex type, a sequence of persistent identifiers of vertices is obtained. Based on the mapping and for a CSR representation of each edge type, a source array is populated that, for a same vertex ordering as the sequence of persistent identifiers for the source vertex type, is based on counts of edges of the edge type that originate from vertices of the source vertex type. For the CSR, the computer populates, in parallel and based on said mapping, a destination array that contains canonical offsets as sequence positions within the sequence of persistent identifiers of the vertices.
Abstract:
Herein are techniques for dynamic aggregation of results of a database request, including concurrent grouping of result items in memory based on quasi-dense keys. Each of many computational threads concurrently performs as follows. A hash code is calculated that represents a particular natural grouping key (NGK) for an aggregate result of a database request. Based on the hash code, the thread detects that a set of distinct NGKs that are already stored in the aggregate result does not contain the particular NGK. A distinct dense grouping key for the particular NGK is statefully generated. The dense grouping key is bound to the particular NGK. Based on said binding, the particular NGK is added to the set of distinct NGKs in the aggregate result.
Abstract:
Herein are techniques for dynamic aggregation of results of a database request, including concurrent grouping of result items in memory based on quasi-dense keys. Each of many computational threads concurrently performs as follows. A hash code is calculated that represents a particular natural grouping key (NGK) for an aggregate result of a database request. Based on the hash code, the thread detects that a set of distinct NGKs that are already stored in the aggregate result does not contain the particular NGK. A distinct dense grouping key for the particular NGK is statefully generated. The dense grouping key is bound to the particular NGK. Based on said binding, the particular NGK is added to the set of distinct NGKs in the aggregate result.
Abstract:
Methods and apparatuses for determining set-membership using Single Instruction Multiple Data (“SIMD”) architecture are presented herein. Specifically, methods and apparatuses are discussed for determining, in parallel, whether multiple values in a first set of values are members of a second set of values. Many of the methods and systems discussed herein are applied to determining whether one or more rows in a dictionary-encoded column of a database table satisfy one or more conditions based on the dictionary-encoded column. However, the methods and systems discussed herein may apply to many applications executed on a SIMD processor using set-membership tests.
Abstract:
Techniques are described herein for maintaining two copies of the same semi-structured data, where each copy is organized in a different format. One copy is in a first-format that may be convenient for storage, but inefficient for query processing. For example, the first-format may be a textual format that needs to be parsed every time a query needs to access individual data items within a semi-structured object. The database system intelligently loads semi-structured first-format data into volatile memory and, while doing so, converts the semi-structured first-format data to a second-format. Because the data in volatile memory is in the second-format, processing queries against the second-format data both allows disk I/0 to be avoided, and increases the efficiency of the queries themselves. For example, the parsing that may be necessary to run a query against a cached copy of the first-format data is avoided.
Abstract:
Techniques for automatically selecting a type of vector index are provided. In one technique, in response to determining to generate a vector index based on a base table that stores a plurality of vectors, a number of the plurality of vectors is identified. Based at least on the number of the plurality of vectors, a particular type of vector index is identified from among a plurality of types of vector indexes. Examples of the plurality of types include an HNSW index and an IVF index. A vector index of the particular type is generated for the base table. Another criterion in identifying a type of vector index to generate is the number of neighbors that is a parameter in generating a certain type of vector index.
Abstract:
Techniques for processing top-K queries are provided. In one technique, a database statement is received that requests top-K results related to a database object and that indicates two columns thereof: a first column by which to partition a result set and a second column by which to order the result set. A buffer is generated. For each of multiple rows in the database object: a first key value that associated with a first value in the first column of said each row is identified; a second key value that associated with a second value in the second column of said each entry is identified; a slot in the buffer is identified based on the first key value and the second key value; and the slot in the buffer may be updated based on the second key value. A response to the database statement is generated based on the buffer.