Abstract:
Techniques are disclosed for synchronizing a primary data system with an auxiliary data system that processes data for the primary data system. In one embodiment, how current the primary data system and the auxiliary data system are is determined. Requests sent from the primary data system that were not processed by the auxiliary data system are determined. The requests are resent to the auxiliary data system for processing.
Abstract:
Provided are techniques for partitioning a physical index into one or more physical partitions; assigning each of the one or more physical partitions to a node in a cluster of nodes; for each received document, assigning an assigned-doc-ID comprising an integer document identifier; and, in response to assigning the assigned-doc-ID to a document, determining a cut-off of assignment of new documents to a current virtual-index-epoch comprising a first set of physical partitions and placing the new documents into a new virtual-index-epoch comprising a second set of physical partitions by inserting each new document to a specific one of the physical partitions in the second set using one or more functions that direct the placement based on one of the assigned-doc-id, a field value derived from a set of fields obtained from the document, and a combination of the assigned-doc-id and the field value.
Abstract:
A dynamic Bloom filter comprises a cascaded set of Bloom filters. The system estimates or guesses a cardinality of input items, selects a number of hash functions based on the desired false positive rate, and allocates memory for an initial Bloom filter based on the estimated cardinality and desired false positive rate. The system inserts items into the initial Bloom filter and counts the bits set as they are inserted. If the number of bits set in the current Bloom filter reaches a predetermined target, the system declares the current Bloom filter full. The system recursively generates additional Bloom filters as needed for items remaining after the initial Bloom filter is filled; items are checked to eliminate duplicates. Each of the set of Bloom filters is individually queried to identify a positive or negative in response to a query. When the system is configured such that the false positive rate of each successive Bloom filter is decreased by one half, the system guarantees a false positive rate of at most twice the desired false positive rate.
Abstract:
Provided are techniques for partitioning a physical index into one or more physical partitions; assigning each of the one or more physical partitions to a node in a cluster of nodes; for each received document, assigning an assigned-doc-ID comprising an integer document identifier; and, in response to assigning the assigned-doc-ID to a document, determining a cut-off of assignment of new documents to a current virtual-index-epoch comprising a first set of physical partitions and placing the new documents into a new virtual-index-epoch comprising a second set of physical partitions by inserting each new document to a specific one of the physical partitions in the second set using one or more functions that direct the placement based on one of the assigned-doc-id, a field value derived from a set of fields obtained from the document, and a combination of the assigned-doc-id and the field value.
Abstract:
Systems, methods and computer program products are disclosed for managing availability of a slave components executing in a distributed system. A master server and a heartbeat protocol may be provided. The heartbeat protocol may define state machines for the master server and/or the slave components. The master server and/or the slave components may act in accordance with the state machines to manage availability of the slave components in the distributed system.
Abstract:
A cache-aware Bloom filter system segments a bit vector of a cache-aware Bloom filter into fixed-size blocks. The system hashes an item to be inserted into the cache-aware Bloom filter to identify one of the fixed-size blocks as a selected block for receiving the item and hashes the item k times to generate k hashed values for encoding the item for insertion in the in the selected block. The system sets bits within the selected block with addresses corresponding to the k hashed values such that accessing the item in the cache-aware Bloom filter requires accessing only the selected block to check the k hashed values. The size of the fixed-size block corresponds to a cache-line size of an associated computer architecture on which the cache-aware Bloom filter is installed.
Abstract:
A cache-aware Bloom filter system segments a bit vector of a cache-aware Bloom filter into fixed-size blocks. The system hashes an item to be inserted into the cache-aware Bloom filter to identify one of the fixed-size blocks as a selected block for receiving the item and hashes the item k times to generate k hashed values for encoding the item for insertion in the in the selected block. The system sets bits within the selected block with addresses corresponding to the k hashed values such that accessing the item in the cache-aware Bloom filter requires accessing only the selected block to check the k hashed values. The size of the fixed-size block corresponds to a cache-line size of an associated computer architecture on which the cache-aware Bloom filter is installed.
Abstract:
In a data mining system, data is gathered into a data store using, e.g., a Web crawler. The data is classified into entities. Data miners use rules to process the entities and append respective keys to the entities representing characteristics of the entities as derived from expert rules embodied in the miners. With these keys, characteristics of entities as defined by disparate expert authors of the data miners are identified for use in responding to complex data requests from customers.
Abstract:
A space-efficient system and method for generating an approximate &phgr;-quantile data element of a data set in a single pass over the data set, without a priori knowledge of the size of the data set. The approximate &phgr;-quantile is guaranteed to lie within a user-specified approximation error &egr; of the true quantile being sought with a probability of at least 1−&dgr;, with &dgr; being a user-defined probability of failure. B buffers, each having a capacity of k elements, initially are filled with elements from the data set, with the values of b and k depending on approximation error e and the probability &dgr;. The buffers are then collapsed into an output buffer, with the remaining buffers then being refilled with elements, collapsed (along with the previous output buffer), and so on until the entire data set has been processed and a single output remains. The element of the output corresponding to the approximate quantile is then output as the approximate quantile. In later iterations (when the height of the tree is at least equal to a predetermined height that depends on &dgr; and &egr;), the data is sampled non-uniformly to populate the buffers to render the desired performance. Parallel processors can be used, with the final output buffers of the processors being sent to a collecting processor P0 as input buffers to the collecting processor P0.
Abstract:
In a data mining system, data is gathered into a data store using, e.g., a Web crawler. The data is classified into entities. Data miners use rules to process the entities and append respective keys to the entities representing characteristics of the entities as derived from rules embodied in the miners. With these keys, characteristics of entities as defined by disparate expert authors of the data miners are identified for use in responding to complex data requests from customers.