Index partition maintenance over monotonically addressed document sequences
    2.
    发明授权
    Index partition maintenance over monotonically addressed document sequences 有权
    索引分区维护通过单调寻址的文档序列

    公开(公告)号:US08738673B2

    公开(公告)日:2014-05-27

    申请号:US12875615

    申请日:2010-09-03

    CPC classification number: G06F17/30233 G06F17/30584 G06F17/30631

    Abstract: Provided are techniques for partitioning a physical index into one or more physical partitions; assigning each of the one or more physical partitions to a node in a cluster of nodes; for each received document, assigning an assigned-doc-ID comprising an integer document identifier; and, in response to assigning the assigned-doc-ID to a document, determining a cut-off of assignment of new documents to a current virtual-index-epoch comprising a first set of physical partitions and placing the new documents into a new virtual-index-epoch comprising a second set of physical partitions by inserting each new document to a specific one of the physical partitions in the second set using one or more functions that direct the placement based on one of the assigned-doc-id, a field value derived from a set of fields obtained from the document, and a combination of the assigned-doc-id and the field value.

    Abstract translation: 提供了用于将物理索引分割成一个或多个物理分区的技术; 将一个或多个物理分区中的每一个分配给节点簇中的节点; 对于每个接收到的文档,分配包括整数文档标识符的分配文档ID; 并且响应于将分配的文档ID分配给文档,确定新文档的分配到当前虚拟索引时期的截断,该当前虚拟索引时期包括第一组物理分区,并将新文档放入新的虚拟 - 指数 - 历元包括第二组物理分区,通过使用一个或多个基于所分配的文档ID中的一个来指导所述布局的功能,将每个新文档插入第二组中的特定一个物理分区 从文档获得的一组字段中导出的值以及分配的doc-id和字段值的组合。

    Generating and using a dynamic bloom filter
    3.
    发明授权
    Generating and using a dynamic bloom filter 失效
    生成和使用动态布局过滤器

    公开(公告)号:US08209368B2

    公开(公告)日:2012-06-26

    申请号:US12134148

    申请日:2008-06-05

    CPC classification number: G06F12/0864

    Abstract: A dynamic Bloom filter comprises a cascaded set of Bloom filters. The system estimates or guesses a cardinality of input items, selects a number of hash functions based on the desired false positive rate, and allocates memory for an initial Bloom filter based on the estimated cardinality and desired false positive rate. The system inserts items into the initial Bloom filter and counts the bits set as they are inserted. If the number of bits set in the current Bloom filter reaches a predetermined target, the system declares the current Bloom filter full. The system recursively generates additional Bloom filters as needed for items remaining after the initial Bloom filter is filled; items are checked to eliminate duplicates. Each of the set of Bloom filters is individually queried to identify a positive or negative in response to a query. When the system is configured such that the false positive rate of each successive Bloom filter is decreased by one half, the system guarantees a false positive rate of at most twice the desired false positive rate.

    Abstract translation: 一个动态的Bloom过滤器包括一个级联的Bloom过滤器。 系统估计或猜测输入项的基数,基于所需的假阳性率选择多个散列函数,并且基于估计的基数和期望的假阳性率为初始布隆过滤器分配存储器。 系统将项目插入到初始布隆过滤器中,并对插入的位进行计数。 如果当前布隆过滤器中设置的位数达到预定目标,则系统将声明当前布隆过滤器已满。 系统会根据需要在初始布隆过滤器填充后剩余的项目递归地生成其他布隆过滤器; 检查项目以消除重复。 每一组Bloom过滤器都被单独查询以识别响应于查询的正或负。 当系统被配置为使得每个连续的Bloom过滤器的假阳性率减少一半时,系统保证假阳性率为期望假阳性率的两倍。

    INDEX PARTITION MAINTENANCE OVER MONOTONICALLY ADDRESSED DOCUMENT SEQUENCES
    4.
    发明申请
    INDEX PARTITION MAINTENANCE OVER MONOTONICALLY ADDRESSED DOCUMENT SEQUENCES 有权
    索引分割维护在单个寻址的文档序列中

    公开(公告)号:US20120059823A1

    公开(公告)日:2012-03-08

    申请号:US12875615

    申请日:2010-09-03

    CPC classification number: G06F17/30233 G06F17/30584 G06F17/30631

    Abstract: Provided are techniques for partitioning a physical index into one or more physical partitions; assigning each of the one or more physical partitions to a node in a cluster of nodes; for each received document, assigning an assigned-doc-ID comprising an integer document identifier; and, in response to assigning the assigned-doc-ID to a document, determining a cut-off of assignment of new documents to a current virtual-index-epoch comprising a first set of physical partitions and placing the new documents into a new virtual-index-epoch comprising a second set of physical partitions by inserting each new document to a specific one of the physical partitions in the second set using one or more functions that direct the placement based on one of the assigned-doc-id, a field value derived from a set of fields obtained from the document, and a combination of the assigned-doc-id and the field value.

    Abstract translation: 提供了用于将物理索引分割成一个或多个物理分区的技术; 将一个或多个物理分区中的每一个分配给节点簇中的节点; 对于每个接收到的文档,分配包括整数文档标识符的分配文档ID; 并且响应于将分配的文档ID分配给文档,确定新文档的分配到当前虚拟索引时期的截断,该当前虚拟索引时期包括第一组物理分区,并将新文档放入新的虚拟 - 指数 - 历元包括第二组物理分区,通过使用一个或多个基于所分配的文档ID中的一个来指导所述布局的功能,将每个新文档插入第二组中的特定一个物理分区 从文档获得的一组字段中导出的值以及分配的doc-id和字段值的组合。

    System and method for generating a cache-aware bloom filter
    6.
    发明授权
    System and method for generating a cache-aware bloom filter 失效
    用于生成缓存感知的布隆过滤器的系统和方法

    公开(公告)号:US08032732B2

    公开(公告)日:2011-10-04

    申请号:US12134125

    申请日:2008-06-05

    CPC classification number: G06F17/10

    Abstract: A cache-aware Bloom filter system segments a bit vector of a cache-aware Bloom filter into fixed-size blocks. The system hashes an item to be inserted into the cache-aware Bloom filter to identify one of the fixed-size blocks as a selected block for receiving the item and hashes the item k times to generate k hashed values for encoding the item for insertion in the in the selected block. The system sets bits within the selected block with addresses corresponding to the k hashed values such that accessing the item in the cache-aware Bloom filter requires accessing only the selected block to check the k hashed values. The size of the fixed-size block corresponds to a cache-line size of an associated computer architecture on which the cache-aware Bloom filter is installed.

    Abstract translation: 一个缓存感知的Bloom过滤器系统将缓存感知的Bloom过滤器的位向量分成固定大小的块。 系统将要插入到缓存感知的布隆过滤器中的项目进行散列,以将固定大小块之一识别为用于接收项目的选定块,并将项目k次哈希,以产生用于编码项目以插入的k个哈希值 在所选的块中。 系统在所选择的块内设置与k个哈希值相对应的地址的位,使得访问缓存感知的Bloom过滤器中的项目只需要访问所选择的块来检查k个哈希值。 固定大小块的大小对应于其上安装有缓存感知布隆过滤器的关联计算机体系结构的高速缓存行大小。

    SYSTEM AND METHOD FOR GENERATING A CACHE-AWARE BLOOM FILTER
    7.
    发明申请
    SYSTEM AND METHOD FOR GENERATING A CACHE-AWARE BLOOM FILTER 审中-公开
    用于生成高速缓存过滤器的系统和方法

    公开(公告)号:US20080155229A1

    公开(公告)日:2008-06-26

    申请号:US11614790

    申请日:2006-12-21

    CPC classification number: G06F17/10

    Abstract: A cache-aware Bloom filter system segments a bit vector of a cache-aware Bloom filter into fixed-size blocks. The system hashes an item to be inserted into the cache-aware Bloom filter to identify one of the fixed-size blocks as a selected block for receiving the item and hashes the item k times to generate k hashed values for encoding the item for insertion in the in the selected block. The system sets bits within the selected block with addresses corresponding to the k hashed values such that accessing the item in the cache-aware Bloom filter requires accessing only the selected block to check the k hashed values. The size of the fixed-size block corresponds to a cache-line size of an associated computer architecture on which the cache-aware Bloom filter is installed.

    Abstract translation: 一个缓存感知的Bloom过滤器系统将缓存感知的Bloom过滤器的位向量分成固定大小的块。 系统将要插入到缓存感知的布隆过滤器中的项目进行散列,以将固定大小块之一识别为用于接收项目的选定块,并将项目k次哈希,以产生用于编码项目以插入的k个哈希值 在所选的块中。 系统在所选择的块内设置与k个哈希值相对应的地址的位,使得访问缓存感知的Bloom过滤器中的项目只需要访问所选择的块来检查k个哈希值。 固定大小块的大小对应于其上安装有缓存感知布隆过滤器的关联计算机体系结构的高速缓存行大小。

    Single pass space efficient system and method for generating an approximate quantile in a data set having an unknown size
    9.
    发明授权
    Single pass space efficient system and method for generating an approximate quantile in a data set having an unknown size 失效
    用于在具有未知尺寸的数据集中生成近似分位数的单遍空间有效系统和方法

    公开(公告)号:US06343288B1

    公开(公告)日:2002-01-29

    申请号:US09268089

    申请日:1999-03-12

    Abstract: A space-efficient system and method for generating an approximate &phgr;-quantile data element of a data set in a single pass over the data set, without a priori knowledge of the size of the data set. The approximate &phgr;-quantile is guaranteed to lie within a user-specified approximation error &egr; of the true quantile being sought with a probability of at least 1−&dgr;, with &dgr; being a user-defined probability of failure. B buffers, each having a capacity of k elements, initially are filled with elements from the data set, with the values of b and k depending on approximation error e and the probability &dgr;. The buffers are then collapsed into an output buffer, with the remaining buffers then being refilled with elements, collapsed (along with the previous output buffer), and so on until the entire data set has been processed and a single output remains. The element of the output corresponding to the approximate quantile is then output as the approximate quantile. In later iterations (when the height of the tree is at least equal to a predetermined height that depends on &dgr; and &egr;), the data is sampled non-uniformly to populate the buffers to render the desired performance. Parallel processors can be used, with the final output buffers of the processors being sent to a collecting processor P0 as input buffers to the collecting processor P0.

    Abstract translation: 一种空间有效的系统和方法,用于在数据集中的单次传递中生成数据集的近似分位数据元素,而无需对数据集的大小的先验知识。 大致的分位数被保证位于用至少1-delta的概率寻求的真实分位数的用户指定的近似误差εi中,其中Δ是用户定义的故障概率。 每个具有k个元素的容量的B缓冲器最初由数据集中的元素填充,其中b和k的值取决于近似误差e和概率delta。 缓冲区然后被折叠成输出缓冲区,剩余的缓冲区然后被元素重新填充(与先前的输出缓冲区一起),等等,直到整个数据集被处理并且保持单个输出。 然后输出对应于近似分位数的输出元素作为近似分位数。 在后面的迭代中(当树的高度至少等于取决于delta和epsi的预定高度时),数据被不均匀地采样以填充缓冲器以呈现期望的性能。 可以使用并行处理器,处理器的最终输出缓冲器被发送到收集处理器P0作为到采集处理器P0的输入缓冲器。

    KNOWLEDGE-BASED DATA MINING SYSTEM
    10.
    发明申请
    KNOWLEDGE-BASED DATA MINING SYSTEM 审中-公开
    基于知识的数据挖掘系统

    公开(公告)号:US20120259890A1

    公开(公告)日:2012-10-11

    申请号:US13526424

    申请日:2012-06-18

    CPC classification number: G06F16/951 G06F2216/03

    Abstract: In a data mining system, data is gathered into a data store using, e.g., a Web crawler. The data is classified into entities. Data miners use rules to process the entities and append respective keys to the entities representing characteristics of the entities as derived from rules embodied in the miners. With these keys, characteristics of entities as defined by disparate expert authors of the data miners are identified for use in responding to complex data requests from customers.

    Abstract translation: 在数据挖掘系统中,使用例如Web爬行器将数据收集到数据存储中。 数据分为实体。 数据挖掘者使用规则来处理实体,并将相应的密钥附加到代表矿工特征的实体的实体。 利用这些密钥,确定数据挖掘者的不同专家作者定义的实体的特征用于响应客户的复杂数据请求。

Patent Agency Ranking