System and method for analyzing data records

    公开(公告)号:US09830357B2

    公开(公告)日:2017-11-28

    申请号:US15226795

    申请日:2016-08-02

    Applicant: GOOGLE INC.

    Abstract: A method processes data records. The method partitions the data records into groups and assigns each group to a respective process of a first plurality of processes, which execute in parallel. For each group, the assigned process extracts information from the data records, applies a script with information processing commands applied sequentially to produce intermediate values, stores the intermediate values in a respective intermediate data structure, and updates the status of the group to indicate completion. When the predefined threshold percentage of the data records are completed, the process assigns each group to a respective second process as a backup. When each of the groups has been completed by at least one process (either the original or the backup), the method executes a second plurality of processes to aggregate intermediate values from the intermediate data structures to produce output data. The aggregation includes intermediate values only once for each group.

    Distributing Data on Distributed Storage Systems
    2.
    发明申请
    Distributing Data on Distributed Storage Systems 审中-公开
    在分布式存储系统上分发数据

    公开(公告)号:US20160299815A1

    公开(公告)日:2016-10-13

    申请号:US15180896

    申请日:2016-06-13

    Applicant: Google Inc.

    Abstract: A method of distributing data in a distributed storage system includes receiving a file, dividing the received file into chunks, and determining a distribution of the chunks among storage devices of the distributed storage system based on a maintenance hierarchy of the distributed storage system. The maintenance hierarchy includes maintenance levels, and each maintenance level includes one or more maintenance units. Each maintenance unit has an active state and an inactive state. Moreover, each storage device is associated with a maintenance unit. The determining of the distribution of the chunks includes identifying a random selection of the storage devices matching a number of chunks of the file and being capable of maintaining accessibility of the file when one or more maintenance units are in an inactive state. The method also includes distributing the chunks to storage devices of the distributed storage system according to the determined distribution.

    Abstract translation: 在分布式存储系统中分发数据的方法包括接收文件,将接收到的文件划分成块,以及基于分布式存储系统的维护层次来确定分布式存储系统的存储设备中的块的分布。 维护层次结构包括维护级别,每个维护级别包括一个或多个维护单元。 每个维护单元都具有活动状态和非活动状态。 此外,每个存储设备与维护单元相关联。 确定块的分布包括识别与文件的多个块匹配的存储设备的随机选择,并且当一个或多个维护单元处于非活动状态时能够保持文件的可访问性。 该方法还包括根据确定的分布将块分配到分布式存储系统的存储设备。

    Prioritizing data reconstruction in distributed storage systems
    3.
    发明授权
    Prioritizing data reconstruction in distributed storage systems 有权
    分布式存储系统中数据重建的优先级

    公开(公告)号:US09535790B2

    公开(公告)日:2017-01-03

    申请号:US15054780

    申请日:2016-02-26

    Applicant: Google Inc.

    Abstract: A method of prioritizing data for recovery in a distributed storage system includes, for each stripe of a file having chunks, determining whether the stripe comprises high-availability chunks or low-availability chunks and determining an effective redundancy value for each stripe. The effective redundancy value is based on the chunks and any system domains associated with the corresponding stripe. The distributed storage system has a system hierarchy including system domains. Chunks of a stripe associated with a system domain in an active state are accessible, whereas chunks of a stripe associated with a system domain in an inactive state are inaccessible. The method also includes reconstructing substantially immediately inaccessible, high-availability chunks having an effective redundancy value less than a threshold effective redundancy value and reconstructing the inaccessible low-availability and other inaccessible high-availability chunks, after a threshold period of time.

    Abstract translation: 对于分布式存储系统中用于恢复的数据进行优先排序的方法包括:对于具有块的文件的每个条带,确定条带是否包括高可用性块或低可用性块,并确定每个条带的有效冗余值。 有效的冗余值基于与相应条带相关联的块和任何系统域。 分布式存储系统具有系统层次结构,包括系统域。 与处于活动状态的系统域相关联的条带的块可访问,而与处于非活动状态的系统域相关联的条带的块不可访问。 该方法还包括在阈值时间段之后重建具有小于阈值有效冗余度值的有效冗余度的基本上立即不可访问的高可用性块并且重建不可访问的低可用性和其他不可访问的高可用性块。

    Efficient data reads from distributed storage systems
    4.
    发明授权
    Efficient data reads from distributed storage systems 有权
    从分布式存储系统读取高效数据

    公开(公告)号:US09514015B2

    公开(公告)日:2016-12-06

    申请号:US15079095

    申请日:2016-03-24

    Applicant: Google Inc.

    Abstract: A method of distributing data in a distributed storage system includes receiving a file into non-transitory memory and dividing the received file into chunks. The chunks are data-chunks and non-data chunks. The method also includes grouping one or more of the data chunks and one or more of the non-data chunks in a group. One or more chunks of the group is capable of being reconstructed from other chunks of the group. The method also includes distributing the chunks of the group to storage devices of the distributed storage system based on a hierarchy of the distributed storage system. The hierarchy includes maintenance domains having active and inactive states, each storage device associated with a maintenance domain, the chunks of a group are distributed across multiple maintenance domains to maintain the ability to reconstruct chunks of the group when a maintenance domain is in an inactive state.

    Abstract translation: 在分布式存储系统中分发数据的方法包括将文件接收到非暂时存储器中并将接收到的文件分割成块。 这些块是数据块和非数据块。 该方法还包括将一个或多个数据块和一组中的一个或多个非数据块分组。 该组中的一个或多个组块能够从该组的其他组块重构。 该方法还包括基于分布式存储系统的层次,将该组块分配到分布式存储系统的存储设备。 层次结构包括具有活动状态和非活动状态的维护域,每个存储设备与维护域相关联,组的块被分布在多个维护域上,以便在维护域处于非活动状态时维持重组组的块的能力 。

    System and Method For Analyzing Data Records
    5.
    发明申请
    System and Method For Analyzing Data Records 有权
    用于分析数据记录的系统和方法

    公开(公告)号:US20160342657A1

    公开(公告)日:2016-11-24

    申请号:US15226795

    申请日:2016-08-02

    Applicant: GOOGLE INC.

    Abstract: A method processes data records. The method partitions the data records into groups and assigns each group to a respective process of a first plurality of processes, which execute in parallel. For each group, the assigned process extracts information from the data records, applies a script with information processing commands applied sequentially to produce intermediate values, stores the intermediate values in a respective intermediate data structure, and updates the status of the group to indicate completion. When the predefined threshold percentage of the data records are completed, the process assigns each group to a respective second process as a backup. When each of the groups has been completed by at least one process (either the original or the backup), the method executes a second plurality of processes to aggregate intermediate values from the intermediate data structures to produce output data. The aggregation includes intermediate values only once for each group.

    Abstract translation: 一种方法处理数据记录。 该方法将数据记录分成组,并将每个组分配给并行执行的第一多个进程的相应进程。 对于每个组,分配的进程从数据记录中提取信息,应用顺序应用的信息处理命令的脚本以产生中间值,将中间值存储在各自的中间数据结构中,并更新组的状态以指示完成。 当数据记录的预定义阈值百分比完成时,进程将每个组分配给相应的第二个进程作为备份。 当每个组已经由至少一个进程(原始或备份)完成时,该方法执行第二多个进程以从中间数据结构聚合中间值以产生输出数据。 聚合包括每个组只有中间值一次。

    Efficient Data Reads From Distributed Storage Systems

    公开(公告)号:US20160203066A1

    公开(公告)日:2016-07-14

    申请号:US15079095

    申请日:2016-03-24

    Applicant: Google Inc.

    Abstract: A method of distributing data in a distributed storage system includes receiving a file into non-transitory memory and dividing the received file into chunks. The chunks are data-chunks and non-data chunks. The method also includes grouping one or more of the data chunks and one or more of the non-data chunks in a group. One or more chunks of the group is capable of being reconstructed from other chunks of the group. The method also includes distributing the chunks of the group to storage devices of the distributed storage system based on a hierarchy of the distributed storage system. The hierarchy includes maintenance domains having active and inactive states, each storage device associated with a maintenance domain, the chunks of a group are distributed across multiple maintenance domains to maintain the ability to reconstruct chunks of the group when a maintenance domain is in an inactive state.

    Quota-based resource scheduling
    9.
    发明授权

    公开(公告)号:US09781054B1

    公开(公告)日:2017-10-03

    申请号:US14810187

    申请日:2015-07-27

    Applicant: Google Inc.

    CPC classification number: H04L47/762 H04L47/821

    Abstract: The present disclosure relates to dynamically scheduling resource requests in a distributed system based on usage quotas. One example method includes identifying usage information for a distributed system including atoms, each atom representing a distinct item used by users of the distributed system; determining that a usage quota associated with the distributed system has been exceeded based on the usage information, the usage quota representing an upper limit for a particular type of usage of the distributed system; receiving a first request for a particular atom requiring invocation of the particular type of usage represented by the usage quota; determining that a second request for a different type of usage of the particular atom is waiting to be processed; and processing the second request for the particular atom before processing the first request.

Patent Agency Ranking