-
公开(公告)号:US09697277B2
公开(公告)日:2017-07-04
申请号:US15208677
申请日:2016-07-13
Applicant: International Business Machines Corporation
Inventor: Andrey Balmin , Vuk Ercegovac , Peter J. Haas , Liping Peng , John Sismanis
IPC: G06F17/30
CPC classification number: G06F17/30598 , G06F17/30324 , G06F17/30486 , G06F17/3053 , G06F17/30536 , G06F17/30867
Abstract: A computer-implemented method includes partitioning a plurality of records into a plurality of splits. Each split includes at least a portion of the plurality of records. The method further includes providing at least one split of the plurality of splits to a mapper. The mapper scans the input data set, transforms each input record using a map function, and extracts a grouping key in parallel. The method further includes assigning at least a portion the records of the at least one split to a group. Each assignment to the group is based on a strata of the assigned record, and filtering the records of the group. Each filtering is based on a comparison of a weight of a record to a local threshold of the mapper. The method further includes shuffling the group to a reducer and providing a stratified sampling of the plurality of records based on the group.
-
2.
公开(公告)号:US20150186493A1
公开(公告)日:2015-07-02
申请号:US14141635
申请日:2013-12-27
Applicant: International Business Machines Corporation
Inventor: Andrey Balmin , Vuk Ercegovac , Peter J. Haas , Liping Peng , John Sismanis
IPC: G06F17/30
CPC classification number: G06F17/30598 , G06F17/30324 , G06F17/30486 , G06F17/3053 , G06F17/30536 , G06F17/30867
Abstract: Stratified sampling of a plurality of records is performed. A plurality of records are partitioned into a plurality of splits, wherein each split includes at least a portion of the plurality of records. The split of the plurality of splits is provided to a mapper. The mapper assigns at least a portion the records of the at least one split to a group based on a strata of the assigned records, and filters the records of the group based on a comparison of the weights of the records to a local threshold of the mapper. The mapper updates the local threshold of the mapper by communicating with a coordinator. The mapper shuffles the group to a reducer, where the reducer filters the records of the group based on the weights of the records. The reducer provides a stratified sampling of the plurality of records based on the group.
Abstract translation: 执行多个记录的分层抽样。 多个记录被分割成多个分割,其中每个分割包括多个记录的至少一部分。 将多个分割的分割提供给映射器。 映射器基于分配的记录的层将至少一个分裂的记录的至少一部分分配给组,并且基于记录的权重与记录的本地阈值的比较来过滤组的记录 映射器 映射器通过与协调器通信来更新映射器的本地阈值。 映射器将组混洗到reducer,其中reducer根据记录的权重过滤组的记录。 减速器基于该组提供多个记录的分层采样。
-
公开(公告)号:US09697274B2
公开(公告)日:2017-07-04
申请号:US14141635
申请日:2013-12-27
Applicant: International Business Machines Corporation
Inventor: Andrey Balmin , Vuk Ercegovac , Peter J. Haas , Liping Peng , John Sismanis
IPC: G06F17/30
CPC classification number: G06F17/30598 , G06F17/30324 , G06F17/30486 , G06F17/3053 , G06F17/30536 , G06F17/30867
Abstract: Stratified sampling of a plurality of records is performed. A plurality of records are partitioned into a plurality of splits, wherein each split includes at least a portion of the plurality of records. The split of the plurality of splits is provided to a mapper. The mapper assigns at least a portion the records of the at least one split to a group based on a strata of the assigned records, and filters the records of the group based on a comparison of the weights of the records to a local threshold of the mapper. The mapper updates the local threshold of the mapper by communicating with a coordinator. The mapper shuffles the group to a reducer, where the reducer filters the records of the group based on the weights of the records. The reducer provides a stratified sampling of the plurality of records based on the group.
-
4.
公开(公告)号:US20160321350A1
公开(公告)日:2016-11-03
申请号:US15208677
申请日:2016-07-13
Applicant: International Business Machines Corporation
Inventor: Andrey Balmin , Vuk Ercegovac , Peter J. Haas , Liping Peng , John Sismanis
IPC: G06F17/30
CPC classification number: G06F17/30598 , G06F17/30324 , G06F17/30486 , G06F17/3053 , G06F17/30536 , G06F17/30867
Abstract: A computer-implemented method includes partitioning a plurality of records into a plurality of splits. Each split includes at least a portion of the plurality of records. The method further includes providing at least one split of the plurality of splits to a mapper. The mapper scans the input data set, transforms each input record using a map function, and extracts a grouping key in parallel. The method further includes assigning at least a portion the records of the at least one split to a group. Each assignment to the group is based on a strata of the assigned record, and filtering the records of the group. Each filtering is based on a comparison of a weight of a record to a local threshold of the mapper. The method further includes shuffling the group to a reducer and providing a stratified sampling of the plurality of records based on the group.
Abstract translation: 计算机实现的方法包括将多个记录划分成多个分割。 每个分割包括多个记录的至少一部分。 该方法还包括将多个分割的至少一个分割提供给绘图器。 映射器扫描输入数据集,使用映射函数转换每个输入记录,并并行提取分组键。 该方法还包括将至少一个拆分的记录的至少一部分分配给一个组。 对组的每个分配基于分配的记录的层次,并且过滤组的记录。 每个过滤是基于记录的权重与映射器的本地阈值的比较。 该方法还包括将组洗牌到减速器,并且基于该组提供多个记录的分层采样。
-
-
-