System and method for joining skewed datasets in a distributed computing environment
摘要:
Disclosed is a method and system for joining datasets in a distributed computing environment. The system comprises a memory 206 and a processor 202. The processor 202 identifies a skewed dataset from two or more datasets to be joined. The processor 202 identifies a replication parameter from a configuration file. The processor 202 then assigns a randomly assigned machine number to each chunk of the skewed dataset owned by the nodes/machines involved in the join operation. The processor 202 forms copies of the non-skewed dataset equal to the replication parameter and adds the copy number to each sample of the copy of the non-skewed dataset formed. Further, the processor 202 merges each non-skewed dataset into the final copy of the non-skewed dataset, forming a single non skewed dataset. The processor 202 then repeats these steps for all the non-skewed datasets involved in the join operation resulting in generation of merged copies of all the non-skewed datasets and then performs the joining operation.
信息查询
0/0