-
公开(公告)号:US10659523B1
公开(公告)日:2020-05-19
申请号:US14286724
申请日:2014-05-23
Applicant: Amazon Technologies, Inc.
Inventor: Rejith George Joseph , Tin-Yu Lee , Scott Michael Le Grand , Saurabh Dileep Baji
Abstract: At the request of a customer, a distributed computing service provider may create multiple clusters under a single customer account, and may isolate them from each other. For example, various isolation mechanisms (or combinations of isolation mechanisms) may be applied when creating the clusters to isolate a given cluster of compute nodes from network traffic from compute nodes of other clusters (e.g., by creating the clusters in different VPCs); to restrict access to data, metadata, or resources that are within the given cluster of compute nodes or that are associated with the given cluster of compute nodes by compute nodes of other clusters in the distributed computing system (e.g., using an instance metadata tag and/or a storage system prefix); and/or restricting access to application programming interfaces of the distributed computing service by the given cluster of compute nodes (e.g., using an identity and access manager).
-
公开(公告)号:US10936432B1
公开(公告)日:2021-03-02
申请号:US14495408
申请日:2014-09-24
Applicant: Amazon Technologies, Inc.
Inventor: Tin-Yu Lee , Rejith George Joseph , Scott Michael Le Grand , Saurabh Dileep Baji
Abstract: Methods, systems, and computer-readable media for implementing a fault-tolerant parallel computation framework are disclosed. Execution of an application comprises execution of a plurality of processes in parallel. Process states for the processes are stored during the execution of the application. The processes use a message passing interface for exchanging messages with one other. The messages are exchanged and the process states are stored at a plurality of checkpoints during execution of the application. A final successful checkpoint is determined after the execution of the application is terminated. The final successful checkpoint represents the most recent checkpoint at which the processes exchanged messages successfully. Execution of the application is resumed from the final successful checkpoint using the process states stored at the final successful checkpoint.
-
公开(公告)号:US10148736B1
公开(公告)日:2018-12-04
申请号:US14281582
申请日:2014-05-19
Applicant: Amazon Technologies, Inc.
Inventor: Tin-Yu Lee , Rejith George Joseph , Scott Michael Le Grand , Saurabh Dileep Baji , Peter Sirota
Abstract: A client may submit a job to a service provider that processes a large data set and that employs a message passing interface (MPI) to coordinate the collective execution of the job on multiple compute nodes. The framework may create a MapReduce cluster (e.g., within a VPC) and may generate a single key pair for the cluster, which may be downloaded by nodes in the cluster and used to establish secure node-to-node communication channels for MPI messaging. A single node may be assigned as a mapper process and may launch the MPI job, which may fork its commands to other nodes in the cluster (e.g., nodes identified in a hostfile associated with the MPI job), according to the MPI interface. A rankfile may be used to synchronize the MPI job and another MPI process used to download portions of the data set to respective nodes in the cluster.
-
公开(公告)号:US10133646B1
公开(公告)日:2018-11-20
申请号:US15468708
申请日:2017-03-24
Applicant: Amazon Technologies, Inc.
Inventor: Rejith George Joseph , Tin-Yu Lee , Bandish N. Chheda , Scott Michael Le Grand , Saurabh Dileep Baji
Abstract: A method for providing fault tolerance in a distributed file system of a service provider may include launching at least one data storage node on at least a first virtual machine instance (VMI) running on one or more servers of the service provider and storing file data. At least one data management node may be launched on at least a second VMI running on the one or more servers of the service provider. The at least second VMI may be associated with a dedicated IP address and the at least one data management node may store metadata information associated with the file data in a network storage attached to the at least second VMI. Upon detecting a failure of the at least second VMI, the at least one data management node may be re-launched on at least a third VMI running on the one or more servers.
-
公开(公告)号:US09612924B1
公开(公告)日:2017-04-04
申请号:US14314969
申请日:2014-06-25
Applicant: Amazon Technologies, Inc.
Inventor: Rejith George Joseph , Tin-Yu Lee , Bandish N. Chheda , Scott Michael Le Grand , Saurabh Dileep Baji
CPC classification number: G06F11/2007 , G06F9/45533 , G06F9/45558 , G06F2009/45579 , G06F2009/45591
Abstract: A method for providing fault tolerance in a distributed file system of a service provider may include launching at least one data storage node on at least a first virtual machine instance (VMI) running on one or more servers of the service provider and storing file data. At least one data management node may be launched on at least a second VMI running on the one or more servers of the service provider. The at least second VMI may be associated with a dedicated IP address and the at least one data management node may store metadata information associated with the file data in a network storage attached to the at least second VMI. Upon detecting a failure of the at least second VMI, the at least one data management node may be re-launched on at least a third VMI running on the one or more servers.
-
-
-
-