Abstract:
Generally described, systems and methods are provided for monitoring and detecting causes of failures of network paths. The system collects performance information from a plurality of nodes and links in a network, aggregates the collected performance information across paths in the network, processes the aggregated performance information for detecting failures on the paths, analyzes each of the detected failures to determine at least one root cause, and initiates a remedial workflow for the at least one root cause determined. In some aspects, processing the aggregated information may include performing a statistical regression analysis or otherwise solving a set of equations for the performance indications on each of a plurality of paths. In another aspect, the system may also include an interface which makes available for display one or more of the network topology, the collected and aggregated performance information, and indications of the detected failures in the topology.
Abstract:
Operating profiles for consumers of computing resources may be automatically determined based on an analysis of actual resource usage measurements and other operating metrics. Measurements may be taken while a consumer, such as a virtual machine instance, uses computing resources, such as those provided by a host. A profile may be dynamically determined based on those measurements. Profiles may be generalized such that groups of consumers with similar usage profiles are associated with a single profile. Assignment decisions may be made based on the profiles, and computing resources may be reallocated or oversubscribed if the profiles indicate that the consumers are unlikely to fully utilize the resources reserved for them. Oversubscribed resources may be monitored, and consumers may be transferred to different resource providers if contention for resources is too high.
Abstract:
A service provider can maintain one or more host computing devices that can be accessed as host computing device resources by customers. A hosting platform includes components arranged in a manner to limit modifications to software or firmware on hardware components. In some aspects, the hosting platform may include a master latch that indicates whether the components may be configured, and the master latch may be set once and only reset upon completion of a power cycle. In another aspect, the hosting platform can implement management functions for establishing control plane functions between the host computing device and the service provider that is independent of the customer. Additionally, the management functions can also be utilized to present different hardware or software attributes of the host computing device.
Abstract:
An asset health monitoring system (AHMS) can assign a confidence indicator to some or all the monitored computing asset in a data center, such as computing systems or networking devices. In response to drops in the confidence indicators, the AHMS can automatically initiate testing of computing assets in order to raise confidence that the asset will perform correctly. Further, the AHMS can automatically initiate remediation procedures for computing assets that fail the confidence testing. By automatically triggering testing of assets and/or remediation procedures, the AHMS can increase reliability for the data center by preemptively identifying problems.
Abstract:
Approaches are disclosed for enabling owners of virtual computing resources to specify one or more constraints for their virtual machines and/or virtual networks, with respect to metrics such as cost, latency, throughput, network bandwidth, power usage, server availability, data redundancy, correlated failure susceptibility, and other such metrics. A customer can declare a set of constraints with metrics goals for their virtual machine instance or network of instances, and the service provider can optimize the placement (e.g., host selection) and various settings (e.g., hardware and software settings) to satisfy the specified constraints. The satisfaction of customer-specified constraints may need to take into account what other virtual machine instances are performing in the shared resource environment.
Abstract:
An asset health monitoring system (AHMS) can assign a confidence indicator to some or all the monitored computing asset in a data center, such as computing systems or networking devices. In response to drops in the confidence indicators, the AHMS can automatically initiate testing of computing assets in order to raise confidence that the asset will perform correctly. Further, the AHMS can automatically initiate remediation procedures for computing assets that fail the confidence testing. By automatically triggering testing of assets and/or remediation procedures, the AHMS can increase reliability for the data center by preemptively identifying problems.
Abstract:
A set of techniques is described for monitoring and analyzing crashes and other malfunctions in a multi-tenant computing environment (e.g. cloud computing environment). The computing environment may host many applications that are executed on different computing resource combinations. The combinations may include varying types and versions of hardware or software resources. A monitoring service is deployed to gather statistical data about the failures occurring in the computing environment. The statistical data is then analyzed to identify abnormally high failure patterns. The failure patterns may be associated with particular computing resource combinations being used to execute particular types of applications. Based on these failure patterns, suggestions can be issued to a user to execute the application using a different computing resource combination. Alternatively, the failure patterns may be used to modify or update the various resources in order to correct the potential malfunctions caused by the resource.
Abstract:
Disclosed are various embodiments of a computing device for validating the configuration of components of a component assembly. The computing device serves a boot image executable by a component of the component assembly. Expected configuration data associated with the component is identified by the computing device, and actual configuration data associated with the component is obtained by the computing device. The computing device determines a validation response for the component assembly based at least in part upon a comparison of the expected configuration data and the actual configuration data.
Abstract:
Systems and methods are disclosed that facilitate the updating of target host computing devices based on versioning information. A set of host computing devices are provisioned with a local computing device management component. Each local computing device management component periodically transmits a request to a host computing device management component to determine whether version information associated with the respective host computing device corresponds to version filter information. Based on a processing of the version filter information with the current version information of the host computing device, the host computing device management component can facilitate the implementation of updates to the requesting host computing device.