Abstract:
Generally described, systems and methods are provided for monitoring and detecting causes of failures of network paths. The system collects performance information from a plurality of nodes and links in a network, aggregates the collected performance information across paths in the network, processes the aggregated performance information for detecting failures on the paths, analyzes each of the detected failures to determine at least one root cause, and initiates a remedial workflow for the at least one root cause determined. In some aspects, processing the aggregated information may include performing a statistical regression analysis or otherwise solving a set of equations for the performance indications on each of a plurality of paths. In another aspect, the system may also include an interface which makes available for display one or more of the network topology, the collected and aggregated performance information, and indications of the detected failures in the topology.
Abstract:
Host machines and other devices performing synchronized operations can be dispersed across multiple racks in a data center to provide additional buffer capacity and to reduce the likelihood of congestion. The level of dispersion can depend on factors such as the level of oversubscription, as it can be undesirable in a highly connected network to push excessive host traffic into the aggregation fabric. As oversubscription levels increase, the amount of dispersion can be reduced and two or more host machines can be clustered on a given rack, or otherwise connected through the same edge switch. By clustering a portion of the machines, some of the host traffic can be redirected by the respective edge switch without entering the aggregation fabric. When provisioning hosts for a customer, application, or synchronized operation, for example, the levels of clustering and dispersion can be balanced to minimize the likelihood for congestion throughout the network.
Abstract:
Efficient and highly-scalable network solutions are provided that each utilize deployment units based on Clos networks, but in an environment such as a data center of Internet Protocol-based network. Each of the deployment units can include multiple stages of devices, where connections between devices are only made between stages and the deployment units are highly connected. In some embodiments, the level of connectivity between two stages can be reduced, providing available connections to add edge switches and additional host connections while keeping the same number of between-tier connections. In some embodiments, where deployment units (or other network groups) can be used at different levels to connect other deployment units, the edges of the deployment units can be fused to reduce the number of devices per host connection.
Abstract:
Systems and methods for handling resources in a computer system differently in certain situations, such as catastrophic events, based upon an assigned layer of the resource to the system. The layer can be based, for example, on criticality of the resource to the system. Services or computing device resources can be physically segregated in accordance with layers, and can be managed in accordance with the segregation. As an example, critical layers can be fenced off or otherwise made not available except to users with secure clearance or authorization. In addition, a light or other indicator can be provided for indicating that a datacenter component is in a particular layer. The indicators can be at a device level, rack level, and/or room or area level.
Abstract:
Systems and methods are described for testing computing resources. In one embodiment, a request is received for testing a computing configuration. A set of computing settings that can be implemented on one or more computing devices is searched. An initial test population for testing the computing configuration is determined. The initial test population is iteratively updated based on test results and a fitness function.
Abstract:
Systems and methods are described for testing computing resources. In one embodiment, a search space of computing settings is analyzed in accordance with weighted data that maps computing performance parameters with the computing settings. A subset of the computing settings is selected to generate a test population to optimize at least one computing performance parameter. One or more computing devices in a computing environment are configured in accordance with the test population, and the test conditions are iteratively updated based on test results in accordance with the test population and a fitness function.
Abstract:
Operating profiles for consumers of computing resources may be automatically determined based on an analysis of actual resource usage measurements and other operating metrics. Measurements may be taken while a consumer, such as a virtual machine instance, uses computing resources, such as those provided by a host. A profile may be dynamically determined based on those measurements. Profiles may be generalized such that groups of consumers with similar usage profiles are associated with a single profile. Assignment decisions may be made based on the profiles, and computing resources may be reallocated or oversubscribed if the profiles indicate that the consumers are unlikely to fully utilize the resources reserved for them. Oversubscribed resources may be monitored, and consumers may be transferred to different resource providers if contention for resources is too high.
Abstract:
A service provider can maintain one or more host computing devices that can be accessed as host computing device resources by customers. A hosting platform includes components arranged in a manner to limit modifications to software or firmware on hardware components. In some aspects, the hosting platform may include a master latch that indicates whether the components may be configured, and the master latch may be set once and only reset upon completion of a power cycle. In another aspect, the hosting platform can implement management functions for establishing control plane functions between the host computing device and the service provider that is independent of the customer. Additionally, the management functions can also be utilized to present different hardware or software attributes of the host computing device.
Abstract:
Approaches for automatically backing up data from volatile memory to persistent storage in the event of a power outage, blackout or other such failure are described. The approaches can be implemented on a computing device that includes a motherboard, central processing unit (CPU) a main power source, volatile memory (e.g., random access memory (RAM)), an alternate power source and circuitry (e.g., a specialized application-specific integrated circuit (ASIC)) for performing the backup of volatile memory to a persistent storage device. In the event of a power failure of the main power source, the alternate power source is configured to supply power to the specialized ASIC for backing up the data in the volatile memory. For example, when power failure is detected, the ASIC can read the data from the DIMM socket using power supplied from the alternate power source and write that data to a persistent storage device.
Abstract:
An asset health monitoring system (AHMS) can assign a confidence indicator to some or all the monitored computing asset in a data center, such as computing systems or networking devices. In response to drops in the confidence indicators, the AHMS can automatically initiate testing of computing assets in order to raise confidence that the asset will perform correctly. Further, the AHMS can automatically initiate remediation procedures for computing assets that fail the confidence testing. By automatically triggering testing of assets and/or remediation procedures, the AHMS can increase reliability for the data center by preemptively identifying problems.