Abstract:
Aspects of the technology provide improvements to a Serverless Computing (SLC) workflow by determining when and how to optimize SLC jobs for computing in a Distributed Computing Framework (DCF). DCF optimization can be performed by abstracting SLC tasks into different workflow configurations to determined optimal arrangements for execution in a DCF environment. A process of the technology can include steps for receiving an SLC job including one or more SLC tasks, executing one or more of the tasks to determine a latency metric and a throughput metric for the SLC tasks, and determining if the SLC tasks should be converted to a Distributed Computing Framework (DCF) format based on the latency metric and the throughput metric. Systems and machine-readable media are also provided.
Abstract:
Approaches are disclosed for improving performance of logical disks. A logical disk can comprise several storage devices. In an object storage system (OSS), when a logical disk stores a file, fragments of the file are stored distributed across the storage devices. Each of the fragments of the file is asymmetrically stored in (write) and retrieved from (read) the storage devices. The performance of the logical disk is improved by reconfiguring one or more of the storage devices based on an influence that each of the storage devices has on performance of the logical disk and the asymmetric read and write operations of each of the storage devices. For example, latency of the logical disk can be reduced by reconfiguring one or more of the plurality of storage disks based on a proportion of the latency of the logical device that is attributable to each of the plurality of storage devices.
Abstract:
Embodiments include obtaining at least one system metric of a distributed storage system, generating one or more recovery parameters based on the at least one system metric, identifying at least one policy associated with data stored in a storage node of a plurality of storage nodes in the distributed storage system, and generating a recovery plan for the data based on the one or more recovery parameters and the at least one policy. In more specific embodiments, the recovery plan includes a recovery order for recovering the data. Further embodiments include initiating a recovery process to copy replicas of the data from a second storage node to a new storage node, wherein the replicas of the data are copied according to the recovery order indicated in the recovery plan.
Abstract:
Embodiments include receiving an indication of a data storage module to be associated with a tenant of a distributed storage system, allocating a partition of a disk for data of the tenant, creating a first association between the data storage module and the disk partition, creating a second association between the data storage module and the tenant, and creating rules for the data storage module based on one or more policies configured for the tenant. Embodiments further include receiving an indication of a type of subscription model selected for the tenant, and selecting the disk partition to be allocated based, at least in part, on the subscription model selected for the tenant. More specific embodiments include generating a storage map indicating the first association between the data storage module and the disk partition and indicating the second association between the data storage module and the tenant.
Abstract:
A method can include receiving, at a workflow controller, a machine learning workflow, the machine learning workflow associated with a first task and a second task. The first task is training a machine learning model and the second task is deploying the model. The method can include segmenting, by the workflow controller, the machine learning workflow into a first sub-workflow associated with the first task and a second sub-workflow associated with the second task, assigning a first workflow agent to the first sub-workflow and assigning a second workflow agent to the second sub-workflow, selecting, by the first workflow agent and based on first resources needed to perform the first task, a first cluster for performing the first task and selecting, by the second workflow agent and based on second resources needed to perform the second task, a second cluster for performing the second task.
Abstract:
A method for accelerating data operations across a plurality of nodes of one or more clusters of a distributed computing environment. Rack awareness information characterizing the plurality of nodes is retrieved and a non-volatile memory (NVM) capability of each node is determined. A write operation is received at a management node of the plurality of nodes and one or more of the rack awareness information and the NVM capability of the plurality of nodes are analyzed to select one or more nodes to receive at least a portion of the write operation, wherein at least one of the selected nodes has an NVM capability. A multicast group for the write operation is then generated wherein the selected nodes are subscribers of the multicast group, and the multicast group is used to perform hardware accelerated read or write operations at one or more of the selected nodes.
Abstract:
Aspects of the disclosed technology relate to ways to determine the optimal storage of data structures across different memory device is associated with physically disparate network nodes. In some aspects, a process of the technology can include steps for receiving a first retrieval request for a first object, searching a local PMEM device for the first object based on the first retrieval request, in response to a failure to find the first object on the local PMEM device, transmitting a second retrieval request to a remote node, wherein the second retrieval request is configured to cause the remote node to retrieve the first object from a remote PMEM device. Systems and machine-readable media are also provided.
Abstract:
Aspects of the subject technology relate to ways to determine the optimal storage of data structures in a hierarchy of memory types. In some aspects, a process of the technology can include steps for determining a latency cost for each of a plurality of fields in an object, identifying at least one field having a latency cost that exceeds a predetermined threshold, and determining whether to store the at least one field to a first memory device or a second memory device based on the latency cost. Systems and machine-readable media are also provided.
Abstract:
Systems, methods, and computer-readable media for storing data in a data storage system using a child table. In some examples, a trickle update to first data in a parent table is received at a data storage system storing the first data in the parent table. A child table storing second data can be created in persistent memory for the parent table. Subsequently the trickle update can be stored in the child table as part of the second data stored in the child table. The second data including the trickle update stored in the child table can be used to satisfy, at least in part, one or more data queries for the parent table using the child table.
Abstract:
Embodiments include receiving an indication of a data storage module to be associated with a tenant of a distributed storage system, allocating a partition of a disk for data of the tenant, creating a first association between the data storage module and the disk partition, creating a second association between the data storage module and the tenant, and creating rules for the data storage module based on one or more policies configured for the tenant. Embodiments further include receiving an indication of a type of subscription model selected for the tenant, and selecting the disk partition to be allocated based, at least in part, on the subscription model selected for the tenant. More specific embodiments include generating a storage map indicating the first association between the data storage module and the disk partition and indicating the second association between the data storage module and the tenant.