Abstract:
Methods and systems for a connector interface in a data pipeline are disclosed. A pipeline comprising two data source nodes and an activity node is configured. Each data source node represents data from a different data source, and the activity node represents a workflow activity that uses the data as input. Two connectors which implement the same connector interface are triggered. In response, data is acquired at each connector from the corresponding data source through the connector interface. The data is sent from the connectors to the activity node through the connector interface. The workflow activity is performed using the acquired data.
Abstract:
Methods and systems for cost-minimizing job scheduling are disclosed. A definition of a task is received. The definition comprises a need-by time. The need-by time comprises a deadline for completion of execution of the task. An estimated duration to complete the execution of the task is determined for each of a plurality of computing resources. One or more of the computing resources are selected based on an estimated cost of completing the execution using the computing resources. The execution of the task is initiated at a scheduled time using the selected one or more computing resources. The scheduled time is earlier than the need-by time by at least the estimated duration.
Abstract:
Methods and systems for optimization of task execution are disclosed. A definition of a task is received. A plurality of parameter values for execution of the task are selected based on an execution history for a plurality of prior tasks performed for a plurality of clients. The plurality of parameter values are selected to optimize one or more execution constraints for the execution of the task. The execution of the task is initiated using one or more computing resources configured with the selected parameter values.
Abstract:
Methods and systems for optimization of task execution are disclosed. A definition of a task is received. A plurality of parameter values for execution of the task are selected based on an execution history for a plurality of prior tasks performed for a plurality of clients. The plurality of parameter values are selected to optimize one or more execution constraints for the execution of the task. The execution of the task is initiated using one or more computing resources configured with the selected parameter values.
Abstract:
Methods and systems for using a scheduler in a data pipeline are disclosed. A plurality of objects in a first layer are created, each representing a respective regularly scheduled task. A plurality of objects in a second layer are created, each representing a respective scheduled instance of a regularly scheduled task. It is determined whether each object in the second layer is ready to execute. For at least one object in the second layer, it is determined if the object has received notifications from any objects on which it depends. For each object that is ready to execute, the regularly scheduled task associated with the object is performed. For each object that is not ready to execute, the object is put to sleep.
Abstract:
Methods and systems for optimization of task execution are disclosed. A definition of a task is received. A plurality of parameter values for execution of the task are selected based on an execution history for a plurality of prior tasks performed for a plurality of clients. The plurality of parameter values are selected to optimize one or more execution constraints for the execution of the task. The execution of the task is initiated using one or more computing resources configured with the selected parameter values.
Abstract:
Methods and systems for task timeouts as a function of input data size are disclosed. A definition of a task is received. The definition of the task indicates a set of input data for the task. A timeout duration for the task is determined based on the set of input data. The timeout duration varies with one or more characteristics of the set of input data. The execution of the task is initiated. The execution of the task is stopped if the execution of the task exceeds the timeout duration.
Abstract:
Techniques are described for managing distributed execution of programs, including by dynamically scaling a cluster of multiple computing nodes performing ongoing distributed execution of a program, such as to increase and/or decrease computing node quantity. An architecture may be used that has core nodes that each participate in a distributed storage system for the distributed program execution, and that has one or more other auxiliary nodes that do not participate in the distributed storage system. Furthermore, as part of performing the dynamic scaling of a cluster, computing nodes that are only temporarily available may be selected and used, such as computing nodes that might be removed from the cluster during the ongoing program execution to be put to other uses and that may also be available for a different fee (e.g., a lower fee) than other computing nodes that are available throughout the ongoing use of the cluster.
Abstract:
Techniques are described for managing distributed execution of programs, including by dynamically scaling a cluster of multiple computing nodes performing ongoing distributed execution of a program, such as to increase and/or decrease computing node quantity. An architecture may be used that has core nodes that each participate in a distributed storage system for the distributed program execution, and that has one or more other auxiliary nodes that do not participate in the distributed storage system. Furthermore, as part of performing the dynamic scaling of a cluster, computing nodes that are only temporarily available may be selected and used, such as computing nodes that might be removed from the cluster during the ongoing program execution to be put to other uses and that may also be available for a different fee (e.g., a lower fee) than other computing nodes that are available throughout the ongoing use of the cluster.
Abstract:
Techniques are described for managing distributed execution of programs, including by dynamically scaling a cluster of multiple computing nodes performing ongoing distributed execution of a program, such as to increase and/or decrease computing node quantity. An architecture may be used that has core nodes that each participate in a distributed storage system for the distributed program execution, and that has one or more other auxiliary nodes that do not participate in the distributed storage system. Furthermore, as part of performing the dynamic scaling of a cluster, computing nodes that are only temporarily available may be selected and used, such as computing nodes that might be removed from the cluster during the ongoing program execution to be put to other uses and that may also be available for a different fee (e.g., a lower fee) than other computing nodes that are available throughout the ongoing use of the cluster.