-
公开(公告)号:US12277134B1
公开(公告)日:2025-04-15
申请号:US18478274
申请日:2023-09-29
Applicant: Amazon Technologies, Inc.
Inventor: Daniel Opincariu , Rajasuba Subramanian , Arnab Dutta , Deepan Chakravarthy Vijayarangam , Ranil Pavithran Muzhangathu , Anas Fattahi
Abstract: In a data lake, a control data object is defined. The control object defines the processes and relationships of processes associated with a data set in the data lake. The control has states that are tied to and adapt in response to state changes of the associated data set. A control can have a control type. The system automatically carries forward enabled processes from one data set version to the next data set version. The system uses the control definition to execute processes, such as compaction or data quality scans, on data sets in the data lake.
-
公开(公告)号:US09563687B1
公开(公告)日:2017-02-07
申请号:US14540648
申请日:2014-11-13
Applicant: AMAZON TECHNOLOGIES, INC.
Inventor: Arnab Dutta , Ramanathan Muthiah , Srinivasan V. Rajagopalan
IPC: G06F17/30
CPC classification number: G06F17/30306 , G06F17/30339
Abstract: Techniques are described for employing a graph-based analysis to determine a configuration of datasets to be stored on data storage systems in a data warehouse environment. Associations between datasets may be determined based on the parsing of join statements or other types of statements in jobs that are executed on the data storage systems. A graph may be generated that describes the associations among datasets. A greedy breadth-first traversal of the graph may be performed to determine sets of associated datasets. A utilization metric describing a weight of storing the datasets may be determined and employed to identify a data storage system on which to store a set of associated datasets, given the storage and processing capacity of the data storage system.
Abstract translation: 描述了采用基于图形的分析来确定要存储在数据仓库环境中的数据存储系统上的数据集的配置的技术。 可以基于在数据存储系统上执行的作业中的连接语句或其他类型的语句的解析来确定数据集之间的关联。 可以生成描述数据集之间关联的图形。 可以执行图的贪心宽度优先遍历以确定相关数据集的集合。 考虑到数据存储系统的存储和处理能力,可以确定描述存储数据集的权重的使用度量,并用于识别在其上存储一组相关联的数据集的数据存储系统。
-