Abstract:
In one embodiment, in response to a request received from a client for retrieving a data object stored in a storage system, a root key is obtained from the request. The data object is represented by metadata in a hierarchical structure having a plurality of levels. Each level includes a plurality of nodes and each node being one of a root node, a leaf node and an intermediate node. The hierarchical structure of metadata associated with the data object is traversed in a top-down approach to decrypt each of a plurality of nodes in the hierarchical structure using a key provided from its parent node, starting from the root node to the leaf nodes, including decrypting the root node using the root key. Decrypted data associated with the plurality of nodes is transmitted to the client.
Abstract:
Mechanisms for predicting a GC duration are described herein. In one embodiment, the mechanisms include receiving a first set of features determined based on current operating status and prior garbage collection (GC) statistics of a first storage system. In one embodiment, the mechanisms include predicting a GC duration of a first GC process being performed at the first storage system by applying a predictive model on the first set of features, wherein the predictive model was generated based on a second set of features received periodically from a plurality of storage systems.
Abstract:
A garbage collector of a storage system traverses a namespace of a file system of the storage system to verify data integrity of segments. The namespace identifies files that are represented by segments arranged in multiple levels in a hierarchy, where an upper level segment includes one or more references to one or more lower level segments, and at least one segment is referenced by multiple files. Traversing the namespace includes computing and verifying checksums all segments in a level-by-level manner, where checksums of an upper level are verified before any of checksums of a lower level are verified. Upon all checksums of all levels have been verified, a garbage collection process is performed on the segments stored in the storage system.
Abstract:
Exemplary methods for verifying data integrity for garbage collection with limited memory include maintaining a data structure that includes a plurality of entries, storing states of a group of segments compressed therein. In response to receiving a request for transitioning a segment from a first state to a second state, retrieving a first entry value of an entry associated with the first segment, generating a second entry value based on the first entry value, the first state, the second state, and a value obtained from a first lookup table based on the first segment. The methods also include writing back the second entry value to the first entry of the data structure. In one embodiment, in response to determining all entries of the data structure reach a predetermined final state, performing a garbage collection process on the segments stored in the storage system.
Abstract:
Techniques for sanitizing a storage system are described herein. In one embodiment, for each of fingerprints representing data chunks stored in a first container of the storage system, a lookup operation in a live bit vector based on the fingerprint is performed to determine whether a corresponding data chunk is live. In one embodiment, a bit in a copy bit vector corresponding to the data chunk is populated based on the lookup operation. In one embodiment, after all of the bits corresponding to the data chunks of the first container have been populated in the CBV, data chunks represented by the CBV are copied from the first container to a second container, and records of the data chunks in the first container are erased.
Abstract:
A computer-implemented method is disclosed. The method starts with determining a first container of a storage system is invalid. The method continues with the storage system setting a data recovery state for the first container to be en-queue, which indicates that data of at least one of the data segments needs to be recovered from the first container, and executing a process to recover any container having an en-queue data recovery state, and for each of the containers, to recover any valid data segment from the corresponding container. The process includes scanning the data segments of the first container to find valid data segments, moving or replicating the valid data segments to a second container, and setting the data recovery state for the first container to be complete once all the valid data segments are moved or replicated to the second container.
Abstract:
A garbage collector of a storage system traverses a namespace of a file system of the storage system to identify segments that are alive in a breadth-first manner. The namespace includes information identifying files that are represented by segments arranged in a plurality of levels in a hierarchy, where an upper level segment includes one or more references to one or more lower level segments, and at least one segment is referenced by multiple files. All live segments of an upper level are identified before any of live segments of a lower level are identified. Upon all live segments of all levels have been identified, the live segments are copied from their original storage locations to a new storage location, and a storage space associated with the original storage locations is reclaimed.
Abstract:
A first set of garbage collection (GC) features and non-GC features associated with a storage system are received, the first set of features being associated with a predetermined start date and a time window. A learning equation is generated having a plurality of vectors of GC features and a plurality of vectors of non-GC features. For a current iteration representing a current GC process, it is determined whether a first prior GC process was started within the time window. An entry of vectors of the non-GC features of the learning equation is populated based on corresponding feature values of the first set of non-GC features, in response to determining that the first prior GC process was started within the time window. A predetermined regression algorithm is applied to the learning equation to generate a GC duration predictive model to predict a GC duration of a subsequent GC process.
Abstract:
In one embodiment, metadata of a data object to be stored in a storage system is received, where the metadata is in a hierarchical structure having multiple levels, each level having multiple nodes and each node being one of a root node, a leaf node and an intermediate node. Each leaf node represents a deduplicated segment associated with the data object. The hierarchical structure is traversed to encrypt each of the nodes in a bottom-up approach, starting from leaf nodes, using different keys. A child key for encrypting content of a child node is stored in a parent node that references the child node, and the child key is encrypted by a parent key associated with the parent node. The encrypted content of the nodes are then stored in one or more storage units of the storage system in a deduplicated manner.
Abstract:
Techniques for sanitizing a storage system are described herein. In one embodiment, for each file stored in the storage system, a list of fingerprints representing data chunks of the file is obtained. In such an embodiment, for each of the fingerprints, identifying a first container storing a data chunk corresponding to the fingerprint is identified, and determining a storage location of the first container in which the data chunk is stored is determined. In one embodiment, a bit in copy bit vector (CBV) is populated based on the identified container and the storage location. In one embodiment, after all of the bits corresponding to the data chunks of the first container have been populated in the CBV, data chunks represented by the CBV are copied from the first container to a second container, and records of the data chunks in the first container are erased.