SourceForge : View Wiki Page: ResponseToReview

Search Wiki Pages Project: GSM-WG Wiki > ResponseToReview > View Wiki Page

wiki2234: ResponseToReview

The OGF asked an (anonymous) expert in the field of grid storage to review the GFD.129 document.

Following the review, the group decided that it was not necessary to revise GFD.129, but it was definitely worth while to respond to the review. A draft was discussed in a handful of emails and at OGF meetings.

Reviewer's comments are in italics, llightly reformatted to be able to annotate it. Hopefully this is sufficiently readable.

Before we start, we should mention that the group would like to express its gratitude to the reviewer for the review.

Abbreviations

SE = Storage Element (an SE provides an SRM interface, along with data transfer and an information interface)
SRM = the Storage Resource Manager protocol

The annotated Review

I reviewed both documents. They provide a storage-based perspective on resource management, suitable for a system administrator. The SRM standards document focuses on the implementation semantics, defining the sets of operations, the criteria under which the operations will be executed, and the state parameters that are used to control the operations. The user experience document lists the breadth of implementation (number of sites) and describes the multiple types of systems that have implemented an SRM interface.

"Both documents" refer to the standards document itself and the experiences document.

Given that over 10 PBs of data are managed by storage systems that use an SRM interface, the documents should be moved to a full recommendation.

That said, the documents do overlook some essential problems that need to be addressed:

1) The SRM Standards document lists two sets of policies, one set for the data (retention, access latency, overwrite) and one set for the storage systems (space reservation, replication, staging, space lifetime).

This is correct: the client submits requests for specific file properties and specific properties of the space in which the file should be stored. These are only requests, though. The SE is permitted to "upgrade" the space. (If the user pays for the quality of space, this may not be a desirable option...)

When there is a single storage resource, the policies can be made consistent with the data lifetime set less than or equal to the space reservation lifetime. When multiple storage systems are integrated into a single Storage element, it is not obvious that both the data policies and the storage policies can be met. The simple example is the migration of data between two storage systems with different policies.

An SE can manage more than one storage resource. It will normally try to meet the client's request for storage; if it cannot, it can choose storage which is good enough (e.g. use permanent space instead of durable space - but not the other way around.) If it cannot meet the client's exact requirements, it can choose to "upgrade" the file.

A chart that explains how policies are resolved between data and storage is needed.

This is a good idea. We do have some old slides explaining the upgrade process between files and spaces. (TODO: link to this presentation.)

A mechanism to verify that the actual data storage location has policies that are consistent with the desired data policies is needed.

This is also a good suggestion. It is already possible to determine the file storage via srmLs (in 'ls -l' mode it returns a TMetaDataPathDetail object). Similarly for the space, srmGetSpaceMetadata() returns information about the space.

2) Multiple SRM systems can be federated, with data moved between SRM systems. The issue of policy consistency must also be addressed for data that are distributed between multiple SRM accessible storage elements. The SRM approach assumes that the application will manage consistency properties of data deposited into storage. When data is deposited into multiple SRMs, the user experience document noted that policies do vary across SRMs, and that the application will have to invoke different options depending upon the SRM that is being used. This is a very heavy requirement to put upon an application.

This is quite perceptive: the SE is the grid interface to storage; it is left to the client - or, as in gLite, higher level data management services - to request consistent policies between SEs. In practice, the disparity does not matter much, because Permanent space is nearly always requested, and the clients choose the desired space with appropriate access latencies and retention policies from information published in information systems.

The reviewer is right, that when these conventional methods are bypassed, discrepancies may arise. For example, if the user "manually" places data into an SE (via a local access protocol), bypassing the data management tools and the SRM layer, the default expiration mode is to deleteWhenExpired ("volatile.") This may obviously surprise a user. Nevertheless, implementations do not wantonly implement different defaults from each other; the protocol must grant the implementations some leeway because they genuinely need to manage different types of resources.

We should emphasise that getting a consistent interface so the client need not be aware of which version it is talking to has always been a design goal of the SRM protocol. We have not always been successful, but mostly so.

3) A second perspective is needed for the standards document, based upon what a user of the system should do for effective interaction (storage and retrieval of data). For each operation, there are a large number of error conditions that need to be checked to decide which subsequent operation to perform. The error conditions may require re-submission of an operation (system temporarily not working), separate queuing of a status request, or use of another Storage Element. The logic behind these choices is not presented.

There is definitely some truth in this statement. We did in fact start producing such a document, long before the document which is now GFD.129 was prepared for OGF.

However, in practice, the interoperation testing carried out between the SRM implementers and the implementers of the client tools has been so effective that such a document was never really needed. This includes the two sets of test suites which try to provoke certain types of error conditions, checking that they get the right error codes in return.

We are thus in a situation now that we can say "if in doubt, do what the test suites or the grid clients do," as they are now as they are now all widely consistent.

The document needs to provide a usage example that includes: - command sequence that a user needs to execute to request a space reservation, verify space reservation, request file staging, verify staging, initiate transfer - logic for responding to possible errors that can be returned for each command I would expect that a sequence of commands and error recovery logic can be created that would be used by all applications that access SRM-based storage elements.

This is correct: the information can be found in the test suites. For example, in OGF26 in Catania, we investigated in detail how the S2 test suite groups requests and handles sequences of requests which achieve higher goals - eg creating spaces and putting files in them. This includes the error handling.

In short, the information is available.

The goal is a clean way for an application to interact with SRM controlled storage resources, that includes both recovery logic as well as the sequence of asynchronous operations that need to be invoked to perform the desired operations.

Show Details