This is a static archive of the previous Open Grid Forum GridForge content management system saved from host forge.ogf.org file /sf/wiki/do/viewPage/projects.pgi-wg/wiki/DataStaging at Thu, 03 Nov 2022 00:04:37 GMT SourceForge : View Wiki Page: DataStaging

Project Home

Tracker

Documents

Tasks

Source Code

Discussions

File Releases

Wiki

Project Admin
Search Wiki Pages Project: pgi-wg     Wiki > DataStaging > View Wiki Page
wiki2073: DataStaging

Data Staging within EGEE

A significant number jobs require data as input to the computation activity. Small files can be transfered with the job and are described using the InputSandbox and OutputSandbox attributes where the values of these attributes are the filenames to be transported. Typically both the computational and data resources are distributed geographically. This results in either the jobs being steered to a computational resource located near to the data resource or vice versa. In order to achieve this goal it is essential to describe the data required and its location for a specific job. This requirement is currently met with the InputData attribute in JDL as used in EGEE. Specific files are described using a Storage URL (SURL) and provides information about the location of a specific file. SURLs are typically resolved into Transport URLs (TURLs) which gives the necessary information to retrieve a physical file, including hostname, path, protocol and port.

SURL: srm://storage.host/my_namespace/my_file.txt<br> TURL: gsiftp://diskserver.host:2811/internal_path/internal_file_name<br>

Currently supported protocols in EGEE include: gsiftp, rfio, dcap, file <br>

Logical Files

Since there may be many replicas of a particular file the locations of these replicas are recorded in file catalogs. As soon as a file is registered in a catalog it is assigned a  Global Unique Identifier (GUID). To simplify the use, Logical File Name (LFN) are usually assigned to a GUID and are used instead. The resolution from LFNs and GUID to SURLs/TURLs is achieved by querying the catalogue. In order to do this it is require to know the location of the catalogue and its type. This is currently done using the ReplicaCatalog attribute where value represents the URL for this catalog.

Currently supported protocols in EGEE include:  lfn, guid

The LFC is used as a replica catalog

Logical Data Sets

A Logical Data Set (LDS) is a set of physical files that needs to be available as an atomic unit of data for a physic analysis job. It can be fully replicated to sites and data catalogs are use to keep track of the locations. The resolution from LDSs to SURLs/TURLs is achieved by querying the catalog. Custom dataset catalogs may provided by a VO which define their own query mechanisms and interfaces and hence Generic Query Strings may need to be used. These queries are resolved by the catalog which returns locations of the requested datasets.

Currently supported protocols in EGEE include:; lds:

The DLI interface is used handle the case where catalogs may present different interfaces.

Note: Within EGEE, the Data Location Interface (DLI) provides an API which contains plugins to different catalogues. In order to simplify the situation where several different implementations exist, a standardized query interface to such catalogs would be required.

Glossary

Acronym Description
LFN Logical File Name
LDS Logical Data Set
GUID Gobal Unique IDentifier
SURL Storage URL
VO Virtual Organization
 




The Open Grid Forum Contact Webmaster | Report a problem | GridForge Help
This is a static archive of the previous Open Grid Forum GridForge content management system saved from host forge.ogf.org file /sf/wiki/do/viewPage/projects.pgi-wg/wiki/DataStaging at Thu, 03 Nov 2022 00:04:37 GMT