This is a static archive of the previous Open Grid Forum GridForge content management system saved from host forge.ogf.org file /sf/discussion/do/listPosts/projects.ggf-editor/discussion.exp_drmaa_v1_0_condor_experience.topc4047 at Thu, 03 Nov 2022 23:21:02 GMT SourceForge : Post

Project Home

Tracker

Documents

Tasks

Source Code

Discussions

File Releases

Wiki

Project Admin
Project: Editor     Discussion > EXP:DRMAA v1.0 - Condor Experience > suspend vs. cancel > List of Posts
Forum Topic - suspend vs. cancel: (2 Items)
View:  as 
 
 
suspend vs. cancel
The DRMAA HOLD operation seems to be implemented with the condor_hold command line tool, which cancels jobs in the 
vanilla universe. Is this behaviour covered by the DRMAA specification ? From my understanding, a job getting HOLD must 
be suspended in any case. Is cancel in contrast of meaning of suspending a valid implementation for long running job? 
Re: suspend vs. cancel
Thanks for your comment. You really found one of the crucial issues with API support for job suspention. This feature 
always demands some support from either the application or the operating system. We wanted this feature in the API, even
 if the underlying system does not really supports it.

We ended up with the argumentation that DRMAA is a pure API specification, with a state model reflected in the API 
functions. Therefore, as long as the API interface behaviour confirms to the DRMAA state transition model, it is not 
important how the implementation fulfills this promise. The resume operation in the Condor vanilla universe restarts the
 job, but also ensures that the job is running again when the call returns. This fulfills the API promise, meaning that 
the job has to return to running state after the resume call. 

There are scenarios from the Condor guys telling that this behaviour is quite ok. The application catches the 
cancellation signal, and performs its own checkpointing before terminating. After the restart, it detects the own 
checkpointing state and recover by itself. This leads to the expected behaviour, even though the operating system, the 
cluster system and the DRMAA implementation performed a job cancellation.

 
 


The Open Grid Forum Contact Webmaster | Report a problem | GridForge Help
This is a static archive of the previous Open Grid Forum GridForge content management system saved from host forge.ogf.org file /sf/discussion/do/listPosts/projects.ggf-editor/discussion.exp_drmaa_v1_0_condor_experience.topc4047 at Thu, 03 Nov 2022 23:21:03 GMT