SourceForge : View Wiki Page: Require

Search Wiki Pages Project: DFDL-WG Wiki > Require > View Wiki Page

wiki1843: Require

Composability

The group seems to agree that using elementary dfdl constructs to specify complex dfdl constructs is a Good Thing. We also noted that it's the complex dfdl constructs that make the language intuitive so that users will find it simple to use.

If we're to eat our own cooking, we should:

use xml schema to describe logical structure
use dfdl annotations to describe external representation
use assertions to explain when to use it and what it does

Initiator, separator and terminator are high-level constructs chosen for their expressive power and user convenience. It should be possible to substitute more primitive constructs to achieve the same effect, though probably more verbose and less convenient. Here is an idea for such a construct: a new property 'dfdl:require'.

Consider this ISO 8601duration specification: _P1Y2M3DT4H5M6.7S_

The following EBNF grammar might not describe the standard absolutely, but it is close enough.

duration = "P", [dateDuration], ["T", timeDuration];
dateDuration = [years], [months], [days];
years = integer, "Y";
months = integer, "M";
days = integer, "D";
timeDuration = [hours], [minutes], [seconds];
hours = integer, "H";
minutes = integer, "M";
seconds = real, "S";
integer = [sign], digit, {digit};
sign = "-";
real = integer, [fraction];
fraction = ".", digit, {digit};

Here's an XML representation of the specimen, using an element hierarchy to convey the logical structure.

<duration xmlns="urn:dfdl-example">
	<dateDuration>
		<years>1</years>
		<months>2</months>
		<days>3</days>
	</dateDuration>
	<timeDuration>
		<hours>4</hours>
		<minutes>5</minutes>
		<seconds>6.7</seconds>
	</timeDuration>
</duration>

This could be represented by an XML schema like this (informal sketch for brevity)

element name="duration"
  element name="dateDuration" minOccurs="0"
    element name="years" type="integer"  minOccurs="0"
    element name="months" type="integer" minOccurs="0"
    element name="days" type="integer" minOccurs="0"
  element name="timeDuration"  minOccurs="0"
    element name="hours" type="integer" minOccurs="0"
    element name="minutes" type="integer" minOccurs="0"
    element name="seconds" type="decimal" minOccurs="0"

Now add the convenient and powerful dfdl constructs that describe the physical tags

element name="duration" dfdl:initiator="P"
  element name="dateDuration" minOccurs="0"
    element name="years" type="integer"  minOccurs="0" dfdl:terminator="Y"
    element name="months" type="integer" minOccurs="0" dfdl:terminator="M"
    element name="days" type="integer" minOccurs="0" dfdl:terminator="D"
  element name="timeDuration"  minOccurs="0" dfdl:initiator="T"
    element name="hours" type="integer" minOccurs="0" dfdl:terminator="H"
    element name="minutes" type="integer" minOccurs="0" dfdl:terminator="M"
    element name="seconds" type="decimal" minOccurs="0" dfdl:terminator="S"

This ties the tags and values closely together, and uses two constructs with different semantics. The tags aren't present in the internal representation but are always present in the external representation. If the parser doesn't find the expected terminator, the field isn't there.

Now suppose we have a new lower-level construct 'dfdl:require' whose purpose is only to describe such mandatory external values without any structural implications. We need to expand the XML schema to show the tags explicitly.

element name="duration"
  element name="tag" type="string" dfdl:require="P"
  element name="dateDuration" minOccurs="0"
    element name="years" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="Y"
    element name="months" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="M"
    element name="days" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="D"
  element name="timeDuration"  minOccurs="0"
    element name="tag" type="string" dfdl:require="T"
    element name="hours" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="H"
    element name="minutes" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="M"
    element name="seconds" minOccurs="0"
      element name="value" type="decimal"
      element name="tag" type="string" dfdl:require="S"

To retain the original simple description for the XML version (internal representation) we can hide the external representation and pick out just the values.

element name="duration"
  element name="external" dfdl:hidden="true"
    element name="tag" type="string" dfdl:require="P"
  element name="dateDuration" minOccurs="0"
    element name="years" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="Y"
    element name="months" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="M"
    element name="days" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="D"
  element name="timeDuration"  minOccurs="0"
    element name="tag" type="string" dfdl:require="T"
    element name="hours" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="H"
    element name="minutes" minOccurs="0"
      element name="value" type="integer"
      element name="tag" type="string" dfdl:require="M"
    element name="seconds" minOccurs="0"
      element name="value" type="decimal"
      element name="tag" type="string" dfdl:require="S"
  element name="dateDuration" minOccurs="0"
    element name="years" type="integer"  minOccurs="0" 
        dfdl:inputValueCalc="{../../external/dateDuration/years/value}"
    element name="months" type="integer" minOccurs="0" 
        dfdl:inputValueCalc="{../../external/dateDuration/months/value}"
    element name="days" type="integer" minOccurs="0" 
        dfdl:inputValueCalc="{../../external/dateDuration/days/value}"
  element name="timeDuration"  minOccurs="0"
    element name="hours" type="integer" minOccurs="0" 
        dfdl:inputValueCalc="{../../external/timeDuration/hours/value}"
    element name="minutes" type="integer" minOccurs="0" 
        dfdl:inputValueCalc="{../../external/timeDuration/minutes/value}"
    element name="seconds" type="decimal" minOccurs="0" 
        dfdl:inputValueCalc="{../../external/timeDuration/seconds/value}"

On output, dfdl:require means the same as 'dfdl:outputValueCalc' (*see below). I haven't worked it right through, but to finish the job it might be enough to annotate each of the value elements in the 'external' structure like this:

	    element name="years" minOccurs="0"
	      element name="value" type="integer" dfdl:outputValueCalc="{../../../../dateDuration/years}"
	      element name="tag" type="string" dfdl:require="Y"

What does all this achieve?

A verbose and cumbersome alternative representation for the neat and convenient dfdl annotations initiator, separator and terminator. Crucially, I hope that the transformation between them might be susceptible to formal description.
A new property that can be used in its own right to describe fixed external markup that has no place in the internal data. Significantly, it has no structural overtones. It may be that XML Schema's facet FIXED will do this, but I haven't convinced myself of that yet.

aside: I should like to drop the trailing abbreviation and call this pair 'inputValue' and 'outputValue'.

Comments by Mike Beckerle 2007-08-08

I had been assuming that dfdl:require was just the xs:fixed facet, i.e., we don't need our own keyword in dfdl for this.

There's a special case though where we would need our own keyword/property which is when the fixed value is one of a number of choices, so we need to be able to express either a list of possibles, or a regexp that matches acceptable data.

Here's a way to handle this though, without any new keyword.

    <xs:element name="separator" type="string" dfdl:hidden="true">
       <...>
           <dfdl:assert>regexpMatch($., ",|;")</dfdl:assert>
       <...>
     </xs:element>

In the above, notice how the assertion uses a predicate test to insist that the value of this "separator" is compatible with the regexp which matches either commas or semicolons.

That is, we can use an assertion expression as a way to stipulate any kind of requirement on what "fixed" data has to be like.

This would be sufficient for the specification anyway where we're explaining things in terms of a speculative parsing idiom.

The scheme of describing markup as structural elements like this is exactly the direction I wanted to take things.

btw: the way outputValueCalc is used above looks right. A logical element uses inputValueCalc to reach into representation data and compute the logical value. A representational (typically hidden) element uses outputValueCalc to reach into logical data and compute the representation's value for output. I think people don't like that you have to write these on two different elements, so if one is an easy inverse of the other that symmetry isn't obvious because the two pieces of expression "code" aren't adjacent, but I don't think that's such a big deal.

Comment by Simon Parker 2007-8-13

Having reviewed the XSD value constraint 'fixed', I agree that it does almost all of what dfdl:require does. The important differences are:

XSD doesn't offer an alternative to the attribute form, so some values will require complex escaping.
XSD doesn't support expression or pattern values, so only constant values can be fixed.

These obstacles are probably not sufficient to justify introducing a new dfdl property, particularly with dfdl:assert to constrain input and dfdl:outputValueCalc to specify output.

However, a future version might choose to supplement the standard XSD attribute 'fixed' with a new dfdl property 'dfdl:fixed', defined to be identical except for the value syntax and the element alternative.

Comment by Mike Beckerle 2007-08-14

Still a big issue I think:

Suppose I say element 'x' is terminated by ";", like so:

    <element name="x" type="string" dfdl:lengthKind="delimited" dfdl:terminator=";" dfdl:appliesTo="thisOnly"/>

Sematically, if we say that's equivalent to:

   <sequence>
       <element name="x" type="string"/> <!-- note: lengthKind is removed -->
       <sequence>
       <annotation><appinfo><dfdl:hidden>
          <element name="x_terminator" type="string" fixed=";"/>
       </dfdl:hidden></appinfo></annotation>
       </sequence>
   </sequence>

Now, we've expressed this terminator as a fixed data string that follows the original, but consider that element 'x' now has nothing at all on it about how it is terminated. What we've expressed here is effectively a "squeeze" of the element x before the constant semicolon element x_terminator, but this isn't well defined. We need to somehow express that 'x' has length up to the occurance of an unescaped unquoted semicolon. But we don't have a way to do that. That's what dfdl:lengthKind="delimited" with dfdl:terminator=";" means.

Furthermore, by removing these annotations, we potentially leave the modified element declaration for 'x' exposed to inheriting different instructions from the scope. So really we have to put terminator="" (empty string) to turn off delimited termination, or we have to explain that we're rewriting into a lowered form of DFDL where there is no concept of terminators anymore.

We could say that when an element is unconstrainted as 'x' is in the rewrite above, that it takes its value from the shortest representation that allows the subsequent elements to have their requirements (fixed or assertions) to be satisfied. This is in effect a description of what it means for something to be terminated by a delimiter, without having to explain what the delimiter itself is. This is a basic ammendment to how speculative parsing works when there are elements in sequence, some of which are constrained others of which are not.

Is this ok?

Comment by Simon Parker, 2007-8-15

Yes, this is OK by me.

The 'sqeeze' is clear enough. Absence of 'length' implies 'delimited' or more generally 'constrained by other structures'.

When defining a high-level construct in terms of low-level construct, perhaps we can impose on ourselves the discipline of dealing only with 'flattened' constructs.

The 'shortest feasible representation' is intuitive enough.

Hide Details

	Versions		Associations		Attachments		Back Links

Version	Version Comment	Created By
Version 7		Simon Parker - 08/15/2007
Version 6		Michael J Beckerle - 08/14/2007
Version 5		Michael J Beckerle - 08/14/2007
Version 4		Simon Parker - 08/13/2007
Version 3		Michael J Beckerle - 08/08/2007
Version 2		Simon Parker - 08/08/2007
Version 1		Simon Parker - 08/08/2007