Skip to content

[Omega-devel] StatDataML

9 messages · David James, Friedrich Leisch, A.J. Rossini +3 more

#
Hi,

I just had a very quick look at the StatDataML proposal --- nice
work!   At the risk of showing my ignorance, I want to mention 
my first impressions.

My first impression is that defining datasets in terms of
arrays and list is a bit too high a level.  What about 
simpler vectors, scalars? (I know that R/S don't have scalars,
but other systems/applications do.)  Can we think of a core
set of "basic" data types (factors, strings, integers, etc.)
from which to build on other, possibly recursive types (perhaps
similar to corba's IDL basic data types or S's datadump?).  
Would it make sense to imagine, say xlispstat/python/java applications 
reading  and interpreting an StatDataML document without serious difficulties? 

My gut feeling (which is often wrong) is that the DTD should make
the data self-describing:  e.g., the factor "machineId" has 
levels (or defining set) "Stepper1", "Stepper2", ... "Stepper20", 
eventhough the particular dataset at hand has only a
subset of those.  Similarly, perhaps allowing  units and classes
to be included in the dataset (in the case of currency, it is certainly 
a number, perhaps single precision, perhaps not, with specific units 
dollars, euros, pesos, etc.)

More long-term, how about application-defined data?  Application may have
it's own set of data objects that fully exploits contextual 
information that could be extremely useful to capture and 
communicate.  Also, do the data have to be in ASCII format?  What about 
(possibly mime-encoded) images? sound?

As I mentioned, these are questions coming from my lack of experience
with XML, but may be worth raising now better than later :-)

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636
-----------------------------------------------------------------------------
Erich.Neuwirth@univie.ac.at, hothorn@ci.tuwien.ac.at, baier@ci.tuwien.ac.at, 
Christian.Buchta@wu-wien.ac.at
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
2 days later
#
DJ> Hi,
DJ> I just had a very quick look at the StatDataML proposal --- nice
DJ> work!   At the risk of showing my ignorance, I want to mention 
DJ> my first impressions.

DJ> My first impression is that defining datasets in terms of
DJ> arrays and list is a bit too high a level.  What about 
DJ> simpler vectors, scalars? (I know that R/S don't have scalars,
DJ> but other systems/applications do.)  Can we think of a core
DJ> set of "basic" data types (factors, strings, integers, etc.)
DJ> from which to build on other, possibly recursive types (perhaps
DJ> similar to corba's IDL basic data types or S's datadump?).

Hmm, basically we have that ... just that I don't see why it's
necessary to differentiate between a vector (=1-dimensional array) and
higher dimensions, i.e., introduce different tags for it. But if many
others feel like this is necessary: I don't have s trong opinion about
it, we just wanted to keep the thing as simple as possible.

Regarding data types: Torsten and I just discussed that we want to
keep the mode of an array as abstract as possible such that
applications can use the internal representation that fits the data
best.

IMO the following modes will be necessary to represent statistical
data:

logical, nominal, ordinal, integer, real, complex



DJ> Would it make sense to imagine, say xlispstat/python/java applications 
DJ> reading  and interpreting an StatDataML document without serious difficulties? 

Sure! What's the difference?


DJ> My gut feeling (which is often wrong) is that the DTD should make
DJ> the data self-describing:  e.g., the factor "machineId" has 
DJ> levels (or defining set) "Stepper1", "Stepper2", ... "Stepper20", 
DJ> eventhough the particular dataset at hand has only a
DJ> subset of those.  Similarly, perhaps allowing  units and classes
DJ> to be included in the dataset (in the case of currency, it is certainly 
DJ> a number, perhaps single precision, perhaps not, with specific units 
DJ> dollars, euros, pesos, etc.)

DJ> More long-term, how about application-defined data?  Application may have
DJ> it's own set of data objects that fully exploits contextual 
DJ> information that could be extremely useful to capture and 
DJ> communicate.

We definitely need (and want) any user to be able to exctend
StatDataML, i.e., define new classes. There should be a set of
standard classes (like dataframe or time series), but also interfaces
for defining new classes.

The current idea (in R) is to have the following: If the SDML object
has a class and there exists a conversion function for that particular
class then use it, otherwise do the default thing.

The conversion function shouldn't do to much, probably mostly renaming
some slots and re-organizing the structure (as claases on different
systems will probably have different structures).



DJ> Also, do the data have to be in ASCII format?  What about 
DJ> (possibly mime-encoded) images? sound?

Hmm, haven't thought about that yet. 


DJ> As I mentioned, these are questions coming from my lack of experience
DJ> with XML, but may be worth raising now better than later :-)

YES!!! That's why we called it ``proposal'' rather than
``StatDataML version 1.0'' :-)

Best,
Fritz

PS: We are also no XML experts!
#
FL> IMO the following modes will be necessary to represent statistical
FL> data:

FL> logical, nominal, ordinal, integer, real, complex

Sorry, forgot character.

.fritz
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
DJ> Also, do the data have to be in ASCII format?  What about
    DJ> (possibly mime-encoded) images? sound?

You can insert _SOME_ base64 encoded binary files.  There are
problems, depending on the image encoding format used
(gif/tiff/png/xbm, I'm not sure which are problematic; see the xml-dev
archives for more info).   I think the solutions where simple.

There are XML-based proposals for images that are quite interesting (I
think SDL?  or SML?  Look under the XML computational chemistry
section) Basically, it's a flexible vector-graphics format, that
scales on the fly and has other nice features...

Have you looked at MathML integration for providing output?  Frank
Harrell pop'd up a quite interesting question regarding literate
programming, and this (the SDML) could be one aspect of it...).

Still havn't looked at it; maybe Friday I'll find some time...

best,
-tony
1 day later
#
This is a multi-part message in MIME format.

------=_NextPart_000_0033_01BF8840.85DCFC80
Content-Type: text/plain;
	charset="iso-8859-1"
Content-Transfer-Encoding: 7bit

  [from your draft]
  The description element
  <!ELEMENT description (title?, source?, date?, version?, comment?)>

  <!ELEMENT title (#PCDATA) >
  <!ELEMENT source (#PCDATA) >
  <!ELEMENT date (#PCDATA) >
  <!ELEMENT version (#PCDATA) >
  <!ELEMENT comment (#PCDATA) >

  The description element itself consists of five elements (title, source,
date, comment, version) which are simple strings and include no other
elements. It is used to provide meta-information about a dataset that is
typically not needed for computations on the data itself.

One of the key issues I worry about is "data set provenance" and I encourage
my clients to be able to trace back to one or more sources to inform
themselves of the quality of the data set (how much reliance they can place
in what they are about to see or manipulate, etc).

1. easy links to an originator would be good e.g. is there any way to use
hypertext in any of the arguments for email or web site referrals?

2. what limits (any?) are there typically on the length of the strings in
the arguments of DESCRIPTION?

3. In S or R or Omega, a statistical dataset's description is always
available by command or menu item (I'm guessing).  Are the arguments
typically available now for any kind of search or filtering or action by my
statistical application (e.g. can I ask the statistical application to only
allow me to work on datasets from a certain source? I guess this is possible
and would be addressed by the application developers, not the developers of
the XML standard but maybe there is something required of the XML standard
to make it possible to hook to applications?)

4. is the list of arguments fixed at five? Or could one allow for multiple
comment1, comment2, ...

[If one can search or otherwise operate on the string in the "comment" field
then I guess you don't need to extend the list of arguments.  That leads to
another question:  how extensible or upgradeable is StatDataML envisioned to
be?  I can imagine a "Data Quality Stamp or Certification" being relevant
within certain communities and it would be nice to have that in the
meta-data description, perhaps as a separate argument.]

Thanks again for your draft proposal on StatDataML, this is a potentially
very important contribution.

Regards
Kevin Little, Ph.D.
Informing Ecological Design, LLC
2213 West Lawn Avenue
Madison, WI  53711
tel 608.251.4355 fax 608.251.0399
email klittle@iecodesign.com


------=_NextPart_000_0033_01BF8840.85DCFC80
Content-Type: text/html;
	charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML><HEAD><TITLE></TITLE>
<META content=3D"text/html; charset=3Diso-8859-1" =
http-equiv=3DContent-Type>
<META content=3D"MSHTML 5.00.2614.3500" name=3DGENERATOR></HEAD>
<BODY>
<BLOCKQUOTE style=3D"MARGIN-RIGHT: 0px">
  <DIV style=3D"MARGIN-RIGHT: 0px"><FONT color=3D#0000ff face=3DArial =
size=3D2>[from=20
  your draft]</FONT></DIV>
  <DIV style=3D"MARGIN-RIGHT: 0px"><FONT color=3D#0000ff face=3DArial =
size=3D2>The=20
  description element<BR>&lt;!ELEMENT description (title?, source?, =
date?,=20
  version?, comment?)&gt;<BR><BR>&lt;!ELEMENT title (#PCDATA)=20
  &gt;<BR>&lt;!ELEMENT source (#PCDATA) &gt;<BR>&lt;!ELEMENT date =
(#PCDATA)=20
  &gt;<BR>&lt;!ELEMENT version (#PCDATA) &gt;<BR>&lt;!ELEMENT comment =
(#PCDATA)=20
  &gt;<BR><BR>The description element itself consists of five elements =
(title,=20
  source, date, comment, version) which are simple strings and include =
no other=20
  elements. It is used to provide meta-information about a dataset that =
is=20
  typically not needed for computations on the data=20
itself.<BR></DIV></BLOCKQUOTE></FONT>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2>One of the =
key issues I=20
worry about is "data set provenance" and I encourage my clients to be =
able to=20
trace back to one or more sources to&nbsp;inform themselves of the =
quality of=20
the data set (how much reliance they can place in what they are about to =
see or=20
manipulate, etc).&nbsp; </FONT></DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial =
size=3D2></FONT>&nbsp;</DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2>1. easy =
links to an=20
originator would be good e.g. is <FONT face=3DArial size=3D2>there any =
way to use=20
hypertext in any of the arguments for email or web site=20
referrals?</FONT></FONT></DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2><FONT =
face=3DArial=20
size=3D2></FONT></FONT>&nbsp;</DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2><FONT =
face=3DArial=20
size=3D2>2. what limits (any?) are there typically on the length of the =
strings in=20
the arguments of&nbsp;DESCRIPTION?</FONT></FONT></DIV>
<DIV style=3D"MARGIN-RIGHT: 0px">&nbsp;</DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2>3. In S or =
R or Omega, a=20
statistical dataset's description is always available by command or menu =
item=20
(I'm guessing).&nbsp; Are the arguments typically available now for any =
kind of=20
search or filtering or action by my statistical application (e.g. can I =
ask the=20
statistical application to only allow me to work on datasets from a =
certain=20
source? I guess this is possible and would be addressed by the =
application=20
developers, not the developers of the XML standard but maybe there is =
something=20
required of the XML standard to make it possible to hook to=20
applications?)</FONT></DIV>
<DIV style=3D"MARGIN-RIGHT: 0px">&nbsp;</DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2><FONT =
face=3DArial=20
size=3D2>4.&nbsp;is the list of arguments fixed at five? Or could one =
allow for=20
multiple comment1, comment2, ... </FONT></FONT></DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2><FONT =
face=3DArial=20
size=3D2></FONT></FONT>&nbsp;</DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2><FONT =
face=3DArial=20
size=3D2>[If one can search or otherwise operate on the string in the =
"comment"=20
field then I guess you don't need to extend the list of arguments.&nbsp; =
That=20
leads to another question:&nbsp; how extensible or upgradeable is =
StatDataML=20
envisioned to be?&nbsp; I can imagine a "Data Quality=20
Stamp&nbsp;or&nbsp;Certification" being relevant within certain =
communities=20
and&nbsp;it would be nice to have that in&nbsp;the meta-data =
description,=20
perhaps as a separate argument.]&nbsp; </FONT></FONT></DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial =
size=3D2></FONT>&nbsp;</DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2><FONT =
color=3D#0000ff=20
face=3DArial size=3D2><FONT color=3D#000000>Thanks again for your draft =
proposal on=20
StatDataML, this is a potentially very important=20
contribution.</FONT></FONT></FONT></DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial =
size=3D2></FONT>&nbsp;</DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial =
size=3D2>Regards</FONT></DIV>
<DIV style=3D"MARGIN-RIGHT: 0px"><FONT face=3DArial size=3D2>
<P><FONT face=3DArial size=3D2>Kevin Little, Ph.D.</FONT> <BR><FONT =
face=3DArial=20
size=3D2>Informing Ecological Design, LLC</FONT> <BR><FONT face=3DArial =
size=3D2>2213=20
West Lawn Avenue</FONT> <BR><FONT face=3DArial size=3D2>Madison, =
WI&nbsp;=20
53711</FONT> <BR><FONT face=3DArial size=3D2>tel 608.251.4355 fax=20
608.251.0399</FONT> <BR><FONT face=3DArial size=3D2>email=20
klittle@iecodesign.com</FONT> </P></DIV></FONT></BODY></HTML>

------=_NextPart_000_0033_01BF8840.85DCFC80--

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
as long as all elements include strings, "source" would be the right place
to save a URL
interesting question, I think there is no limit.
in our alpha implementation, readSDML returns the read object with an
additional "SDMLdescription" attribute. One can check, if e.g. the source
is valid in a special application.
<!ELEMENT description (title?, source?, date?, version?, comment*)>

allows comments until the end of time (will be added in StatDataML.dtd,
thanks for the hint).
something like a RSA-key included in the description? will think about
this! 

Torsten

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
relevant
I would think of any certification as being part of the "communication
protocol". I could see a place in the data format for indicating the
source, description, and possibly a special field for the the source's
assessment of the quality. If you really want  certification ( =
authentication)  you should have the whole thing encrypted so that it
needs to be unencrypted with the source's public key, but you have to do
that outside the format description (and you need PKI infastructure to
do it correctly).

Paul Gilbert


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
The idea was not encryption but having something that we can be sure that
the dataset has not been manipulated. There is a w3 project about XML and
signatures at http://www.w3.org/Signature/, maybe one can use something
like this. 

Additionaly, Fritz and I discussed the helpful mail by Kevin and decided
to add a properties element to the description instead of more
comments. One can save "proprietary", that means application based
extentions as properties to the document (which is, in our opinion,
more closed to the framework and saves some lines of code)

We suggest to use the extention *.sdml for StatDataML files, any protests?

Torsten


-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
#
Torsten
that
This is called authentication. You can add a checksum to the data
format, or something like that, but that just helps stop accidental
manipulation. I think it is useful and may be the way to go, but don't
think of it as any kind of certification or authentication. If you
really want authentication, then the whole format (data, description and
all)  needs to be encrypted by the source with their private key, and it
can then be unencrypted by anyone with the source's public key. The
encryption does not prevent anyone from reading it, since anyone can get
the source's public key. It just prevents manipulation. To do this
properly you then need to be sure that you can get the true source's
public key (not some bogus key provided by whoever is providing you with
manipulated data). That is where Public Key Infrastructure (PKI) is
necessary.

Needless to say, this gets a bit messy, but there are some important
points. The key cannot be part of the message (data format), otherwise
you cannot unlock the key. That is, the authentication has to be part of
what I loosely call the "communications protocol."  I believe these
services are provided by CORBA. If you really want authentication then
you should be looking at the communications protocol, and that is
separate from the data format.

Paul Gilbert

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._