When is interactive data visualization useful to use?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110211/eb109f67/attachment.pl>
----------------------------------------
From: tal.galili at gmail.com
Date: Fri, 11 Feb 2011 08:26:16 +0200
To: r-help at r-project.org
Subject: [R] When is *interactive* data visualization useful to use?

Hello all,

Before getting to my question, I would like to apologize for asking this
question here. My question is not directly an R question, however, I still
find the topic relevant to R community of users - especially due to only *
partial* (current) support for interactive data visualization (see here:
http://cran.r-project.org/web/views/Graphics.html were with iplots we are
waiting for iplots extreme, and with rggobi, it currently can not run with R
2.12 and windows 7 OS).
I guess I would just mention a few related issues that are central to R
that I have encountered. This is not well organized but if there is a point
here I'm suggesting that maybe the thing to do is make R work better with streaming
data and provide a way to pipe text data to and from other graphically oriented 
tools that could be taken from many unrelated sources. 

One issue is the concept of streaming for dealing with unlimited data 
and the other is playing nice with the other tools. I recently encountered
your concerns with R ( a few days ago) wondering if interactive may be a good
way to survey some plot I had- many thousands of points that were hard to
explore without interactive zoom seemed to be a natural for this. Often people
here complain about memory limits with large data sets and it is not unreasonable
to work with indefinitely long data streams and examine results in real time.
I had encountered this in the past, IIRC I wanted to watch histograms from a monte
carlo simulation and wanted to know right away if things were going wrong.

Probably you would want to consider R capabilities along with those of
related tools and means for sharing data. Even complex models or data
are normally reducible to text that can piped around to various tools so
having a feature like this in any tools or packages is important.

If you want to author fixed results but let the viewer interact with them,
maybe look at things like PDF once there are more open source tools for dealing with it. 
I have grown up hating PDF but apparently the viewers can offer reasonable
interactivity with properly authored PDF files. The "Standard" is hardly well 
supported with open source tools
and many features of the standard get referred to "only available if you buy this from Adobe."
This creates two issues, one just being cost and annoyance but the other is ability
to check results. If you suspect something is wrong with open source you are can always
look and taking someone's word for software correctness, well, take a look at the credit
rating agencies LOL. And there is always a concern for an attitude problem with this too as
web designers seem to think that " well we created a huge
brand-name file that is also a 'standard' if it is that big from a big 
company there must be lots of information in all those bytes" as if they get paid by 
the megabyte when often just a csv file would be more important to R users.

If you really want professional graphics with good interactivity and are willing
to dig a little as part of a larger survey, I'd be curious to know if there is anything
that can be extracted from all the interactive games LOL...
And now for my question:

While preparing for a talk I will give soon, I recently started digging into
two major (Free) tools for interactive data visualization:
GGobi
and mondrian  - both offer a great range of
capabilities (even if they're a bit buggy).

I wish to ask for your help in articulating (both to myself, and for my
future audience) *When is it helpful to use interactive plots? Either for
data exploration (for ourselves) and data presentation (for a "client")?*

For when explaining the data to a client, I can see the value of animation
for:

- Using "identify/linking/brushing" for seeing which data point in the
graph is what.
- Presenting a sensitivity analysis of the data (e.g: "if we remove this
point, here is what we will get)
- Showing the effect of different groups in the data (e.g: "let's look at
our graphs for males and now for the females")
- Showing the effect of time (or age, or in general, offering another
dimension to the presentation)

For when exploring the data ourselves, I can see the value of
identify/linking/brushing when exploring an outlier in a dataset we are
working on.

But other then these two examples, I am not sure what other practical use
[[elided Hotmail spam]]
It could be argued that the interactive part is good for exploring (For
example) a different behavior of different groups/clusters in the data. But
when (in practice) I approached such situation, what I tended to do was to
run the relevant statistical procedures (and post-hoc tests) - and what I
found to be significant I would then plot with colors clearly dividing the
data to the relevant groups. From what I've seen, this is a safer approach
then "wondering around" the data (which could easily lead to data dredging
(were the scope of the multiple comparison needed for correction is not even
clear).

I'd be very happy to read your experience/thoughts on this matter.

Thanks in advance,
Tal

----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com | 972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
Dear Tal, dear list,

I think the importance of interactive graphics has a lot do with how visual your 
scientific discipline works. I'm spectroscopist, and I think we are very 
visually oriented: if I think of a spectrum I mentally see a graph.

So for that kind of work, I need a lot of interaction (type: plot, change a bit, 
plot again), e.g.
One example is the removal of spikes from Raman spectra (caused e.g. by cosmic 
rays hitting the detector). It is fairly easy to compute a list of suspicious 
signals. It is already much more complicated to find the actual beginning and 
end of the spike. And it is really difficult not to have false positives by some 
automatic procedure, because the spectra can look very different for different 
samples. It would just take me far longer to find a computational description of 
what is a spike than interactively accepting/rejecting the automatically marked 
suspicions. Even though it feels like slave work ;-)

Roughly the same applies for the choice of pre-processing like baseline 
correction. A number of different physical causes can produce different kinds of 
baselines, and usually you don't know which process contributes to what extent. 
In practice, experience suggests a method, I apply it and look whether the 
result looks as expected. I'm not aware of any performance measure that would 
indicate success here.

The next point where interaction is needed pops up as my data has e.g. spatial 
and spectral dimensions. So do the models usually: e.g. in a PCA, the loadings 
would usually capture the spectroscopic direction, whereas the scores belong to 
the spatial domain. So I have "connected" graphs: the spatial distribution 
(intensity map, score map, etc.), and the spectra (or loadings).
As soon as I have such connections I wish for interactive visualization:
I go back and forth between the plots: what is the spectrum that belongs to this 
region of the map? Where on the sample are high intensities of this band? What 
is the substance behind that: if it is x, the intensities at that other spectral 
band should correlate. And then I want to compare this to the scatterplot (pairs 
plot of the PCA score) or to a dendrogram of HCA...

Also, exploration is not just prerequisite for models, but it frequently is 
already the very proper scientific work (particularly in basic science). The 
more so, if you include exploring the models: Now, which of the bands are 
actually used by my predictive models? Which samples do get their predictions 
because of which spectral feature?
And, the "statistical outliers" may very well be just the interesting part of 
the sample. And the outlier statistics cannot interprete the data in terms of 
interesting ./. crap.

For presentation* of results, I personally think that most of the time a careful 
selection of static graphs is much better than live interaction.
*The thing where you talk to an audience far awayf from your work computer. As 
opposed to sitting down with your client/colleague and analysing the data together.
It could be argued that the interactive part is good for exploring (For
example) a different behavior of different groups/clusters in the data. But
when (in practice) I approached such situation, what I tended to do was to
run the relevant statistical procedures (and post-hoc tests)
As long as the relevant measure exists, sure.
Yet as a non-statistician, my work is focused on the physical/chemical 
interpretation. Summary statistics are one set of tools for me, and interactive 
visualisation is another set of tools (overlapping though).

I may want to subtract the influence of the overall unchanging sample matrix 
(that would be the minimal intensity for each wavelength). But the minimum 
spectrum is too noisy. So I use a quantile. Which one? Depends on the data. I'll 
have a look at a series (say, the 2nd to 10th percentile) and decide trading off 
noise and whether any new signals appear. I honestly think there's nothing 
gained if I sit down and try to write a function scoring the similarity to the 
minimum spectrum and the noise level: the more so as it just shifts the need for 
a decision (How much noise outweighs what intensity of real signal being 
subtracted?). It is a decision I need to take. With number or with eye. And 
after all, my professional training was thought to enable me taking this 
decision, and I'm paid (also) for being able to take this decision efficiently 
(i.e. making a reasonably good choice within not too long time).

After all, it may also have to do with a complaint a colleague from a 
computational data analysis group once had. He said the bad thing with us 
spectroscopists is that our problems are either so easy that there's no fun in 
solving them, or they are too hard to solve.
- and what I
found to be significant I would then plot with colors clearly dividing the
data to the relevant groups. From what I've seen, this is a safer approach
then "wondering around" the data (which could easily lead to data dredging
(were the scope of the multiple comparison needed for correction is not even
clear).
Sure, yet:
- Isn't that what validation was invented for (I mean with a proper, new, 
[double] blind test set after you decided your parameters)?
- Summarizing a whole data set into a few numbers, without having looked at the 
data itself may not be safe, either:
- The few comparisons shouldn't come at the cost of risking a bad modeling 
modelling strategy and fitting parameters because the data was not properly 
examined.

My 2 ct,

Claudia (who in practice warns far more frequently of multiple comparisons and 
validation sets being compromised (not independent) than of too few data 
exploration ;-) )
Claudia Beleites
Dipartimento dei Materiali e delle Risorse Naturali
Universit? degli Studi di Trieste
Via Alfonso Valerio 6/a
I-34127 Trieste

phone: +39 0 40 5 58-37 68
email: cbeleites at units.it
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Dear Tal, dear list,

I think the importance of interactive graphics has a lot do with how
visual your scientific discipline works. I'm spectroscopist, and I think
we are very visually oriented: if I think of a spectrum I mentally see a
graph.

So for that kind of work, I need a lot of interaction (type: plot,
change a bit, plot again), e.g.
One example is the removal of spikes from Raman spectra (caused e.g. by
cosmic rays hitting the detector). It is fairly easy to compute a list
of suspicious signals. It is already much more complicated to find the
actual beginning and end of the spike. And it is really difficult not to
have false positives by some automatic procedure, because the spectra
can look very different for different samples. It would just take me far
longer to find a computational description of what is a spike than
interactively accepting/rejecting the automatically marked suspicions.
Even though it feels like slave work ;-)

Roughly the same applies for the choice of pre-processing like baseline
correction. A number of different physical causes can produce different
kinds of baselines, and usually you don't know which process contributes
to what extent. In practice, experience suggests a method, I apply it
and look whether the result looks as expected. I'm not aware of any
performance measure that would indicate success here.

The next point where interaction is needed pops up as my data has e.g.
spatial and spectral dimensions. So do the models usually: e.g. in a
PCA, the loadings would usually capture the spectroscopic direction,
whereas the scores belong to the spatial domain. So I have "connected"
graphs: the spatial distribution (intensity map, score map, etc.), and
the spectra (or loadings).
As soon as I have such connections I wish for interactive visualization:
I go back and forth between the plots: what is the spectrum that belongs
to this region of the map? Where on the sample are high intensities of
this band? What is the substance behind that: if it is x, the
intensities at that other spectral band should correlate. And then I
want to compare this to the scatterplot (pairs plot of the PCA score) or
to a dendrogram of HCA...

Also, exploration is not just prerequisite for models, but it frequently
is already the very proper scientific work (particularly in basic
science). The more so, if you include exploring the models: Now, which
of the bands are actually used by my predictive models? Which samples do
get their predictions because of which spectral feature?
And, the "statistical outliers" may very well be just the interesting
part of the sample. And the outlier statistics cannot interprete the
data in terms of interesting ./. crap.

For presentation* of results, I personally think that most of the time a
careful selection of static graphs is much better than live interaction.
*The thing where you talk to an audience far awayf from your work
computer. As opposed to sitting down with your client/colleague and
analysing the data together.

It could be argued that the interactive part is good for exploring (For
example) a different behavior of different groups/clusters in the
data. But
when (in practice) I approached such situation, what I tended to do
was to
run the relevant statistical procedures (and post-hoc tests)
As long as the relevant measure exists, sure.
Yet as a non-statistician, my work is focused on the physical/chemical
interpretation. Summary statistics are one set of tools for me, and
interactive visualisation is another set of tools (overlapping though).

I may want to subtract the influence of the overall unchanging sample
matrix (that would be the minimal intensity for each wavelength). But
the minimum spectrum is too noisy. So I use a quantile. Which one?
Depends on the data. I'll have a look at a series (say, the 2nd to 10th
percentile) and decide trading off noise and whether any new signals
appear. I honestly think there's nothing gained if I sit down and try to
write a function scoring the similarity to the minimum spectrum and the
noise level: the more so as it just shifts the need for a decision (How
much noise outweighs what intensity of real signal being subtracted?).
It is a decision I need to take. With number or with eye. And after all,
my professional training was thought to enable me taking this decision,
and I'm paid (also) for being able to take this decision efficiently
(i.e. making a reasonably good choice within not too long time).

After all, it may also have to do with a complaint a colleague from a
computational data analysis group once had. He said the bad thing with
us spectroscopists is that our problems are either so easy that there's
no fun in solving them, or they are too hard to solve.

- and what I
found to be significant I would then plot with colors clearly dividing
the
data to the relevant groups. From what I've seen, this is a safer
approach
then "wondering around" the data (which could easily lead to data
dredging
(were the scope of the multiple comparison needed for correction is
not even
clear).
Sure, yet:
- Isn't that what validation was invented for (I mean with a proper,
new, [double] blind test set after you decided your parameters)?
- Summarizing a whole data set into a few numbers, without having looked
at the data itself may not be safe, either:
- The few comparisons shouldn't come at the cost of risking a bad
modeling modelling strategy and fitting parameters because the data was
not properly examined.

My 2 ct,

Claudia (who in practice warns far more frequently of multiple
comparisons and validation sets being compromised (not independent) than
of too few data exploration ;-) )
These are very interesting and valid points. But which tools are
recommended / usefull for interactive graphs for data evaluation? I
somehow have difficulties getting my head around ggobi, and haven't yet
tried out mondian (but I will). Are there any other ones (as we are ion
the R list - which integrate with R) which can be recommended?

Rainer

- -- 
Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
Biology, UCT), Dipl. Phys. (Germany)

Centre of Excellence for Invasion Biology
Natural Sciences Building
Office Suite 2039
Stellenbosch University
Main Campus, Merriman Avenue
Stellenbosch
South Africa

Tel:        +33 - (0)9 53 10 27 44
Cell:       +27 - (0)8 39 47 90 42
Fax (SA):   +27 - (0)8 65 16 27 82
Fax (D) :   +49 - (0)3 21 21 25 22 44
Fax (FR):   +33 - (0)9 58 10 27 44
email:      Rainer at krugs.de

Skype:      RMkrug
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk1Y+SYACgkQoYgNqgF2egohOQCeNdhPw6hJ+Ikd3QyDkHE0J47q
oSkAn1uzat8G70Nq78iOsCCedCYPmZGf
=d7jP
-----END PGP SIGNATURE-----
There are some interactive graphics tools in the TeachingDemos package (tkBrush allows brushing, tkexamp helps you create your own interactive graphics, etc.).

There are also the iplots package, the rgl package (spinning in 3 dimonsions), 'tkrplot' package, the fgui package, the playwith package, and the rpanel package that may be of interest.

Just a few things to look at.
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111

[snip]

> These are very interesting and valid points. But which tools are
> recommended / usefull for interactive graphs for data evaluation? I
> somehow have difficulties getting my head around ggobi, and haven't yet
> tried out mondian (but I will). Are there any other ones (as we are ion
> the R list - which integrate with R) which can be recommended?
> 
> Rainer
> 
> >
> 
> 
> - --
> Rainer M. Krug, PhD (Conservation Ecology, SUN), MSc (Conservation
> Biology, UCT), Dipl. Phys. (Germany)
> 
> Centre of Excellence for Invasion Biology
> Natural Sciences Building
> Office Suite 2039
> Stellenbosch University
> Main Campus, Merriman Avenue
> Stellenbosch
> South Africa
> 
> Tel:        +33 - (0)9 53 10 27 44
> Cell:       +27 - (0)8 39 47 90 42
> Fax (SA):   +27 - (0)8 65 16 27 82
> Fax (D) :   +49 - (0)3 21 21 25 22 44
> Fax (FR):   +33 - (0)9 58 10 27 44
> email:      Rainer at krugs.de
> 
> Skype:      RMkrug
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.10 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
> 
> iEYEARECAAYFAk1Y+SYACgkQoYgNqgF2egohOQCeNdhPw6hJ+Ikd3QyDkHE0J47q
> oSkAn1uzat8G70Nq78iOsCCedCYPmZGf
> =d7jP
> -----END PGP SIGNATURE-----
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
Tal,

One interactive capability that I have repeatedly wished for (but
never taken the time to develop with the existing R tools) is the
ability to interactively zoom in on and out of a data set, and to
interactively create "call-outs of sections of the data. Much of the
data that I deal with takes the form of time series where both the
full data and small section carry meaningful information.

Some of the capabilities of Deducer approach interactive graphing,
such as adjusting alpha values or smoothers, though the updates don't
happen in quite real-time.

- Tom
Hello all,

Before getting to my question, I would like to apologize for asking this
question here. ?My question is not directly an R question, however, I still
find the topic relevant to R community of users ?- especially due to only *
partial* (current) support for interactive data visualization (see here:
http://cran.r-project.org/web/views/Graphics.html ?were with iplots we are
waiting for iplots extreme, and with rggobi, it currently can not run with R
2.12 and windows 7 OS).

And now for my question:

While preparing for a talk I will give soon, I recently started digging into
two major (Free) tools for interactive data visualization:
GGobi<http://www.ggobi.org/>
?and mondrian <http://rosuda.org/mondrian/> - both offer a great range of
capabilities (even if they're a bit buggy).

I wish to ask for your help in articulating (both to myself, and for my
future audience) *When is it helpful to use interactive plots? Either for
data exploration (for ourselves) and data presentation (for a "client")?*

For when explaining the data to a client, I can see the value of animation
for:

 ? - Using "identify/linking/brushing" for seeing which data point in the
 ? graph is what.
 ? - Presenting a sensitivity analysis of the data (e.g: "if we remove this
 ? point, here is what we will get)
 ? - Showing the effect of different groups in the data (e.g: "let's look at
 ? our graphs for males and now for the females")
 ? - Showing the effect of time (or age, or in general, offering another
 ? dimension to the presentation)

For when exploring the data ourselves, I can see the value of
identify/linking/brushing when exploring an outlier in a dataset we are
working on.

But other then these two examples, I am not sure what other practical use
these techniques offer. Especially for our own data exploration!

It could be argued that the interactive part is good for exploring (For
example) a different behavior of different groups/clusters in the data. But
when (in practice) I approached such situation, what I tended to do was to
run the relevant statistical procedures (and post-hoc tests) - and what I
found to be significant I would then plot with colors clearly dividing the
data to the relevant groups. From what I've seen, this is a safer approach
then "wondering around" the data (which could easily lead to data dredging
(were the scope of the multiple comparison needed for correction is not even
clear).

I'd be very happy to read your experience/thoughts on this matter.

Thanks in advance,
Tal

----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com | ?972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

 ? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org?mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Tal,

One interactive capability that I have repeatedly wished for (but
never taken the time to develop with the existing R tools) is the
ability to interactively zoom in on and out of a data set,

I believe that you can do this with playwith. See this [1]. Regards
Liviu

[1] http://code.google.com/p/playwith/wiki/Screenshots#Time_series_plot_(Lattice)

and to
interactively create "call-outs of sections of the data. Much of the
data that I deal with takes the form of time series where both the
full data and small section carry meaningful information.

Some of the capabilities of Deducer approach interactive graphing,
such as adjusting alpha values or smoothers, though the updates don't
happen in quite real-time.

- Tom

On Friday, February 11, 2011, Tal Galili <tal.galili at gmail.com> wrote:
Hello all,

Before getting to my question, I would like to apologize for asking this
question here. ?My question is not directly an R question, however, I still
find the topic relevant to R community of users ?- especially due to only *
partial* (current) support for interactive data visualization (see here:
http://cran.r-project.org/web/views/Graphics.html ?were with iplots we are
waiting for iplots extreme, and with rggobi, it currently can not run with R
2.12 and windows 7 OS).

And now for my question:

While preparing for a talk I will give soon, I recently started digging into
two major (Free) tools for interactive data visualization:
GGobi<http://www.ggobi.org/>
?and mondrian <http://rosuda.org/mondrian/> - both offer a great range of
capabilities (even if they're a bit buggy).

I wish to ask for your help in articulating (both to myself, and for my
future audience) *When is it helpful to use interactive plots? Either for
data exploration (for ourselves) and data presentation (for a "client")?*

For when explaining the data to a client, I can see the value of animation
for:

?? - Using "identify/linking/brushing" for seeing which data point in the
?? graph is what.
?? - Presenting a sensitivity analysis of the data (e.g: "if we remove this
?? point, here is what we will get)
?? - Showing the effect of different groups in the data (e.g: "let's look at
?? our graphs for males and now for the females")
?? - Showing the effect of time (or age, or in general, offering another
?? dimension to the presentation)

For when exploring the data ourselves, I can see the value of
identify/linking/brushing when exploring an outlier in a dataset we are
working on.

But other then these two examples, I am not sure what other practical use
these techniques offer. Especially for our own data exploration!

It could be argued that the interactive part is good for exploring (For
example) a different behavior of different groups/clusters in the data. But
when (in practice) I approached such situation, what I tended to do was to
run the relevant statistical procedures (and post-hoc tests) - and what I
found to be significant I would then plot with colors clearly dividing the
data to the relevant groups. From what I've seen, this is a safer approach
then "wondering around" the data (which could easily lead to data dredging
(were the scope of the multiple comparison needed for correction is not even
clear).

I'd be very happy to read your experience/thoughts on this matter.

Thanks in advance,
Tal

----------------Contact
Details:-------------------------------------------------------
Contact me: Tal.Galili at gmail.com | ?972-52-7275845
Read me: www.talgalili.com (Hebrew) | www.biostatistics.co.il (Hebrew) |
www.r-statistics.com (English)
----------------------------------------------------------------------------------------------

?? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org?mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Do you know how to read?
http://www.alienetworks.com/srtest.cfm
http://goodies.xfce.org/projects/applications/xfce4-dict#speed-reader
Do you know how to write?
http://garbl.home.comcast.net/~garbl/stylemanual/e.htm#e-mail