Skip to content

NA, where no NA should (could!) be!

14 messages · oliver, Sarah Goslee, David Winsemius +5 more

#
Hello,

again I'm on my weblog-script... having problems...




This code:

===========================
weblog <- read_weblog("web.log")
weblog_by_date <- split(weblog, weblog$date)

#for ( i in names(weblog_by_day) ) { print(i); print(weblog_by_day$i) }
for ( datum in names(weblog_by_date) )
{
  print(datum)
  selected <- weblog_by_date[[datum]]

  res_size_by_host <- tapply( selected$size, selected$host, sum)
  mycat <- function(a,b) cat(paste(a, "==>", b, "\n"))
  mapply( mycat, selected$size, selected$host )
  print( res_size_by_host )
}
===========================



produces this result (only a part is shown!):

=======================================
 124.0.210.117   145.253.3.244   160.91.44.155   174.36.196.98   
193.47.80.48
             NA              NA              NA              NA         
    NA
  200.212.63.51   200.87.53.234   208.80.194.30   208.80.194.35  
208.80.194.46
             NA             294              NA            5774         
    NA
  208.80.194.49   209.17.171.58   210.207.57.39  211.171.202.85  
211.43.212.94
=======================================

There are no "NA"-values, because the function read_weblog()
replaces all NA by 0.

So there should be no way to produce NA's!

How can this happen?


Ciao,
   Oliver
#
I think we need the reproducible example requested in
the posting guide.

Sarah

On Sat, Dec 20, 2008 at 4:42 PM, Oliver Bandel
<oliver at first.in-berlin.de> wrote:

  
    
#
Sarah Goslee <sarah.goslee <at> gmail.com> writes:
====================
for ( datum in names(weblog_by_date) )
{ 
  print(datum)
  selected <- weblog_by_date[[datum]]

  res_size_by_host <- tapply( selected$size, selected$host, sum) 
  mycat <- function(a,b) cat(paste(a, "==>", b, "\n"))
  mapply( mycat, selected$size, selected$host )
  print( res_size_by_host )

  print( "is there any NA?!")
  print( any( is.na(selected$size)) )

}
====================



At the end of the printouts, it gives me:

=======================
 94.101.145.110     94.23.3.220 
             NA              NA 
[1] "is there any NA?!"
[1] FALSE
=======================


Strange, eh?!

Ciao,
   Oliver
#
Oliver Bandel wrote:
Why do so many people have such trouble with the word "reproducible"? We 
can't reproduce that without access to weblog_by_date!

Anyways I think it is tapply that is behaving unexpectedly to you:

 > x <- factor(1,levels=1:2)
 > tapply(1,x,sum)
  1  2
  1 NA

which is kind of surprising since the sum over an empty set is usually 
zero. However, that _is_ what the documentation for tapply says:

      When 'FUN' is present, 'tapply' calls 'FUN' for each cell that has
      any data in it.  If 'FUN' returns a single atomic value for each
      such cell (e.g., functions 'mean' or 'var') and when 'simplify' is
      'TRUE', 'tapply' returns a multi-way array containing the values,
      and 'NA' for the empty cells.

a passable workaround is

 > sapply(split(1,x),sum)
1 2
1 0

  
    
#
Zitat von Peter Dalgaard <p.dalgaard at biostat.ku.dk>:
[...]


To create test data I need more time, have to change the original
IP-adresses to fake adresses, before posting it here.
Also I doubt a *.zip file would be accepted, but this would
have been the next thing I wanted to try.

If it will not be possible to send binary attachements, then it will be
not possible to send testdata here, because the length of the lines in
the logfile are longer than what my current weblmailer allows me to
send without breaking the lines.

Also I hoped, that people know the traps, and can help by just looking
at the code and know, where to look for the problem.

As you now have shown, this is possible, because you knew were too look
for the problem, which shows me that you are very experienced in R.
See above: problem of providing such data and needing time for creating
it.
[...]

Thank you.

This loooks like the solution for that simple case.


I hope I can adapt it to my data structure.

The problem here is, that there are no empty cells
in my data. There is always a numeric value of
0 or greater, because I threw out any "NA" and
substituted it with 0.

The data is inside a data-frame.
How can there be an empty cell in a data-frame?
There are no NAs and no NANs...
...and the factors must be new each time,
because the data will be created newly,
and I also had used rm(selected) to be sure there are not
factors stored from the last access...

Did I overlooked something?

Ciao,
   Oliver

P.S.: I will try to attach my zip-file now... it contains
      the complete code and a changed weblog (changed IP-addresses).
      I hope the list accepts it.
#
Oliver Bandel <oliver <at> first.in-berlin.de> writes:

[...]
[...]

As assumed, this did not work.

But I found, where the problem might be located...


With a smalle testfile I tried this after
running my script:

===============================================
[1] "22.99.44.101"   "266.249.71.143" "5.66.61.230"    "66.29.1.13"    
[5] "7.6.1.20"       "7.6.14.240"
[1] "7.6.1.20"    "7.6.14.240"  "5.66.61.230"
===============================================

So, somehow there are unused levels inside.
Using drop=TRUE in split() did not helped.
So I have somewhere later to fix it.

But when I do it before tapply, there will be an error-message...

Ciao,
   Oliver
#
Hello,


ok, I found the problem!



I now have:

    res_size_by_host <- tapply( selected$size, factor(selected$host), sum)


instead of

  
  res_size_by_host <- tapply( selected$size, selected$host, sum)


and now it works.

IMHO this is strange, because selected$host is already a factor!


I don't know, why this must be done...
...someone of the R-experts might know it...
...and may explain it...?!


Ciao,
   Oliver
#
On Dec 20, 2008, at 6:26 PM, Oliver Bandel wrote:

            
It does not take an expert. All you need to do is read the help page.   
Dalgaard already diagnosed the problem. Look at his example and see  
what your "solution" does to it.

 > x <- factor(1,levels=1:2)
 >  tapply(1,x,sum)
  1  2
  1 NA

 > x <- factor(1,levels=1:2)
 >  tapply(1,factor(x),sum)
1
1

The function, factor, applied to a factor with unused levels discards  
those levels.

 From the factor help page:
"Normally the ?levels? used as an attribute of the result are the  
reduced set of levels after removing those in exclude, but this can be  
altered by supplying labels."

Since NA is the default for exclude, that results in the "trimming  
down" that you see with the application of factor(.)
#
Oliver Bandel wrote:
I already told you: You have empty levels in selected$host. Look at the 
example I showed you.
#
Peter Dalgaard <p.dalgaard <at> biostat.ku.dk> writes:
In a strict sense, the example is "reproducible" as opposite to "spurious".
Reproducible research means that you can get the same results whe you buy 
an ultracentrifuge, high-grade chemicals, a safety lab, and a technician 
with a golden hand .:)

We should probably better use "self-running" instead, or whatever a 
native speaker would suggest as an alternative. Even in German I do not know 
of a better word; it should be "that can be pasted into rterm and give the 
same result".

Dieter
#
On Sun, Dec 21, 2008 at 5:42 AM, Dieter Menne
<dieter.menne at menne-biomed.de> wrote:
I think reproducible is the correct word and its meaning should be clear from
both its conventional meaning, see link, and the context in which its used:
http://en.wikipedia.org/wiki/Reproducibility

It is surprising how many posters disregard this basic requirement for a post,
clearly stated at the bottom of each message to r-help.
#
On 21/12/2008 7:57 AM, Gabor Grothendieck wrote:
I don't find it surprising.   Putting together a good bug report 
requires several skills that need to be learned.  I suspect medical 
doctors and auto mechanics also work with poor reports of what's wrong. 
  I do sometimes find it frustrating (as I imagine doctors and auto 
mechanics do), but probably not as frustrating as the posters find it.
Now really, who reads repetitive stuff at the bottom of messages?  The 
dividing line clearly indicates that it's some formal requirement, not 
meant to be read.

Duncan Murdoch
#
On Sun, Dec 21, 2008 at 8:52 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
I think most people do read it since most posts ask in a reproducible way
and the whole idea of repetition, as in advertising, is that such
repetition can be
effective.
#
Gabor Grothendieck wrote:
well, the foot
says 'reproducible code', but code is what you really want to get, not
to reproduce ;)

vQ