Hello,
again I'm on my weblog-script... having problems...
This code:
===========================
weblog <- read_weblog("web.log")
weblog_by_date <- split(weblog, weblog$date)
#for ( i in names(weblog_by_day) ) { print(i); print(weblog_by_day$i) }
for ( datum in names(weblog_by_date) )
{
print(datum)
selected <- weblog_by_date[[datum]]
res_size_by_host <- tapply( selected$size, selected$host, sum)
mycat <- function(a,b) cat(paste(a, "==>", b, "\n"))
mapply( mycat, selected$size, selected$host )
print( res_size_by_host )
}
===========================
produces this result (only a part is shown!):
=======================================
124.0.210.117 145.253.3.244 160.91.44.155 174.36.196.98
193.47.80.48
NA NA NA NA
NA
200.212.63.51 200.87.53.234 208.80.194.30 208.80.194.35
208.80.194.46
NA 294 NA 5774
NA
208.80.194.49 209.17.171.58 210.207.57.39 211.171.202.85
211.43.212.94
=======================================
There are no "NA"-values, because the function read_weblog()
replaces all NA by 0.
So there should be no way to produce NA's!
How can this happen?
Ciao,
Oliver
NA, where no NA should (could!) be!
14 messages · oliver, Sarah Goslee, David Winsemius +5 more
I think we need the reproducible example requested in the posting guide. Sarah On Sat, Dec 20, 2008 at 4:42 PM, Oliver Bandel
<oliver at first.in-berlin.de> wrote:
Hello,
again I'm on my weblog-script... having problems...
This code:
===========================
weblog <- read_weblog("web.log")
weblog_by_date <- split(weblog, weblog$date)
#for ( i in names(weblog_by_day) ) { print(i); print(weblog_by_day$i) }
for ( datum in names(weblog_by_date) )
{
print(datum)
selected <- weblog_by_date[[datum]]
res_size_by_host <- tapply( selected$size, selected$host, sum)
mycat <- function(a,b) cat(paste(a, "==>", b, "\n"))
mapply( mycat, selected$size, selected$host )
print( res_size_by_host )
}
===========================
produces this result (only a part is shown!):
=======================================
124.0.210.117 145.253.3.244 160.91.44.155 174.36.196.98
193.47.80.48
NA NA NA NA
NA
200.212.63.51 200.87.53.234 208.80.194.30 208.80.194.35
208.80.194.46
NA 294 NA 5774
NA
208.80.194.49 209.17.171.58 210.207.57.39 211.171.202.85
211.43.212.94
=======================================
There are no "NA"-values, because the function read_weblog()
replaces all NA by 0.
So there should be no way to produce NA's!
How can this happen?
Ciao,
Oliver
Sarah Goslee http://www.functionaldiversity.org
Sarah Goslee <sarah.goslee <at> gmail.com> writes:
I think we need the reproducible example requested in the posting guide.
====================
for ( datum in names(weblog_by_date) )
{
print(datum)
selected <- weblog_by_date[[datum]]
res_size_by_host <- tapply( selected$size, selected$host, sum)
mycat <- function(a,b) cat(paste(a, "==>", b, "\n"))
mapply( mycat, selected$size, selected$host )
print( res_size_by_host )
print( "is there any NA?!")
print( any( is.na(selected$size)) )
}
====================
At the end of the printouts, it gives me:
=======================
94.101.145.110 94.23.3.220
NA NA
[1] "is there any NA?!"
[1] FALSE
======================= Strange, eh?! Ciao, Oliver
Oliver Bandel wrote:
Sarah Goslee <sarah.goslee <at> gmail.com> writes:
I think we need the reproducible example requested in the posting guide.
====================
for ( datum in names(weblog_by_date) )
{
print(datum)
selected <- weblog_by_date[[datum]]
res_size_by_host <- tapply( selected$size, selected$host, sum)
mycat <- function(a,b) cat(paste(a, "==>", b, "\n"))
mapply( mycat, selected$size, selected$host )
print( res_size_by_host )
print( "is there any NA?!")
print( any( is.na(selected$size)) )
}
====================
Why do so many people have such trouble with the word "reproducible"? We
can't reproduce that without access to weblog_by_date!
Anyways I think it is tapply that is behaving unexpectedly to you:
> x <- factor(1,levels=1:2)
> tapply(1,x,sum)
1 2
1 NA
which is kind of surprising since the sum over an empty set is usually
zero. However, that _is_ what the documentation for tapply says:
When 'FUN' is present, 'tapply' calls 'FUN' for each cell that has
any data in it. If 'FUN' returns a single atomic value for each
such cell (e.g., functions 'mean' or 'var') and when 'simplify' is
'TRUE', 'tapply' returns a multi-way array containing the values,
and 'NA' for the empty cells.
a passable workaround is
> sapply(split(1,x),sum)
1 2
1 0
At the end of the printouts, it gives me:
=======================
94.101.145.110 94.23.3.220
NA NA
[1] "is there any NA?!"
[1] FALSE
=======================
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Zitat von Peter Dalgaard <p.dalgaard at biostat.ku.dk>:
Oliver Bandel wrote:
Sarah Goslee <sarah.goslee <at> gmail.com> writes:
I think we need the reproducible example requested in the posting guide.
====================
for ( datum in names(weblog_by_date) )
{
print(datum)
selected <- weblog_by_date[[datum]]
res_size_by_host <- tapply( selected$size, selected$host, sum)
mycat <- function(a,b) cat(paste(a, "==>", b, "\n"))
mapply( mycat, selected$size, selected$host )
print( res_size_by_host )
print( "is there any NA?!")
print( any( is.na(selected$size)) )
}
====================
Why do so many people have such trouble with the word "reproducible"?
[...] To create test data I need more time, have to change the original IP-adresses to fake adresses, before posting it here. Also I doubt a *.zip file would be accepted, but this would have been the next thing I wanted to try. If it will not be possible to send binary attachements, then it will be not possible to send testdata here, because the length of the lines in the logfile are longer than what my current weblmailer allows me to send without breaking the lines. Also I hoped, that people know the traps, and can help by just looking at the code and know, where to look for the problem. As you now have shown, this is possible, because you knew were too look for the problem, which shows me that you are very experienced in R.
We can't reproduce that without access to weblog_by_date!
See above: problem of providing such data and needing time for creating it.
Anyways I think it is tapply that is behaving unexpectedly to you:
> x <- factor(1,levels=1:2) > tapply(1,x,sum)
1 2
1 NA
which is kind of surprising since the sum over an empty set is
usually
zero. However, that _is_ what the documentation for tapply says:
When 'FUN' is present, 'tapply' calls 'FUN' for each cell that
has
any data in it. If 'FUN' returns a single atomic value for
each
such cell (e.g., functions 'mean' or 'var') and when 'simplify'
is
'TRUE', 'tapply' returns a multi-way array containing the
values,
and 'NA' for the empty cells.
a passable workaround is
> sapply(split(1,x),sum)
1 2 1 0
[...]
Thank you.
This loooks like the solution for that simple case.
I hope I can adapt it to my data structure.
The problem here is, that there are no empty cells
in my data. There is always a numeric value of
0 or greater, because I threw out any "NA" and
substituted it with 0.
The data is inside a data-frame.
How can there be an empty cell in a data-frame?
There are no NAs and no NANs...
...and the factors must be new each time,
because the data will be created newly,
and I also had used rm(selected) to be sure there are not
factors stored from the last access...
Did I overlooked something?
Ciao,
Oliver
P.S.: I will try to attach my zip-file now... it contains
the complete code and a changed weblog (changed IP-addresses).
I hope the list accepts it.
Oliver Bandel <oliver <at> first.in-berlin.de> writes: [...]
P.S.: I will try to attach my zip-file now... it contains
the complete code and a changed weblog (changed IP-addresses).
I hope the list accepts it.
[...] As assumed, this did not work. But I found, where the problem might be located... With a smalle testfile I tried this after running my script: ===============================================
levels(selected$host)
[1] "22.99.44.101" "266.249.71.143" "5.66.61.230" "66.29.1.13" [5] "7.6.1.20" "7.6.14.240"
levels(selected$host) <- factor(selected$host) levels(selected$host)
[1] "7.6.1.20" "7.6.14.240" "5.66.61.230"
=============================================== So, somehow there are unused levels inside. Using drop=TRUE in split() did not helped. So I have somewhere later to fix it. But when I do it before tapply, there will be an error-message... Ciao, Oliver
Hello,
ok, I found the problem!
I now have:
res_size_by_host <- tapply( selected$size, factor(selected$host), sum)
instead of
res_size_by_host <- tapply( selected$size, selected$host, sum)
and now it works.
IMHO this is strange, because selected$host is already a factor!
I don't know, why this must be done...
...someone of the R-experts might know it...
...and may explain it...?!
Ciao,
Oliver
On Dec 20, 2008, at 6:26 PM, Oliver Bandel wrote:
Hello, ok, I found the problem! I now have: res_size_by_host <- tapply( selected$size, factor(selected$host), sum) instead of res_size_by_host <- tapply( selected$size, selected$host, sum) and now it works. IMHO this is strange, because selected$host is already a factor! I don't know, why this must be done... ...someone of the R-experts might know it... ...and may explain it...?!
It does not take an expert. All you need to do is read the help page. Dalgaard already diagnosed the problem. Look at his example and see what your "solution" does to it. > x <- factor(1,levels=1:2) > tapply(1,x,sum) 1 2 1 NA > x <- factor(1,levels=1:2) > tapply(1,factor(x),sum) 1 1 The function, factor, applied to a factor with unused levels discards those levels. From the factor help page: "Normally the ?levels? used as an attribute of the result are the reduced set of levels after removing those in exclude, but this can be altered by supplying labels." Since NA is the default for exclude, that results in the "trimming down" that you see with the application of factor(.)
David Winsemius > > > > Ciao, > Oliver > > ______________________________________________ > R-help at r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code.
Oliver Bandel wrote:
Hello,
ok, I found the problem!
I now have:
res_size_by_host <- tapply( selected$size, factor(selected$host), sum)
instead of
res_size_by_host <- tapply( selected$size, selected$host, sum)
and now it works.
IMHO this is strange, because selected$host is already a factor!
I don't know, why this must be done...
...someone of the R-experts might know it...
...and may explain it...?!
I already told you: You have empty levels in selected$host. Look at the example I showed you.
O__ ---- Peter Dalgaard ?ster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
Peter Dalgaard <p.dalgaard <at> biostat.ku.dk> writes:
Why do so many people have such trouble with the word "reproducible"? We can't reproduce that without access to weblog_by_date!
In a strict sense, the example is "reproducible" as opposite to "spurious". Reproducible research means that you can get the same results whe you buy an ultracentrifuge, high-grade chemicals, a safety lab, and a technician with a golden hand .:) We should probably better use "self-running" instead, or whatever a native speaker would suggest as an alternative. Even in German I do not know of a better word; it should be "that can be pasted into rterm and give the same result". Dieter
On Sun, Dec 21, 2008 at 5:42 AM, Dieter Menne
<dieter.menne at menne-biomed.de> wrote:
Peter Dalgaard <p.dalgaard <at> biostat.ku.dk> writes:
Why do so many people have such trouble with the word "reproducible"? We can't reproduce that without access to weblog_by_date!
In a strict sense, the example is "reproducible" as opposite to "spurious". Reproducible research means that you can get the same results whe you buy an ultracentrifuge, high-grade chemicals, a safety lab, and a technician with a golden hand .:)
I think reproducible is the correct word and its meaning should be clear from both its conventional meaning, see link, and the context in which its used: http://en.wikipedia.org/wiki/Reproducibility It is surprising how many posters disregard this basic requirement for a post, clearly stated at the bottom of each message to r-help.
On 21/12/2008 7:57 AM, Gabor Grothendieck wrote:
On Sun, Dec 21, 2008 at 5:42 AM, Dieter Menne <dieter.menne at menne-biomed.de> wrote:
Peter Dalgaard <p.dalgaard <at> biostat.ku.dk> writes:
Why do so many people have such trouble with the word "reproducible"? We can't reproduce that without access to weblog_by_date!
In a strict sense, the example is "reproducible" as opposite to "spurious". Reproducible research means that you can get the same results whe you buy an ultracentrifuge, high-grade chemicals, a safety lab, and a technician with a golden hand .:)
I think reproducible is the correct word and its meaning should be clear from both its conventional meaning, see link, and the context in which its used: http://en.wikipedia.org/wiki/Reproducibility It is surprising how many posters disregard this basic requirement for a post,
I don't find it surprising. Putting together a good bug report requires several skills that need to be learned. I suspect medical doctors and auto mechanics also work with poor reports of what's wrong. I do sometimes find it frustrating (as I imagine doctors and auto mechanics do), but probably not as frustrating as the posters find it.
clearly stated at the bottom of each message to r-help.
Now really, who reads repetitive stuff at the bottom of messages? The dividing line clearly indicates that it's some formal requirement, not meant to be read. Duncan Murdoch
On Sun, Dec 21, 2008 at 8:52 AM, Duncan Murdoch <murdoch at stats.uwo.ca> wrote:
On 21/12/2008 7:57 AM, Gabor Grothendieck wrote:
On Sun, Dec 21, 2008 at 5:42 AM, Dieter Menne <dieter.menne at menne-biomed.de> wrote:
Peter Dalgaard <p.dalgaard <at> biostat.ku.dk> writes:
Why do so many people have such trouble with the word "reproducible"? We can't reproduce that without access to weblog_by_date!
In a strict sense, the example is "reproducible" as opposite to "spurious". Reproducible research means that you can get the same results whe you buy an ultracentrifuge, high-grade chemicals, a safety lab, and a technician with a golden hand .:)
I think reproducible is the correct word and its meaning should be clear from both its conventional meaning, see link, and the context in which its used: http://en.wikipedia.org/wiki/Reproducibility It is surprising how many posters disregard this basic requirement for a post,
I don't find it surprising. Putting together a good bug report requires several skills that need to be learned. I suspect medical doctors and auto mechanics also work with poor reports of what's wrong. I do sometimes find it frustrating (as I imagine doctors and auto mechanics do), but probably not as frustrating as the posters find it.
clearly stated at the bottom of each message to r-help.
Now really, who reads repetitive stuff at the bottom of messages? The dividing line clearly indicates that it's some formal requirement, not meant to be read.
I think most people do read it since most posts ask in a reproducible way and the whole idea of repetition, as in advertising, is that such repetition can be effective.
Gabor Grothendieck wrote:
I think reproducible is the correct word and its meaning should be clear from both its conventional meaning, see link, and the context in which its used: http://en.wikipedia.org/wiki/Reproducibility It is surprising how many posters disregard this basic requirement for a post, clearly stated at the bottom of each message to r-help.
well, the foot
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
says 'reproducible code', but code is what you really want to get, not to reproduce ;) vQ