Hi there, I was wondering if anybody can explain to me why the boxplot ends up with different results in the following case: I have some integer data as a vector and I compare the stats of boxplot with the same data divided by a factor. I've attached a csv file with both data present (d1, d2). The factor is 34.16667. If I run the boxplot function on d1 I get the following stats: 0.848... 0.907... 0.936... 0.965... 1.024... For d2 I get these stats: 29 31 32 33 36 If I convert the stats of d1 with the factor, I get 29 31 32 33 35 Obviously different for the upper whisker. But why??? Antje -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: data.csv URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20091120/be8a7751/attachment-0001.pl>
boxplot question
4 messages · Peter Ehlers, Antje
If there's been an answer to this, I've missed it. Here's my take.
Antje wrote:
Hi there, I was wondering if anybody can explain to me why the boxplot ends up with different results in the following case: I have some integer data as a vector and I compare the stats of boxplot with the same data divided by a factor. I've attached a csv file with both data present (d1, d2). The factor is 34.16667. If I run the boxplot function on d1 I get the following stats: 0.848... 0.907... 0.936... 0.965... 1.024... For d2 I get these stats: 29 31 32 33 36 If I convert the stats of d1 with the factor, I get 29 31 32 33 35 Obviously different for the upper whisker. But why??? Antje
Antje: Three comments: 1. I think your 'factor' is actually 205/6, not 34.16667. 2. This looks like another case of FAQ 7.31: # Let's take your d2 and create d1; I'll call them x and y: x <- rep(c(29:38, 40), c(7, 24, 50, 71, 24, 12, 14, 7, 13, 5, 1)) y <- x * 6 / 205 # x is your d2, sorted # y is your d1, sorted # The critical values are x[202:203] and y[202:203]; x[201:204] #[1] 35 35 36 36 # The boxplot stats are: sx <- boxplot.stats(x)$stats sy <- boxplot.stats(y)$stats # Calculate potential extent of upper whisker: ux <- sx[4] + (sx[4] - sx[2]) * 1.5 #36 uy <- sy[4] + (sy[4] - sy[2]) * 1.5 #1.053658536585366 # Is y[203] <= uy? y[203] <= uy #[1] FALSE #!!! y[202] <= uy #[1] TRUE # For x: x[203] <= ux #[1] TRUE And there's your answer: for y the whisker goes to y[202], not y[203], due to the inevitable imprecision in machine calculation. 3. last comment: I would not use boxplots for data like this. -Peter Ehlers
------------------------------------------------------------------------
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
2 days later
Peter Ehlers wrote:
If there's been an answer to this, I've missed it. Here's my take. Antje wrote:
Hi there, I was wondering if anybody can explain to me why the boxplot ends up with different results in the following case: I have some integer data as a vector and I compare the stats of boxplot with the same data divided by a factor. I've attached a csv file with both data present (d1, d2). The factor is 34.16667. If I run the boxplot function on d1 I get the following stats: 0.848... 0.907... 0.936... 0.965... 1.024... For d2 I get these stats: 29 31 32 33 36 If I convert the stats of d1 with the factor, I get 29 31 32 33 35 Obviously different for the upper whisker. But why??? Antje
Antje: Three comments: 1. I think your 'factor' is actually 205/6, not 34.16667. 2. This looks like another case of FAQ 7.31: # Let's take your d2 and create d1; I'll call them x and y: x <- rep(c(29:38, 40), c(7, 24, 50, 71, 24, 12, 14, 7, 13, 5, 1)) y <- x * 6 / 205 # x is your d2, sorted # y is your d1, sorted # The critical values are x[202:203] and y[202:203]; x[201:204] #[1] 35 35 36 36 # The boxplot stats are: sx <- boxplot.stats(x)$stats sy <- boxplot.stats(y)$stats # Calculate potential extent of upper whisker: ux <- sx[4] + (sx[4] - sx[2]) * 1.5 #36 uy <- sy[4] + (sy[4] - sy[2]) * 1.5 #1.053658536585366 # Is y[203] <= uy? y[203] <= uy #[1] FALSE #!!! y[202] <= uy #[1] TRUE # For x: x[203] <= ux #[1] TRUE And there's your answer: for y the whisker goes to y[202], not y[203], due to the inevitable imprecision in machine calculation. 3. last comment: I would not use boxplots for data like this. -Peter Ehlers
Hi Peter, thanks a lot for your explanation! Now I understand the difference. I was using the boxplot statistic to filter outliers from my data. Do you have any suggestion for me what to use instead? (I tried to improve the estimation of mean and sd, when iteratively removing outliers by boxplot stats...) Antje
Antje wrote:
Peter Ehlers wrote:
If there's been an answer to this, I've missed it. Here's my take. Antje wrote:
Hi there, I was wondering if anybody can explain to me why the boxplot ends up with different results in the following case: I have some integer data as a vector and I compare the stats of boxplot with the same data divided by a factor. I've attached a csv file with both data present (d1, d2). The factor is 34.16667. If I run the boxplot function on d1 I get the following stats: 0.848... 0.907... 0.936... 0.965... 1.024... For d2 I get these stats: 29 31 32 33 36 If I convert the stats of d1 with the factor, I get 29 31 32 33 35 Obviously different for the upper whisker. But why??? Antje
Antje: Three comments: 1. I think your 'factor' is actually 205/6, not 34.16667. 2. This looks like another case of FAQ 7.31: # Let's take your d2 and create d1; I'll call them x and y: x <- rep(c(29:38, 40), c(7, 24, 50, 71, 24, 12, 14, 7, 13, 5, 1)) y <- x * 6 / 205 # x is your d2, sorted # y is your d1, sorted # The critical values are x[202:203] and y[202:203]; x[201:204] #[1] 35 35 36 36 # The boxplot stats are: sx <- boxplot.stats(x)$stats sy <- boxplot.stats(y)$stats # Calculate potential extent of upper whisker: ux <- sx[4] + (sx[4] - sx[2]) * 1.5 #36 uy <- sy[4] + (sy[4] - sy[2]) * 1.5 #1.053658536585366 # Is y[203] <= uy? y[203] <= uy #[1] FALSE #!!! y[202] <= uy #[1] TRUE # For x: x[203] <= ux #[1] TRUE And there's your answer: for y the whisker goes to y[202], not y[203], due to the inevitable imprecision in machine calculation. 3. last comment: I would not use boxplots for data like this. -Peter Ehlers
Hi Peter, thanks a lot for your explanation! Now I understand the difference. I was using the boxplot statistic to filter outliers from my data. Do you have any suggestion for me what to use instead? (I tried to improve the estimation of mean and sd, when iteratively removing outliers by boxplot stats...) Antje
Removing outliers to 'improve ...' is always problematic. Perhaps you should not use mean or sd? Consider robust alternatives, e.g. median/IQR. This very much depends on the purpose of the analysis. See the taskview on Robust Statistical Methods. For outliers, there's pkg:outliers. I haven't used it. There seems to be quite a bit more: I got 277 hits from: library(sos) ???"outlier" -Peter Ehlers