Hello, I'm encountering the following error: In a package for survival analysis I use a data.frame is created, one column is created by applying unique on the event times while others are created by running table on the event times and the treatment arm. When there are event times very close together they are put in the same factor level when coerced to factor while unique outputs both values, leading to different lengths of the columns. Try this to reproduce: x <- c(1, 1+.Machine$double.eps) unique(x) table(x) Is there a general best practice to deal with such issues? Should calling table on floats be avoided in general? What can one use instead? One could easily iterate over the unique values and compare all values with the whole vector but this are N*N comparisons, compared to N*log(N) when sorting first and taking into account that the vector is sorted. I think for my purposes I'll round to a hundredth of a day before calling the function, but any advice on avoiding this issue an writing more fault tolerant code is greatly appreciated. all the best, Tobias
as.factor and floating point numbers
5 messages · Tobias Fellinger, Andrew Simmons, Ebert,Timothy Aaron +1 more
R converts floats to strings with ~15 digits of accuracy, specifically to avoid differentiating between 1 and 1 + .Machine$double.eps, it is assumed that small differences such as this are due to rounding errors and are unimportant. So, if when making your factor, you want all digits, you could write this: `as.factor(format(x, digits = 17L))`
On Wed, Jan 25, 2023 at 4:03 AM Tobias Fellinger <tobby at htu.at> wrote:
Hello,
I'm encountering the following error:
In a package for survival analysis I use a data.frame is created, one column is created by applying unique on the event times while others are created by running table on the event times and the treatment arm.
When there are event times very close together they are put in the same factor level when coerced to factor while unique outputs both values, leading to different lengths of the columns.
Try this to reproduce:
x <- c(1, 1+.Machine$double.eps)
unique(x)
table(x)
Is there a general best practice to deal with such issues?
Should calling table on floats be avoided in general?
What can one use instead?
One could easily iterate over the unique values and compare all values with the whole vector but this are N*N comparisons, compared to N*log(N) when sorting first and taking into account that the vector is sorted.
I think for my purposes I'll round to a hundredth of a day before calling the function, but any advice on avoiding this issue an writing more fault tolerant code is greatly appreciated.
all the best, Tobias
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Another option is to convert all times to base units or the sample rate from the analog-to-digital converter. If this is 100 milliseconds then use milliseconds rather than fractions of an hour or day. This approach might not help if the range in values spans more than 16 digits: slightly finer resolution than one year sampled in microseconds. Tim -----Original Message----- From: R-help <r-help-bounces at r-project.org> On Behalf Of Andrew Simmons Sent: Wednesday, January 25, 2023 4:13 AM To: Tobias Fellinger <tobby at htu.at> Cc: r-help at r-project.org Subject: Re: [R] as.factor and floating point numbers [External Email] R converts floats to strings with ~15 digits of accuracy, specifically to avoid differentiating between 1 and 1 + .Machine$double.eps, it is assumed that small differences such as this are due to rounding errors and are unimportant. So, if when making your factor, you want all digits, you could write this: `as.factor(format(x, digits = 17L))`
On Wed, Jan 25, 2023 at 4:03 AM Tobias Fellinger <tobby at htu.at> wrote:
Hello,
I'm encountering the following error:
In a package for survival analysis I use a data.frame is created, one column is created by applying unique on the event times while others are created by running table on the event times and the treatment arm.
When there are event times very close together they are put in the same factor level when coerced to factor while unique outputs both values, leading to different lengths of the columns.
Try this to reproduce:
x <- c(1, 1+.Machine$double.eps)
unique(x)
table(x)
Is there a general best practice to deal with such issues?
Should calling table on floats be avoided in general?
What can one use instead?
One could easily iterate over the unique values and compare all values with the whole vector but this are N*N comparisons, compared to N*log(N) when sorting first and taking into account that the vector is sorted.
I think for my purposes I'll round to a hundredth of a day before calling the function, but any advice on avoiding this issue an writing more fault tolerant code is greatly appreciated.
all the best, Tobias
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat .ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu %7C0195a6c837ad415067ac08dafeb5360b%7C0d4da0f84a314d76ace60a62331e1b84 %7C0%7C0%7C638102351500433338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sda ta=%2FqBEN7jKAYswd1Hz5xAvitS3%2F2TEGm%2FweHoUk80sBgs%3D&reserved=0 PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r -project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C01 95a6c837ad415067ac08dafeb5360b%7C0d4da0f84a314d76ace60a62331e1b84%7C0% 7C0%7C638102351500433338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiL CJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kn %2BJ8XO8XqOpc5cKXJ6%2FDjIBfN9CH9m6qvimXLTwUBg%3D&reserved=0 and provide commented, minimal, self-contained, reproducible code.
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://nam10.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-help&data=05%7C01%7Ctebert%40ufl.edu%7C0195a6c837ad415067ac08dafeb5360b%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638102351500433338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=%2FqBEN7jKAYswd1Hz5xAvitS3%2F2TEGm%2FweHoUk80sBgs%3D&reserved=0 PLEASE do read the posting guide https://nam10.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.r-project.org%2Fposting-guide.html&data=05%7C01%7Ctebert%40ufl.edu%7C0195a6c837ad415067ac08dafeb5360b%7C0d4da0f84a314d76ace60a62331e1b84%7C0%7C0%7C638102351500433338%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kn%2BJ8XO8XqOpc5cKXJ6%2FDjIBfN9CH9m6qvimXLTwUBg%3D&reserved=0 and provide commented, minimal, self-contained, reproducible code.
Hello Tobias, A factor is basically a way to get a character to behave like an integer. It consists of an integer with values from 1 to nlev, and a character vector levels, specifying for each value a level name. But this means that factors only really make sense with characters, and anything that is not a character will be forced to be a character. Thus two values that are represented by the same value in as.character will be treated as the same. Now this is probably reasonable most of the time, as numeric values will usually represent metric data, which tends to make little sense as factor. But if we want to do this we can easily build or own factors from floats, and even write some convenience wrapper around table, as shown in the appended file. Best regards, Valentin Am Mittwoch, 25. J?nner 2023, 10:03:01 CET schrieb Tobias Fellinger:
Hello, I'm encountering the following error: In a package for survival analysis I use a data.frame is created, one column is created by applying unique on the event times while others are created by running table on the event times and the treatment arm. When there are event times very close together they are put in the same factor level when coerced to factor while unique outputs both values, leading to different lengths of the columns. Try this to reproduce: x <- c(1, 1+.Machine$double.eps) unique(x) table(x) Is there a general best practice to deal with such issues? Should calling table on floats be avoided in general? What can one use instead? One could easily iterate over the unique values and compare all values with the whole vector but this are N*N comparisons, compared to N*log(N) when sorting first and taking into account that the vector is sorted. I think for my purposes I'll round to a hundredth of a day before calling the function, but any advice on avoiding this issue an writing more fault tolerant code is greatly appreciated. all the best, Tobias [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
-------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 833 bytes Desc: This is a digitally signed message part. URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20230125/20efc180/attachment.sig>
Hello, I'll reply in one mail to all. Thank you for your suggestions. I already tried Andrews solution with increasing the digits. In the most extreme case I encountered I had to take the maximum possible digits in format but it worked. Tims solution is also a good workaround but in this case I would have to know much about the user input. Valentins solution works and is surely the safest of the options but somehow more than I need. The case I encountered does not really need to deal with the levels, but just with the counts of every unique value across another variable. After thinking about it a little bit longer I came up with another solution that works alright for my purposes: I use table on the ranks. Since in the case I encountered the vector does not have duplicates and is already sorted, I can use table on the ranks of the vector and get the counts in the right order. Thanks Everyone, Tobias
On Mittwoch, 25. J?nner 2023 20:59:16 CET Valentin Petzel wrote:
Hello Tobias, A factor is basically a way to get a character to behave like an integer. It consists of an integer with values from 1 to nlev, and a character vector levels, specifying for each value a level name. But this means that factors only really make sense with characters, and anything that is not a character will be forced to be a character. Thus two values that are represented by the same value in as.character will be treated as the same. Now this is probably reasonable most of the time, as numeric values will usually represent metric data, which tends to make little sense as factor. But if we want to do this we can easily build or own factors from floats, and even write some convenience wrapper around table, as shown in the appended file. Best regards, Valentin Am Mittwoch, 25. J?nner 2023, 10:03:01 CET schrieb Tobias Fellinger:
Hello, I'm encountering the following error: In a package for survival analysis I use a data.frame is created, one column is created by applying unique on the event times while others are created by running table on the event times and the treatment arm. When there are event times very close together they are put in the same factor level when coerced to factor while unique outputs both values, leading to different lengths of the columns. Try this to reproduce: x <- c(1, 1+.Machine$double.eps) unique(x) table(x) Is there a general best practice to deal with such issues? Should calling table on floats be avoided in general? What can one use instead? One could easily iterate over the unique values and compare all values with the whole vector but this are N*N comparisons, compared to N*log(N) when sorting first and taking into account that the vector is sorted. I think for my purposes I'll round to a hundredth of a day before calling the function, but any advice on avoiding this issue an writing more fault tolerant code is greatly appreciated. all the best, Tobias [[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.