Skip to content

misbehavior with extract_numeric() from tidyr

7 messages · Arnaud Gaboury, Jim Lemon, William Dunlap +1 more

#
R 3.2.0 on Linux
--------------------------------

library(tidyr)

playerStats <- c("LVL 10", "5,671,448 AP l6,000,000 AP", "Unique
Portals Visited 1,038",
"XM Collected 15,327,123 XM", "Hacks 14,268", "Resonators Deployed 11,126",
"Links Created 1,744", "Control Fields Created 294", "Mind Units
Captured 2,995,484 MUs",
"Longest Link Ever Created 75 km", "Largest Control Field 189,731 MUs",
"XM Recharged 3,006,364 XM", "Portals Captured 1,204", "Unique Portals
Captured 486",
"Resonators Destroyed 12,481", "Portals Neutralized 1,240", "Enemy
Links Destroyed 3,169",
"Enemy Control Fields Destroyed 1,394", "Distance Walked 230 km",
"Max Time Portal Held 240 days", "Max Time Link Maintained 15 days",
"Max Link Length x Days 276 km-days", "Max Time Field Held 4days",
"Largest Field MUs x Days 83,226 MU-days")

-----------------------------------------------------------------------------------------------
 extract_numeric(playerStats)
 [1]             10 56714486000000           1038       15327123
   14268          11126           1744            294        2995484
[10]             75         189731        3006364           1204
     486          12481           1240           3169           1394
[19]            230            240             15             NA
       4             NA

------------------------------------------------------------------------------------------------
 playerStats[c(22,24)]
[1] "Max Link Length x Days 276 km-days"      "Largest Field MUs x
Days 83,226 MU-days"
--------------------------------------------------------------------------------------------

I do not understand why these two vectors return NA when the function
extract_numeric() works well for others,

Any wrong settings in my env?

Thank you for hints.
#
On Mon, Apr 20, 2015 at 9:10 AM, arnaud gaboury
<arnaud.gaboury at gmail.com> wrote:
-------------------------------------------------------------------------
 as.numeric(gsub("[^0-9]", "",playerStats))
 [1]             10 56714486000000           1038       15327123
   14268          11126           1744            294        2995484
[10]             75         189731        3006364           1204
     486          12481           1240           3169           1394
[19]            230            240             15            276
       4          83226
--------------------------------------------------------------------

The above command does the job, but I still can not figure out why
extract_numeric() returns two NA

  
    
#
Hi arnaud,
At a guess, it is the two hyphens that are present in those strings. I
think that the function you are using interprets them as subtraction
operators and since the string following the hyphen would produce NA,
the result would be NA.

Jim


On Mon, Apr 20, 2015 at 7:46 PM, arnaud gaboury
<arnaud.gaboury at gmail.com> wrote:
#
On Mon, Apr 20, 2015 at 12:09 PM, Jim Lemon <drjimlemon at gmail.com> wrote:
I was thinking of 'x' as being the culprit (interpreted as multiply)
but you are right indeed

noHyphens <- str_replace(playerStats[c(22,24)],'-','')
 extract_numeric(noHyphens)
[1]   276 83226


in fact:
---------------------------------------------------------
 extract_numeric
function (x)
{
    as.numeric(gsub("[^0-9.-]+", "", as.character(x)))
}
<environment: namespace:tidyr>
---------------------------------------------------------

Is there any particular reason for the hyphen in gsub() ? Why not
remove it thus ?

TY much Jim

  
    
#
The hyphen without a following digit confuses tidyr::extract_numeric().
E.g.,
   > extract_numeric("23 ft-lbs")
   Warning message:
   In extract_numeric("23 ft-lbs") : NAs introduced by coercion
   [1] NA
   > extract_numeric("23 ft*lbs")
   [1] 23
Contact the BugReports address for the package
   > packageDescription("tidyr")$BugReports
   [1] "https://github.com/hadley/tidyr/issues"
or package's maintainer
   > maintainer("tidyr")
   [1] "Hadley Wickham <hadley at rstudio.com>"
to report problems in a user-contributed package.



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Apr 20, 2015 at 12:10 AM, arnaud gaboury <arnaud.gaboury at gmail.com>
wrote:

  
  
#
On Mon, Apr 20, 2015 at 6:09 PM, William Dunlap <wdunlap at tibco.com> wrote:

            
See[0] for the reason on the minus in the regex. It is not a bug but a wish.
I am honestly very surprised the maintainer decided to go with such a so
simple solution for negative numbers.

[0]https://github.com/hadley/tidyr/issues/20

Contact the BugReports address for the package

  
    
#
On Mon, Apr 20, 2015 at 1:57 PM, arnaud gaboury
<arnaud.gaboury at gmail.com> wrote:
Any heuristic is going to fail in some circumstances. If you want to
be sure it's doing what you want for your use case, write the regular
expression yourself.

Hadley