Profiling question: string formatting extremely slow

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20090318/378ad15b/attachment-0002.pl>
Try this way.  Took less than 1 second for 50,000
system.time({
+     x <- sample(50000)  # test data
+     x[sample(50000,10000)] <- 'asdfasdf'  # characters strings
+     which.num <- grep("^[ 0-9]+$", x)  # find numbers
+     # convert to leading 0
+     x[which.num] <- sprintf("%018.0f", as.numeric(x[which.num]))
+     x[-which.num] <- toupper(x[-which.num])
+ })
   user  system elapsed
   0.25    0.00    0.25

head(x,30)
[1] "000000000000026550" "000000000000019100" "000000000000045961"
"000000000000031473" "000000000000005031" "000000000000012266"
 [7] "000000000000034418" "000000000000042279" "000000000000041193"
"ASDFASDF"           "000000000000005760" "000000000000035659"
[13] "ASDFASDF"           "000000000000008420" "000000000000042220"
"ASDFASDF"           "000000000000039903" "000000000000032234"
[19] "000000000000024125" "000000000000032970" "000000000000006814"
"000000000000000215" "ASDFASDF"           "000000000000045239"
[25] "ASDFASDF"           "ASDFASDF"           "000000000000043065"
"ASDFASDF"           "000000000000007642" "000000000000019196"

On Wed, Mar 18, 2009 at 12:16 PM, Olivier Boudry
Hi all,

I'm using R to find duplicates in a set of 6 files containing Part Number
information. Before applying the intersect method to identify the duplicates
I need to normalize the P/Ns. Converting the P/N to uppercase if
alphanumerical and applying an 18 char long zero padding if numerical.

When I apply the pn_formatting function (see code below) to "Part Number"
column of the data.frame (character vectors up to 18 char long) it consumes
a lot of memory, my computer (Windows XP SP3) starts to swap memory, CPU
goes to zero and completion takes hours to complete. Part Number columns can
have from 7'000 to 80'000 records and I've never got enough patience to wait
for completion of more than 17'000 records.

Is there a way to find out which of the function used below is the
bottleneck, as.integer, is.na, sub, paste, nchar, toupper? Is there a
profiler for R and if yes where could I find some documentation on how to
use it?

The code:

# String contains digits only (can be converted to an integer)
digits_only <- function(x) { suppressWarnings(!is.na(as.integer(x))) }

# Remove blanks at both ends of a string
trim <- function (x) {
?sub("^\\s+((.*\\S)\\s+)?$", "\\2", x)
}

# P/N formatting
pn_formatting <- function(pn_in) {

?pn_out = trim(pn_in)
?if (digits_only(pn_out)) {

? ?# Zero padding
? ?pn_out <- paste("000000000000000000", pn_out, sep="")
? ?pn_len <- nchar(pn_out)
? ?pn_out <- substr(pn_out, pn_len - 17, pn_len)

?} else {
? ?# Uppercase
? ?pn_out <- toupper(pn_out)
?}
?pn_out
}

Thanks,

Olivier.

? ? ? ?[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?