regular expression strikes again - R-help

Tue, Jul 9, 2013 2:45 AM #

Dear experts in regexpr.

I have this

dput(test[500:510])
c("pH 9,36 2", "pH 9,36 3", "pH 9,66 1", "pH 9,66 2", "pH 9,66 3", 
"pH 10,04 1", "pH 10,04 2", "pH 10,04 3", "RGLP 144006 pH 6,13 1", 
"RGLP 144006 pH 6,13 2", "RGLP 144006 pH 6,13 3")

and I want something like this

gsub("^.*([[:digit:]],[[:digit:]]*).*$", "\\1", test[500:510])
 [1] "9,36" "9,36" "9,66" "9,66" "9,66" "0,04" "0,04" "0,04" "6,13" "6,13"
[11] "6,13"

but with 10,04 values instead of 0,04.

I tried
gsub("^.*([[:digit:]]+,[[:digit:]]*).*$", "\\1", test[500:510])

or other variations but without any success.

Please help.

Regards
Petr

Peter Dalgaard

Tue, Jul 9, 2013 2:58 AM #

On Jul 9, 2013, at 11:45 , PIKAL Petr wrote:

Presumably the ^.* is too greedy. Perhaps add a space? I.e.,

gsub("^.* ([[:di......

Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

PIKAL Petr

Tue, Jul 9, 2013 3:19 AM #

Thanks, it works to some extent. 

The test comes from some file which is not filled propperly. If I use your suggestion I get correct values for those 2 digit numbers before "," but I get some other values which do not have space before numbers.

c("Cl Tio2 ph 5,8 1", "Cl Tio2 ph 5,8 2", "Cl Tio2 ph 5,8 3", 
"pH5,57 1", "pH5,57 2", "pH5,57 3", "pH4,8 1", "pH4,8 2", "pH4,8 3", 
"pH4,12 1", "pH 9,36 2", "pH 9,36 3", "pH 9,66 1", "pH 9,66 2", 
"pH 9,66 3", "pH 10,04 1", "pH 10,04 2", "pH 10,04 3", "RGLP 144006 pH 6,13 1", 
"RGLP 144006 pH 6,13 2", "RGLP 144006 pH 6,13 3")

[1] "5,8"      "5,8"      "5,8"      "pH5,57 1" "pH5,57 2" "pH5,57 3"
 [7] "pH4,8 1"  "pH4,8 2"  "pH4,8 3"  "pH4,12 1" "9,36"     "9,36"    
[13] "9,66"     "9,66"     "9,66"     "10,04"    "10,04"    "10,04"   
[19] "6,13"     "6,13"     "6,13"

Basically I would like to get one or two digits before comma and two digits after comma.

Thanks anyway
Petr

Jan T. Kim

Tue, Jul 9, 2013 4:16 AM #

On Tue, Jul 09, 2013 at 09:45:55AM +0000, PIKAL Petr wrote:

The "1" in "10,04" is matched by ".*". In your example, all floating
comma numbers you're trying to extract are preceded by "pH ", so
replacing ".*" with ".*pH " should do what you want.

I'd be wary about that variation of having "RGLP 144006" in some
cases, though, it might be better to clean up this rubbish earlier
on (and it would be ideal to never have it generated in the first
place). Regular expressions can be useful to separate some chaff
from the wheat, but relying on that too much comes with a risk of
extracting something that is valid in some syntactic / technical
sense but not correct semantically. If you can't be 100% certain
that the number you want is (1) always preceded by "pH ", (2)
always a floating comma number and (3) will always contain an
integer and a fractional part (i.e. you'll never get ",09" rather
than "0,09", or "10" rather than "10,0"), you have to be prepared
for more difficulties, and you may want to consider a more systematic
approach to parsing your input.

Best regards, Jan

+- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*

Peter Dalgaard

Tue, Jul 9, 2013 4:50 AM #

On Jul 9, 2013, at 12:19 , PIKAL Petr wrote:

Then maybe

[1] "5,8"   "5,8"   "5,8"   "5,57"  "5,57"  "5,57"  "4,8"   "4,8"   "4,8"  
[10] "4,12"  "9,36"  "9,36"  "9,66"  "9,66"  "9,66"  "10,04" "10,04" "10,04"
[19] "6,13"  "6,13"  "6,13"

Peter Dalgaard, Professor
Center for Statistics, Copenhagen Business School
Solbjerg Plads 3, 2000 Frederiksberg, Denmark
Phone: (+45)38153501
Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com

arun

Tue, Jul 9, 2013 5:05 AM #

Hi,
May be this helps:

? gsub(".*\\w+\\s+(.*)\\s+.*","\\1",test)
?#[1] "9,36"? "9,36"? "9,66"? "9,66"? "9,66"? "10,04" "10,04" "10,04" "6,13" 
#[10] "6,13"? "6,13" 

A.K.

----- Original Message -----
From: PIKAL Petr <petr.pikal at precheza.cz>
To: r-help <r-help at r-project.org>
Cc: 
Sent: Tuesday, July 9, 2013 5:45 AM
Subject: [R] regular expression strikes again

Dear experts in regexpr.

I have this

dput(test[500:510])
c("pH 9,36 2", "pH 9,36 3", "pH 9,66 1", "pH 9,66 2", "pH 9,66 3", 
"pH 10,04 1", "pH 10,04 2", "pH 10,04 3", "RGLP 144006 pH 6,13 1", 
"RGLP 144006 pH 6,13 2", "RGLP 144006 pH 6,13 3")

and I want something like this

gsub("^.*([[:digit:]],[[:digit:]]*).*$", "\\1", test[500:510])
[1] "9,36" "9,36" "9,66" "9,66" "9,66" "0,04" "0,04" "0,04" "6,13" "6,13"
[11] "6,13"

but with 10,04 values instead of 0,04.

I tried
gsub("^.*([[:digit:]]+,[[:digit:]]*).*$", "\\1", test[500:510])

or other variations but without any success.

Please help.

Regards
Petr

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

jim holtman

Tue, Jul 9, 2013 9:33 AM #

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20130709/2b8716d2/attachment.pl>