Skip to content

Pattern match

7 messages · Neeti, Dennis Murphy, neetika nath +1 more

#
Hi ALL,

I have very simple question regarding pattern matching. Could anyone tell me
how to I can use R to retrieve string pattern from text file.  for example
my file contain following information

SpeciesCommon=(Human);SpeciesScientific=(Homo
sapiens);ReactiveCentres=(N,C,C,C,+
H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
eciesScientific=(Achromobacter
cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+

and I want to extract ?SpeciesScientific = (?)? information from this file.
Problem is in 3rd line where SpeciesScientific word is divided with +.  

Could anyone help me please?
Thank you


--
View this message in context: http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
Sent from the R help mailing list archive at Nabble.com.
#
Hi:

This is a bit of a roundabout approach; I'm sure that folks with regex
expertise will trump this in a heartbeat. I modified the last piece of
the string a bit to accommodate the approach below. Depending on where
the strings have line breaks, you may have some odd '\n' characters
inserted.

# Step 1: read the input as a single character string
u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"

# Step 2: Split input lines by the ';' delimiter and then use lapply()
to split variable names from values.
# This results in a nested list for ulist2.
ulist <- strsplit(u, ';')
ulist2 <- lapply(ulist, function(s) strsplit(s, '='))

# Step 3: Break out the results into a matrix whose first column is
the variable name
# and whose second column is the value (with parens included)
# This avoids dealing with nested lists
v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)

# Step 4: Strip off the parens
w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
colnames(w) <- c('Name', 'Value')
w
      Name                 Value
 [1,] "SpeciesCommon"      "Human"
 [2,] "SpeciesScientific"  "Homo sapiens"
 [3,] "ReactiveCentres"    "N,C,C,C,+H,O,C,C,C,C,O,H"
 [4,] "BondInvolved"       "C-H"
 [5,] "EzCatDBID"          "S00343"
 [6,] "BondFormed"         "O-H,O-H"
 [7,] "Bond"               "255B"
 [8,] "Cofactors"          "CuII,CU,501,A,CuII,CU,502,A"
 [9,] "CatalyticSwissProt" "P25006"
[10,] "SpeciesScientific"  "Achromobacter\ncycloclastes"
[11,] "SpeciesCommon"      "Bacteria"
[12,] "Reactive"           "Ce+"

# Step 5: Subset out the values of the SpeciesScientific variables
subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
                         Value
2                 Homo sapiens
10 Achromobacter\ncycloclastes


One possible 'advantage' of this approach is that if you have a number
of string records of this type, you can create nested lists for each
string and then manipulate the lists to get what you need. Hopefully
you can use some of these ideas for other purposes as well.

Dennis
On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi at gmail.com> wrote:
#
On Apr 21, 2011, at 5:27 AM, neetika nath wrote:

            
You might consider attaching the original file named with an extension  
of `.txt`, since your verbal description does not match your included  
example. What I see after the various servers have passed this around  
and inserted line-ends is the string `SpeciesScientific` in the first  
line, rather than in the third.

-- 
David
#
Thank you for your message. please see attach file for the template/test
dataset of my file.
On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <dwinsemius at comcast.net>wrote:

            
-------------- next part --------------
--
$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION
lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C,1,C-C,1,C=C,2,C-+
C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O,1);CatalyticResidues=(Gl+
y149A,Tyr155A,His161A);Cofactors=(FAD,FAD,601,none);CatalyticSwissProt=(P15559);+
SpeciesCommon=(Human);SpeciesScientific=(Homo sapiens);ReactiveCentres=(N,C,C,C,+
H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+

--
$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION
$DATUM CatalyticCATH=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+
255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
eciesScientific=(Achromobacter cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
ntres=(N,O,H,Cu);BondFormed=(O-H);BondCleaved=(O-N);PreviousEC=(1.7.99.3,1.9.3.2+
);Return=(Yes);CreatedBy=(GLH,GJB,DEA);DLU=(24102008);MID=(M0004);KEGG=(R00785).

--
$DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION
$DATUM OverallComment=(The reference states that this mechanism was elucidated a+
t low pH. This enzyme specifically removes basic or hydrophobic amino acid resid+
ues from the C-terminus of the peptide substrate.);CatalyticCATH=(3.40.50.1820);+
CatalyticResidues=(Gly53A,Ser146A,Tyr147A,Asp338B,His397B);CatalyticSwissProt=(P+
08819);SpeciesCommon=(Wheat);SpeciesScientific=(Triticum aestivum);ReactiveCentr+
es=(N,H,O,C);EzCatDBID=(S00374);BondFormed=(N-H,C-O);BondCleaved=(C-N,O-H);Retur+
n=(Yes);DLU=(24102008);MID=(M0005);CreatedBy=(GLH,GJB,DEA).