Hi ALL, I have very simple question regarding pattern matching. Could anyone tell me how to I can use R to retrieve string pattern from text file. for example my file contain following information SpeciesCommon=(Human);SpeciesScientific=(Homo sapiens);ReactiveCentres=(N,C,C,C,+ H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+ 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+ eciesScientific=(Achromobacter cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+ and I want to extract ?SpeciesScientific = (?)? information from this file. Problem is in 3rd line where SpeciesScientific word is divided with +. Could anyone help me please? Thank you -- View this message in context: http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html Sent from the R help mailing list archive at Nabble.com.
Pattern match
7 messages · Neeti, Dennis Murphy, neetika nath +1 more
Hi:
This is a bit of a roundabout approach; I'm sure that folks with regex
expertise will trump this in a heartbeat. I modified the last piece of
the string a bit to accommodate the approach below. Depending on where
the strings have line breaks, you may have some odd '\n' characters
inserted.
# Step 1: read the input as a single character string
u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
# Step 2: Split input lines by the ';' delimiter and then use lapply()
to split variable names from values.
# This results in a nested list for ulist2.
ulist <- strsplit(u, ';')
ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
# Step 3: Break out the results into a matrix whose first column is
the variable name
# and whose second column is the value (with parens included)
# This avoids dealing with nested lists
v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
# Step 4: Strip off the parens
w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
colnames(w) <- c('Name', 'Value')
w
Name Value
[1,] "SpeciesCommon" "Human"
[2,] "SpeciesScientific" "Homo sapiens"
[3,] "ReactiveCentres" "N,C,C,C,+H,O,C,C,C,C,O,H"
[4,] "BondInvolved" "C-H"
[5,] "EzCatDBID" "S00343"
[6,] "BondFormed" "O-H,O-H"
[7,] "Bond" "255B"
[8,] "Cofactors" "CuII,CU,501,A,CuII,CU,502,A"
[9,] "CatalyticSwissProt" "P25006"
[10,] "SpeciesScientific" "Achromobacter\ncycloclastes"
[11,] "SpeciesCommon" "Bacteria"
[12,] "Reactive" "Ce+"
# Step 5: Subset out the values of the SpeciesScientific variables
subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
Value
2 Homo sapiens
10 Achromobacter\ncycloclastes
One possible 'advantage' of this approach is that if you have a number
of string records of this type, you can create nested lists for each
string and then manipulate the lists to get what you need. Hopefully
you can use some of these ideas for other purposes as well.
Dennis
On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi at gmail.com> wrote:
Hi ALL, I have very simple question regarding pattern matching. Could anyone tell me how to I can use R to retrieve string pattern from text file. ?for example my file contain following information SpeciesCommon=(Human);SpeciesScientific=(Homo sapiens);ReactiveCentres=(N,C,C,C,+ H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+ 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+ eciesScientific=(Achromobacter cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+ and I want to extract ?SpeciesScientific = (?)? information from this file. Problem is in 3rd line where SpeciesScientific word is divided with +. Could anyone help me please? Thank you -- View this message in context: http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110421/13a9d695/attachment.pl>
On Apr 21, 2011, at 5:27 AM, neetika nath wrote:
Thank you Dennis, yes the problem is the input file. i have .rdf file and the format is in same way i have posted earlier. if i open that file in notepad++ the lines are divided or broken with CR+LF character. so any suggestion to retrieve SpeciesScientific information without changing the input file?
You might consider attaching the original file named with an extension of `.txt`, since your verbal description does not match your included example. What I see after the various servers have passed this around and inserted line-ends is the string `SpeciesScientific` in the first line, rather than in the third. -- David
>
> Thank you
>
> On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmuser at gmail.com>
> wrote:
>
>> Hi:
>>
>> This is a bit of a roundabout approach; I'm sure that folks with
>> regex
>> expertise will trump this in a heartbeat. I modified the last piece
>> of
>> the string a bit to accommodate the approach below. Depending on
>> where
>> the strings have line breaks, you may have some odd '\n' characters
>> inserted.
>>
>> # Step 1: read the input as a single character string
>> u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
>>
>> sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-
>> H);EzCatDBID=(S00343);BondFormed=(O-H,O-
>> H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
>> 502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
>> cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
>>
>> # Step 2: Split input lines by the ';' delimiter and then use
>> lapply()
>> to split variable names from values.
>> # This results in a nested list for ulist2.
>> ulist <- strsplit(u, ';')
>> ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
>>
>> # Step 3: Break out the results into a matrix whose first column is
>> the variable name
>> # and whose second column is the value (with parens included)
>> # This avoids dealing with nested lists
>> v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
>>
>> # Step 4: Strip off the parens
>> w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
>> colnames(w) <- c('Name', 'Value')
>> w
>> Name Value
>> [1,] "SpeciesCommon" "Human"
>> [2,] "SpeciesScientific" "Homo sapiens"
>> [3,] "ReactiveCentres" "N,C,C,C,+H,O,C,C,C,C,O,H"
>> [4,] "BondInvolved" "C-H"
>> [5,] "EzCatDBID" "S00343"
>> [6,] "BondFormed" "O-H,O-H"
>> [7,] "Bond" "255B"
>> [8,] "Cofactors" "CuII,CU,501,A,CuII,CU,502,A"
>> [9,] "CatalyticSwissProt" "P25006"
>> [10,] "SpeciesScientific" "Achromobacter\ncycloclastes"
>> [11,] "SpeciesCommon" "Bacteria"
>> [12,] "Reactive" "Ce+"
>>
>> # Step 5: Subset out the values of the SpeciesScientific variables
>> subset(as.data.frame(w), Name == 'SpeciesScientific', select =
>> 'Value')
>> Value
>> 2 Homo sapiens
>> 10 Achromobacter\ncycloclastes
>>
>>
>> One possible 'advantage' of this approach is that if you have a
>> number
>> of string records of this type, you can create nested lists for each
>> string and then manipulate the lists to get what you need. Hopefully
>> you can use some of these ideas for other purposes as well.
>>
>> Dennis
>>
>>
>>
>> On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi at gmail.com> wrote:
>>> Hi ALL,
>>>
>>> I have very simple question regarding pattern matching. Could
>>> anyone tell
>> me
>>> how to I can use R to retrieve string pattern from text file. for
>> example
>>> my file contain following information
>>>
>>> SpeciesCommon=(Human);SpeciesScientific=(Homo
>>> sapiens);ReactiveCentres=(N,C,C,C,+
>>>
>> H,O,C,C,C,C,O,H);BondInvolved=(C-
>> H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
>>>
>> 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,
>> 502,A);CatalyticSwissProt=(P25006);Sp+
>>> eciesScientific=(Achromobacter
>>> cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+
>>>
>>> and I want to extract ?SpeciesScientific = (?)? information from
>>> this
>> file.
>>> Problem is in 3rd line where SpeciesScientific word is divided
>>> with +.
>>>
>>> Could anyone help me please?
>>> Thank you
>>>
>>>
>>> --
>>> View this message in context:
>> http://r.789695.n4.nabble.com/Pattern-match-tp3463625p3463625.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
Thank you for your message. please see attach file for the template/test dataset of my file.
On Thu, Apr 21, 2011 at 1:30 PM, David Winsemius <dwinsemius at comcast.net>wrote:
On Apr 21, 2011, at 5:27 AM, neetika nath wrote: Thank you Dennis,
yes the problem is the input file. i have .rdf file and the format is in same way i have posted earlier. if i open that file in notepad++ the lines are divided or broken with CR+LF character. so any suggestion to retrieve SpeciesScientific information without changing the input file?
You might consider attaching the original file named with an extension of `.txt`, since your verbal description does not match your included example. What I see after the various servers have passed this around and inserted line-ends is the string `SpeciesScientific` in the first line, rather than in the third. -- David --
Thank you On Wed, Apr 20, 2011 at 9:49 PM, Dennis Murphy <djmuser at gmail.com> wrote: Hi:
This is a bit of a roundabout approach; I'm sure that folks with regex
expertise will trump this in a heartbeat. I modified the last piece of
the string a bit to accommodate the approach below. Depending on where
the strings have line breaks, you may have some odd '\n' characters
inserted.
# Step 1: read the input as a single character string
u <- "SpeciesCommon=(Human);SpeciesScientific=(Homo
sapiens);ReactiveCentres=(N,C,C,C,+H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond=(255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);SpeciesScientific=(Achromobacter
cycloclastes);SpeciesCommon=(Bacteria);Reactive=(Ce+)"
# Step 2: Split input lines by the ';' delimiter and then use lapply()
to split variable names from values.
# This results in a nested list for ulist2.
ulist <- strsplit(u, ';')
ulist2 <- lapply(ulist, function(s) strsplit(s, '='))
# Step 3: Break out the results into a matrix whose first column is
the variable name
# and whose second column is the value (with parens included)
# This avoids dealing with nested lists
v <- matrix(unlist(ulist2), ncol = 2, byrow = TRUE)
# Step 4: Strip off the parens
w <- apply(v, 2, function(s) gsub('([\\(\\)])', '', s))
colnames(w) <- c('Name', 'Value')
w
Name Value
[1,] "SpeciesCommon" "Human"
[2,] "SpeciesScientific" "Homo sapiens"
[3,] "ReactiveCentres" "N,C,C,C,+H,O,C,C,C,C,O,H"
[4,] "BondInvolved" "C-H"
[5,] "EzCatDBID" "S00343"
[6,] "BondFormed" "O-H,O-H"
[7,] "Bond" "255B"
[8,] "Cofactors" "CuII,CU,501,A,CuII,CU,502,A"
[9,] "CatalyticSwissProt" "P25006"
[10,] "SpeciesScientific" "Achromobacter\ncycloclastes"
[11,] "SpeciesCommon" "Bacteria"
[12,] "Reactive" "Ce+"
# Step 5: Subset out the values of the SpeciesScientific variables
subset(as.data.frame(w), Name == 'SpeciesScientific', select = 'Value')
Value
2 Homo sapiens
10 Achromobacter\ncycloclastes
One possible 'advantage' of this approach is that if you have a number
of string records of this type, you can create nested lists for each
string and then manipulate the lists to get what you need. Hopefully
you can use some of these ideas for other purposes as well.
Dennis
On Wed, Apr 20, 2011 at 10:17 AM, Neeti <nikkihathi at gmail.com> wrote:
Hi ALL, I have very simple question regarding pattern matching. Could anyone tell
me
how to I can use R to retrieve string pattern from text file. for
example
my file contain following information SpeciesCommon=(Human);SpeciesScientific=(Homo sapiens);ReactiveCentres=(N,C,C,C,+ H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+
255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+
eciesScientific=(Achromobacter cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+ and I want to extract ?SpeciesScientific = (?)? information from this
file.
Problem is in 3rd line where SpeciesScientific word is divided with +. Could anyone help me please? Thank you -- View this message in context:
Sent from the R help mailing list archive at Nabble.com.
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[[alternative HTML version deleted]]
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD West Hartford, CT
-------------- next part -------------- -- $DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION lyticCATH=(3.40.50.360);BondOrderChanged=(C-N,1,C=N,2,C=C,2,C-C,1,C-C,1,C=C,2,C-+ C,1,C=C,2,C=C,2,C-C,1,C-C,1,C=C,2,C=O,2,C-O,1,C=O,2,C-O,1);CatalyticResidues=(Gl+ y149A,Tyr155A,His161A);Cofactors=(FAD,FAD,601,none);CatalyticSwissProt=(P15559);+ SpeciesCommon=(Human);SpeciesScientific=(Homo sapiens);ReactiveCentres=(N,C,C,C,+ H,O,C,C,C,C,O,H);BondInvolved=(C-H);EzCatDBID=(S00343);BondFormed=(O-H,O-H);Bond+ -- $DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION $DATUM CatalyticCATH=(2.60.40.420);CatalyticResidues=(Asp98A,His135A,Cys136A,His+ 255B);Cofactors=(Cu(II),CU,501,A,Cu(II),CU,502,A);CatalyticSwissProt=(P25006);Sp+ eciesScientific=(Achromobacter cycloclastes);SpeciesCommon=(Bacteria);ReactiveCe+ ntres=(N,O,H,Cu);BondFormed=(O-H);BondCleaved=(O-N);PreviousEC=(1.7.99.3,1.9.3.2+ );Return=(Yes);CreatedBy=(GLH,GJB,DEA);DLU=(24102008);MID=(M0004);KEGG=(R00785). -- $DTYPE ROOT:OVERALL REACTION(1):OVERALL REACTION ANNOTATION $DATUM OverallComment=(The reference states that this mechanism was elucidated a+ t low pH. This enzyme specifically removes basic or hydrophobic amino acid resid+ ues from the C-terminus of the peptide substrate.);CatalyticCATH=(3.40.50.1820);+ CatalyticResidues=(Gly53A,Ser146A,Tyr147A,Asp338B,His397B);CatalyticSwissProt=(P+ 08819);SpeciesCommon=(Wheat);SpeciesScientific=(Triticum aestivum);ReactiveCentr+ es=(N,H,O,C);EzCatDBID=(S00374);BondFormed=(N-H,C-O);BondCleaved=(C-N,O-H);Retur+ n=(Yes);DLU=(24102008);MID=(M0005);CreatedBy=(GLH,GJB,DEA).
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110422/7d379556/attachment.pl>
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20110422/254352c6/attachment.pl>