Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record: > C1s[1] [1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN" [4] "GERMANY" "DANIEL" "LESCA MIRIAM" [7] "GERMANY" "ANKE" "MATTHIAS" [10] "MATTHIAS" "GERMANY" "KERSTIN" [13] "GERMANY" "GERMANY" "[SCHEIDT, HOLGER A." [16] "JUERGEN" "GERMANY" "HUMBOLDT" [19] "GERMANY" For some reason the first and sixth pair of the eight square brackets are not removed ... Do you understand why? Instead I'd like to get this result, though: > C1s[1] [1] "GERMANY" "GERMANY" "GERMANY" [4] "GERMANY" "GERMANY" "GERMANY" [7] "HUMBOLDT" "GERMANY" What am I doing wrong? What are the errors in my R-script? Would anybody be so kind as to take a look and help me out, please? Thank you very much in advance! Faithfully yours, Sabina Arndt
How to remove square brackets, etc. from address strings?
12 messages · Sabina Arndt, Sarah Goslee, Rui Barradas
Part of your problem is that your regexes have spaces in them, so that's what you're matching. A small reproducible example would be more useful. I'm not feeling inclined to wade through all your linked files on Friday evening, but see if this helps:
testdata <- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
results <- gsub("\\[.*?\\]", "", testdata)
results <- unlist(strsplit(results, ";"))
results <- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$", "\\1", x))
names(results) <- NULL
results
[1] "New Zealand" "USA" "Germany" "Germany" "Germany" "Germany" "Germany" "Germany" Sarah
On Fri, May 25, 2012 at 4:31 PM, Sabina Arndt <sabina.arndt at hotmail.de> wrote:
Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record:
C1s[1]
?[1] "[ENGEL, ?KATHRIN M. Y." "KRISTIN" ? ? ? ? ? ? ? ?"TORSTEN" ?[4] "GERMANY" ? ? ? ? ? ? ? ?"DANIEL" ? ? ? ? ? ? ? ? "LESCA MIRIAM" ?[7] "GERMANY" ? ? ? ? ? ? ? ?"ANKE" ? ? ? ? ? ? ? ? ? "MATTHIAS" [10] "MATTHIAS" ? ? ? ? ? ? ? "GERMANY" ? ? ? ? ? ? ? ?"KERSTIN" [13] "GERMANY" ? ? ? ? ? ? ? ?"GERMANY" ? ? ? ? ? ? ? ?"[SCHEIDT, ?HOLGER A." [16] "JUERGEN" ? ? ? ? ? ? ? ?"GERMANY" ? ? ? ? ? ? ? ?"HUMBOLDT" [19] "GERMANY" For some reason the first and sixth pair of the eight square brackets are not removed ... Do you understand why? Instead I'd like to get this result, though:
C1s[1]
?[1] "GERMANY" ? ? ? ?"GERMANY" ? ? ? ?"GERMANY" ?[4] "GERMANY" ? ? ? ?"GERMANY" ? ? ? ?"GERMANY" ?[7] "HUMBOLDT" ? ? ? ?"GERMANY" What am I doing wrong? What are the errors in my R-script? Would anybody be so kind as to take a look and help me out, please? Thank you very much in advance! Faithfully yours, Sabina Arndt
Sarah Goslee http://www.functionaldiversity.org
1 day later
Hello r-help members, I'm very grateful for the reply which Sarah Goslee sent to me in such a prompt and helpful manner. It took me some time, but with a few amendments her suggestion now works not only for an example but for my entire data file as well: > results [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" [5] "GERMANY" "GERMANY" "GERMANY" "GERMANY" ... Thank you very much for that, dear Sarah! All these names actually belong to the very first record, though, which contains eight addresses instead of only one: > testdata[1] [1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany" > results[1] [1] "GERMANY" How can I put the country names back into their original lines / order? This is an example of the correct result I'd like to receive: > results[1] [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" How can I achieve this result? I think counting the semicolons outside square brackets - i.e. the ones before a "[" but behind a "]" would be helpful in this regard, but I'm not sure how to do that, unfortunately. These semicolons directly follow the country names, like this, e.g.: "... Germany; [..." If I add "+ 1" to their number it results in the number of addresses for each record / line. Thank you very much in advance! Faithfully yours, Sabina Arndt Am 26.05.2012 00:19, schrieb Sarah Goslee:
Part of your problem is that your regexes have spaces in them, so that's what you're matching. A small reproducible example would be more useful. I'm not feeling inclined to wade through all your linked files on Friday evening, but see if this helps:
testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
results<- gsub("\\[.*?\\]", "", testdata)
results<- unlist(strsplit(results, ";"))
results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$", "\\1", x))
names(results)<- NULL
results
[1] "New Zealand" "USA" "Germany" "Germany" "Germany"
"Germany" "Germany" "Germany"
Sarah
On Fri, May 25, 2012 at 4:31 PM, Sabina Arndt<sabina.arndt at hotmail.de> wrote:
Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record:
C1s[1]
[1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN" [4] "GERMANY" "DANIEL" "LESCA MIRIAM" [7] "GERMANY" "ANKE" "MATTHIAS" [10] "MATTHIAS" "GERMANY" "KERSTIN" [13] "GERMANY" "GERMANY" "[SCHEIDT, HOLGER A." [16] "JUERGEN" "GERMANY" "HUMBOLDT" [19] "GERMANY" For some reason the first and sixth pair of the eight square brackets are not removed ... Do you understand why? Instead I'd like to get this result, though:
C1s[1]
[1] "GERMANY" "GERMANY" "GERMANY" [4] "GERMANY" "GERMANY" "GERMANY" [7] "HUMBOLDT" "GERMANY" What am I doing wrong? What are the errors in my R-script? Would anybody be so kind as to take a look and help me out, please? Thank you very much in advance! Faithfully yours, Sabina Arndt
Hello,
Though I've not been following this thread, it seems like a regular
expressions problem.
In the code below, I've created a 'testdata' variable based on your post.
# create a vector with two elements.
x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y <- gsub("Germany", "Portugal", x)
testdata <- c(x, y)
# 's' is a list of character vectors, each element's final word is a
country
s <- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
If this isn't it, sorry for the intrusion.
Rui Barradas
Em 27-05-2012 17:29, Sabina Arndt escreveu:
Hello r-help members, I'm very grateful for the reply which Sarah Goslee sent to me in such a prompt and helpful manner. It took me some time, but with a few amendments her suggestion now works not only for an example but for my entire data file as well:
results
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" [5] "GERMANY" "GERMANY" "GERMANY" "GERMANY" ... Thank you very much for that, dear Sarah! All these names actually belong to the very first record, though, which contains eight addresses instead of only one:
testdata[1]
[1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
results[1]
[1] "GERMANY" How can I put the country names back into their original lines / order? This is an example of the correct result I'd like to receive:
results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" How can I achieve this result? I think counting the semicolons outside square brackets - i.e. the ones before a "[" but behind a "]" would be helpful in this regard, but I'm not sure how to do that, unfortunately. These semicolons directly follow the country names, like this, e.g.: "... Germany; [..." If I add "+ 1" to their number it results in the number of addresses for each record / line. Thank you very much in advance! Faithfully yours, Sabina Arndt Am 26.05.2012 00:19, schrieb Sarah Goslee:
Part of your problem is that your regexes have spaces in them, so that's what you're matching. A small reproducible example would be more useful. I'm not feeling inclined to wade through all your linked files on Friday evening, but see if this helps:
testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery,
Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol
Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias; Blueher,
Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal
Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ
Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany;
[Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol&
Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen;
Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys,
Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim
Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut
AG, Martinsried, Germany"
results<- gsub("\\[.*?\\]", "", testdata)
results<- unlist(strsplit(results, ";"))
results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$",
"\\1", x))
names(results)<- NULL
results
[1] "New Zealand" "USA" "Germany" "Germany" "Germany"
"Germany" "Germany" "Germany"
Sarah
On Fri, May 25, 2012 at 4:31 PM, Sabina
Arndt<sabina.arndt at hotmail.de> wrote:
Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record:
C1s[1]
[1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN" [4] "GERMANY" "DANIEL" "LESCA MIRIAM" [7] "GERMANY" "ANKE" "MATTHIAS" [10] "MATTHIAS" "GERMANY" "KERSTIN" [13] "GERMANY" "GERMANY" "[SCHEIDT, HOLGER A." [16] "JUERGEN" "GERMANY" "HUMBOLDT" [19] "GERMANY" For some reason the first and sixth pair of the eight square brackets are not removed ... Do you understand why? Instead I'd like to get this result, though:
C1s[1]
[1] "GERMANY" "GERMANY" "GERMANY" [4] "GERMANY" "GERMANY" "GERMANY" [7] "HUMBOLDT" "GERMANY" What am I doing wrong? What are the errors in my R-script? Would anybody be so kind as to take a look and help me out, please? Thank you very much in advance! Faithfully yours, Sabina Arndt
______________________________________________ R-help at r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Hello r-help members, thank you very much for your reply, Rui Barradas. Unfortunately, I'm not sure if I understand it correctly: I don't know how to create the vector's second element y that way. The pattern you used has to be extracted from the address strings first. This is more complex as I'd tried to explain in my previous posts. It finally seems to work now. Do you happen to have any idea on how I could put the country names back into their original lines / order, though? Thank you very much in advance! Faithfully yours, Sabina Arndt Am 27.05.2012 19:04, schrieb Rui Barradas:
Hello,
Though I've not been following this thread, it seems like a regular
expressions problem.
In the code below, I've created a 'testdata' variable based on your post.
# create a vector with two elements.
x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y <- gsub("Germany", "Portugal", x)
testdata <- c(x, y)
# 's' is a list of character vectors, each element's final word is a
country
s <- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
If this isn't it, sorry for the intrusion.
Rui Barradas
Em 27-05-2012 17:29, Sabina Arndt escreveu:
Hello r-help members, I'm very grateful for the reply which Sarah Goslee sent to me in such a prompt and helpful manner. It took me some time, but with a few amendments her suggestion now works not only for an example but for my entire data file as well:
results
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" [5] "GERMANY" "GERMANY" "GERMANY" "GERMANY" ... Thank you very much for that, dear Sarah! All these names actually belong to the very first record, though, which contains eight addresses instead of only one:
testdata[1]
[1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
results[1]
[1] "GERMANY" How can I put the country names back into their original lines / order? This is an example of the correct result I'd like to receive:
results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" How can I achieve this result? I think counting the semicolons outside square brackets - i.e. the ones before a "[" but behind a "]" would be helpful in this regard, but I'm not sure how to do that, unfortunately. These semicolons directly follow the country names, like this, e.g.: "... Germany; [..." If I add "+ 1" to their number it results in the number of addresses for each record / line. Thank you very much in advance! Faithfully yours, Sabina Arndt Am 26.05.2012 00:19, schrieb Sarah Goslee:
Part of your problem is that your regexes have spaces in them, so that's what you're matching. A small reproducible example would be more useful. I'm not feeling inclined to wade through all your linked files on Friday evening, but see if this helps:
testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam;
Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&
Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias;
Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept
Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]
Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig,
Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst
Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.;
Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med
Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt
Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin]
Ingenium Pharmaceut AG, Martinsried, Germany"
results<- gsub("\\[.*?\\]", "", testdata)
results<- unlist(strsplit(results, ";"))
results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$",
"\\1", x))
names(results)<- NULL
results
[1] "New Zealand" "USA" "Germany" "Germany" "Germany"
"Germany" "Germany" "Germany"
Sarah
On Fri, May 25, 2012 at 4:31 PM, Sabina
Arndt<sabina.arndt at hotmail.de> wrote:
Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record:
C1s[1]
[1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN" [4] "GERMANY" "DANIEL" "LESCA MIRIAM" [7] "GERMANY" "ANKE" "MATTHIAS" [10] "MATTHIAS" "GERMANY" "KERSTIN" [13] "GERMANY" "GERMANY" "[SCHEIDT, HOLGER A." [16] "JUERGEN" "GERMANY" "HUMBOLDT" [19] "GERMANY" For some reason the first and sixth pair of the eight square brackets are not removed ... Do you understand why? Instead I'd like to get this result, though:
C1s[1]
[1] "GERMANY" "GERMANY" "GERMANY" [4] "GERMANY" "GERMANY" "GERMANY" [7] "HUMBOLDT" "GERMANY" What am I doing wrong? What are the errors in my R-script? Would anybody be so kind as to take a look and help me out, please? Thank you very much in advance! Faithfully yours, Sabina Arndt
Hello, Em 27-05-2012 22:12, Sabina Arndt escreveu:
Hello r-help members, thank you very much for your reply, Rui Barradas. Unfortunately, I'm not sure if I understand it correctly: I don't know how to create the vector's second element y that way. The pattern you used has to be extracted from the address strings first. This is more complex as I'd tried to explain in my previous posts. It finally seems to work now.
Your data file has more than one line. I've called it "sabrina.txt" and
then processed with:
x <- readLines("sabrina.txt")
s <- strsplit(x, ";[[:space:]]\\[")
r <- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
length(r)
[1] 21
So a vector 'y' and 19 other would have been created.
Do you happen to have any idea on how I could put the country names back into their original lines / order, though?
r[[21]] <- NULL r[[20]] <- r[[20]][ -length(r[[20]]) ] r1 <- lapply(r, function(x) x[nchar(x) > 0]) country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ] # clean up rm(s, r, r1) # See what we have country.list As far as I can tell they're in the original order. But what do you mean by "back into their original lines"?
Thank you very much in advance!
Any time, glad to help. Rui Barradas
Faithfully yours, Sabina Arndt Am 27.05.2012 19:04, schrieb Rui Barradas:
Hello,
Though I've not been following this thread, it seems like a regular
expressions problem.
In the code below, I've created a 'testdata' variable based on your
post.
# create a vector with two elements.
x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y <- gsub("Germany", "Portugal", x)
testdata <- c(x, y)
# 's' is a list of character vectors, each element's final word is a
country
s <- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
If this isn't it, sorry for the intrusion.
Rui Barradas
Em 27-05-2012 17:29, Sabina Arndt escreveu:
Hello r-help members, I'm very grateful for the reply which Sarah Goslee sent to me in such a prompt and helpful manner. It took me some time, but with a few amendments her suggestion now works not only for an example but for my entire data file as well:
results
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" [5] "GERMANY" "GERMANY" "GERMANY" "GERMANY" ... Thank you very much for that, dear Sarah! All these names actually belong to the very first record, though, which contains eight addresses instead of only one:
testdata[1]
[1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
results[1]
[1] "GERMANY" How can I put the country names back into their original lines / order? This is an example of the correct result I'd like to receive:
results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" How can I achieve this result? I think counting the semicolons outside square brackets - i.e. the ones before a "[" but behind a "]" would be helpful in this regard, but I'm not sure how to do that, unfortunately. These semicolons directly follow the country names, like this, e.g.: "... Germany; [..." If I add "+ 1" to their number it results in the number of addresses for each record / line. Thank you very much in advance! Faithfully yours, Sabina Arndt Am 26.05.2012 00:19, schrieb Sarah Goslee:
Part of your problem is that your regexes have spaces in them, so that's what you're matching. A small reproducible example would be more useful. I'm not feeling inclined to wade through all your linked files on Friday evening, but see if this helps:
testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam;
Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&
Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias;
Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept
Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]
Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig,
Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst
Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.;
Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med
Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt
Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin]
Ingenium Pharmaceut AG, Martinsried, Germany"
results<- gsub("\\[.*?\\]", "", testdata)
results<- unlist(strsplit(results, ";"))
results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$",
"\\1", x))
names(results)<- NULL
results
[1] "New Zealand" "USA" "Germany" "Germany" "Germany"
"Germany" "Germany" "Germany"
Sarah
On Fri, May 25, 2012 at 4:31 PM, Sabina
Arndt<sabina.arndt at hotmail.de> wrote:
Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record:
C1s[1]
[1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN" [4] "GERMANY" "DANIEL" "LESCA MIRIAM" [7] "GERMANY" "ANKE" "MATTHIAS" [10] "MATTHIAS" "GERMANY" "KERSTIN" [13] "GERMANY" "GERMANY" "[SCHEIDT, HOLGER A." [16] "JUERGEN" "GERMANY" "HUMBOLDT" [19] "GERMANY" For some reason the first and sixth pair of the eight square brackets are not removed ... Do you understand why? Instead I'd like to get this result, though:
C1s[1]
[1] "GERMANY" "GERMANY" "GERMANY" [4] "GERMANY" "GERMANY" "GERMANY" [7] "HUMBOLDT" "GERMANY" What am I doing wrong? What are the errors in my R-script? Would anybody be so kind as to take a look and help me out, please? Thank you very much in advance! Faithfully yours, Sabina Arndt
1 day later
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120529/0d0cdeab/attachment.pl>
Hello, The error message means that 'x' is not a character vector. Can't you try it only with the text in the link you've posted, http://pastebin.com/mYZNDXg6 ? I'm asking this because I've just checked it and it doesn't give any eror. Em 29-05-2012 12:39, Sabina Arndt escreveu:
Hello r-help members, thank you very much for your reply, Rui Barradas.
Your data file has more than one line.
Yes, each line is a new record and I read several such data files into one data.frame.
This is problably why it gives you that error. Process just one file, like I've said, then say something. (Moreover, it makes sense to solve the problems with a smaller set then move on to the larger one.) Rui Barradas
I've called it "sabrina.txt" and then processed with:
x<- readLines("sabrina.txt")
s<- strsplit(x, ";[[:space:]]\\[")
Thank you; but this gives me an error message: Error in strsplit(x, ";[[:space:]]\\[") : non-character argument So I cannot check the rest of your suggestion, unfortunately.
Do you happen to have any idea on how I could put the country names back into their original lines / order, though?
...
As far as I can tell they're in the original order. But what do you mean by "back into their original lines"?
Each line of my data.frame represents a record - except for the first one which is the header. Each record has different addresses in the field / column I'm analyzing. In fact, the records vary in the number of addresses they feature (The first has eight, the second only one, etc.). I don't want a simple list of all the country names but a new field in my data.frame which contains for each record the country name(s) extracted from the addresses of that very same record. I'd like to measure the number of elements after applying strsplit() to each string. I tried: ... results<- strsplit(results, ";") numbers<- sapply(results, length) results<- unlist(results) ... But this doesn't seem to work, because:
numbers
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... Does anybody know how I would achieve these results instead:
numbers[1]
[1] 8
numbers[2]
[1] 1
results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY"
results[2]
[1] "GERMANY" Thank you very much in advance! Faithfully yours, Sabina Arndt PS: I updated the subject of my message to reflect the progress I've made thanks to your replies. I hope this is appropriate and clearer this way.
Am 27.05.2012 19:04, schrieb Rui Barradas:
Hello,
Though I've not been following this thread, it seems like a regular
expressions problem.
In the code below, I've created a 'testdata' variable based on your
post.
# create a vector with two elements.
x<- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y<- gsub("Germany", "Portugal", x)
testdata<- c(x, y)
# 's' is a list of character vectors, each element's final word is a
country
s<- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
If this isn't it, sorry for the intrusion.
Rui Barradas
Em 27-05-2012 17:29, Sabina Arndt escreveu:
Hello r-help members, I'm very grateful for the reply which Sarah Goslee sent to me in such a prompt and helpful manner. It took me some time, but with a few amendments her suggestion now works not only for an example but for my entire data file as well:
results
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" [5] "GERMANY" "GERMANY" "GERMANY" "GERMANY" ... Thank you very much for that, dear Sarah! All these names actually belong to the very first record, though, which contains eight addresses instead of only one:
testdata[1]
[1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
results[1]
[1] "GERMANY" How can I put the country names back into their original lines / order? This is an example of the correct result I'd like to receive:
results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" How can I achieve this result? I think counting the semicolons outside square brackets - i.e. the ones before a "[" but behind a "]" would be helpful in this regard, but I'm not sure how to do that, unfortunately. These semicolons directly follow the country names, like this, e.g.: "... Germany; [..." If I add "+ 1" to their number it results in the number of addresses for each record / line. Thank you very much in advance! Faithfully yours, Sabina Arndt Am 26.05.2012 00:19, schrieb Sarah Goslee:
Part of your problem is that your regexes have spaces in them, so that's what you're matching. A small reproducible example would be more useful. I'm not feeling inclined to wade through all your linked files on Friday evening, but see if this helps:
testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam;
Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&
Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias;
Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept
Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]
Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig,
Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst
Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.;
Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med
Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt
Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin]
Ingenium Pharmaceut AG, Martinsried, Germany"
results<- gsub("\\[.*?\\]", "", testdata)
results<- unlist(strsplit(results, ";"))
results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$",
"\\1", x))
names(results)<- NULL
results
[1] "New Zealand" "USA" "Germany" "Germany" "Germany"
"Germany" "Germany" "Germany"
Sarah
On Fri, May 25, 2012 at 4:31 PM, Sabina
Arndt<sabina.arndt at hotmail.de> wrote:
Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record:
C1s[1]
[1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN" [4] "GERMANY" "DANIEL" "LESCA MIRIAM" [7] "GERMANY" "ANKE" "MATTHIAS" [10] "MATTHIAS" "GERMANY" "KERSTIN" [13] "GERMANY" "GERMANY" "[SCHEIDT, HOLGER A." [16] "JUERGEN" "GERMANY" "HUMBOLDT" [19] "GERMANY" For some reason the first and sixth pair of the eight square brackets are not removed ... Do you understand why? Instead I'd like to get this result, though:
C1s[1]
[1] "GERMANY" "GERMANY" "GERMANY" [4] "GERMANY" "GERMANY" "GERMANY" [7] "HUMBOLDT" "GERMANY" What am I doing wrong? What are the errors in my R-script? Would anybody be so kind as to take a look and help me out, please? Thank you very much in advance! Faithfully yours, Sabina Arndt
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120529/3dce8329/attachment.pl>
Hello, again. See comments inline Em 29-05-2012 16:28, Sabina Arndt escreveu:
Hello, thank you very much for your reply, Rui Barradas. OK, I did what you said:
x<- readLines("sabina.txt")
s<- strsplit(x, ";[[:space:]]\\[")
r<- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
length(r)
[1] 20 I don't know why your result here was 21 since the file consists of only 20 lines.
Don't worry. When I copied the file it probably included some junk character in the end.
r[[21]]<- NULL r[[20]]<- r[[20]][ -length(r[[20]]) ] r1<- lapply(r, function(x) x[nchar(x)> 0]) country.list<- r1[ -which(sapply(r1, function(x) is.null(x))) ] rm(s, r, r1) country.list
list() I also tried this:
r[[20]]<- NULL r[[19]]<- r[[19]][ -length(r[[19]]) ] r1<- lapply(r, function(x) x[nchar(x)> 0]) country.list<- r1[ -which(sapply(r1, function(x) is.null(x))) ] rm(s, r, r1) country.list
list() But the result was the same. For some reason this seems to be empty. But if I try this before "country.list<- r1[ -which(sapply(r1, function(x) is.null(x))) ]":
It should be. The error is that I've made some experiences with the data, since 'r' has some empty strings in its elements. In my workspace everything was converted either to non-empty strings or NULLs. This is how to do it. r1 <- lapply(r, function(x) x[nchar(x) > 0]) r1 <- lapply(r1, function(x) if(length(x)) x else NULL) # second pass country.list <- r1[ -which(sapply(r1, is.null)) ] country.list
r[[18]]
[1] "England" "Scotland" "Germany" "Germany" "England" "WOS:000296579800006" This is almost correct. But the last country name is missing of this record and replaced with the value of the very last column / field of this record. Do you know how to correct this?
After removing the nulls, in my workspace the list numbers are different, but you could remove unwanted values along the lines of bad <- -length(r[[18]]) r[[18]] <- r[[18]][ -bad ] Note that you could do this to 'country.list', it might be simpler.
In addition to that there are some additional adjustments I need to apply to the country names before output since there are many different versions of US addresses, e.g. (See 000296579800006.). I'm not sure I understand your function correctly, do you think the edits I mentioned could be fit in there as well?
If it all works correctly, adjustments can be made, if not it might be premature. I don't know. See how it goes, so far.
Thank you very much for bearing with me! I swear I ususally am not that dumb! Faithfully yours, Sabina Arndt
You're welcome, Rui Barradas
Date: Tue, 29 May 2012 13:00:36 +0100 From: ruipbarradas at sapo.pt To: sabina.arndt at hotmail.de CC: r-help at r-project.org Subject: Re: [R] Relist strings? [Was: How to remove square brackets, etc. from address strings?] Hello, The error message means that 'x' is not a character vector. Can't you try it only with the text in the link you've posted, http://pastebin.com/mYZNDXg6 ? I'm asking this because I've just checked it and it doesn't give any eror. Em 29-05-2012 12:39, Sabina Arndt escreveu:
Hello r-help members, thank you very much for your reply, Rui Barradas.
Your data file has more than one line.
Yes, each line is a new record and I read several such data files into one data.frame.
This is problably why it gives you that error. Process just one file, like I've said, then say something. (Moreover, it makes sense to solve the problems with a smaller set then move on to the larger one.) Rui Barradas
I've called it "sabrina.txt" and then processed with:
x<- readLines("sabrina.txt")
s<- strsplit(x, ";[[:space:]]\\[")
Thank you; but this gives me an error message: Error in strsplit(x, ";[[:space:]]\\[") : non-character argument So I cannot check the rest of your suggestion, unfortunately.
Do you happen to have any idea on how I could put the country names back into their original lines / order, though?
...
As far as I can tell they're in the original order. But what do you mean by "back into their original lines"?
Each line of my data.frame represents a record - except for the first one which is the header. Each record has different addresses in the field / column I'm analyzing. In fact, the records vary in the number of addresses they feature (The first has eight, the second only one, etc.). I don't want a simple list of all the country names but a new field in my data.frame which contains for each record the country name(s) extracted from the addresses of that very same record. I'd like to measure the number of elements after applying strsplit() to each string. I tried: ... results<- strsplit(results, ";") numbers<- sapply(results, length) results<- unlist(results) ... But this doesn't seem to work, because:
numbers
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ... Does anybody know how I would achieve these results instead:
numbers[1]
[1] 8
numbers[2]
[1] 1
results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY"
results[2]
[1] "GERMANY" Thank you very much in advance! Faithfully yours, Sabina Arndt PS: I updated the subject of my message to reflect the progress I've made thanks to your replies. I hope this is appropriate and clearer this way.
Am 27.05.2012 19:04, schrieb Rui Barradas:
Hello,
Though I've not been following this thread, it seems like a regular
expressions problem.
In the code below, I've created a 'testdata' variable based on your
post.
# create a vector with two elements.
x<- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y<- gsub("Germany", "Portugal", x)
testdata<- c(x, y)
# 's' is a list of character vectors, each element's final word is a
country
s<- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
If this isn't it, sorry for the intrusion.
Rui Barradas
Em 27-05-2012 17:29, Sabina Arndt escreveu:
Hello r-help members, I'm very grateful for the reply which Sarah Goslee sent to me in such a prompt and helpful manner. It took me some time, but with a few amendments her suggestion now works not only for an example but for my entire data file as well:
results
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY"
[5] "GERMANY" "GERMANY" "GERMANY" "GERMANY"
...
Thank you very much for that, dear Sarah!
All these names actually belong to the very first record, though,
which contains eight addresses instead of only one:
testdata[1]
[1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany"
results[1]
[1] "GERMANY" How can I put the country names back into their original lines / order? This is an example of the correct result I'd like to receive:
results[1]
[1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" How can I achieve this result? I think counting the semicolons outside square brackets - i.e. the ones before a "[" but behind a "]" would be helpful in this regard, but I'm not sure how to do that, unfortunately. These semicolons directly follow the country names, like this, e.g.: "... Germany; [..." If I add "+ 1" to their number it results in the number of addresses for each record / line. Thank you very much in advance! Faithfully yours, Sabina Arndt Am 26.05.2012 00:19, schrieb Sarah Goslee:
Part of your problem is that your regexes have spaces in them, so that's what you're matching. A small reproducible example would be more useful. I'm not feeling inclined to wade through all your linked files on Friday evening, but see if this helps:
testdata<- "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg,
Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem,
Leipzig, New Zealand; [Teupser, Daniel; Holdt, Lesca Miriam;
Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem&
Mol Diagnost, Leipzig, USA; [Toenjes, Anke; Kern, Matthias;
Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept
Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter]
Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig,
Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst
Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.;
Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med
Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt
Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin]
Ingenium Pharmaceut AG, Martinsried, Germany"
results<- gsub("\\[.*?\\]", "", testdata)
results<- unlist(strsplit(results, ";"))
results<- sapply(results, function(x)sub("^.*, ([A-Za-z ]*)$",
"\\1", x))
names(results)<- NULL
results
[1] "New Zealand" "USA" "Germany" "Germany" "Germany"
"Germany" "Germany" "Germany"
Sarah
On Fri, May 25, 2012 at 4:31 PM, Sabina
Arndt<sabina.arndt at hotmail.de> wrote:
Hello r-help members, the solutions which Sarah Goslee and arun sent to me in such a prompt and helpful manner work well with the examples I cut from the data.frame I'm analyzing. Thank you very much for that! I incorporated them into my R-script and discovered that it still doesn't work properly, unfortunately. I have no idea why that's the case. You see, I want to extract country names from the contents of tab-delimited text files. This is an example of the data I'm using: http://pastebin.com/mYZNDXg6 This is the script I'm using to import the data: http://pastebin.com/Z10UUH3z (It requires the text files to be in a folder which doesn't contain any other .txt files.) This is the script I'm using to extract the country names: http://pastebin.com/G37fuPba This is the string that's in the relevant field of the first record I'm working on: [Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, Inst Lab Med Clin Chem& Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst Pharmacol& Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, Martinsried, Germany This is the incorrect result my extraction script gives me for the first record:
C1s[1]
[1] "[ENGEL, KATHRIN M. Y." "KRISTIN" "TORSTEN"
[4] "GERMANY" "DANIEL" "LESCA
MIRIAM"
[7] "GERMANY" "ANKE" "MATTHIAS"
[10] "MATTHIAS" "GERMANY" "KERSTIN"
[13] "GERMANY" "GERMANY" "[SCHEIDT,
HOLGER
A."
[16] "JUERGEN" "GERMANY" "HUMBOLDT"
[19] "GERMANY"
For some reason the first and sixth pair of the eight square
brackets are
not removed ... Do you understand why?
Instead I'd like to get this result, though:
C1s[1]
[1] "GERMANY" "GERMANY" "GERMANY"
[4] "GERMANY" "GERMANY" "GERMANY"
[7] "HUMBOLDT" "GERMANY"
What am I doing wrong? What are the errors in my R-script?
Would anybody be so kind as to take a look and help me out, please?
Thank you very much in advance!
Faithfully yours,
Sabina Arndt
An embedded and charset-unspecified text was scrubbed... Name: not available URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20120529/5fb942e3/attachment.pl>
What is the result after each step? Could you use dput to post them? dput(head(r)) dput(head(r1)) # after the first lapply Copy the output of those instructions and paste them here. I'm asking this because I've tried with your dataset and it worked. Rui Barradas Em 29-05-2012 21:06, Sabina Arndt escreveu:
Hello, thank you very much for your reply, Rui Barradas.
In my workspace everything was converted either to non-empty strings or NULLs. This is how to do it. r1<- lapply(r, function(x) x[nchar(x)> 0]) r1<- lapply(r1, function(x) if(length(x)) x else NULL) # second pass country.list<- r1[ -which(sapply(r1, is.null)) ] country.list
Thank you. I tried it but I've still got the same result as before, unfortunately:
x<- readLines("sabina.txt")
s<- strsplit(x, ";[[:space:]]\\[")
r<- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))
length(r)
[1] 20
r[[20]]<- NULL r[[19]]<- r[[19]][ -length(r[[19]]) ] r1<- lapply(r, function(x) x[nchar(x)> 0]) r1<- lapply(r1, function(x) if(length(x)) x else NULL) # second pass country.list<- r1[ -which(sapply(r1, is.null)) ] country.list
list()
After removing the nulls, in my workspace the list numbers are different, but you could remove unwanted values along the lines of bad<- -length(r[[18]]) r[[18]]<- r[[18]][ -bad ] Note that you could do this to 'country.list', it might be simpler.
OK, which step would this be in the order above? And how / where do I get the country name that was incorrectly replaced by the Article Identifier?
If it all works correctly, adjustments can be made, if not it might be premature. I don't know.
I see.
See how it goes, so far.
See above, please.
You're welcome,
Thank you! Faithfully yours, Sabina Arndt PS: I cut the old messages to save list space.