Skip to content

How to remove square brackets, etc. from address strings?

12 messages · Sabina Arndt, Sarah Goslee, Rui Barradas

#
Hello r-help members,

the solutions which Sarah Goslee and arun sent to me in such a prompt 
and helpful manner work well with the examples I cut from the data.frame 
I'm analyzing. Thank you very much for that!
I incorporated them into my R-script and discovered that it still 
doesn't work properly, unfortunately. I have no idea why that's the case.
You see, I want to extract country names from the contents of 
tab-delimited text files. This is an example of the data I'm using: 
http://pastebin.com/mYZNDXg6
This is the script I'm using to import the data: 
http://pastebin.com/Z10UUH3z (It requires the text files to be in a 
folder which doesn't contain any other .txt files.)
This is the script I'm using to extract the country names: 
http://pastebin.com/G37fuPba
This is the string that's in the relevant field of the first record I'm 
working on:

[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; Schulz, 
Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; [Teupser, 
Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, Fac Med, 
Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; [Toenjes, Anke; 
Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] Univ Leipzig, Fac 
Med, Dept Internal Med, Leipzig, Germany; [Dietrich, Kerstin; Kovacs, 
Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr Clin Res, Leipzig, 
Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf Boehm Inst 
Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; Schiller, 
Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys & Biophys, 
Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst Anim Sci, 
D-10099 Berlin, Germany; [Augustin, Martin] Ingenium Pharmaceut AG, 
Martinsried, Germany

This is the incorrect result my extraction script gives me for the first 
record:

 > C1s[1]
  [1] "[ENGEL,  KATHRIN M. Y." "KRISTIN"                "TORSTEN"
  [4] "GERMANY"                "DANIEL"                 "LESCA MIRIAM"
  [7] "GERMANY"                "ANKE"                   "MATTHIAS"
[10] "MATTHIAS"               "GERMANY"                "KERSTIN"
[13] "GERMANY"                "GERMANY"                "[SCHEIDT,  
HOLGER A."
[16] "JUERGEN"                "GERMANY"                "HUMBOLDT"
[19] "GERMANY"

For some reason the first and sixth pair of the eight square brackets 
are not removed ... Do you understand why?
Instead I'd like to get this result, though:

 > C1s[1]
  [1] "GERMANY"        "GERMANY"        "GERMANY"
  [4] "GERMANY"        "GERMANY"        "GERMANY"
  [7] "HUMBOLDT"        "GERMANY"

What am I doing wrong? What are the errors in my R-script?
Would anybody be so kind as to take a look and help me out, please?
Thank you very much in advance!

Faithfully yours,

Sabina Arndt
#
Part of your problem is that your regexes have spaces in them, so
that's what you're matching.

A small reproducible example would be more useful. I'm not feeling
inclined to wade through all your linked files on Friday evening, but
see if this helps:
[1] "New Zealand" "USA"         "Germany"     "Germany"     "Germany"
   "Germany"     "Germany"     "Germany"


Sarah
On Fri, May 25, 2012 at 4:31 PM, Sabina Arndt <sabina.arndt at hotmail.de> wrote:

  
    
1 day later
#
Hello r-help members,

I'm very grateful for the reply which Sarah Goslee sent to me in such a 
prompt and helpful manner.
It took me some time, but with a few amendments her suggestion now works 
not only for an example but for my entire data file as well:

 > results
   [1] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
   [5] "GERMANY"         "GERMANY"         "GERMANY"        "GERMANY"
...

Thank you very much for that, dear Sarah!

All these names actually belong to the very first record, though, which 
contains eight addresses instead of only one:

 > testdata[1]
   [1] "[Engel, Kathrin M. Y.; Schroeck, Kristin; Schoeneberg, Torsten; 
Schulz, Angela] Univ Leipzig, Fac Med, Inst Biochem, Leipzig, Germany; 
[Teupser, Daniel; Holdt, Lesca Miriam; Thiery, Joachim] Univ Leipzig, 
Fac Med, Inst Lab Med Clin Chem & Mol Diagnost, Leipzig, Germany; 
[Toenjes, Anke; Kern, Matthias; Blueher, Matthias; Stumvoll, Michael] 
Univ Leipzig, Fac Med, Dept Internal Med, Leipzig, Germany; [Dietrich, 
Kerstin; Kovacs, Peter] Univ Leipzig, Fac Med, Interdisciplinary Ctr 
Clin Res, Leipzig, Germany; [Kruegel, Ute] Univ Leipzig, Fac Med, Rudolf 
Boehm Inst Pharmacol & Toxicol, Leipzig, Germany; [Scheidt, Holger A.; 
Schiller, Juergen; Huster, Daniel] Univ Leipzig, Fac Med, Inst Med Phys 
& Biophys, Leipzig, Germany; [Brockmann, Gudrun A.] Humboldt Univ, Inst 
Anim Sci, D-10099 Berlin, Germany; [Augustin, Martin] Ingenium 
Pharmaceut AG, Martinsried, Germany"
 > results[1]
   [1] "GERMANY"

How can I put the country names back into their original lines / order?
This is an example of the correct result I'd like to receive:

 > results[1]
   [1] "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" "GERMANY" 
"GERMANY" "GERMANY"

How can I achieve this result?

I think counting the semicolons outside square brackets - i.e. the ones 
before a "[" but behind a "]" would be helpful in this regard, but I'm 
not sure how to do that, unfortunately. These semicolons directly follow 
the country names, like this, e.g.: "... Germany; [..."
If I add "+ 1" to their number it results in the number of addresses for 
each record / line.

Thank you very much in advance!

Faithfully yours,

Sabina Arndt


Am 26.05.2012 00:19, schrieb Sarah Goslee:
#
Hello,

Though I've not been following this thread, it seems like a regular 
expressions problem.
In the code below, I've created a 'testdata' variable based on your post.

# create a vector with two elements.
x <- "[Engel, Kathrin M. Y.; Schroeck, ... etc ...
y <- gsub("Germany", "Portugal", x)
testdata <- c(x, y)

# 's' is a list of character vectors, each element's final word is a 
country
s <- strsplit(testdata, ";[[:space:]]+\\[")
lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))


If this isn't it, sorry for the intrusion.

Rui Barradas

Em 27-05-2012 17:29, Sabina Arndt escreveu:
#
Hello r-help members,

thank you very much for your reply, Rui Barradas.

Unfortunately, I'm not sure if I understand it correctly: I don't know 
how to create the vector's second element y that way. The pattern you 
used has to be extracted from the address strings first. This is more 
complex as I'd tried to explain in my previous posts. It finally seems 
to work now.

Do you happen to have any idea on how I could put the country names back 
into their original lines / order, though?

Thank you very much in advance!
Faithfully yours,

Sabina Arndt


Am 27.05.2012 19:04, schrieb Rui Barradas:
#
Hello,

Em 27-05-2012 22:12, Sabina Arndt escreveu:
Your data file has more than one line. I've called it "sabrina.txt" and 
then processed with:

x <- readLines("sabrina.txt")

s <- strsplit(x, ";[[:space:]]\\[")
r <- lapply(s, function(x) sapply(strsplit(x, "[[:blank:]]"), tail, 1))

length(r)
[1] 21

So a vector 'y' and 19 other would have been created.
r[[21]] <- NULL
r[[20]] <- r[[20]][ -length(r[[20]]) ]
r1 <- lapply(r, function(x) x[nchar(x) > 0])
country.list <- r1[ -which(sapply(r1, function(x) is.null(x))) ]
# clean up
rm(s, r, r1)

# See what we have
country.list


As far as I can tell they're in the original order. But what do you mean 
by "back into their original lines"?
Any time, glad to help.

Rui Barradas
1 day later
#
Hello,

The error message means that 'x' is not a character vector. Can't you 
try it only with the text in the link you've posted, 
http://pastebin.com/mYZNDXg6 ?

I'm asking this because I've just checked it and it doesn't give any eror.

Em 29-05-2012 12:39, Sabina Arndt escreveu:
This is problably why it gives you that error. Process just one file, 
like I've said, then say something.
(Moreover, it makes sense to solve the problems with a smaller set then 
move on to the larger one.)

Rui Barradas
#
Hello, again.

See comments inline

Em 29-05-2012 16:28, Sabina Arndt escreveu:
Don't worry. When I copied the file it probably included some junk 
character in the end.
It should be. The error is that I've made some experiences with the 
data, since 'r' has some empty strings in its elements.
In my workspace everything was converted either to non-empty strings or 
NULLs.
This is how to do it.


r1 <- lapply(r, function(x) x[nchar(x) > 0])
r1 <- lapply(r1, function(x) if(length(x)) x else NULL)  # second pass
country.list <- r1[ -which(sapply(r1,  is.null)) ]
country.list
After removing the nulls, in my workspace the list numbers are 
different, but you could remove unwanted values along the lines of

bad <- -length(r[[18]])
r[[18]] <- r[[18]][ -bad ]

Note that you could do this to 'country.list', it might be simpler.
If it all works correctly, adjustments can be made, if not it might be 
premature. I don't know.
See how it goes, so far.
You're welcome,

Rui Barradas
#
What is the result after each step? Could you use dput to post them?

dput(head(r))
dput(head(r1))  # after the first lapply

Copy the output of those instructions and paste them here.
I'm asking this because I've tried with your dataset and it worked.

Rui Barradas


Em 29-05-2012 21:06, Sabina Arndt escreveu: