Skip to content

matching similar character strings

4 messages · A M Lavezzi, arun, Jeff Newmiller

#
Dear Arun,
please excuse me for this late reply, we had to stop working on this
temporaririly.

Let me reproduce here two examples of rows from F1 and F2 (sorry, but
with dput() I am not able to produce a clear example)
Nome.azienda                   Indirizzo
17     Alterego             Via Edmondo De Amicis, 18

On row 17 of F1 we have a firm named ("Nome.azienda") 'Alterego' whose
address ("indirizzo") is  'Via Edmondo de Amicis, 18'

Below I reproduce the portion of F2 with information on the street
mentioned in F1_ex$Indirizzo.
CODICE    STRADA       AREADICIRCOLAZIONE          NUMBER1 BARRATO1
NUMBER2 BARRATO2 SECTION
1  15620        VIA            DE AMICIS EDMONDO                     1
                          5                                 1288
2  15620        VIA            DE AMICIS EDMONDO                     2
                         34                                 1261
3  15620        VIA            DE AMICIS EDMONDO                     7
                         17                                 1287
4  15620        VIA            DE AMICIS EDMONDO                    36
                         62                                1264
5  15620        VIA            DE AMICIS EDMONDO                    37
                         37                                1287
6  15620        VIA            DE AMICIS EDMONDO                    64
                         84                                1262


Line 1 says that the portion of VIA DE AMICIS EDMONDO
("STRADA"+"AREADICIRCOLAZIONE"), with street numbers between 1 and 5
belongs to SECTION 1288 (these are census sections). ("BARRATO1" and
"BARRATO2" refer to the letter in street numbers such as 12/A, 28/D,
etc. In the present example they are empty)

Line 2 says that the portion of VIA DE AMICIS EDMONDO, with street
numbers between 2 and 34 belongs to SECTION 1261,

etc.

Our problem is to assign SECTION 1261 to 'Alterego', exploting the
information on its address. The problem is that the syntax of the
street address in F1 is different from the syntax in F2.

Hope I have clarified the issue

thanks a lot
Mario
On Fri, Jun 21, 2013 at 5:25 PM, arun <smartpink111 at yahoo.com> wrote:

  
    
#
Dear Mario,
Not sure if this is what you wanted:
F1_ex<- read.table(text="
?? Nome.azienda;Indirizzo
17;Alterego;Via Edmondo De Amicis, 18
18;Alterego;Via Edmondo De Amicis, 65
",sep=";",header=TRUE,stringsAsFactors=FALSE)

F2_ex<- read.table(text="
?? CODICE;STRADA;AREADICIRCOLAZIONE;NUMBER1;BARRATO1;NUMBER2;BARRATO2;SECTION
1;15620;VIA;DE AMICIS EDMONDO;1;;5;;1288 
2;15620;VIA;DE AMICIS EDMONDO;2;;34;;1261
3;15620;VIA;DE AMICIS EDMONDO;7;;17;;1287
4;15620;VIA;DE AMICIS EDMONDO;36;;62;;1264
5;15620;VIA;DE AMICIS EDMONDO;37;;37;;1287
6;15620;VIA;DE AMICIS EDMONDO;64;;84;;1262
",sep=";",header=TRUE,stringsAsFactors=FALSE)
library(stringr)
?vec1<-sapply(lapply(toupper(str_trim(gsub("[0-9,]","",F1_ex[,2]))),word,c(1,3,4,2)),paste,collapse=" ")
?vec2<- as.numeric(gsub("\\D+","",F1_ex[,2]))
?F1_ex[,1]<-F2_ex[sapply(vec2,function(x) which((x>F2_ex[,4] & x< F2_ex[,6]) & paste(F2_ex[,2],F2_ex[,3])%in%vec1)),"SECTION"]
?F1_ex
#?? Nome.azienda???????????????? Indirizzo
#17???????? 1261 Via Edmondo De Amicis, 18
#18???????? 1262 Via Edmondo De Amicis, 65
A.K.





----- Original Message -----
From: A M Lavezzi <mario.lavezzi at unipa.it>
To: r-help <r-help at r-project.org>
Cc: 
Sent: Tuesday, July 2, 2013 10:22 AM
Subject: Re: [R] matching similar character strings

Dear Arun,
please excuse me for this late reply, we had to stop working on this
temporaririly.

Let me reproduce here two examples of rows from F1 and F2 (sorry, but
with dput() I am not able to produce a clear example)
? ? ? ? Nome.azienda? ? ? ? ? ? ? ? ?  Indirizzo
17? ?  Alterego? ? ? ? ? ?  Via Edmondo De Amicis, 18

On row 17 of F1 we have a firm named ("Nome.azienda") 'Alterego' whose
address ("indirizzo") is? 'Via Edmondo de Amicis, 18'

Below I reproduce the portion of F2 with information on the street
mentioned in F1_ex$Indirizzo.
? CODICE? ? STRADA? ? ?  AREADICIRCOLAZIONE? ? ? ? ? NUMBER1 BARRATO1
NUMBER2 BARRATO2 SECTION
1? 15620? ? ? ? VIA? ? ? ? ? ? DE AMICIS EDMONDO? ? ? ? ? ? ? ? ? ?  1
? ? ? ? ? ? ? ? ? ? ? ? ? 5? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  1288
2? 15620? ? ? ? VIA? ? ? ? ? ? DE AMICIS EDMONDO? ? ? ? ? ? ? ? ? ?  2
? ? ? ? ? ? ? ? ? ? ? ?  34? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  1261
3? 15620? ? ? ? VIA? ? ? ? ? ? DE AMICIS EDMONDO? ? ? ? ? ? ? ? ? ?  7
? ? ? ? ? ? ? ? ? ? ? ?  17? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?  1287
4? 15620? ? ? ? VIA? ? ? ? ? ? DE AMICIS EDMONDO? ? ? ? ? ? ? ? ? ? 36
? ? ? ? ? ? ? ? ? ? ? ?  62? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1264
5? 15620? ? ? ? VIA? ? ? ? ? ? DE AMICIS EDMONDO? ? ? ? ? ? ? ? ? ? 37
? ? ? ? ? ? ? ? ? ? ? ?  37? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1287
6? 15620? ? ? ? VIA? ? ? ? ? ? DE AMICIS EDMONDO? ? ? ? ? ? ? ? ? ? 64
? ? ? ? ? ? ? ? ? ? ? ?  84? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? 1262


Line 1 says that the portion of VIA DE AMICIS EDMONDO
("STRADA"+"AREADICIRCOLAZIONE"), with street numbers between 1 and 5
belongs to SECTION 1288 (these are census sections). ("BARRATO1" and
"BARRATO2" refer to the letter in street numbers such as 12/A, 28/D,
etc. In the present example they are empty)

Line 2 says that the portion of VIA DE AMICIS EDMONDO, with street
numbers between 2 and 34 belongs to SECTION 1261,

etc.

Our problem is to assign SECTION 1261 to 'Alterego', exploting the
information on its address. The problem is that the syntax of the
street address in F1 is different from the syntax in F2.

Hope I have clarified the issue

thanks a lot
Mario
On Fri, Jun 21, 2013 at 5:25 PM, arun <smartpink111 at yahoo.com> wrote:

  
    
3 days later
#
Dear Arun,

thank you so much! The code you suggest captures what we have in mind.
However, what we are looking for is something a bit more general
(sorry: I realised that maybe this was not so clear from the
beginning).

In particular:

- in F1_ex the address in the "Indirizzo" field could be spelled more
irregularly (ex: "Via De Amicis 18", "V. De Amicis 18", "Via E. De
Amicis 18", etc.)

- in F2 the classification of the portions of the street is based on
odd and even numbers. For example, if we had number "15" in F1 it
should be matched to row 3 and not to row 2 of F2 (I actually provided
a wrong example with number 65: row 2 of F1_ex is currently matched to
row 6 of F2_ex which contains even numbers. Moreover, there are no odd
street numbers in this street higher than 37)

Thank you very much once again

Mario
On Wed, Jul 3, 2013 at 6:47 AM, arun <smartpink111 at yahoo.com> wrote:

  
    
#
The intent of this list is to help you help yourself. If you spend the time to take Arun's ideas and run with therm, then you will do us and yourself a favor. We will benefit because you can then help others with similar problems, and you will benefit because you don't need to worry about failure to communicate out delays from us. Even if you cannot completely figure it out on your own yet, giving us your attempt every tone you post can demonstrate your participation and help clarify what your intent is. But asking us to just "do it" for you is not in the spirit of this list.
---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
--------------------------------------------------------------------------- 
Sent from my phone. Please excuse my brevity.
A M Lavezzi <mario.lavezzi at unipa.it> wrote: