Message-ID: <74055798-AC8E-401B-B9EA-EE4CE834EFF5@comcast.net>
Date: 2016-12-19T22:01:15Z
From: David Winsemius
Subject: Regex to stop at first capital letter after sequence
In-Reply-To: <CAM-xyZjHkTfndpF2_uiKOr2iPJOKPM=woTBThcVktz_Qvcf0=g@mail.gmail.com>
> On Dec 19, 2016, at 1:25 PM, Omar Andr? Gonz?les D?az <oma.gonzales at gmail.com> wrote:
>
> I have the following strings:
>
> [1] "PPA 06 - Promo Vasito" [2] "PPA 05 - Cuentos"
> [3] "PPA 04 - Promo vasito" [4] "PPA 03 - Promoci?n escolar"
> [5] "PPA - Saluda a tu pediatra" [6] "PPL - Dia del Pediatra"
>
> *Desired result*:
>
> [1] "Promo Vasito" "Cuentos" "Promo vasito"
>
> [4] "Promoci?n escolar" "Saluda a tu pediatra" "Dia del Pediatra"
All this assumes you are passing a character vector to sub. The combination of your subject line and the example are a bit underspecified. Here's two solution one of which is delivering everything beginning with the last cap after the (last) dash and the other is delivering everything after but not including the last <dash><spc> sequence:
> sub("^.+[-].+(?=[A-Z])", "" , dat, perl=TRUE) # need perl=TRUE for PCRE look-ahead
[1] "Vasito" "Cuentos"
[3] "Promo vasito" "Promoci?n escolar"
[5] "Saluda a tu pediatra" "Pediatra"
Greedy matching above, ungreedy; set by '(?U)' below:
> sub("(?U)^.+[-].+(?=[A-Z])", "" , dat, perl=TRUE)
[1] "Promo Vasito" "Cuentos"
[3] "Promo vasito" "Promoci?n escolar"
[5] "Saluda a tu pediatra" "Dia del Pediatra"
> sub("^.+[-][ ]", "" , dat) # character classes to define sequence.
[1] "Promo Vasito" "Cuentos"
[3] "Promo vasito" "Promoci?n escolar"
[5] "Saluda a tu pediatra" "Dia del Pediatra"
>
>
> *First attemp*:
>
> After this line:
>
> mead_nov$`Nombre del anuncio` <- gsub("(PPA.*)([A-Z].*)", "\\2",
> mead_nov$`Nombre del anuncio`)
>
> I get these:
>
> [1] "Vasito" [2] "Cuentos" [3] "Promo
> vasito"
> [4] "Promoci?n escolar" [5] "Saluda a tu pediatra" [6] "PPL - Dia
> del Pediatra"
>
>
> *Second attemp:*
>
> mead_nov$`Nombre del anuncio` <- gsub("(PPA|PPL.*)([A-Z].*)", "\\2",
> mead_nov$`Nombre del anuncio`)
>
> I get this:
>
> [1] "PPA 06 - Promo Vasito" [2] "PPA 05 - Cuentos"
> [3] "PPA 04 - Promo vasito" [3] "PPA 03 - Promoci?n escolar"
> [5] "PPA - Saluda a tu pediatra" [6] "Pediatra"
>
>
> Thank you for your help.
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius
Alameda, CA, USA