Skip to content

R2.1.0: Bug in list.files

14 messages · Steve Roberts, Suresh Krishna, Romain Francois +5 more

#
R2.0.1 (MS Windows)
[1] "P:/SARsoftware/Rlibraries/gnlm_0.1.zip"
[2] "P:/SARsoftware/Rlibraries/lms2_0.2.zip"


R2.1.0:
Error in list.files(path, pattern, all.files, full.names, recursive) : 
        invalid 'pattern' regular expression

Bug? or have I missed something

Steve.
  Dr Steve Roberts 
  steve.roberts at manchester.ac.uk

Senior Lecturer in Medical Statistics,
CMMCH NHS Trust and University of Manchester Biostatistics Group,
0161 275 5192/5764 / 0161 276 5785
#
Steve Roberts wrote:

            
>
You missed to read the NEWS that tells you:

     o   The regular expression code is now based on that in glibc 2.3.3.
     It has stricter conformance to POSIX, so metachars such as
     { } + * may need to be escaped where before they did not
     (but could have been).


Probably you want

  list.files(pattern = "\\.zip$", full.names = TRUE)

Uwe Ligges
#
Is that the entire story ? I tried this with yesterday's patched version 
(windows xp) and found:

 > list.files(getwd(),"*.txt",full=T)
Error in list.files(path, pattern, all.files, full.names, recursive) :
         invalid 'pattern' regular expression

 > list.files(getwd(),'.txt',full=T)
[1] "C:/Documents and Settings/suresh/BDE_SysInfo.txt"
[2] "C:/Documents and Settings/suresh/dxva_sig.txt"

Replacing "*.txt" with '*.txt' seems to do "something".

-s.
Uwe Ligges wrote:
#
Le 12.05.2005 10:30, Steve Roberts a ??crit :
That has something to do with regexpr, try someting like :
Romain

  
    
#
oops, my fault. i missed typing the key '*' character in the second version.

apologies !!!

suresh
Suresh Krishna wrote:
#
Suresh Krishna wrote:

            
No! Replacing "*.txt" with ".txt" does something (you do not intend)!

Please read about regular expressions (!!!) and try to understand that
".txt" also finds "Not_a_txt_file.xls" ....

Uwe Ligges
#
Romain Francois wrote:

            
Which also finds the file "unzip.exe".
Please, folks, do read about regular expressions!

Uwe Ligges
#
Le 12.05.2005 10:48, Suresh Krishna a ??crit :
Well, that's not what you did exactlty, without the * in the first call, 
the result would have been the same.

Romain
#
Uwe Ligges wrote:

            
The confusion here is between regular expressions and wildcard 
expansion known as 'globbing'. The two things are very different, and 
use characters such as '*' '.' and '?' in different ways.

  There's added confusion when people come from a DOS background, where 
commands did their own thing when given '*' as parameter. The DOS command:

  RENAME *.FOO *.BAR

  did what seems obvious, renaming all the .FOO files to .BAR, but on a 
unix machine doing this with 'mv' can be destructive!

  In short (and slightly simplified), a '*' when expanded as a wildcard 
in a glob matches any string, whereas a '*' in a regular expression 
(regexp), matches the previous character 0 or more times. This is why 
"*.zip" is flagged as invalid now - there's no character before the "*".

  That should be enough clues to send you on your way.

  Baz
#
Yes I missed the NEWS entry - or rather didn't realise its significance. 
So the "bug" was in the previous version and my old code which worked 
but shouldn't have.

Thanks for the replies - rapid and to the point as usual.

Steve.


Date sent:      	Thu, 12 May 2005 10:45:03 +0200
From:           	Uwe Ligges <ligges at statistik.uni-dortmund.de>
Organization:   	Fachbereich Statistik, Universitaet Dortmund
To:             	steve.roberts at manchester.ac.uk
Copies to:      	R-help at stat.math.ethz.ch
Subject:        	Re: [R] R2.1.0: Bug in list.files
Dr Steve Roberts 
  steve.roberts at manchester.ac.uk

Senior Lecturer in Medical Statistics,
CMMCH NHS Trust and University of Manchester Biostatistics Group,
0161 275 5192/5764 / 0161 276 5785
#
Note that sfsmisc::glob2rx is a handy function that will convert glob style
wildcard expressions to regular expressions.
On 5/12/05, Steve Roberts <steve.roberts at manchester.ac.uk> wrote:
#
On Thu, 12 May 2005, Suresh Krishna wrote:
It finds any file name containing the substring txt beginning anywhere 
except the first letter. Now, this is exactly what *.txt used to do, so in 
that sense it is equivalent, but it probably isn't what you wanted.  The 
pattern argument to list.files isn't a Windows wildcard expression. It 
never has been a Windows wildcard expression.    It just so happens that 
".txt" is also a valid regular expression, but one that means something 
different from the Windows wildcard expression "*.txt".


 	-thomas
#

        
BaRow> Uwe Ligges wrote:
>> Please read about regular expressions (!!!) and try to
    >> understand that ".txt" also finds "Not_a_txt_file.xls"
    >> ....


    BaRow>   The confusion here is between regular expressions
    BaRow> and wildcard expansion known as 'globbing'. The two
    BaRow> things are very different, and use characters such as
    BaRow> '*' '.' and '?' in different ways.

Exactly,  I had devised  a  "glob" to "regexp" function many
years ago in order to help newbies make the transition.

That function, nowadays, called 'glob2rx' has been part of our
(CRAN) package "sfsmisc" and hence available to all via
 
       install.packages("sfsmisc")
       library("sfsmisc")

But it's quite simple (though not trivial to read for the
inexperienced because of the many escapes ("\") needed)
and it maybe helpful to see its code on R-help, below.
Then, this topic has lead me to add 2 (obvious in hindsight)
logical optional arguments to the function so that it now looks like

glob2rx <- function(pattern, trim.head = FALSE, trim.tail = TRUE)
{
    ## Purpose: Change "ls" aka "wildcard" aka "globbing" _pattern_ to
    ##	      Regular Expression (as in grep, perl, emacs, ...)
    ## -------------------------------------------------------------------------
    ## Author: Martin Maechler ETH Zurich, ~ 1991
    ##	       New version using [g]sub() : 2004
    p <- gsub('\\.','\\\\.', paste('^', pattern, '$', sep=''))
    p <- gsub('\\?',	 '.',  gsub('\\*',  '.*', p))
    ## these are trimming '.*$' and '^.*' - in most cases only for esthetics
    if(trim.tail) p <- sub("\\.\\*\\$$", '', p)
    if(trim.head) p <- sub("\\^\\.\\*",  '', p)
    p
}


So those confused newbies (and DOS long timers!)
could use

      list.files(myloc, glob2rx("*.zip"), full=TRUE)

            ## (yes, make a habit of using 'TRUE', not 'T' ..)

The current example code, BTW, has

    stopifnot(glob2rx("abc.*") == "^abc\\.",
               glob2rx("a?b.*") == "^a.b\\.",
               glob2rx("a?b.*", trim.tail=FALSE) == "^a.b\\..*$",
               glob2rx("*.doc") == "^.*\\.doc$",
               glob2rx("*.doc", trim.head=TRUE) == "\\.doc$",
               glob2rx("*.t*")  == "^.*\\.t",
               glob2rx("*.t??") == "^.*\\.t..$"
     )


Martin Maechler,
ETH Zurich


    BaRow>   There's added confusion when people come from a DOS
    BaRow> background, where commands did their own thing when
    BaRow> given '*' as parameter. The DOS command:

    BaRow>   RENAME *.FOO *.BAR

    BaRow>   did what seems obvious, renaming all the .FOO files
    BaRow> to .BAR, but on a unix machine doing this with 'mv'
    BaRow> can be destructive!

    BaRow>   In short (and slightly simplified), a '*' when
    BaRow> expanded as a wildcard in a glob matches any string,
    BaRow> whereas a '*' in a regular expression (regexp),
    BaRow> matches the previous character 0 or more times. This
    BaRow> is why "*.zip" is flagged as invalid now - there's no
    BaRow> character before the "*".

    BaRow>   That should be enough clues to send you on your
    BaRow> way.

    BaRow>   Baz
#
I think glob2rx is of sufficient interest and sufficiently small
that it would be nice to have in the core of R without having to 
install and load sfsmisc.
On 5/12/05, Martin Maechler <maechler at stat.math.ethz.ch> wrote: