Why does the lexical analyzer drop comments ?

An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <https://stat.ethz.ch/pipermail/r-devel/attachments/20090320/b55d3cf7/attachment.pl>
It happens in the token function in gram.c: 

? ? ?  c = SkipSpace();
? ? ?  if (c == '#') c = SkipComment();

and then SkipComment goes like that: 

static int SkipComment(void)
{
? ? ?  int c;
? ? ?  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
? ? ?  if (c == R_EOF) EndOfFile = 2;
? ? ?  return c;
}

which effectively drops comments.

Would it be possible to keep the information somewhere ? 

The source code says this: 

? *?  The function yylex() scans the input, breaking it into
? *?  tokens which are then passed to the parser.?  The lexical
? *?  analyser maintains a symbol table (in a very messy fashion).

so my question is could we use this symbol table to keep track of, say, COMMENT tokens. 

Why would I even care about that ? I'm writing a package that will
perform syntax highlighting of R source code based on the output of the
parser, and it seems a waste to drop the comments. 

An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user.

Am I mad if I contemplate looking into this ? 
Comments are syntactically the same as whitespace.  You don't want them 
to affect the parsing.

If you're doing syntax highlighting, you can determine the whitespace by
looking at the srcref records, and then parse that to determine what 
isn't being counted as tokens.  (I think you'll find a few things there 
besides whitespace, but it is a fairly limited set, so shouldn't be too 
hard to recognize.)

The Rd parser is different, because in an Rd file, whitespace is 
significant, so it gets kept.

Duncan Murdoch
On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:
It happens in the token function in gram.c:
? ? ?  c = SkipSpace();
? ? ?  if (c == '#') c = SkipComment();

and then SkipComment goes like that:
static int SkipComment(void)
{
? ? ?  int c;
? ? ?  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
? ? ?  if (c == R_EOF) EndOfFile = 2;
? ? ?  return c;
}

which effectively drops comments.

Would it be possible to keep the information somewhere ?
The source code says this:
? *?  The function yylex() scans the input, breaking it into
? *?  tokens which are then passed to the parser.?  The lexical
? *?  analyser maintains a symbol table (in a very messy fashion).

so my question is could we use this symbol table to keep track of, 
say, COMMENT tokens.
Why would I even care about that ? I'm writing a package that will
perform syntax highlighting of R source code based on the output of the
parser, and it seems a waste to drop the comments.
An also, when you print a function to the R console, you don't get the 
comments, and some of them might be useful to the user.

Am I mad if I contemplate looking into this ? 
Comments are syntactically the same as whitespace.  You don't want them 
to affect the parsing.
Well, you might, but there is quite some madness lying that way.

Back in the bronze age, we did actually try to keep comments attached to 
(AFAIR) the preceding token. One problem is that the elements of the 
parse tree typically involve multiple tokens, and if comments after 
different tokens get stored in the same place something is not going 
back where it came from when deparsing. So we had problems with comments 
moving from one end of a loop the other and the like.

You could try extending the scheme by encoding which part of a syntactic 
structure the comment belongs to, but consider for instance how many 
places in a function call you can stick in a comment.

f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well
If you're doing syntax highlighting, you can determine the whitespace by
looking at the srcref records, and then parse that to determine what 
isn't being counted as tokens.  (I think you'll find a few things there 
besides whitespace, but it is a fairly limited set, so shouldn't be too 
hard to recognize.)

The Rd parser is different, because in an Rd file, whitespace is 
significant, so it gets kept.

Duncan Murdoch

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

O__  ---- Peter Dalgaard             ?ster Farimagsgade 5, Entr.B
   c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907
Duncan Murdoch wrote:
On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:
It happens in the token function in gram.c:
? ? ?  c = SkipSpace();
? ? ?  if (c == '#') c = SkipComment();

and then SkipComment goes like that:
static int SkipComment(void)
{
? ? ?  int c;
? ? ?  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
? ? ?  if (c == R_EOF) EndOfFile = 2;
? ? ?  return c;
}

which effectively drops comments.

Would it be possible to keep the information somewhere ?
The source code says this:
? *?  The function yylex() scans the input, breaking it into
? *?  tokens which are then passed to the parser.?  The lexical
? *?  analyser maintains a symbol table (in a very messy fashion).

so my question is could we use this symbol table to keep track of, 
say, COMMENT tokens.
Why would I even care about that ? I'm writing a package that will
perform syntax highlighting of R source code based on the output of the
parser, and it seems a waste to drop the comments.
An also, when you print a function to the R console, you don't get 
the comments, and some of them might be useful to the user.

Am I mad if I contemplate looking into this ? 
Comments are syntactically the same as whitespace.  You don't want 
them to affect the parsing.
Well, you might, but there is quite some madness lying that way.

Back in the bronze age, we did actually try to keep comments attached 
to (AFAIR) the preceding token. One problem is that the elements of 
the parse tree typically involve multiple tokens, and if comments 
after different tokens get stored in the same place something is not 
going back where it came from when deparsing. So we had problems with 
comments moving from one end of a loop the other and the like.
Ouch. That helps picturing the kind of madness ...

Another way could be to record comments separately (similarly to srcfile 
attribute for example) instead of dropping them entirely, but I guess 
this is the same as Duncan's idea, which is easier to set up.
You could try extending the scheme by encoding which part of a 
syntactic structure the comment belongs to, but consider for instance 
how many places in a function call you can stick in a comment.

f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well

If you're doing syntax highlighting, you can determine the whitespace by
looking at the srcref records, and then parse that to determine what 
isn't being counted as tokens.  (I think you'll find a few things 
there besides whitespace, but it is a fairly limited set, so 
shouldn't be too hard to recognize.)

The Rd parser is different, because in an Rd file, whitespace is 
significant, so it gets kept.

Duncan Murdoch

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
Peter Dalgaard wrote:
Duncan Murdoch wrote:
On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:
It happens in the token function in gram.c:
? ? ?  c = SkipSpace();
? ? ?  if (c == '#') c = SkipComment();

and then SkipComment goes like that:
static int SkipComment(void)
{
? ? ?  int c;
? ? ?  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
? ? ?  if (c == R_EOF) EndOfFile = 2;
? ? ?  return c;
}

which effectively drops comments.

Would it be possible to keep the information somewhere ?
The source code says this:
? *?  The function yylex() scans the input, breaking it into
? *?  tokens which are then passed to the parser.?  The lexical
? *?  analyser maintains a symbol table (in a very messy fashion).

so my question is could we use this symbol table to keep track of, 
say, COMMENT tokens.
Why would I even care about that ? I'm writing a package that will
perform syntax highlighting of R source code based on the output of 
the
parser, and it seems a waste to drop the comments.
An also, when you print a function to the R console, you don't get 
the comments, and some of them might be useful to the user.

Am I mad if I contemplate looking into this ? 
Comments are syntactically the same as whitespace.  You don't want 
them to affect the parsing.
Well, you might, but there is quite some madness lying that way.

Back in the bronze age, we did actually try to keep comments attached 
to (AFAIR) the preceding token. One problem is that the elements of 
the parse tree typically involve multiple tokens, and if comments 
after different tokens get stored in the same place something is not 
going back where it came from when deparsing. So we had problems with 
comments moving from one end of a loop the other and the like.
Ouch. That helps picturing the kind of madness ...

Another way could be to record comments separately (similarly to 
srcfile attribute for example) instead of dropping them entirely, but 
I guess this is the same as Duncan's idea, which is easier to set up.

You could try extending the scheme by encoding which part of a 
syntactic structure the comment belongs to, but consider for instance 
how many places in a function call you can stick in a comment.

f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well
Coming back on this. I actually get two expressions:

 > p <- parse( "/tmp/parsing.R")
 > str( p )
length 2 expression(f, (a = 1))
 - attr(*, "srcref")=List of 2
  ..$ :Class 'srcref'  atomic [1:6] 1 1 1 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
  ..$ :Class 'srcref'  atomic [1:6] 2 1 6 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
 - attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>

But anyway, if I drop the first comment, then I get one expression with 
some srcref information:

 > p <- parse( "/tmp/parsing.R")
 > str( p )
length 1 expression(f(a = 1))
 - attr(*, "srcref")=List of 1
  ..$ :Class 'srcref'  atomic [1:6] 1 1 5 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
 - attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>

but as far as i can see, there is only srcref information for that 
expression as a whole, it does not go beyond, so I am not sure I can 
implement Duncan's proposal without more detailed information from the 
parser, since I will only have the chance to check if a whitespace is 
actually a comment if it is between two expressions with a srcref.

Would it be sensible then to retain the comments and their srcref 
information, but separate from the tokens used for the actual parsing, 
in some other attribute of the output of parse ?

Romain
If you're doing syntax highlighting, you can determine the 
whitespace by
looking at the srcref records, and then parse that to determine what 
isn't being counted as tokens.  (I think you'll find a few things 
there besides whitespace, but it is a fairly limited set, so 
shouldn't be too hard to recognize.)

The Rd parser is different, because in an Rd file, whitespace is 
significant, so it gets kept.

Duncan Murdoch

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
Romain Francois wrote:
Peter Dalgaard wrote:
Duncan Murdoch wrote:
On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:
It happens in the token function in gram.c:
? ? ?  c = SkipSpace();
? ? ?  if (c == '#') c = SkipComment();

and then SkipComment goes like that:
static int SkipComment(void)
{
? ? ?  int c;
? ? ?  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
? ? ?  if (c == R_EOF) EndOfFile = 2;
? ? ?  return c;
}

which effectively drops comments.

Would it be possible to keep the information somewhere ?
The source code says this:
? *?  The function yylex() scans the input, breaking it into
? *?  tokens which are then passed to the parser.?  The lexical
? *?  analyser maintains a symbol table (in a very messy fashion).

so my question is could we use this symbol table to keep track of, 
say, COMMENT tokens.
Why would I even care about that ? I'm writing a package that will
perform syntax highlighting of R source code based on the output of 
the
parser, and it seems a waste to drop the comments.
An also, when you print a function to the R console, you don't get 
the comments, and some of them might be useful to the user.

Am I mad if I contemplate looking into this ? 
Comments are syntactically the same as whitespace.  You don't want 
them to affect the parsing.
Well, you might, but there is quite some madness lying that way.

Back in the bronze age, we did actually try to keep comments attached 
to (AFAIR) the preceding token. One problem is that the elements of 
the parse tree typically involve multiple tokens, and if comments 
after different tokens get stored in the same place something is not 
going back where it came from when deparsing. So we had problems with 
comments moving from one end of a loop the other and the like.
Ouch. That helps picturing the kind of madness ...

Another way could be to record comments separately (similarly to 
srcfile attribute for example) instead of dropping them entirely, but 
I guess this is the same as Duncan's idea, which is easier to set up.

You could try extending the scheme by encoding which part of a 
syntactic structure the comment belongs to, but consider for instance 
how many places in a function call you can stick in a comment.

f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well
Coming back on this. I actually get two expressions:

 > p <- parse( "/tmp/parsing.R")
 > str( p )
length 2 expression(f, (a = 1))
 - attr(*, "srcref")=List of 2
  ..$ :Class 'srcref'  atomic [1:6] 1 1 1 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
  ..$ :Class 'srcref'  atomic [1:6] 2 1 6 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
 - attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>

But anyway, if I drop the first comment, then I get one expression with 
some srcref information:

 > p <- parse( "/tmp/parsing.R")
 > str( p )
length 1 expression(f(a = 1))
 - attr(*, "srcref")=List of 1
  ..$ :Class 'srcref'  atomic [1:6] 1 1 5 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
 - attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>

but as far as i can see, there is only srcref information for that 
expression as a whole, it does not go beyond, so I am not sure I can 
implement Duncan's proposal without more detailed information from the 
parser, since I will only have the chance to check if a whitespace is 
actually a comment if it is between two expressions with a srcref.
Currently srcrefs are only attached to whole statements.  Since your 
source only included one or two statements, you only get one or two 
srcrefs.  It would not be hard to attach a srcref to every 
subexpression; there hasn't been a need for that before, so I didn't do 
it just for the sake of efficiency.

However, it might make sense for you to have your own parser, based on 
the grammar in R's parser, but handling white space differently. 
Certainly it would make sense to do that before making changes to the 
base R one.  The whole source is in src/main/gram.y; if you're not 
familiar with Bison, I can give you a hand.

Duncan Murdoch
Would it be sensible then to retain the comments and their srcref 
information, but separate from the tokens used for the actual parsing, 
in some other attribute of the output of parse ?

Romain

If you're doing syntax highlighting, you can determine the 
whitespace by
looking at the srcref records, and then parse that to determine what 
isn't being counted as tokens.  (I think you'll find a few things 
there besides whitespace, but it is a fairly limited set, so 
shouldn't be too hard to recognize.)

The Rd parser is different, because in an Rd file, whitespace is 
significant, so it gets kept.

Duncan Murdoch

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

On 22/03/2009 4:50 PM, Romain Francois wrote:
Romain Francois wrote:
Peter Dalgaard wrote:
Duncan Murdoch wrote:
On 3/20/2009 2:56 PM, romain.francois at dbmail.com wrote:
It happens in the token function in gram.c:
? ? ?  c = SkipSpace();
? ? ?  if (c == '#') c = SkipComment();

and then SkipComment goes like that:
static int SkipComment(void)
{
? ? ?  int c;
? ? ?  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
? ? ?  if (c == R_EOF) EndOfFile = 2;
? ? ?  return c;
}

which effectively drops comments.

Would it be possible to keep the information somewhere ?
The source code says this:
? *?  The function yylex() scans the input, breaking it into
? *?  tokens which are then passed to the parser.?  The lexical
? *?  analyser maintains a symbol table (in a very messy fashion).

so my question is could we use this symbol table to keep track 
of, say, COMMENT tokens.
Why would I even care about that ? I'm writing a package that will
perform syntax highlighting of R source code based on the output 
of the
parser, and it seems a waste to drop the comments.
An also, when you print a function to the R console, you don't 
get the comments, and some of them might be useful to the user.

Am I mad if I contemplate looking into this ? 
Comments are syntactically the same as whitespace.  You don't want 
them to affect the parsing.
Well, you might, but there is quite some madness lying that way.

Back in the bronze age, we did actually try to keep comments 
attached to (AFAIR) the preceding token. One problem is that the 
elements of the parse tree typically involve multiple tokens, and 
if comments after different tokens get stored in the same place 
something is not going back where it came from when deparsing. So 
we had problems with comments moving from one end of a loop the 
other and the like.
Ouch. That helps picturing the kind of madness ...

Another way could be to record comments separately (similarly to 
srcfile attribute for example) instead of dropping them entirely, 
but I guess this is the same as Duncan's idea, which is easier to 
set up.

You could try extending the scheme by encoding which part of a 
syntactic structure the comment belongs to, but consider for 
instance how many places in a function call you can stick in a 
comment.

f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well
Coming back on this. I actually get two expressions:

 > p <- parse( "/tmp/parsing.R")
 > str( p )
length 2 expression(f, (a = 1))
 - attr(*, "srcref")=List of 2
  ..$ :Class 'srcref'  atomic [1:6] 1 1 1 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
  ..$ :Class 'srcref'  atomic [1:6] 2 1 6 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>
 - attr(*, "srcfile")=Class 'srcfile' <environment: 0x95c3c00>

But anyway, if I drop the first comment, then I get one expression 
with some srcref information:

 > p <- parse( "/tmp/parsing.R")
 > str( p )
length 1 expression(f(a = 1))
 - attr(*, "srcref")=List of 1
  ..$ :Class 'srcref'  atomic [1:6] 1 1 5 1 1 1
  .. .. ..- attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>
 - attr(*, "srcfile")=Class 'srcfile' <environment: 0x9bca314>

but as far as i can see, there is only srcref information for that 
expression as a whole, it does not go beyond, so I am not sure I can 
implement Duncan's proposal without more detailed information from 
the parser, since I will only have the chance to check if a 
whitespace is actually a comment if it is between two expressions 
with a srcref.
Currently srcrefs are only attached to whole statements.  Since your 
source only included one or two statements, you only get one or two 
srcrefs.  It would not be hard to attach a srcref to every 
subexpression; there hasn't been a need for that before, so I didn't 
do it just for the sake of efficiency.
I understand that. I wanted to make sure I did not miss something.
However, it might make sense for you to have your own parser, based on 
the grammar in R's parser, but handling white space differently. 
Certainly it would make sense to do that before making changes to the 
base R one.  The whole source is in src/main/gram.y; if you're not 
familiar with Bison, I can give you a hand.
Thank you, I appreciate your help. Having my own parser is the option I 
am slowly converging to.
I'll start with reading bison documentation. Besides bison documents, is 
there R specific documentation on how the R parser was written ?
Duncan Murdoch

Would it be sensible then to retain the comments and their srcref 
information, but separate from the tokens used for the actual 
parsing, in some other attribute of the output of parse ?

Romain

If you're doing syntax highlighting, you can determine the 
whitespace by
looking at the srcref records, and then parse that to determine 
what isn't being counted as tokens.  (I think you'll find a few 
things there besides whitespace, but it is a fairly limited set, 
so shouldn't be too hard to recognize.)

The Rd parser is different, because in an Rd file, whitespace is 
significant, so it gets kept.

Duncan Murdoch

______________________________________________
R-devel at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
Duncan Murdoch wrote:
....
However, it might make sense for you to have your own parser, based on 
the grammar in R's parser, but handling white space differently. 
Certainly it would make sense to do that before making changes to the 
base R one.  The whole source is in src/main/gram.y; if you're not 
familiar with Bison, I can give you a hand.
Thank you, I appreciate your help. Having my own parser is the option I 
am slowly converging to.
I'll start with reading bison documentation. Besides bison documents, is 
there R specific documentation on how the R parser was written ?
I don't think so.

Duncan Murdoch
Hi Romain,

I've been thinking for quite a long time on how to keep comments when
parsing R code and finally got a trick with inspiration from one of my
friends, i.e. to mask the comments in special assignments to "cheat" R
parser:

# keep.comment: whether to keep the comments or not
# keep.blank.line: preserve blank lines or not?
# begin.comment and end.comment: special identifiers that mark the orignial
#     comments as 'begin.comment = "#[ comments ]end.comment"'
#     and these marks will be removed after the modified code is parsed
tidy.source <- function(source = "clipboard", keep.comment = TRUE,
    keep.blank.line = FALSE, begin.comment, end.comment, ...) {
    # parse and deparse the code
    tidy.block = function(block.text) {
        exprs = parse(text = block.text)
        n = length(exprs)
        res = character(n)
        for (i in 1:n) {
            dep = paste(deparse(exprs[i]), collapse = "\n")
            res[i] = substring(dep, 12, nchar(dep) - 1)
        }
        return(res)
    }
    text.lines = readLines(source, warn = FALSE)
    if (keep.comment) {
        # identifier for comments
        identifier = function() paste(sample(LETTERS), collapse = "")
        if (missing(begin.comment))
            begin.comment = identifier()
        if (missing(end.comment))
            end.comment = identifier()
        # remove leading and trailing white spaces
        text.lines = gsub("^[[:space:]]+|[[:space:]]+$", "",
            text.lines)
        # make sure the identifiers are not in the code
        # or the original code might be modified
        while (length(grep(sprintf("%s|%s", begin.comment, end.comment),
            text.lines))) {
            begin.comment = identifier()
            end.comment = identifier()
        }
        head.comment = substring(text.lines, 1, 1) == "#"
        # add identifiers to comment lines to cheat R parser
        if (any(head.comment)) {
            text.lines[head.comment] = gsub("\"", "\'",
text.lines[head.comment])
            text.lines[head.comment] = sprintf("%s=\"%s%s\"",
                begin.comment, text.lines[head.comment], end.comment)
        }
        # keep blank lines?
        blank.line = text.lines == ""
        if (any(blank.line) & keep.blank.line)
            text.lines[blank.line] = sprintf("%s=\"%s\"", begin.comment,
                end.comment)
        text.tidy = tidy.block(text.lines)
        # remove the identifiers
        text.tidy = gsub(sprintf("%s = \"|%s\"", begin.comment,
            end.comment), "", text.tidy)
    }
    else {
        text.tidy = tidy.block(text.lines)
    }
    cat(paste(text.tidy, collapse = "\n"), "\n", ...)
    invisible(text.tidy)
}

The above function can deal with comments which are in single lines, e.g.

f = tempfile()
writeLines('
  # rotation of the word "Animation"
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
# redraw the plot again and again
plot(1,ann=FALSE,type="n",axes=FALSE)
# rotate; use rainbow() colors
text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360)
# pause for a while
Sys.sleep(0.01)}
', f)

Then parse the code file 'f':
tidy.source(f)
# rotation of the word 'Animation'
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
    # redraw the plot again and again
    plot(1, ann = FALSE, type = "n", axes = FALSE)
    # rotate; use rainbow() colors
    text(1, 1, "Animation", srt = i, col = rainbow(360)[i], cex = 7 *
        i/360)
    # pause for a while
    Sys.sleep(0.01)
}

Of course this function has some limitations: it does not support
inline comments or comments which are inside incomplete code lines.
Peter's example

f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well

will be parsed as

f
(a = 1)

I'm quite interested in syntax highlighting of R code and saw your
previous discussions in another posts (with Jose Quesada, etc). I'd
like to do something for your package if I could be of some help.

Regards,
Yihui
--
Yihui Xie <xieyihui at gmail.com>
Phone: +86-(0)10-82509086 Fax: +86-(0)10-82509086
Mobile: +86-15810805877
Homepage: http://www.yihui.name
School of Statistics, Room 1037, Mingde Main Building,
Renmin University of China, Beijing, 100872, China

2009/3/21  <romain.francois at dbmail.com>:
It happens in the token function in gram.c:

?????? c = SkipSpace();
?????? if (c == '#') c = SkipComment();

and then SkipComment goes like that:

static int SkipComment(void)
{
?????? int c;
?????? while ((c = xxgetc()) != '\n' && c != R_EOF) ;
?????? if (c == R_EOF) EndOfFile = 2;
?????? return c;
}

which effectively drops comments.

Would it be possible to keep the information somewhere ?

The source code says this:

??*?? The function yylex() scans the input, breaking it into
??*?? tokens which are then passed to the parser.?? The lexical
??*?? analyser maintains a symbol table (in a very messy fashion).

so my question is could we use this symbol table to keep track of, say, COMMENT tokens.

Why would I even care about that ? I'm writing a package that will
perform syntax highlighting of R source code based on the output of the
parser, and it seems a waste to drop the comments.

An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user.

Am I mad if I contemplate looking into this ?

Romain

--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr

Hi,

Thank you for this (inspired) trick. I am currently in the process of 
extracting out the parser from R (ie the gram.y file) and making a 
custom parser using the same grammar but structuring the output in a 
different manner, more suitable for what the syntax highlighter will need.

You will find the project here: 
http://r-forge.r-project.org/projects/highlight/
Feel free to "request to join" on the project if you feel you can make 
useful contributions.

At the moment, I am concentrating efforts deep down in the parser code, 
but there are other challenges:
- once the expressions are parsed, we will need something that 
investigates to find evidence about function calls, to get an idea of 
where the function is defined (by the user, in a package, ...) . This is 
tricky, and unless you actually evaluate the code, there will be some 
errors made.
- once the evidence is collected, other functions (renderers) will have 
the task to render the evidence using html, latex, rtf, ansi escape 
codes, ... the idea here is to design the system so that other packages 
can implement custom renderers to format the evidence in their markup 
language

Romain
Hi Romain,

I've been thinking for quite a long time on how to keep comments when
parsing R code and finally got a trick with inspiration from one of my
friends, i.e. to mask the comments in special assignments to "cheat" R
parser

# keep.comment: whether to keep the comments or not
# keep.blank.line: preserve blank lines or not?
# begin.comment and end.comment: special identifiers that mark the orignial
#     comments as 'begin.comment = "#[ comments ]end.comment"'
#     and these marks will be removed after the modified code is parsed
tidy.source <- function(source = "clipboard", keep.comment = TRUE,
    keep.blank.line = FALSE, begin.comment, end.comment, ...) {
    # parse and deparse the code
    tidy.block = function(block.text) {
        exprs = parse(text = block.text)
        n = length(exprs)
        res = character(n)
        for (i in 1:n) {
            dep = paste(deparse(exprs[i]), collapse = "\n")
            res[i] = substring(dep, 12, nchar(dep) - 1)
        }
        return(res)
    }
    text.lines = readLines(source, warn = FALSE)
    if (keep.comment) {
        # identifier for comments
        identifier = function() paste(sample(LETTERS), collapse = "")
        if (missing(begin.comment))
            begin.comment = identifier()
        if (missing(end.comment))
            end.comment = identifier()
        # remove leading and trailing white spaces
        text.lines = gsub("^[[:space:]]+|[[:space:]]+$", "",
            text.lines)
        # make sure the identifiers are not in the code
        # or the original code might be modified
        while (length(grep(sprintf("%s|%s", begin.comment, end.comment),
            text.lines))) {
            begin.comment = identifier()
            end.comment = identifier()
        }
        head.comment = substring(text.lines, 1, 1) == "#"
        # add identifiers to comment lines to cheat R parser
        if (any(head.comment)) {
            text.lines[head.comment] = gsub("\"", "\'",
text.lines[head.comment])
            text.lines[head.comment] = sprintf("%s=\"%s%s\"",
                begin.comment, text.lines[head.comment], end.comment)
        }
        # keep blank lines?
        blank.line = text.lines == ""
        if (any(blank.line) & keep.blank.line)
            text.lines[blank.line] = sprintf("%s=\"%s\"", begin.comment,
                end.comment)
        text.tidy = tidy.block(text.lines)
        # remove the identifiers
        text.tidy = gsub(sprintf("%s = \"|%s\"", begin.comment,
            end.comment), "", text.tidy)
    }
    else {
        text.tidy = tidy.block(text.lines)
    }
    cat(paste(text.tidy, collapse = "\n"), "\n", ...)
    invisible(text.tidy)
}

The above function can deal with comments which are in single lines, e.g.

f = tempfile()
writeLines('
  # rotation of the word "Animation"
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
# redraw the plot again and again
plot(1,ann=FALSE,type="n",axes=FALSE)
# rotate; use rainbow() colors
text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360)
# pause for a while
Sys.sleep(0.01)}
', f)

Then parse the code file 'f':

tidy.source(f)

# rotation of the word 'Animation'
# in a loop; change the angle and color
# step by step
for (i in 1:360) {
    # redraw the plot again and again
    plot(1, ann = FALSE, type = "n", axes = FALSE)
    # rotate; use rainbow() colors
    text(1, 1, "Animation", srt = i, col = rainbow(360)[i], cex = 7 *
        i/360)
    # pause for a while
    Sys.sleep(0.01)
}

Of course this function has some limitations: it does not support
inline comments or comments which are inside incomplete code lines.
Peter's example

f #here
( #here
a #here (possibly)
= #here
1 #this one belongs to the argument, though
) #but here as well

will be parsed as

f
(a = 1)

I'm quite interested in syntax highlighting of R code and saw your
previous discussions in another posts (with Jose Quesada, etc). I'd
like to do something for your package if I could be of some help.

Regards,
Yihui
--
Yihui Xie <xieyihui at gmail.com>
Phone: +86-(0)10-82509086 Fax: +86-(0)10-82509086
Mobile: +86-15810805877
Homepage: http://www.yihui.name
School of Statistics, Room 1037, Mingde Main Building,
Renmin University of China, Beijing, 100872, China

2009/3/21  <romain.francois at dbmail.com>:

It happens in the token function in gram.c:

? ? ?  c = SkipSpace();
? ? ?  if (c == '#') c = SkipComment();

and then SkipComment goes like that:

static int SkipComment(void)
{
? ? ?  int c;
? ? ?  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
? ? ?  if (c == R_EOF) EndOfFile = 2;
? ? ?  return c;
}

which effectively drops comments.

Would it be possible to keep the information somewhere ?

The source code says this:

? *?  The function yylex() scans the input, breaking it into
? *?  tokens which are then passed to the parser.?  The lexical
? *?  analyser maintains a symbol table (in a very messy fashion).

so my question is could we use this symbol table to keep track of, say, COMMENT tokens.

Why would I even care about that ? I'm writing a package that will
perform syntax highlighting of R source code based on the output of the
parser, and it seems a waste to drop the comments.

An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user.

Am I mad if I contemplate looking into this ?

Romain

--
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr

Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr
At the moment, I am concentrating efforts deep down in the parser code, but
there are other challenges:
- once the expressions are parsed, we will need something that investigates
to find evidence about function calls, to get an idea of where the function
is defined (by the user, in a package, ...) . This is tricky, and unless you
actually evaluate the code, there will be some errors made.
Are you aware of Luke Tierney's codetools package?  That would seem to
be the place to start.

Hadley
http://had.co.nz/
At the moment, I am concentrating efforts deep down in the parser code, but
there are other challenges:
- once the expressions are parsed, we will need something that investigates
to find evidence about function calls, to get an idea of where the function
is defined (by the user, in a package, ...) . This is tricky, and unless you
actually evaluate the code, there will be some errors made.

Are you aware of Luke Tierney's codetools package?  That would seem to
be the place to start.

Yep. Plan to combine the more verbose information out of the modified 
parser with the same guess machine that checkUsage uses.
Another side effect is that we could imagine to link error patterns 
identified by checkUsage (no visible binding for global variable "y", 
...) to actual locations on the file (for example the place where the 
variable y is used in that case ), which at the moment is not possible 
because the parser only locates entire expression (semantic groupings) 
and not tokens.

 > f <- function( x = 2) {
+ y + 2
+ }
 > checkUsage( f )
<anonymous>: no visible binding for global variable ?y?
Hadley

Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr