[Bioc-devel] A handful of check to follow up on R CMD BiocCheck
Hi Martin, Thanks for spending the time and effort to refactor and integrate my code into BiocCheck. I didn't intend to give you so much to think about ! For the record, I used the "checkFunctionLengths" method (now starting at line 829 of checks.R) as a template to offer code somewhat homogenous with what I could see in place already. I believe that function should also be refactored using your new "Context" and "handleContext" helpers, right? Side note: Far from trying to compete with the BiocCheck Bioconductor package, I couldn't help but update my "follow-up" "BiocCheckTools" Python code <https://github.com/kevinrue/BiocCheckTools/blob/master/README.md> in the meantime (sorry for unoriginal name, I'll take suggestions!) taking Kasper's advice on board: - Single master file which runs everything and gives a report. - *Bonus*: each module can be run separately (line_chars; tab_width). - All line numbers are reported - Content of the line is not displayed - I didn't mark up TABs, but I indicate how many spaces they are made of (when not a multiple of 4) All the best, Kevin On Wed, Nov 9, 2016 at 11:31 PM, Martin Morgan <
martin.morgan at roswellpark.org> wrote:
On 11/03/2016 08:14 AM, Kevin RUE wrote:
Apologies for the additional spam, for two reasons:
* The diff files that I've previously sent had the base and modified
versions swapped. This new one fixes that.
* This new diff file (always relative to the code I cloned from
Bioconductor-mirror) also fixes a bug whereby the updated code would
fail on packages that do not break any of the three guidelines.
Hi Kevin -- thanks for these, I incorporated a version in 1.11.1.
The report is for the first 6 lines of offenders; the code for finding all
offenders is mentioned in the vignette
BiocCheck:::checkFormatting("path/to/YourPackage", nlines=Inf)
Marcel Ramos mentions the lintr (CRAN) and goodpractice (github only,
https://github.com/MangoTheCat/goodpractice) packages as useful resources
for these types of questions; also formatR (CRAN).
For the record, it was fun to work through your code. The basic pattern
was to read lines and flag those that violated a (usually simple) test, add
these to a data frame, then if after parsing all files the data frame had
any rows report these to the user.
I standardized the columns of the data frame across checks to facilitate
re-use, and imposed the standard by using a constructor -- non-exported
Context(). I took a little bit of pain to make the constructor work (return
a data frame with the correct columns but no rows) when there are no lines
of code flagged to be added to the context, so that I could use it without
have to insert a bunch of conditional `if (...)` in my code that make it
both harder to understand and harder to test. Also, the constructor is
vectorized, whereas your implementation was iterative.
For reporting to the user, I wrote a small helper handleContext() that I
could re-use across the different types of checks (and likely elsewhere in
the package) and also maintain the separation of what is reported to the
user (e.g., in handleContext()) from how it is reported to the user --
.msg, .verbatim, etc in the utils.R file. Probably the reporting could have
been refactored further.
I feel guilty about not writing unit tests; the package does include
tests, including a handy package-generator function to make packages that
are invalid in various ways. It seems like a better way to go would be to
further refactor the body of the checkFormatting() function so that one
could for instance invoke a function checkLineLengths(pkg, file, lines) and
get back a data.frame created by Context(); it would then be easy to test
these for various 'lines' without having to figure out how to make test
packages. But this seemed of course like too much additional work for this
feature (I'll probably regret not expending the effort at a future date).
I also found myself using the horrible 'copy-and-append' paradigm
ctxt = Context()
for (fl in files) {
...
ctxt = rbind(ctxt, Context(...)
}
which makes a copy of (the increasingly large) ctxt each time through the
loop and therefore scales quadratically (when each file has some problems)
with the number of files. I'm counting on the number of files to be small
(e.g., less than 10000, when the copy-and-append pattern starts to be
expensive even for trivial operations; mzR has ~8750 files, though many of
these are not checked for formatting) so that this pattern does not become
a bottleneck. It was still painful to use...
Martin
Best,
Kevin
On Thu, Nov 3, 2016 at 11:49 AM, Kevin RUE <kevinrue67 at gmail.com
<mailto:kevinrue67 at gmail.com>> wrote:
Hi all,
Please find attached the diff relative to the code that I cloned
from Bioconductor-mirror yesterday (please ignore the previous diff
file).
Basically three new features:
* As per previous email: display up to the first 6 lines that are
over 80 characters long
* *New*: display up to the first 6 lines that are not indented by
a multiple of 4 spaces
* *New*: display up to the first 6 lines that use TAB instead of 4
spaces for indentation
I also attach the output of the updated code
<https://github.com/kevinrue/BiocCheck>, to illustrate the changes.
Notes:
* For demonstration purpose, I indented a handful of lines from
the checks.R file itself with TAB characters. I assume that's
OK, as some lines were already longer than 80 characters and not
indented by a multiple of 4 spaces.
All the best,
Kevin
On Wed, Nov 2, 2016 at 10:00 PM, Kevin RUE <kevinrue67 at gmail.com
<mailto:kevinrue67 at gmail.com>> wrote:
Me again :)
Please find attached the first patch to print the first 6 lines
over 80 characters long. (I'll get to the tabulation offenders
next).
Note that all the offending lines are stored in the "df.length"
data.frame. How about an option like "fullReport=c(FALSE, TRUE)"
that print *all* the offending lines?
The data.frame also stores the content of the lines for the
record, but does not print them. I think Kasper is right:
filename and line should be enough to track down the line.
All the best,
Kevin
On Wed, Nov 2, 2016 at 8:08 PM, Kevin RUE <kevinrue67 at gmail.com
<mailto:kevinrue67 at gmail.com>> wrote:
Thanks for the feedback!
I also tend to prefer *all* the lines being reported (or to
be honest, that was really true when I had lots of them; a
problem that I largely mitigated by fixing all of them once
and subsequently paying more attention while developing).
Printing the content of the offending line somewhat helps me
spot the line faster (more so for tab issues). But I must
admit that showing the whole line is somewhat "overkill". I
just started thinking of a compromise being to only show the
first N characters of the line, with N being 80 minus the
number of characters necessary to print the filename and
line number.
Thanks Martin for pointing out the lines in BiocCheck. (Now
I feel bad for not having checked sooner.. hehe!)
I think the idea of BiocCheck showing the first 6 offenders
in BiocCheck quite nice, as I rarely have more since I use
using the RStudio "Tools > Global Options > Code > Display >
Show Margin > Margin column: 80" feature.
I'll give a go at both approaches (developing BiocCheck and
my own scripts)
Cheers,
Kevin
On Wed, Nov 2, 2016 at 7:41 PM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com
<mailto:kasperdanielhansen at gmail.com>> wrote:
I would prefer all line numbers reported, but on the
other hand I am indifferent wrt. the content of the
line, unless (say) TABs are marked up somehow.
Kasper
On Wed, Nov 2, 2016 at 3:17 PM, Martin Morgan
<martin.morgan at roswellpark.org
<mailto:martin.morgan at roswellpark.org>> wrote:
On 11/02/2016 02:49 PM, Kevin RUE wrote:
Dear all,
Just thought I'd share a handful of scripts that
I wrote to follow up on
certain NOTE messages thrown by R CMD BiocCheck.
https://github.com/kevinrue/BiocCheckTools
<https://github.com/kevinrue/BiocCheckTools>
They're very simple, but I occasionally find
them quite convenient.
Apologies if something similar already exists
somewhere :)
Maybe consider creating a diff against the source
code that, e.g., reported the first 6 offenders? The
relevant lines are near
https://github.com/Bioconducto
r-mirror/BiocCheck/blob/master/R/checks.R#L1081
<https://github.com/Bioconduct
or-mirror/BiocCheck/blob/master/R/checks.R#L1081>
Martin
All the best,
Kevin
[[alternative HTML version deleted]]
_______________________________________________
Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel
This email message may contain legally privileged
and/or...{{dropped:2}}
_______________________________________________
Bioc-devel at r-project.org
<mailto:Bioc-devel at r-project.org> mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel
<https://stat.ethz.ch/mailman/listinfo/bioc-devel>
This email message may contain legally privileged and/or confidential information. If you are not the intended recipient(s), or the employee or agent responsible for the delivery of this message to the intended recipient(s), you are hereby notified that any disclosure, copying, distribution, or use of this email message is prohibited. If you have received this message in error, please notify the sender immediately by e-mail and delete this email message from your computer. Thank you.