[R-pkg-devel] Strategy for dealing with websites serving HTTP 403 only when validated by 'R CMD check'

Tue, Mar 3, 2026 12:59 AM

> Henrik, yes, that's quite annoying 

indeed, and similar for all those of us who do want to have
resourceful help pages (and other R package documentation).


    >  - they respond with a
    > 403 which *does* have html content which the browsers
    > display and that page contains JavaScript code which calls
    > a CGI script on their server which bounces to CF's server
    > which after 7(!) more requests finally sets the challenge
    > cookie and re-directs back to winehq.org. However, what is
    > truly annoying, the response is the same whether the
    > resource exists or not, so there is no way to verify the
    > URL. I'm somewhat shocked that they rely on the browsers
    > showing the error page and hijack it to quickly re-direct
    > from it so the user isn't even aware they the server
    > responded with an error.

    > More practically, I don't see that we can do anything
    > about it. Those are URLs are truly responding with an
    > error, so short of emulating a full browser with
    > JavaScript (they also do fingerprinting etc. so it's
    > distinctly non-trivial - by design) there is no way to
    > verify them. Given the amount of shenanigans that page
    > does with the user's browser I'd say your approach is
    > probably good since the user won't accidentally click on
    > the link then :). But more seriously, this is a problem
    > since the idea behind checking URLs is a good one - they
    > do disappear or change quite often, so not checking them
    > is not an answer, either.

    > One special-case approach for cases like you mentioned
    > (i.e. where you want to check a top-domain as opposed to a
    > specific resource) is to use a resource that is guaranteed
    > (by design) to be accessible by direct requests, so for
    > example robots.txt. So for top-level URLs, we could fall
    > back to checking https://winehq.org/robots.txt which does
    > work (since most sites do want those to be directly
    > accessible). However, it doesn't help with URLs containing
    > specific paths as those will be still blocked.

    > Cheers, Simon

Thank you, Simon. Nice idea with `robots.txt`, but as you
mention this will only apply to a relative tiny fraction of our
url's in R (package) documentation. ... or the 'R CMD check' R
functions could always try  https://<toplevel>/robots.txt  if
the  https://<toplevel>/<morestuff> URL gives a 403 ... ? ...

OTOH, isn't this rather a "world-wide" challenge/problem: 
"
 I want to check if an https URL is "valid" (i.e., not invalid),
 but I don't need to get any other data from its http server.
"
for which e.g. the W3C (World Wide Web Consortium) or others
should have provided recommendations or even protocols and tools?

Martin

    >> On 3/03/2026, at 18:08, Henrik Bengtsson

>> <henrik.bengtsson at gmail.com> wrote:

>> 
    >> I've started to get:
    >> 
    >> * checking CRAN incoming feasibility ... NOTE Found the
    >> following (possibly) invalid URLs: URL:
    >> https://www.winehq.org/ From:
    >> inst/doc/parallelly-22-wine-workers.html Status: 403
    >> Message: Forbidden
    >> 
    >> when R CMD check:ing 'parallelly'. The page
    >> <https://www.winehq.org/> works fine in the web browser,
    >> but it blocked (by Cloudflare) elsewhere, e.g.
    >> 
    >> $ curl --silent --head https://www.winehq.org/ | head -1
    >> HTTP/2 403
    >> 
    >> and
    >> 
    >> $ wget https://www.winehq.org/ --2026-03-02 21:01:12--
    >> https://www.winehq.org/ Resolving www.winehq.org
    >> (www.winehq.org)... 104.26.8.100, 172.67.69.38,
    >> 104.26.9.100, ...  Connecting to www.winehq.org
    >> (www.winehq.org)|104.26.8.100|:443... connected.  HTTP
    >> request sent, awaiting response... 403 Forbidden
    >> 2026-03-02 21:01:12 ERROR 403: Forbidden.
    >> 
    >> I can only guess, but I suspect that
    >> <https://www.winehq.org/> started to do this to protect
    >> against AI-scraping bots, or similar. I can imagine more
    >> websites to do the same.
    >> 
    >> To avoid having to deal with this check NOTE everywhere
    >> (e.g. locally, CI, and on CRAN submission), my current
    >> strategy is to switch from \url{https://www.winehq.org/}
    >> to \code{https://www.winehq.org/} in the docs. Does
    >> anyone else have a better idea?
    >> 
    >> /Henrik

[R-pkg-devel] Strategy for dealing with websites serving HTTP 403 only when validated by 'R CMD check'

Thread (6 messages)