Back to formatted view
Raw Message

Message-ID: <27046.41726.702188.663853@stat.math.ethz.ch>
Date: 2026-03-03T08:59:42Z
From: Martin Maechler
Subject: [R-pkg-devel]  Strategy for dealing with websites serving HTTP 403 only when validated by 'R CMD check'
In-Reply-To: <fake-VM-id.5b81630d93aa39c4d069671047dadfac@talos.iv>

>>>>> Simon Urbanek 
>>>>>     on Tue, 3 Mar 2026 20:01:13 +1300 writes:

    > Henrik, yes, that's quite annoying 

indeed, and similar for all those of us who do want to have
resourceful help pages (and other R package documentation).


    >  - they respond with a
    > 403 which *does* have html content which the browsers
    > display and that page contains JavaScript code which calls
    > a CGI script on their server which bounces to CF's server
    > which after 7(!) more requests finally sets the challenge
    > cookie and re-directs back to winehq.org. However, what is
    > truly annoying, the response is the same whether the
    > resource exists or not, so there is no way to verify the
    > URL. I'm somewhat shocked that they rely on the browsers
    > showing the error page and hijack it to quickly re-direct
    > from it so the user isn't even aware they the server
    > responded with an error.

    > More practically, I don't see that we can do anything
    > about it. Those are URLs are truly responding with an
    > error, so short of emulating a full browser with
    > JavaScript (they also do fingerprinting etc. so it's
    > distinctly non-trivial - by design) there is no way to
    > verify them. Given the amount of shenanigans that page
    > does with the user's browser I'd say your approach is
    > probably good since the user won't accidentally click on
    > the link then :). But more seriously, this is a problem
    > since the idea behind checking URLs is a good one - they
    > do disappear or change quite often, so not checking them
    > is not an answer, either.

    > One special-case approach for cases like you mentioned
    > (i.e. where you want to check a top-domain as opposed to a
    > specific resource) is to use a resource that is guaranteed
    > (by design) to be accessible by direct requests, so for
    > example robots.txt. So for top-level URLs, we could fall
    > back to checking https://winehq.org/robots.txt which does
    > work (since most sites do want those to be directly
    > accessible). However, it doesn't help with URLs containing
    > specific paths as those will be still blocked.

    > Cheers, Simon

Thank you, Simon. Nice idea with `robots.txt`, but as you
mention this will only apply to a relative tiny fraction of our
url's in R (package) documentation. ... or the 'R CMD check' R
functions could always try  https://<toplevel>/robots.txt  if
the  https://<toplevel>/<morestuff> URL gives a 403 ... ? ...

OTOH, isn't this rather a "world-wide" challenge/problem: 
"
 I want to check if an https URL is "valid" (i.e., not invalid),
 but I don't need to get any other data from its http server.
"
for which e.g. the W3C (World Wide Web Consortium) or others
should have provided recommendations or even protocols and tools?

Martin

    >> On 3/03/2026, at 18:08, Henrik Bengtsson
    >> <henrik.bengtsson at gmail.com> wrote:
    >> 
    >> I've started to get:
    >> 
    >> * checking CRAN incoming feasibility ... NOTE Found the
    >> following (possibly) invalid URLs: URL:
    >> https://www.winehq.org/ From:
    >> inst/doc/parallelly-22-wine-workers.html Status: 403
    >> Message: Forbidden
    >> 
    >> when R CMD check:ing 'parallelly'. The page
    >> <https://www.winehq.org/> works fine in the web browser,
    >> but it blocked (by Cloudflare) elsewhere, e.g.
    >> 
    >> $ curl --silent --head https://www.winehq.org/ | head -1
    >> HTTP/2 403
    >> 
    >> and
    >> 
    >> $ wget https://www.winehq.org/ --2026-03-02 21:01:12--
    >> https://www.winehq.org/ Resolving www.winehq.org
    >> (www.winehq.org)... 104.26.8.100, 172.67.69.38,
    >> 104.26.9.100, ...  Connecting to www.winehq.org
    >> (www.winehq.org)|104.26.8.100|:443... connected.  HTTP
    >> request sent, awaiting response... 403 Forbidden
    >> 2026-03-02 21:01:12 ERROR 403: Forbidden.
    >> 
    >> I can only guess, but I suspect that
    >> <https://www.winehq.org/> started to do this to protect
    >> against AI-scraping bots, or similar. I can imagine more
    >> websites to do the same.
    >> 
    >> To avoid having to deal with this check NOTE everywhere
    >> (e.g. locally, CI, and on CRAN submission), my current
    >> strategy is to switch from \url{https://www.winehq.org/}
    >> to \code{https://www.winehq.org/} in the docs. Does
    >> anyone else have a better idea?
    >> 
    >> /Henrik