[R-pkg-devel] Strategy for dealing with websites serving HTTP 403 only when validated by 'R CMD check'
Simon Urbanek
on Tue, 3 Mar 2026 20:01:13 +1300 writes:
> Henrik, yes, that's quite annoying
indeed, and similar for all those of us who do want to have
resourceful help pages (and other R package documentation).
> - they respond with a
> 403 which *does* have html content which the browsers
> display and that page contains JavaScript code which calls
> a CGI script on their server which bounces to CF's server
> which after 7(!) more requests finally sets the challenge
> cookie and re-directs back to winehq.org. However, what is
> truly annoying, the response is the same whether the
> resource exists or not, so there is no way to verify the
> URL. I'm somewhat shocked that they rely on the browsers
> showing the error page and hijack it to quickly re-direct
> from it so the user isn't even aware they the server
> responded with an error.
> More practically, I don't see that we can do anything
> about it. Those are URLs are truly responding with an
> error, so short of emulating a full browser with
> JavaScript (they also do fingerprinting etc. so it's
> distinctly non-trivial - by design) there is no way to
> verify them. Given the amount of shenanigans that page
> does with the user's browser I'd say your approach is
> probably good since the user won't accidentally click on
> the link then :). But more seriously, this is a problem
> since the idea behind checking URLs is a good one - they
> do disappear or change quite often, so not checking them
> is not an answer, either.
> One special-case approach for cases like you mentioned
> (i.e. where you want to check a top-domain as opposed to a
> specific resource) is to use a resource that is guaranteed
> (by design) to be accessible by direct requests, so for
> example robots.txt. So for top-level URLs, we could fall
> back to checking https://winehq.org/robots.txt which does
> work (since most sites do want those to be directly
> accessible). However, it doesn't help with URLs containing
> specific paths as those will be still blocked.
> Cheers, Simon
Thank you, Simon. Nice idea with `robots.txt`, but as you
mention this will only apply to a relative tiny fraction of our
url's in R (package) documentation. ... or the 'R CMD check' R
functions could always try https://<toplevel>/robots.txt if
the https://<toplevel>/<morestuff> URL gives a 403 ... ? ...
OTOH, isn't this rather a "world-wide" challenge/problem:
"
I want to check if an https URL is "valid" (i.e., not invalid),
but I don't need to get any other data from its http server.
"
for which e.g. the W3C (World Wide Web Consortium) or others
should have provided recommendations or even protocols and tools?
Martin
>> On 3/03/2026, at 18:08, Henrik Bengtsson
>> <henrik.bengtsson at gmail.com> wrote:
>>
>> I've started to get:
>>
>> * checking CRAN incoming feasibility ... NOTE Found the
>> following (possibly) invalid URLs: URL:
>> https://www.winehq.org/ From:
>> inst/doc/parallelly-22-wine-workers.html Status: 403
>> Message: Forbidden
>>
>> when R CMD check:ing 'parallelly'. The page
>> <https://www.winehq.org/> works fine in the web browser,
>> but it blocked (by Cloudflare) elsewhere, e.g.
>>
>> $ curl --silent --head https://www.winehq.org/ | head -1
>> HTTP/2 403
>>
>> and
>>
>> $ wget https://www.winehq.org/ --2026-03-02 21:01:12--
>> https://www.winehq.org/ Resolving www.winehq.org
>> (www.winehq.org)... 104.26.8.100, 172.67.69.38,
>> 104.26.9.100, ... Connecting to www.winehq.org
>> (www.winehq.org)|104.26.8.100|:443... connected. HTTP
>> request sent, awaiting response... 403 Forbidden
>> 2026-03-02 21:01:12 ERROR 403: Forbidden.
>>
>> I can only guess, but I suspect that
>> <https://www.winehq.org/> started to do this to protect
>> against AI-scraping bots, or similar. I can imagine more
>> websites to do the same.
>>
>> To avoid having to deal with this check NOTE everywhere
>> (e.g. locally, CI, and on CRAN submission), my current
>> strategy is to switch from \url{https://www.winehq.org/}
>> to \code{https://www.winehq.org/} in the docs. Does
>> anyone else have a better idea?
>>
>> /Henrik