Message-ID: <CAAS8PALefpenrk4BJghCFdJqLk8JVo4QWxZhYkBydy+XtAE8-A@mail.gmail.com>
Date: 2026-03-03T09:35:15Z
From: Greg Hunt
Subject: [R-pkg-devel] Strategy for dealing with websites serving HTTP 403 only when validated by 'R CMD check'
In-Reply-To: <27046.41726.702188.663853@stat.math.ethz.ch>
Martin,
W3C did. The HTTP HEAD verb does something like what you want, but the
point of something like CloudFlare is to keep problematic (not a person
clicking a link) workload off the target server entirely. Detail like
individual resources is only resolved when you get to the server after
interacting with Cloudflare.
Greg
On Tue, 3 Mar 2026 at 19:59, Martin Maechler <maechler at stat.math.ethz.ch>
wrote:
> >>>>> Simon Urbanek
> >>>>> on Tue, 3 Mar 2026 20:01:13 +1300 writes:
>
> > Henrik, yes, that's quite annoying
>
> indeed, and similar for all those of us who do want to have
> resourceful help pages (and other R package documentation).
>
>
> > - they respond with a
> > 403 which *does* have html content which the browsers
> > display and that page contains JavaScript code which calls
> > a CGI script on their server which bounces to CF's server
> > which after 7(!) more requests finally sets the challenge
> > cookie and re-directs back to winehq.org. However, what is
> > truly annoying, the response is the same whether the
> > resource exists or not, so there is no way to verify the
> > URL. I'm somewhat shocked that they rely on the browsers
> > showing the error page and hijack it to quickly re-direct
> > from it so the user isn't even aware they the server
> > responded with an error.
>
> > More practically, I don't see that we can do anything
> > about it. Those are URLs are truly responding with an
> > error, so short of emulating a full browser with
> > JavaScript (they also do fingerprinting etc. so it's
> > distinctly non-trivial - by design) there is no way to
> > verify them. Given the amount of shenanigans that page
> > does with the user's browser I'd say your approach is
> > probably good since the user won't accidentally click on
> > the link then :). But more seriously, this is a problem
> > since the idea behind checking URLs is a good one - they
> > do disappear or change quite often, so not checking them
> > is not an answer, either.
>
> > One special-case approach for cases like you mentioned
> > (i.e. where you want to check a top-domain as opposed to a
> > specific resource) is to use a resource that is guaranteed
> > (by design) to be accessible by direct requests, so for
> > example robots.txt. So for top-level URLs, we could fall
> > back to checking https://winehq.org/robots.txt which does
> > work (since most sites do want those to be directly
> > accessible). However, it doesn't help with URLs containing
> > specific paths as those will be still blocked.
>
> > Cheers, Simon
>
> Thank you, Simon. Nice idea with `robots.txt`, but as you
> mention this will only apply to a relative tiny fraction of our
> url's in R (package) documentation. ... or the 'R CMD check' R
> functions could always try https://<toplevel>/robots.txt if
> the https://<toplevel>/<morestuff> URL gives a 403 ... ? ...
>
> OTOH, isn't this rather a "world-wide" challenge/problem:
> "
> I want to check if an https URL is "valid" (i.e., not invalid),
> but I don't need to get any other data from its http server.
> "
> for which e.g. the W3C (World Wide Web Consortium) or others
> should have provided recommendations or even protocols and tools?
>
> Martin
>
> >> On 3/03/2026, at 18:08, Henrik Bengtsson
> >> <henrik.bengtsson at gmail.com> wrote:
> >>
> >> I've started to get:
> >>
> >> * checking CRAN incoming feasibility ... NOTE Found the
> >> following (possibly) invalid URLs: URL:
> >> https://www.winehq.org/ From:
> >> inst/doc/parallelly-22-wine-workers.html Status: 403
> >> Message: Forbidden
> >>
> >> when R CMD check:ing 'parallelly'. The page
> >> <https://www.winehq.org/> works fine in the web browser,
> >> but it blocked (by Cloudflare) elsewhere, e.g.
> >>
> >> $ curl --silent --head https://www.winehq.org/ | head -1
> >> HTTP/2 403
> >>
> >> and
> >>
> >> $ wget https://www.winehq.org/ --2026-03-02 21:01:12--
> >> https://www.winehq.org/ Resolving www.winehq.org
> >> (www.winehq.org)... 104.26.8.100, 172.67.69.38,
> >> 104.26.9.100, ... Connecting to www.winehq.org
> >> (www.winehq.org)|104.26.8.100|:443... connected. HTTP
> >> request sent, awaiting response... 403 Forbidden
> >> 2026-03-02 21:01:12 ERROR 403: Forbidden.
> >>
> >> I can only guess, but I suspect that
> >> <https://www.winehq.org/> started to do this to protect
> >> against AI-scraping bots, or similar. I can imagine more
> >> websites to do the same.
> >>
> >> To avoid having to deal with this check NOTE everywhere
> >> (e.g. locally, CI, and on CRAN submission), my current
> >> strategy is to switch from \url{https://www.winehq.org/}
> >> to \code{https://www.winehq.org/} in the docs. Does
> >> anyone else have a better idea?
> >>
> >> /Henrik
>
> ______________________________________________
> R-package-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>
[[alternative HTML version deleted]]