Back to formatted view
Raw Message

Message-ID: <CAAS8PALefpenrk4BJghCFdJqLk8JVo4QWxZhYkBydy+XtAE8-A@mail.gmail.com>
Date: 2026-03-03T09:35:15Z
From: Greg Hunt
Subject: [R-pkg-devel]  Strategy for dealing with websites serving HTTP 403 only when validated by 'R CMD check'
In-Reply-To: <27046.41726.702188.663853@stat.math.ethz.ch>

Martin,
W3C did.  The HTTP HEAD verb does something like what you want, but the
point of something like CloudFlare is to keep problematic (not a person
clicking a link) workload off the target server entirely.  Detail like
individual resources is only resolved when you get to the server after
interacting with Cloudflare.

Greg

On Tue, 3 Mar 2026 at 19:59, Martin Maechler <maechler at stat.math.ethz.ch>
wrote:

> >>>>> Simon Urbanek
> >>>>>     on Tue, 3 Mar 2026 20:01:13 +1300 writes:
>
>     > Henrik, yes, that's quite annoying
>
> indeed, and similar for all those of us who do want to have
> resourceful help pages (and other R package documentation).
>
>
>     >  - they respond with a
>     > 403 which *does* have html content which the browsers
>     > display and that page contains JavaScript code which calls
>     > a CGI script on their server which bounces to CF's server
>     > which after 7(!) more requests finally sets the challenge
>     > cookie and re-directs back to winehq.org. However, what is
>     > truly annoying, the response is the same whether the
>     > resource exists or not, so there is no way to verify the
>     > URL. I'm somewhat shocked that they rely on the browsers
>     > showing the error page and hijack it to quickly re-direct
>     > from it so the user isn't even aware they the server
>     > responded with an error.
>
>     > More practically, I don't see that we can do anything
>     > about it. Those are URLs are truly responding with an
>     > error, so short of emulating a full browser with
>     > JavaScript (they also do fingerprinting etc. so it's
>     > distinctly non-trivial - by design) there is no way to
>     > verify them. Given the amount of shenanigans that page
>     > does with the user's browser I'd say your approach is
>     > probably good since the user won't accidentally click on
>     > the link then :). But more seriously, this is a problem
>     > since the idea behind checking URLs is a good one - they
>     > do disappear or change quite often, so not checking them
>     > is not an answer, either.
>
>     > One special-case approach for cases like you mentioned
>     > (i.e. where you want to check a top-domain as opposed to a
>     > specific resource) is to use a resource that is guaranteed
>     > (by design) to be accessible by direct requests, so for
>     > example robots.txt. So for top-level URLs, we could fall
>     > back to checking https://winehq.org/robots.txt which does
>     > work (since most sites do want those to be directly
>     > accessible). However, it doesn't help with URLs containing
>     > specific paths as those will be still blocked.
>
>     > Cheers, Simon
>
> Thank you, Simon. Nice idea with `robots.txt`, but as you
> mention this will only apply to a relative tiny fraction of our
> url's in R (package) documentation. ... or the 'R CMD check' R
> functions could always try  https://<toplevel>/robots.txt  if
> the  https://<toplevel>/<morestuff> URL gives a 403 ... ? ...
>
> OTOH, isn't this rather a "world-wide" challenge/problem:
> "
>  I want to check if an https URL is "valid" (i.e., not invalid),
>  but I don't need to get any other data from its http server.
> "
> for which e.g. the W3C (World Wide Web Consortium) or others
> should have provided recommendations or even protocols and tools?
>
> Martin
>
>     >> On 3/03/2026, at 18:08, Henrik Bengtsson
>     >> <henrik.bengtsson at gmail.com> wrote:
>     >>
>     >> I've started to get:
>     >>
>     >> * checking CRAN incoming feasibility ... NOTE Found the
>     >> following (possibly) invalid URLs: URL:
>     >> https://www.winehq.org/ From:
>     >> inst/doc/parallelly-22-wine-workers.html Status: 403
>     >> Message: Forbidden
>     >>
>     >> when R CMD check:ing 'parallelly'. The page
>     >> <https://www.winehq.org/> works fine in the web browser,
>     >> but it blocked (by Cloudflare) elsewhere, e.g.
>     >>
>     >> $ curl --silent --head https://www.winehq.org/ | head -1
>     >> HTTP/2 403
>     >>
>     >> and
>     >>
>     >> $ wget https://www.winehq.org/ --2026-03-02 21:01:12--
>     >> https://www.winehq.org/ Resolving www.winehq.org
>     >> (www.winehq.org)... 104.26.8.100, 172.67.69.38,
>     >> 104.26.9.100, ...  Connecting to www.winehq.org
>     >> (www.winehq.org)|104.26.8.100|:443... connected.  HTTP
>     >> request sent, awaiting response... 403 Forbidden
>     >> 2026-03-02 21:01:12 ERROR 403: Forbidden.
>     >>
>     >> I can only guess, but I suspect that
>     >> <https://www.winehq.org/> started to do this to protect
>     >> against AI-scraping bots, or similar. I can imagine more
>     >> websites to do the same.
>     >>
>     >> To avoid having to deal with this check NOTE everywhere
>     >> (e.g. locally, CI, and on CRAN submission), my current
>     >> strategy is to switch from \url{https://www.winehq.org/}
>     >> to \code{https://www.winehq.org/} in the docs. Does
>     >> anyone else have a better idea?
>     >>
>     >> /Henrik
>
> ______________________________________________
> R-package-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

	[[alternative HTML version deleted]]