-
Notifications
You must be signed in to change notification settings - Fork 89
Closed
Description
Example for forbes.com robots txt
https://www.forbes.com/robots.txt
They have blocked all paths for GPTBot
User-agent: GPTBot
Disallow: /
However for url https://www.forbes.com/test
public boolean canCrawl(String url, String userAgent, String robotsBody)
throws MalformedURLException {
SimpleRobotRulesParser robotParser = new SimpleRobotRulesParser();
robotParser.setExactUserAgentMatching(false);
BaseRobotRules robotRules =
robotParser.parseContent(
"https://www.forbes.com/robots.txt",
robotsBody.getBytes(StandardCharsets.UTF_8),
"text/plain",
Collections.singletonList(userAgent));
return robotRules.isAllowed(url);
}returns true
The function is called as
boolean canCrawl = canCrawl("https://www.forbes.com/test", "GPTBot", "<robots body>");Verified this behaviour with https://github.com/samclarke/robots-parser and https://github.com/google/robotstxt, they both return false which seems correct.
If the UA would have been gptbot in robots.txt, it would return false for both GPTBot and gptbot UA. There seems to be some case sensitive check in the code base.
Reactions are currently unavailable