Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal

Hot Potato@lemmy.world · 4 months ago

Google Is the Only Search Engine That Works on Reddit Now Thanks to AI Deal

TimeSquirrel@kbin.melroy.org · 4 months ago

A lot of Fediverse admins are just normal people like you and me with a budget, and disallowing bots and spiders helps save bandwidth, and the budget.

Admiral Patrick@dubvee.org · 4 months ago

Yep. I block all bots to my instance.

Most are parasitic (GPTBot, ImageSift bot, Yandex, etc) but I’ve even blocked Google’s crawler (and its ActivityPub cralwer bot) since it now feeds their LLM models. Most of my content can be found anyway because instances it federated to don’t block those, but the bandwidth and processing savings are what I’m in it for.

Mac@federation.red · 4 months ago

Teach me oh wise one

Admiral Patrick@dubvee.org · 4 months ago

Kinda long, so I’m putting it in spoilers. This applies to Nginx, but you can probably adapt it to other reverse proxies.

Create a file to hold the mappings and store it somewhere you can include it from your other configs. I named mine map-bot-user-agents.conf

Here, I’m doing a regex comparison against the user agent ($http_user_agent) and mapping it to either a 0 (default/false) or 1 (true) and storing that value in the variable $ua_disallowed. The run-on string at the bottom was inherited from another admin I work with, and I never bothered to split it out.

'map-bot-user-agents.conf'

# Map bot user agents
map $http_user_agent $ua_disallowed {
    default 		0;
    "~CCBot"		1;
    "~ClaudeBot"	1;
    "~VelenPublicWebCrawler"	1;
    "~WellKnownBot"	1;
    "~Synapse (bot; +https://github.com/matrix-org/synapse)" 1;
    "~python-requests"	1;
    "~bitdiscovery"	1;
    "~bingbot"		1;
    "~SemrushBot" 	1;
    "~Bytespider" 	1;
    "~AhrefsBot" 	1;
    "~AwarioBot"	1;
    "~GPTBot" 		1;
    "~DotBot"	 	1;
    "~ImagesiftBot"	1;
    "~Amazonbot"	1;
    "~GuzzleHttp" 	1;
    "~DataForSeoBot" 	1;
    "~StractBot"	1;
    "~Googlebot"	1;
    "~Barkrowler"	1;
    "~SeznamBot"	1;
    "~FriendlyCrawler"	1;
    "~facebookexternalhit" 1;
    "~*(?i)(80legs|360Spider|Aboundex|Abonti|Acunetix|^AIBOT|^Alexibot|Alligator|AllSubmitter|Apexoo|^asterias|^attach|^BackDoorBot|^BackStreet|^BackWeb|Badass|Bandit|Baid|Baiduspider|^BatchFTP|^Bigfoot|^Black.Hole|^BlackWidow|BlackWidow|^BlowFish|Blow|^BotALot|Buddy|^BuiltBotTough|
^Bullseye|^BunnySlippers|BBBike|^Cegbfeieh|^CheeseBot|^CherryPicker|^ChinaClaw|^Cogentbot|CPython|Collector|cognitiveseo|Copier|^CopyRightCheck|^cosmos|^Crescent|CSHttp|^Custo|^Demon|^Devil|^DISCo|^DIIbot|discobot|^DittoSpyder|Download.Demon|Download.Devil|Download.Wonder|^dragonfl
y|^Drip|^eCatch|^EasyDL|^ebingbong|^EirGrabber|^EmailCollector|^EmailSiphon|^EmailWolf|^EroCrawler|^Exabot|^Express|Extractor|^EyeNetIE|FHscan|^FHscan|^flunky|^Foobot|^FrontPage|GalaxyBot|^gotit|Grabber|^GrabNet|^Grafula|^Harvest|^HEADMasterSEO|^hloader|^HMView|^HTTrack|httrack|HTT
rack|htmlparser|^humanlinks|^IlseBot|Image.Stripper|Image.Sucker|imagefetch|^InfoNaviRobot|^InfoTekies|^Intelliseek|^InterGET|^Iria|^Jakarta|^JennyBot|^JetCar|JikeSpider|^JOC|^JustView|^Jyxobot|^Kenjin.Spider|^Keyword.Density|libwww|^larbin|LeechFTP|LeechGet|^LexiBot|^lftp|^libWeb|
^likse|^LinkextractorPro|^LinkScan|^LNSpiderguy|^LinkWalker|msnbot|MSIECrawler|MJ12bot|MegaIndex|^Magnet|^Mag-Net|^MarkWatch|Mass.Downloader|masscan|^Mata.Hari|^Memo|^MIIxpc|^NAMEPROTECT|^Navroad|^NearSite|^NetAnts|^Netcraft|^NetMechanic|^NetSpider|^NetZIP|^NextGenSearchBot|^NICErs
PRO|^niki-bot|^NimbleCrawler|^Nimbostratus-Bot|^Ninja|^Nmap|nmap|^NPbot|Offline.Explorer|Offline.Navigator|OpenLinkProfiler|^Octopus|^Openfind|^OutfoxBot|Pixray|probethenet|proximic|^PageGrabber|^pavuk|^pcBrowser|^Pockey|^ProPowerBot|^ProWebWalker|^psbot|^Pump|python-requests\/|^Qu
eryN.Metasearch|^RealDownload|Reaper|^Reaper|^Ripper|Ripper|Recorder|^ReGet|^RepoMonkey|^RMA|scanbot|SEOkicks-Robot|seoscanners|^Stripper|^Sucker|Siphon|Siteimprove|^SiteSnagger|SiteSucker|^SlySearch|^SmartDownload|^Snake|^Snapbot|^Snoopy|Sosospider|^sogou|spbot|^SpaceBison|^spanne
r|^SpankBot|Spinn4r|^Sqworm|Sqworm|Stripper|Sucker|^SuperBot|SuperHTTP|^SuperHTTP|^Surfbot|^suzuran|^Szukacz|^tAkeOut|^Teleport|^Telesoft|^TurnitinBot|^The.Intraformant|^TheNomad|^TightTwatBot|^Titan|^True_Robot|^turingos|^TurnitinBot|^URLy.Warning|^Vacuum|^VCI|VidibleScraper|^Void
EYE|^WebAuto|^WebBandit|^WebCopier|^WebEnhancer|^WebFetch|^Web.Image.Collector|^WebLeacher|^WebmasterWorldForumBot|WebPix|^WebReaper|^WebSauger|Website.eXtractor|^Webster|WebShag|^WebStripper|WebSucker|^WebWhacker|^WebZIP|Whack|Whacker|^Widow|Widow|WinHTTrack|^WISENutbot|WWWOFFLE|^
WWWOFFLE|^WWW-Collector-E|^Xaldon|^Xenu|^Zade|^Zeus|ZmEu|^Zyborg|SemrushBot|^WebFuck|^MJ12bot|^majestic12|^WallpapersHD)" 1;

}

Once you have a mapping file setup, you’ll need to do something with it. This applies at the virtual host level and should go inside the server block of your configs (except the include for the mapping config.).

This assumes your configs are in conf.d/ and are included from nginx.conf.

The map-bot-user-agents.conf is included above the server block (since it’s an http level config item) and inside server, we look at the $ua_disallowedvalue where 0=false and 1=true (the values are set in the map).

You could also do the mapping in the base nginx.conf since it doesn’t do anything on its own.

If the $ua_disallowed value is 1 (true), we immediately return an HTTP 444. The 444 status code is an Nginx thing, but it basically closes the connection immediately and wastes no further time/energy processing the request. You could, optionally, redirect somewhere, return a different status code, or return some pre-rendered LLM-generated gibberish if your bot list is configured just for AI crawlers (because I’m a jerk like that lol).

Example site1.conf


include conf.d/includes/map-bot-user-agents.conf;

server {
  server_name  example.com;
  ...
  # Deny disallowed user agents
  if ($ua_disallowed) { 
    return 444;
  }
 
  location / {
    ...
  }

}

Mac@federation.red · edit-2 4 months ago

So I would need to add this to every subdomain conf file I have? Preciate you!

Admiral Patrick@dubvee.org · edit-2 4 months ago

I just include the map-bot-user-agents.conf in my base nginx.conf so it’s available to all of my virtual hosts.

When I want to enforce the bot blocking on one or more virtual host (some I want to leave open to bots, others I don’t), I just include a deny-disallowed.conf in the server block of those.

deny-disallowed.conf

  # Deny disallowed user agents
  if ($ua_disallowed) { 
    return 444;
  }

site.conf

server {
  server_name example.com;
   ...
  include conf.d/includes/deny-disallowed.conf;

  location / {
    ...
  }
}

Mac@federation.red · 4 months ago

Okay yeah I was thinking my base domain conf but that’s even better.

MCasq_qsaCJ_234@lemmy.zip · 4 months ago

I have two questions. How much do those bots consume your bandwidth? And by blocking search robots, do you stop being present in the search results or are you still present, but they do not show the content in question?

I ask these questions because I don’t know much about the topic when managing a website or an instance of the fediverse.

Cyborganism@lemmy.ca · 4 months ago

Could it be possible to have one major global instance that aggregates everything so it can be indexed by search engines? Would that work? Or do I not fully understand how federation works?

wholookshere@lemmy.blahaj.zone · 4 months ago

That would defeat the purpose of federation.

It becomes a central choke point of moderation. Who gets to decide what instances are part of global and which ones aren’t. Because a free for all isn’t going to end well. And then you’re back at Reddit.

WanderingVentra@lemm.ee · edit-2 4 months ago

I wonder if you could have an instance federated to every other instance just for archived purposes, to save the data on every other instance’s post and comment. Because copies of posts and comments are saved to federated instances, too, right? Or do I understand the tech wrong?

So it could have an admin team but no users, to prevent people worried about spammers and bots joining that instance to get around defederation rules. Maybe it just has a bot that crawls Lemmy, looking for instances to federate to. Could that work?

4am@lemm.ee · 4 months ago

You’re describing Meta’s plan but yes that could work.

WanderingVentra@lemm.ee · 4 months ago

Godamnit Meta… Lol

MCasq_qsaCJ_234@lemmy.zip · 4 months ago

I prefer the Internet Archive plan than the Meta Plan.

rbits@lemm.ee · 4 months ago

Right, but having a centralised search index thingy is better than none at all. Maybe there could be something where it’s a joint effort from admins from many of the biggest servers, idk if that would work.

barsoap@lemm.ee · edit-2 4 months ago

Lemmy search already is quite excellent… at least here on lemm.ee, we don’t have many communities but tons of users subscribed to probably about everything on the lemmyverse so the servers have it all.

It might be interesting to team up with something like YaCy: Instances could operate as YaCy peers for everything they have. That is, integrate a p2p search protocol into ActivityPub itself so that also smaller instances can find everything. Ordinary YaCy instances, doing mostly web crawling, can in turn use posts here as interesting starting points.

Omniraptor@lemm.ee · 4 months ago

I just wish lemmy search itself wasn’t broken…

dumbass@leminal.space · 4 months ago

Gotta keep some things that feel like reddit.

HobbitFoot @thelemmy.club · 4 months ago

All a spider needs is an instance to download everything.

Amanda@aggregatet.org · 4 months ago

I was worrying about precisely this. I’d be ok with blocking search engines if there was a better way of searching but AFAICT there isn’t federated search of any kind?

chrischryse@lemmy.world · 4 months ago

Really? I thought they were free and didn’t affect bandwidth.

thejml@lemm.ee · 4 months ago

Any data transit costs money. Both in the data transit itself and in the increased server resources to respond to the web queries in the first place.

chrischryse@lemmy.world · 4 months ago

Ah that makes sense not really familiar iwth this stuff so didn’t think it’s that intensive lol

darkkite@lemmy.ml · 4 months ago

bots take resources to serve just like any regular user

Damage@feddit.it · 4 months ago

They usually only index text though