• r00ty@kbin.life
    link
    fedilink
    arrow-up
    17
    ·
    9 days ago

    If you’re running nginx I am using the following:

    if ($http_user_agent ~* "SemrushBot|Semrush|AhrefsBot|MJ12bot|YandexBot|YandexImages|MegaIndex.ru|BLEXbot|BLEXBot|ZoominfoBot|YaK|VelenPublicWebCrawler|SentiBot|Vagabondo|SEOkicks|SEOkicks-Robot|mtbot/1.1.0i|SeznamBot|DotBot|Cliqzbot|coccocbot|python|Scrap|SiteCheck-sitecrawl|MauiBot|Java|GumGum|Clickagy|AspiegelBot|Yandex|TkBot|CCBot|Qwantify|MBCrawler|serpstatbot|AwarioSmartBot|Semantici|ScholarBot|proximic|GrapeshotCrawler|IAScrawler|linkdexbot|contxbot|PlurkBot|PaperLiBot|BomboraBot|Leikibot|weborama-fetcher|NTENTbot|Screaming Frog SEO Spider|admantx-usaspb|Eyeotabot|VoluumDSP-content-bot|SirdataBot|adbeat_bot|TTD-Content|admantx|Nimbostratus-Bot|Mail.RU_Bot|Quantcastboti|Onespot-ScraperBot|Taboolabot|Baidu|Jobboerse|VoilaBot|Sogou|Jyxobot|Exabot|ZGrab|Proximi|Sosospider|Accoona|aiHitBot|Genieo|BecomeBot|ConveraCrawler|NerdyBot|OutclicksBot|findlinks|JikeSpider|Gigabot|CatchBot|Huaweisymantecspider|Offline Explorer|SiteSnagger|TeleportPro|WebCopier|WebReaper|WebStripper|WebZIP|Xaldon_WebSpider|BackDoorBot|AITCSRoboti|Arachnophilia|BackRub|BlowFishi|perl|CherryPicker|CyberSpyder|EmailCollector|Foobot|GetURL|httplib|HTTrack|LinkScan|Openbot|Snooper|SuperBot|URLSpiderPro|MAZBot|EchoboxBot|SerendeputyBot|LivelapBot|linkfluence.com|TweetmemeBot|LinkisBot|CrowdTanglebot|ClaudeBot|Bytespider|ImagesiftBot|Barkrowler|DataForSeoBo|Amazonbot|facebookexternalhit|meta-externalagent|FriendlyCrawler|GoogleOther|PetalBot|Applebot") { return 403; }

    That will block those that actually use recognisable user agents. I add any I find as I go on. It will catch a lot!

    I also have a huuuuuge IP based block list (generated by adding all ranges returned from looking up the following AS numbers):

    AS45102 (Alibaba cloud) AS136907 (Huawei SG) AS132203 (Tencent) AS32934 (Facebook)

    Since these guys run or have run bots that impersonate real browser agents.

    There are various tools online to return prefix/ip lists for an autonomous system number.

    I put both into a single file and include it into my web site config files.

    EDIT: Just to add, keeping on top of this is a full time job! EDIT 2: Removed Mojeek bot as it seems to be a normal web crawler.

    • ctag@lemmy.sdf.orgOP
      link
      fedilink
      English
      arrow-up
      5
      ·
      9 days ago

      Thank you for the detailed reply.

      keeping on top of this is a full time job!

      I guess that’s why I’m interested in a tooling based solution. My selfhosting is small-fry junk, but a lot of others like me are hosting entire fedi communities or larger websites.

      • r00ty@kbin.life
        link
        fedilink
        arrow-up
        5
        ·
        9 days ago

        Yeah, I probably should look to see if there’s any good plugins that do this on some community submission basis. Because yes, it’s a pain to keep up with whatever trick they’re doing next.

        And unlike web crawlers that generally check a url here and there, AI bots absolutely rip through your sites like something rabid.

        • Admiral Patrick@dubvee.org
          link
          fedilink
          English
          arrow-up
          3
          ·
          9 days ago

          AI bots absolutely rip through your sites like something rabid.

          SemrushBot being the most rabid from my experience. Just will not take “fuck off” as an answer.

          That looks pretty much like how I’m doing it, also as an include for each virtual host. The only difference is I don’t even bother with a 403. I just use Nginx’s 444 “response” to immediately close the connection.

          Are you doing the IP blocks also in Nginx or lower at the firewall level? Currently I’m doing it at firewall level since many of those will also attempt SSH brute forces (good luck since I only use keys, but still…)

          • r00ty@kbin.life
            link
            fedilink
            arrow-up
            4
            ·
            9 days ago

            So on my mbin instance, it’s on cloudflare. So I filter the AS numbers there. Don’t even reach my server.

            On the sites that aren’t behind cloudflare. Yep it’s on the nginx level. I did consider firewall level. Maybe just make a specific chain for it. But since I was blocking at the nginx level I just did it there for now. I mean it keeps them off the content, but yes it does tell them there’s a website there to leech if they change their tactics for example.

            You need to block the whole ASN too. Those that are using chrome/firefox UAs change IP every 5 minutes from a random other one in their huuuuuge pools.

    • Atemu@lemmy.ml
      link
      fedilink
      English
      arrow-up
      2
      ·
      9 days ago

      I’d suspect the bots would just try again with a masked user agent when they receive a 403.

      I think the best strategy would be to feed the bots shit that looks like real content.