• wjs018@piefed.social
    link
    fedilink
    English
    arrow-up
    2
    ·
    edit-2
    15 days ago

    Really great piece. We have recently seen many popular lemmy instances struggle under recent scraping waves, and that is hardly the first time its happened. I have some firsthand experience with the second part of this article that talks about AI-generated bug reports/vulnerabilities for open source projects.

    I help maintain a python library and got a bug report a couple weeks back of a user getting a type-checking issue and a bit of additional information. It didn’t strictly follow the bug report template we use, but it was well organized enough, so I spent some time digging into it and came up with no way to reproduce this at all. Thankfully, the lead maintainer was able to spot the report for what it was and just closed it and saved me from further efforts to diagnose the issue (after an hour or two were burned already).

    • Dave@lemmy.nz
      link
      fedilink
      English
      arrow-up
      1
      ·
      15 days ago

      AI scrapers are a massive issue for Lemmy instances. I’m gonna try some things in this article because there are enough of them identifying themselves with user agents that I didn’t even think of the ones lying about it.

      I guess a bonus (?) is that with 1000 Lemmy instances, the bots get the Lemmy content 1000 times so our input has 1000 times the weighting of reddit.

      • wjs018@piefed.social
        link
        fedilink
        English
        arrow-up
        1
        ·
        15 days ago

        The theory that the lead maintainer had (he is an actual software developer, I just dabble), is that it might be a type of reinforcement learning:

        • Get your LLM to create what it thinks are valid bug reports/issues
        • Monitor the outcome of those issues (closed immediately, discussion, eventual pull request)
        • Use those outcomes to assign how “good” or “bad” that generated issue was
        • Use that scoring as a way to feed back into the model to influence it to create more “good” issues

        If this is what’s happening, then it’s essentially offloading your LLM’s reinforcement learning scoring to open source maintainers.

  • Fijxu@programming.dev
    link
    fedilink
    English
    arrow-up
    2
    ·
    14 days ago

    AI scrapping is so cancerous. I host a public RedLib instance (redlib.nadeko.net) and due to BingBot and Amazon bots, my instance was always rate limited because the amount of requests they do is insane. What makes me more angry, is that this fucking fuck fuckers use free, privacy respecting services to be able to access Reddit and scrape . THEY CAN’T BE SO GREEDY. Hopefully, blocking their user-agent works fine ;)

  • db0@lemmy.dbzer0.com
    link
    fedilink
    English
    arrow-up
    1
    ·
    15 days ago

    Yep, it hit many lemmy servers as well, including mine. I had to block multiple alibaba subnet to get things back to normal. But I’m expecting the next spam wave.

  • MonkderVierte@lemmy.ml
    link
    fedilink
    English
    arrow-up
    1
    ·
    edit-2
    14 days ago

    Assuming we could build a new internet from the ground up, what would be the solution? IPFS for load-balancing?

  • grue@lemmy.world
    link
    fedilink
    English
    arrow-up
    1
    ·
    14 days ago

    ELI5 why the AI companies can’t just clone the git repos and do all the slicing and dicing (running git blame etc.) locally instead of running expensive queries on the projects’ servers?

    • green@feddit.nl
      link
      fedilink
      English
      arrow-up
      1
      ·
      14 days ago

      Too many people overestimate the actual capabilities of these companies.

      I really do not like saying this because it lacks a lot of nuance, but 90% of programmers are not skilled in their profession. This is not to say they are stupid (though they likely are, see cat-v/harmful) but they do not care about efficiency nor gracefulness - as long as the job gets done.

      You assume they are using source control (which is unironically unlikely), you assume they know that they can run a server locally (which I pray they do), and you assume their deadlines allow them to think about actual solutions to problems (which they probably don’t)

      Yes, they get paid a lot of money. But this does not say much about skill in an age of apathy and lawlessness

  • 𝕸𝖔𝖘𝖘@infosec.pub
    link
    fedilink
    English
    arrow-up
    1
    ·
    12 days ago

    Failtoban should add all those scraper IPs, and we need to just flat out block them. Or send them to those mazes. Or redirect them to themselves lol

  • Buelldozer@lemmy.today
    link
    fedilink
    English
    arrow-up
    1
    ·
    15 days ago

    I too read Drew DeVault’s article the other day and I’m still wondering how the hell these companies have access to “tens of thousands” of unique IP addresses. Seriously, how the hell do they have access to so many IP addresses that SysAdmins are resorting to banning entire countries to make it stop?

    • festus@lemmy.ca
      link
      fedilink
      English
      arrow-up
      2
      ·
      14 days ago

      There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users. They route traffic through these IPs via malware, hacked routers, “free” VPN clients, etc. If you block the IP range for one of these addresses you’ll also block real users.

      • Buelldozer@lemmy.today
        link
        fedilink
        English
        arrow-up
        1
        ·
        14 days ago

        There are residential IP providers that provide services to scrapers, etc. that involves them having thousands of IPs available from the same IP ranges as real users.

        Now that makes sense. I hadn’t considered rogue ISPs.

        • festus@lemmy.ca
          link
          fedilink
          English
          arrow-up
          1
          ·
          14 days ago

          It’s not even necessarily the ISPs that are doing it. In many cases they don’t like this because their users start getting blocked on websites; it’s bad actors piggy-packing on legitimate users connections without those users’ knowledge.

      • Buelldozer@lemmy.today
        link
        fedilink
        English
        arrow-up
        1
        ·
        14 days ago

        Sure, network blocking like this has been a thing for decades but it still requires ongoing manual intervention which is what these SysAdmins are complaining about.

  • fjordo@feddit.uk
    link
    fedilink
    English
    arrow-up
    0
    ·
    15 days ago

    I wish these companies would realise that acting like this is a very fast way to get scraping outlawed altogether, which is a shame because it can be genuinely useful (archival, automation, etc).

    • jol@discuss.tchncs.de
      link
      fedilink
      English
      arrow-up
      1
      ·
      15 days ago

      How can you outlaw something a company in another conhtinent is doing? And specially when they are becoming better as disguising themselves as normal traffic? What will happen is that politicians will see this as another reason to push for everyone having their ID associated with their Internet traffic.

  • klu9@lemmy.ca
    link
    fedilink
    English
    arrow-up
    0
    ·
    15 days ago

    The Linux Mint forums have been knocked offline multiple times over the last few months, to the point where the admins had to block all Chinese and Brazilian IPs for a while.

    • deeferg@lemmy.world
      link
      fedilink
      English
      arrow-up
      1
      ·
      15 days ago

      This is the first I’ve heard about Brazil in this type of cyber attack. Is it re-routed traffic going there or are there a large number of Brazilian bot farms now?

      • klu9@lemmy.ca
        link
        fedilink
        English
        arrow-up
        1
        ·
        15 days ago

        I don’t know why/how, just know that the admins saw the servers were being overwhelmed by traffic from Brazilian IPs and blocked it for a while.