URI: 
        _______               __                   _______
       |   |   |.---.-..----.|  |--..-----..----. |    |  |.-----..--.--.--..-----.
       |       ||  _  ||  __||    < |  -__||   _| |       ||  -__||  |  |  ||__ --|
       |___|___||___._||____||__|__||_____||__|   |__|____||_____||________||_____|
                                                             on Gopher (inofficial)
  HTML Visit Hacker News on the Web
       
       
       COMMENT PAGE FOR:
  HTML   AI scrapers request commented scripts
       
       
        throw_me_uwu wrote 17 hours 54 min ago:
        > most likely trying to non-consensually collect content for training
        LLMs
        
        No, it's just background internet scanning noise
       
          lucasluitjes wrote 17 hours 33 min ago:
          This.
          
          If you were writing a script to mass-scan the web for
          vulnerabilities, you would want to collect as many http endpoints as
          possible. JS files, regardless of whether they're commented out or
          not, are a great way to find endpoints in modern web applications.
          
          If you were writing a scraper to collect source code to train LLMs
          on, I doubt you would care as much about a commented-out JS file. I'm
          not sure you'd even want to train on random low-quality JS served by
          websites. Anyone familiar with LLM training data collection who can
          comment on this?
       
        lrpe wrote 20 hours 31 min ago:
        If you want humans to read your website, I would suggest making your
        website readable to humans. Green on blue is both hideous and painful.
       
        sokoloff wrote 1 day ago:
        Well, if they’re going to request commented out scripts, serve them
        up some very large scripts…
       
        renegat0x0 wrote 1 day ago:
        Most web scrapers, even if illegal, are for... business. So they scrape
        amazon, or shops. So yeah. Most unwanted traffic is from big tech, or
        bad actors trying to sniff vulnerabilities.
        
        I know a thing or two about web scraping.
        
        There are sometimes status codes 404 for protection, so that you skip
        this site, so my crawler tries, as a hammer, several of faster crawling
        methods (curlcffi).
        
        Zip bombs are also not for me. Reading header content length is enough
        to not read the page/file. I provide byte limit to check if response is
        not too big for me. For other cases reading timeout is enough.
        
        Oh, and did you know that requests timeout is not really timeout a
        timeout for page read? So server can spoonfeed you bytes, one after
        another, and there will be no timeout.
        
        That is why I created my own crawling system to mitigate these
        problems, and have one consistent mean of running selenium. [1] Based
        on library
        
  HTML  [1]: https://github.com/rumca-js/crawler-buddy
  HTML  [2]: https://github.com/rumca-js/webtoolkit
       
          1vuio0pswjnm7 wrote 15 hours 3 min ago:
          Is there a difference between "scraping" and "crawling"
       
          Mars008 wrote 1 day ago:
          Looks like it's time for in-browser scrappers. They will be
          indistinguishable from the servers side. With AI driver can pass even
          human tests.
       
            bartread wrote 19 hours 33 min ago:
            Not a new idea. For years now, on the occasions I’ve needed to
            scrape, I’ve used a set of ViolentMonkey scripts. I’ve even
            considered creating an extension, but have never really needed it
            enough to do the extra work.
            
            But this is why lots of sites implement captchas and other
            mechanisms to detect, frustrate, or trap automated activity -
            because plenty of bots run in browsers too.
       
            eur0pa wrote 20 hours 14 min ago:
            you mean OpenAI Atlas?
       
            overfeed wrote 22 hours 23 min ago:
            > Looks like it's time for in-browser scrappers.
            
            If scrapers were as well-behaved as humans, website operators
            wouldn't bother to block them[1]. It's the abuse that motivates the
            animus and action. As the fine articles spelt out, scrapers are
            greedy in many ways, one of which is trying to slurp down as many
            URLs as possible without wasting bytes. Not enough people know
            about common crawl, or know how to write multithreaded scrapers
            with high utilization across domains without suffocating any single
            one. If your scraper is URL FIFO or stack in a loop, you're just
            DOSing one domain at a time.
            
            1. The most successful scrapers avoid standing out in any way
       
              Mars008 wrote 21 hours 58 min ago:
              The question is who runs them? There are only a few big companies
              like MS, Google, OpenAI, Anthropic. But from the posts here it
              looks like hordes of buggy scrapers run by enthusiasts.
       
                luckylion wrote 15 hours 59 min ago:
                Ad companies, even the small ones, "Brand Protection"
                companies, IP lawyers looking for images that were used without
                license, Brand Marketing companies, where it matters also your
                competitors etc etc
       
                iamacyborg wrote 19 hours 28 min ago:
                Lots of “data” companies out there that want to sell you
                scraped data sets.
       
          hnav wrote 1 day ago:
          content-length is computed after content-encoding
       
            ahoka wrote 16 hours 28 min ago:
            If it’s present at all.
       
        mikeiz404 wrote 1 day ago:
        Two thoughts here when it comes to poisoning unwanted LLM training data
        traffic
        
        1) A coordinated effort among different sites will have a much greater
        chance of poisoning the data of a model so long as they can avoid any
        post scraping deduplication or filtering.
        
        2) I wonder if copyright law can be used to amplify the cost of
        poisoning here. Perhaps if the poisoned content is something which has
        already been shown to be aggressively litigated against then the
        copyright owner will go after them when the model can be shown to
        contain that banned data. This may open up site owners to the legal
        risk of distributing this content though… not sure. A cooperative
        effort with a copyright holder may sidestep this risk but they would
        have to have the means and want to litigate.
       
          Anamon wrote 16 hours 9 min ago:
          As for 1, it would be great to have this as a plugin for WordPress
          etc. that anyone could simply install and enable. Pre-processing
          images to dynamically poison them on each request should be fun, and
          also protect against a deduplication defense. I'd certainly install
          that.
       
        hexage1814 wrote 1 day ago:
        I like and support web scrapers. It is even funnier when the site
        owners don't like it
       
          1gn15 wrote 1 day ago:
          Thank you <3
       
          ang_cire wrote 1 day ago:
          Yep. Robots.txt is a framework intended for performance, not a legal
          or ethical imperative.
          
          If you want to control how someone accesses something, the onus is on
          you to put access controls in place.
          
          The people who put things on a public, un-restricted server and then
          complain that the public accessed it in an un-restricted way might be
          excusable if it's some geocities-esque Mom and Pop site that has no
          reason to know better, but 'cryptography dog' ain't that.
       
            Anamon wrote 15 hours 57 min ago:
            What controls do you suggest?
            
            Saying that a handful of mass copyright infringers with billion
            dollar investors are simply part of the "public" like every regular
            visitor is seriously distorting the issue here.
            
            Sites with a robots.txt banning bots are only "unrestricted" in a
            strictly technical sense. They are clearly setting terms of use
            that these rogue bots are violating. Besides, robots.txt is legally
            binding in certain jurisdictions, it's not just a polite plea. And
            if we decide that anything not technically prevented is legal, then
            we're also legitimising botnets, DDoS attacks, and a lot more.
            Hacking into a corporate system through a malconfiguration or
            vulnerability is also illegal, despite the fact that the defenses
            failed.
            
            Finally, we all know that the only purpose these bots are scraping
            for is mass copyright infringement. That's another layer where the
            "if it's accessible, it's fair game" logic falls apart. I can
            download a lot of publicly accessible art, music, or software, but
            that doesn't mean I can do with those files whatever I want. The
            only reason these AI companies haven't been sued out of existence
            yet, like they should've been, is that it's trickier to prove
            provenance than if they straight up served the unmodified files.
       
            ordu wrote 19 hours 17 min ago:
            It is an asocial state of mind. We have locks and security systems
            that prevent people from stealing. But if all people agreed to not
            steal, then we could save that efforts for something better. The
            ideal approach doesn't work with the stealing, and now it doesn't
            work with HTTP either. It just raises costs for a society with no
            lasting benefit for anyone: site owners just figure out ways to
            restrict access and no more scraping of pages that they do not want
            to be scraped.
            
            A healthy society relies on a cooperation between members. It
            relies on them accepting some rules that limits their behavior.
            Like we agreed not to kill others, and now I can go outside without
            weapons and anti-bullet defenses.
       
        stevage wrote 1 day ago:
        The title is confusing, should be "commented-out".
       
          pimlottc wrote 1 day ago:
          Agree, I thought maybe this was going to be a script to block AI
          scrapers or something like that.
       
            zahlman wrote 1 day ago:
            I thought it was going to be AI scraper operators getting annoyed
            that they have to run reasoning models on the scraped data to make
            use of it.
       
        bigbuppo wrote 1 day ago:
        Sounds like you should give the bots exactly what they want... a 512MB
        file of random data.
       
          AlienRobot wrote 15 hours 38 min ago:
          512 MB of saying your service is the best service.
       
          kelseyfrog wrote 1 day ago:
          That's leaving a lot of opportunity on the table.
          
          The real money is in monetizing ad responses to AI scrapers so that
          LLMs are biased toward recommending certain products. The stealth
          startup I've founded does exactly this. Ad-poisoning-as-a-service is
          a huge untapped market.
       
            bigbuppo wrote 1 day ago:
            Now that's a paid subscription I can get behind, especially if it
            suggests that Meta should cut Rob Schneider a check for
            $200,000,000,000 to make more movies.
       
              kelseyfrog wrote 1 day ago:
              Contact info in bio. Always looking to make more happy customers.
       
          aDyslecticCrow wrote 1 day ago:
          Scraper sinkhole of randomly generated inter-linked files filled with
          AI poison could work. No human would click that link, so it leads to
          the "exclusive club".
       
            oytis wrote 1 day ago:
            Outbound traffic normally costs more than inbound one, so the
            asymmetry is set up wrong here. Data poisoning is probably the way.
       
              zahlman wrote 1 day ago:
              > Outbound traffic normally costs more than inbound one, so the
              asymmetry is set up wrong here.
              
              That's what zip bombs are for.
       
          kelnos wrote 1 day ago:
          Most people have to pay for their bandwidth, though.  That's a lot of
          data to send out over and over.
       
            jcheng wrote 1 day ago:
            512MB file of incredibly compressible data, then?
       
              QuadmasterXLII wrote 1 day ago:
              Could I recommend [1] ?
              
              50:1 compression ratio, but it's legitimately an implementation
              of a rubiks cube, that I wasn't actually making as any sort of
              trap, just wasn't thinking about file size, so any rule that
              filters it out is going to have a nasty false positive rate
              
  HTML        [1]: https://cubes.hgreer.com/ssg/output.html
       
        bakql wrote 1 day ago:
        >These were scrapers, and they were most likely trying to
        non-consensually collect content for training LLMs.
        
        "Non-consensually", as if you had to ask for permission to perform a
        GET request to an open HTTP server.
        
        Yes, I know about weev. That was a travesty.
       
          smsm42 wrote 1 day ago:
          You are still trying to pretend that accessing HTTP server once and
          burying it under an avalanche of never-stopping bot crawlers is the
          same thing? And spam is the same as "sending an email" and should be
          treated the same? I thought in this day and age we're past that.
       
            1gn15 wrote 1 day ago:
            If you're trying to say DDoS, just say that.
       
              smsm42 wrote 1 day ago:
              DDoS is a very specific type of attack. To be abusive, you don't
              have to do exactly that - it could be any type of DoS, and in
              fact it doesn't even have to deny all service - it could just
              impose excessive costs, for example.
       
          malfist wrote 1 day ago:
          If I set out a bowl of candy for ticker treaters, I wouldn't expect
          to be okay with the first adult strolling by and taking everything.
       
            righthand wrote 1 day ago:
            Then cutting up the candy and taping candy together in the most
            statistically pleasing way and finally selling all of the stolen
            frankenstein’s monster candy as innovative new candy and the
            future of humanity.
       
            dylan604 wrote 1 day ago:
            and if they do, you have no recourse just like with scrapers. with
            the candy example, you spend you time sitting near the candy bowl
            supervising. for servers, we have various anti-bot supervisors.
            however, some asshat with no scruples can still just walk right up
            to your bowl and empty the contents into their bag and then just
            walk away even with you sitting right there. Unless you're willing
            to commit violence, there's nothing stopping them. now you're the
            assailant and the asshat is the victim. you still loose.
       
          grayhatter wrote 1 day ago:
          If you're lying in the requests you send,  to trick my server into
          returning the content you want, instead of what I would want to
          return to webscrapers, that's non-consensual.
          
          You don't need my permission to send a GET request, I completely
          agree. In fact, by having a publicly accessible webserver, there's
          implied consent that I'm willing to accept reasonable, and valid GET
          requests.
          
          But I have configured my server to spend server resources the way I
          want, you don't like how my server works, so your configure your bot
          to lie. If you get what you want only because you're willing to lie,
          where's the implied consent?
       
            wqaatwt wrote 16 hours 7 min ago:
            Somebody concealing or obfuscating various information a browser
            would send by standard for privacy or other reasons is also
            “lying” by that standard? Or someone using a VPN?
       
              grayhatter wrote 12 hours 59 min ago:
              Someone using a VPN is not lying. The intent of a user agent is
              to identify the software sending the request. The IP address
              isn't sent by the browser, and isn't part of the HTTP request.
              It's part of the routing information required to deliver the
              packet back to the client. If a client sent it's "real" IP
              address as an HTTP header, and I tried to respond to that IP
              instead of the IP address from the TCP packet. It would never
              arrive.
              
              There's a difference between sending no data, and sending false
              data. I don't block requests without http referrers for that very
              reason.
       
                wqaatwt wrote 12 hours 51 min ago:
                IIRC Firefox (and I assume other browsers) when using
                privacy/no tracking mode does send fake data..
       
                  grayhatter wrote 12 hours 40 min ago:
                  You're incorrect. I've never seen any browser, on it's own
                  lie about it's user agent. (I can set a custom string and lie
                  with it, but that's not the agent doing it)
                  
                  Do you have a specific / concrete example in mind? Or are you
                  mistaking a feature from something other than a mainstream
                  browser?
       
                    gkbrk wrote 6 hours 5 min ago:
                    Firefox sends an incorrect version and operating system on
                    its User-Agent when the privacy settings are turned on.
                    
                    IIRC it defaults to a Windows user agent even when you use
                    it on other operating systems.
       
            batch12 wrote 1 day ago:
            Browser user agents have a history of being lies from the earliest
            days of usage. Official browsers lied about what they were- and
            still do.
       
              grayhatter wrote 12 hours 42 min ago:
              Can you give a single example of a browser with a user agent that
              lies about it's real origin?
              
              The best I can come up with is the TOR browser, which will reduce
              the number of bits of information it will return, but I dont
              consider that to be misleading. It's a custom build of firefox,
              that discloses it is firefox, and otherwise behaves exactly as I
              would expect firefox to behave.
       
              jraph wrote 20 hours 15 min ago:
              Lies in user agent strings where for bypassing bugs, poor
              workarounds and assumptions that became wrong, they are nothing
              like what we are talking about.
       
                batch12 wrote 12 hours 44 min ago:
                Yes, the client wanted the server to deliver content it had
                intended for a different client, regardless of what the service
                operator wanted, so it lied using its user agent. Exact same
                thing we are talking about. The difference is that people don't
                want companies to profit off of their content. That's fair. In
                this case, they should maybe consider some form of real
                authentication, or if the bot is abusive, some kind of rate
                limiting control.
       
                  jraph wrote 10 hours 39 min ago:
                  Add "assumptions that became wrong" to "intended" and the
                  perspective radically changes, to the point that omitting
                  this part from my comment changes everything.
                  
                  I would even add:
                  
                  > the client wanted the server to deliver content it had
                  intended for a different client
                  
                  In most cases, the webmaster intended their work to look
                  good, not really to send different content to different
                  clients. That later part is a technical means, a workaround.
                  The intent of bringing the ok version to the end user was
                  respected… even better with the user agent lies!
                  
                  > The difference is that people don't want companies to
                  profit off of their content.
                  
                  Indeed¹, and also they don't want terrible bot to bring down
                  their servers.
                  
                  1: well, my open source work explicitly allows people to
                  profit off of it - as long as the license is respected
                  (attribution, copyleft, etc)
       
                  grayhatter wrote 12 hours 33 min ago:
                  > Yes, the client wanted the server to deliver content it had
                  intended for a different client, regardless of what the
                  service operator wanted, so it lied using its user agent.
                  
                  I would actually argue, it's not nearly the same type of
                  misconfiguration. The reason scripts, which have never been a
                  browser, who omit their real identity, are doing it, is to
                  evade bot detection. The reason browsers pack their UA with
                  so much legacy data, is because of misconfigured servers. The
                  server owner wants to send data to users and their browsers,
                  but through incompetence, they've made a mistake. Browsers
                  adapted by including extra strings in the UA to account for
                  the expectations of incorrectly configured servers. Extra
                  strings being the critical part, Google bot's UA is an
                  example of this being done correctly.
       
                gkbrk wrote 15 hours 46 min ago:
                A server returning HTML for Chrome but not cURL seems like a
                bug, no?
                
                This is why there are so many libraries to make requests that
                look like they came from browser, to work around buggy servers
                or server operators with wrong assumptions.
       
                  grayhatter wrote 12 hours 50 min ago:
                  > A server returning HTML for Chrome but not cURL seems like
                  a bug, no?
                  
                  tell me you've never heard of [1] without telling me. :P
                  
                  It would absolutely be a bug iff this site returned html to
                  curl.
                  
                  > This is why there are so many libraries to make requests
                  that look like they came from browser, to work around buggy
                  servers or server operators with wrong assumptions.
                  
                  This is a shallow take, the best counter example is how
                  googlebot has no problem identifying it itself both in and
                  out of thue user agent. Do note user agent packing, is
                  distinctly different from a fake user agent selected randomly
                  from the list of most common.
                  
                  The existence of many libraries with the intent to help
                  conceal the truth about a request doesn't feel like proof
                  that's what everyone should be doing. It feels more like
                  proof that most people only want to serve traffic to browsers
                  and real users. And it's the bots and scripts that are the
                  fuckups.
                  
  HTML            [1]: https://wttr.in/
       
                    batch12 wrote 12 hours 37 min ago:
                    Googlebot has no problem identifying itself because Google
                    knows that you want it to index your site if you want
                    visitors. It doesn't identify itself to give you the option
                    to block it. It identifies itself so you don't.
       
                      grayhatter wrote 12 hours 25 min ago:
                      I care much less about being indexed by Google as much as
                      you might think.
                      
                      Google bot doesn't get blocked from my server primarily
                      because it's a *very* well behaved bot. It sends a lot of
                      requests, but it's very kind, and has never acted in a
                      way that could overload my server. It respects
                      robots.txt, and identifies itself multiple times.
                      
                      Google bot doesn't get blocked, because it's a well
                      behaved bot that eagerly follows the rules. I wouldn't
                      underestimate how far that goes towards the reason it
                      doesn't get blocked. Much more than the power gained by
                      being google search.
       
          j2kun wrote 1 day ago:
          You should not have to ask for permission, but you should have to
          honestly set your user-agent. (In my opinion, this should be the law
          and it should be enforced)
       
            gkbrk wrote 6 hours 1 min ago:
            > In my opinion, this should be the law and it should be enforced
            
            You think people should go to prison if they go to their browser
            settings and change their user agent?
       
          davesque wrote 1 day ago:
          I mean, it costs money to host content. If you are hosting content
          for bots fine, but if the money you're paying to host it is meant to
          benefit human users (the reason for robots.txt) then yeah, you ought
          to ask permission. Content might also be copyrighted. Honestly, I
          don't even know why I'm bothering to mention these things because it
          just feels obvious. LLM scrapers obviously want as much data as they
          can get, whether or not they act like assholes (ignoring robots.txt)
          or criminals (ignoring copyright) to get it.
       
          codyb wrote 1 day ago:
          The sign on the door said "no scrapers", which as far as I know is
          not a protected class.
       
          jraph wrote 1 day ago:
          When I open an HTTP server to the public web, I expect and welcome
          GET requests in general.
          
          However,
          
          (1) there's a difference between (a) a regular user browsing my
          websites and (b) robots DDoSing them. It was never okay to hammer a
          webserver. This is not new, and it's for this reason that curl has
          had options to throttle repeated requests to servers forever. In real
          life, there are many instances of things being offered for free, it's
          usually not okay to take it all. Yes, this would be abuse. And no,
          the correct answer to such a situation would not be "but it was free,
          don't offer it for free if you don't want it to be taken for free".
          Same thing here.
          
          (2) there's a difference between (a) a regular user reading my
          website or even copying and redistributing my content as long as the
          license of this work / the fair use or related laws are respected,
          and (b) a robot counterfeiting it (yeah, I agree with another
          commenter, theft is not the right word, let's call a spade a spade)
          
          (3) well-behaved robots are expected to respect robots.txt. This is
          not the law, this is about being respectful. It is only fair
          bad-behaved robots get called out.
          
          Well behaved robots do not usually use millions of residential IPs
          through shady apps to "Perform a get request to an open HTTP server".
       
            Razengan wrote 22 hours 31 min ago:
            > And no, the correct answer to such a situation would not be "but
            it was free, don't offer it for free if you don't want it to be
            taken for free".
            
            The answer to THAT could: "It is free but leave some for others you
            greedy fuck"
       
            Aloisius wrote 1 day ago:
            > Well behaved robots do not usually use millions of residential
            IPs
            
            Some antivirus and parental control control software will scan
            links sent to someone from their machine (or from access
            points/routers).
            
            Even some antivirus services will fetch links from residential IPs
            in order to detect malware from sites configured to serve malware
            only to residential IPs.
            
            Actually, I'm not entirely sure how one would tell the difference
            between a user software scanning links to detect adult
            content/malware/etc, randos crawling the web searching for personal
            information/vulnerable sites/etc. and these supposed "AI crawlers"
            just from access logs.
            
            While I'm certainly not going to dismiss the idea that these are
            poorly configured crawlers at some major AI company, I haven't seen
            much in the way of evidence that is the case.
       
              kijin wrote 1 day ago:
              Occasionally fetching a link will probably go unnoticed.
              
              If your antivirus software hammers the same website several times
              a second for hours on end, in a way that is indistinguishable
              from an "AI crawler", then maybe it's really misbehaving and
              should be stopped from doing so.
       
                Aloisius wrote 1 day ago:
                Legitimate software that scan links are often well behaved, in
                isolation. It's when that software is installed on millions of
                computers that in aggregate, they can behave poorly. This isn't
                particularly new though. RSS software used to blow up small
                websites that couldn't handle it. Now with some browsers
                speculatively loading links, you can be hammered simply because
                you're linked to from a popular site even if no one actually
                clicks on the link.
                
                Personally, I'm skeptical of blaming everything on AI scrapers.
                Everything people are complaining about has been happening for
                decades - mostly by people searching for website
                vulnerabilities/sensitive info who don't care if they're
                misbehaving, sometimes by random individuals who want to
                archive a site or are playing with a crawler and don't see why
                they should slow them down.
                
                Even the techniques for poisoning aggressive or impolite
                crawlers are at least 30 years old.
       
                  kijin wrote 23 hours 25 min ago:
                  Yes, and sysadmins have been quietly banning those
                  misbehaving programs for the last 30 years.
                  
                  The only thing that seems to have changed is that today's
                  thread is full of people who think they have some sort of
                  human right to access any website by any means possible,
                  including their sloppy vibe-coded crawler. In the past, IIRC,
                  people used to be a little more apologetic about consuming
                  other people's resources and did their best to fly below the
                  radar.
                  
                  It's my website. I have every right to block anyone at any
                  time for any reason whatsoever. Whether or not your use case
                  is "legitimate" is beside the point.
       
                    ToucanLoucan wrote 10 hours 40 min ago:
                    The entitlement of so many modern vibe coders (or as we
                    called them before, script kiddies) is absolutely off the
                    charts. Just because there is not a rule or law expressly
                    against what you're doing doesn't mean it's perfectly fine
                    to do. Websites are hosted by and funded by people, and if
                    your shitty scraper racks up a ton of traffic on one of my
                    sites, I may end up on the hook for that. I am perfectly
                    within both my rights and ethical boundaries to block your
                    IP(s).
                    
                    And just to not leave it merely implied, I don't give a
                    rats ass if that slows down your "innovation." Go away.
       
            Cervisia wrote 1 day ago:
            > robots.txt. This is not the law
            
            In Germany, it is the law. § 44b UrhG says (translated):
            
            (1) Text and data mining is the automated analysis of one or more
            digital or digitized works to obtain information, in particular
            about patterns, trends, and correlations.
            
            (2) Reproductions of lawfully accessible works for text and data
            mining are permitted. These reproductions must be deleted when they
            are no longer needed for text and data mining.
            
            (3) Uses pursuant to paragraph 2, sentence 1, are only permitted if
            the rights holder has not reserved these rights. A reservation of
            rights for works accessible online is only effective if it is in
            machine-readable form.
       
              luckylion wrote 15 hours 53 min ago:
              I doubt robots.txt would fit. robots.txt allows or disallows
              access, but it does not state any claim. You can license content
              you don't own, put it on your website, and then exclude it in
              robots.txt without that implying any claims of rights to that
              content.
       
              klntsky wrote 21 hours 26 min ago:
              >  A reservation of rights for works accessible online is only
              effective if it is in machine-readable form.
              
              What if MY machine can't read it though?
       
                Y-bar wrote 19 hours 48 min ago:
                That’s your problem.
                
                A solution has been offered and you can adhere to it, or stop
                doing that thing which causes problems for many of us.
       
          sdenton4 wrote 1 day ago:
          The problem is that serving content costs money. Llm scraping is
          essentially ddos'ing content meant for human consumption. Ddos'ing
          sucks.
       
            dylan604 wrote 1 day ago:
            running the scraping bots cost money too.
       
              meepmorp wrote 17 hours 19 min ago:
              > Won’t somebody please think of the parasites?
       
              QuadmasterXLII wrote 1 day ago:
              what?
       
            2OEH8eoCRo0 wrote 1 day ago:
            Scraping is legal. DDoSing isn't.
            
            We should start suing these bad actors. Why do techies forget that
            the legal system exists?
       
              herbst wrote 17 hours 17 min ago:
              Facebook and Bing sometimes are 80% of my daily hits and don't
              respect my IP bans and other bot filterings at all. You think I
              can just sue them and have any change to win before being broke?
       
              ColinWright wrote 1 day ago:
              There is no way that you can sue the people responsible for
              DDoSing your system.  Even if you can find them ... and you won't
              ... they're likely as not either not in your jurisdiction (they
              might be in Russia, or China, or Bolivia, or anywhere) and they
              will have a lot more money than you.
              
              People here on HN are laughing at the UKs Online Safety Act for
              trying to impose restrictions on people in other countries, and
              yet now you're implying that similar restrictions can be placed
              on people in other countries and over whom you have neither power
              nor control.
       
          arccy wrote 1 day ago:
          yeah all open HTTP servers are fair game for DDoS because well it's
          open right?
       
          Lionga wrote 1 day ago:
          So if a house is not not locked I can take whatever I want?
       
            Ylpertnodi wrote 1 day ago:
            Yes, but you may get caught, and there suffer 'consequences'.
            I can drive well over 220kmh+ on the autobahn (Germany, Europe),
            and also in France (also in Europe).
            One is acceptable,  the other will get me Royale-e fucked.
            If the can catch me.
       
          munk-a wrote 1 day ago:
          I think there's a massive shift in what the letter of the law needs
          to be to match the intent.  The letter hasn't changed and this is all
          still quite legal - but there is a significant different between what
          webscraping was doing to impact creative lives five years ago and
          today.    It was always possible for artists to have their content
          stolen and for creative works to be reposted - but there was enough
          IP laws around image sharing (which AI disingenuously steps around)
          and other creative work wasn't monetarily efficient to scrape.
          
          I think there is a really different intent to an action to read
          something someone created (which is often a form of marketing) and to
          reproduce but modify someone's creative output (which competes
          against and starves the creative of income).
          
          The world changed really quickly and our legal systems haven't kept
          up.  It is hurting real people who used to have small side
          businesses.
       
          isodev wrote 1 day ago:
          Ah yes, the “it’s ok because I can” school of thought. As if
          that was ever true.
       
          Calavar wrote 1 day ago:
          I agree. It always surprises me when people are indignant about
          scrapers ignoring robots.txt and throw around words like "theft" and
          "abuse."
          
          robots.txt is a polite request to please not scrape these pages
          because it's probably not going to be productive. It was never meant
          to be a binding agreement, otherwise there would be a stricter
          protocol around it.
          
          It's kind of like leaving a note for the deliveryman saying please
          don't leave packages on the porch. It's fine for low stakes
          situations, but if package security is of utmost importance to you,
          you should arrange to get it certified or to pick it up at the
          delivery center. Likewise if enforcing a rule of no scraping is of
          utmost importance you need to require an API token or some other form
          of authentication before you serve the pages.
       
            bigiain wrote 23 hours 28 min ago:
            > robots.txt is a polite request to please not scrape these pages
            
            At the same time, an http GET request is a polite request to
            respond with the expects content. There is no binding agreement
            that my webserver sends you the webpage you asked for. I am at
            liberty to enforce my no-scraping rules however I see fit. I get to
            choose whether I'm prepared to accept the consequences of a "real
            user" tripping my web scraping detection thresholds and getting
            firewalled or served nonsense or zipbombed (or whatever
            countermeasure I choose). Perhaps that'll drive away a reader (or
            customer) who opens 50 tabs to my site all at once, perhaps Google
            will send a badly behaved bot and miss indexing some of my pages or
            even deindexing my site. For my personal site I'm 100% OK with
            those consequences. For work's website I still use countermeasures
            but set the thresholds significantly more conservatively. For
            production webapps I use different but still strict thresholds and
            different countermeasures.
            
            Anybody who doesn't consider typical AI company's webscraping
            behaviour over the last few years to qualify as "abuse" has
            probably never been responsible for a website with any volume of
            vaguely interesting text or any reasonable number of backlinks from
            popular/respected sites.
       
              overfeed wrote 22 hours 1 min ago:
              It may be naivete, but I love the standards-based open web as a
              software platform and a s a fabric that connects people. O
              It makes my blood boil that some solipsistic, predatory bastards
              are eager to turn the internet into a dark forest
       
            smsm42 wrote 1 day ago:
            "Theft" may be wrong, but "abuse" certainly is not. Human
            interactions in general, and the web in particular, are built on
            certain set of conventions and common behaviors. One of them is
            that most sites are for consuming information at human paces and
            volumes, not downloading their content wholesale. There are
            specialized sites that are fine with that, but they say it upfront.
            Average, especially hobbyist site, is not that. People who do not
            abide by it are certainly abusing it.
            
            > Likewise if enforcing a rule of no scraping is of utmost
            importance you need to require an API token or some other form of
            authentication before you serve the pages.
            
            Yes, and if the rule of not dumping a ton of manure on your
            driveway is so important to you, you should live in a gated
            community and hire round-the-clock security. Some people do, but
            living in a society where the only way to not wake up with a ton of
            manure in your driveway is to spend excessive resources on security
            is not the world that I would prefer to live in. And I don't see
            why people would spend time to prove this is the only possible and
            normal world - it's certainly not the case, we can do better.
       
              o11c wrote 23 hours 32 min ago:
              Theft is correct but for a different reason.
              
              The #1 reason for all AI scrapers is to replace the content they
              are scraping. This means no "fair use" defense to the copyright
              infringement they inevitably commit.
       
            grayhatter wrote 1 day ago:
            > I  agree. It always surprises me when people are indignant about
            scrapers ignoring robots.txt and throw around words like "theft"
            and "abuse."
            
            This feels like the kind of argument some would make as to why they
            aren't required to return their shopping cart to the bay.
            
            > robots.txt is a polite request to please not scrape these pages
            because it's probably not going to be productive. It was never
            meant to be a binding agreement, otherwise there would be a
            stricter protocol around it.
            
            Well, no. That's an overly simplistic description which fits your
            argument, but doesn't accurately represent reality. yes, robots.txt
            is created as a hint for robots, a hint that was never expected to
            be non-binding, but the important detail, the one that is important
            to understanding why it's called robots.txt is because the web
            server exists to serve the requests of humans. Robots are welcome
            too, but please follow these rules.
            
            You can tell your description is completely inaccurate and
            non-representative of the expectations of the web as a whole.
            because every popular llm scraper goes out of their way to both
            follow and announce that they follow robots.txt.
            
            > It's kind of like leaving a note for the deliveryman saying
            please don't leave packages on the porch.
            
            It's nothing like that, it's more like a note that says no
            soliciting, or please knock quietly because the baby is sleeping.
            
            > It's fine for low stakes situations, but if package security is
            of utmost importance to you, you should arrange to get it certified
            or to pick it up at the delivery center.
            
            Or, people could not be assholes? Yes, I get it, the reality we
            live in there are assholes. But the problem as I see it, is not
            just the assholes, but the people who act as apologists for this
            clearly deviant behavior.
            
            > Likewise if enforcing a rule of no scraping is of utmost
            importance you need to require an API token or some other form of
            authentication before you serve the pages.
            
            Because it's your fault if you don't, right? That's victim blaming.
            I want to be able to host free, easy to access content for humans,
            but someone with more money, and more compute resources than I
            have, gets to overwhelm my server because they don't care... And
            that's my fault, right?
            
            I guess that's a take...
            
            There's a huge difference between suggesting mitigations for
            dealing with someone abusing resources, and excusing the abuse of
            resources, or implying that I should expect my server to be abused,
            instead of frustrated about the abuse.
       
            watwut wrote 1 day ago:
            If you ignore polite request, then it is perfectly ok to give you
            as much false data as possible. You have shown yourself not
            interested in good faith cooperation, that means other people can
            and should treat you as a jerk.
       
            kelnos wrote 1 day ago:
            > robots.txt is a polite request to please not scrape these pages
            
            People who ignore polite requests are assholes, and we are well
            within our rights to complain about them.
            
            I agree that "theft" is too strong (though I think you might be
            presenting a straw man there), but "abuse" can be perfectly apt: a
            crawler hammering a server, requesting the same pages over and
            over, absolutely is abuse.
            
            > Likewise if enforcing a rule of no scraping is of utmost
            importance you need to require an API token or some other form of
            authentication before you serve the pages.
            
            That's a shitty world that we shouldn't have to live in.
       
              DoctorOetker wrote 10 hours 11 min ago:
              Whenever one forms a sentence, it is worthwhile to try to form a
              sentence that you believe to be generally true.
              
              If someone politely requests you to suck their genitalia, and you
              ignore that request, does that make you an asshole?
       
              wslh wrote 1 day ago:
              > People who ignore polite requests are assholes, and we are well
              within our rights to complain about them.
              
              If you are building a new search engine and the robots.txt only
              include Google, are you an asshole indexing the information?
       
                kijin wrote 1 day ago:
                Yes, because the site owner has clearly and explicitly
                requested that you don't scrape their site, fully accepting the
                consequence that their site will not appear in any search
                engine other than Google.
                
                Whatever impact your new search engine or LLM might have in the
                world is irrelevant to their wishes.
       
            mxkopy wrote 1 day ago:
            The metaphor doesn’t work. It’s not the security of the package
            that’s in question, but something like whether the delivery
            person is getting paid enough or whether you’re supporting them
            getting replaced by a robot. The issue is in the context, not the
            protocol.
       
            bigbuppo wrote 1 day ago:
            Seriously. Did you see what that web server was wearing? I mean,
            sure it said "don't touch me" and started screaming for help and
            blocked 99.9% of our IP space, but we got more and they didn't
            block that so clearly they weren't serious. They were asking for
            it. It's their fault. They're not really victims.
       
              jMyles wrote 1 day ago:
              Sexual consent is sacred.  This metaphor is in truly bad taste.
              
              When you return a response with a 200-series status code, you've
              granted consent.  If you don't want to grant consent, change the
              logic of the server.
       
                mvc wrote 16 hours 42 min ago:
                Future rapist right here.
       
                LexGray wrote 1 day ago:
                Perhaps bad taste, but bots could also be legitimately
                purposely violating the most private or traumatizing moments a
                vulnerable person has in any exploitative way it cares to. I am
                not sure using bad taste is enough of an excuse to not discuss
                the issue as many people do in fact use the internet for sexual
                things. If anything consent should be MORE important because it
                is easier to document and verify.
                
                A vast hoard of personal information exists and most of it
                never had or will have proper consent, knowledge, or
                protection.
       
                  jMyles wrote 12 hours 29 min ago:
                  > the most private or traumatizing moments a vulnerable
                  person has
                  
                  ...and in this hypothetical, this person is serving them via
                  an unauthenticated http server and hoping that clients will
                  respect robots.txt?
       
                    bigbuppo wrote 10 hours 43 min ago:
                    Robots are supposed to behave. It was a solved problem 30
                    years ago until AI bros unsolved it. Any entity that does
                    not obey robots.txt is by definition a malicious actor.
       
                Larrikin wrote 1 day ago:
                >I don't like how your metaphor is an effective metaphor for
                the situation so it's in bad taste.
       
                  bigbuppo wrote 10 hours 40 min ago:
                  They also conveniently missed the point that it was about
                  victim blaming.
       
                  jack_pp wrote 1 day ago:
                  if u absolutely want a sexual metaphor it's more like you
                  snuck into the world record for how many sexual parteners a
                  woman can take in 24h and even tho you aren't on the list you
                  still got to smash.
                  
                  solution is the same, implement better security
       
                    bigbuppo wrote 1 day ago:
                    Thank you for finding the right metaphor. If there is a
                    sign out front that has a list of individuals that should
                    go away but they continue, they're in a lot of legal
                    trouble. If they show a fake ID to the event organizers
                    that are handling all the paperwork, that is also something
                    that will land them in prison.
       
                jraph wrote 1 day ago:
                > When you return a response with a 200-series status code,
                you've granted consent. If you don't want to grant consent,
                change the logic of the server.
                
                "If you don't consent to me entering your house, change its
                logic so that picking the door's lock doesn't let me open the
                door"
                
                Yeah, well…
                
                As if the LLM scrappers didn't try anything under the sun like
                using millions of different residential IP to prevent admins
                from "changing the logic of the server" so it doesn't "return a
                response with a 200-series status code" when they don't agree
                to this scrapping.
                
                As if there weren't broken assumptions that make "When you
                return a response with a 200-series status code, you've granted
                consent" very false.
                
                As if technical details were good carriers of human intents.
       
                  ryandrake wrote 1 day ago:
                  The locked door is a ridiculous analogy when it comes to the
                  open web. Pretty much all "door" analogies are flawed, but
                  sure let's imagine your web server has a door. If you want to
                  actually lock the door, you're more than welcome to put an
                  authentication gate around your content. A web server that
                  accepts a GET request and replies 2xx is distinctly NOT
                  "locked" in any way.
       
                    jraph wrote 1 day ago:
                    Any analogy is flawed and you can kill most analogies very
                    fast. They are meant to illustrate a point hopefully
                    efficiently, not to be mathematically true. They are not to
                    everyone's taste, me included in most cases. They are
                    mostly fine as long as they are not used to make a point,
                    but only to illustrate it.
                    
                    I agree with this criticism of this analogy, I actually had
                    this flaw in mind from the start. There are other flaws I
                    have in mind as well.
                    
                    I have developed more without the analogy in the remaining
                    of the comment. How about we focus on the crux of the
                    matter?
                    
                    > A web server that accepts a GET request and replies 2xx
                    is distinctly NOT "locked" in any way
                    
                    The point is that these scrappers use tricks so that it's
                    difficult not to grant them access. What is unreasonable
                    here is to think that 200 means consent, especially knowing
                    about the tricks.
                    
                    Edit:
                    
                    > you're more than welcome to put an authentication gate
                    around your content.
                    
                    I don't want to. Adding auth so llm providers don't abuse
                    my servers and the work I meant to share publicly is not a
                    working solution.
       
                      ryandrake wrote 1 day ago:
                      People need to have a better mental model of what it
                      means to host a public web site, and what they are
                      actually doing when they run the web server and point it
                      at a directory of files. They're not just serving those
                      files to customers. They're not just serving them to
                      members. They're not just serving them to human beings.
                      They're not even necessarily serving files to web
                      browsers. They're serving files to every IP address (no
                      matter what machine is attached to it) that is capable of
                      opening a socket and sending GET. There's no such
                      distinct thing as a scraper--and if your mental model
                      tries to distinguish between a scraper and a human user,
                      you're going to be disappointed.
                      
                      As the web server operator, you can try to figure out if
                      there's a human behind the IP, and you might be right or
                      wrong. You can try to figure out if it's a web browser,
                      or if it's someone typing in curl from a command line, or
                      if it's a massively parallel automated system, and you
                      might be right or wrong. You can try to guess what
                      country the IP is in, and you might be right or wrong.
                      But if you really want to actually limit access to the
                      content, you shouldn't be publishing that content
                      publicly.
       
                        tremon wrote 9 hours 49 min ago:
                        The CFAA wants to have a word. The fact that a server
                        responds with a 200 OK has no bearing on the legality
                        of your request, there's plenty of precedent by now.
       
                        bigbuppo wrote 1 day ago:
                        How about AI companies just act ethically and obey
                        norms?
       
                        Retric wrote 1 day ago:
                        > They're serving files to every IP address (no matter
                        what machine is attached to it) that is capable of
                        opening a socket and sending GET.
                        
                        Legally in the US a “public” web server can have
                        any set of usage restrictions it feels like even
                        without a login screen.  Private property doesn’t
                        automatically give permission to do anything even if
                        there happens to be a driveway from the public road
                        into the middle of it.
                        
                        The law cars about authorized access not the specific
                        technical implementation of access.  Which has caused
                        serious legal trouble for many people when they make
                        seemingly reasonable assumptions that say access to
                        someURL/A12.jpg also gives them permission to
                        someURL/A13.jpg etc.
       
                          jMyles wrote 1 day ago:
                          ...but the matter of "what the law cares about" is
                          not really the point of contention here - what
                          matters here is what happens in the real world.
                          
                          In the real world, these requests are being made, and
                          servers are generating responses.  So the way to
                          change that is to change the logic of the servers.
       
                            Retric wrote 1 day ago:
                            > In the real world, these requests are being made,
                            and servers are generating responses.
                            
                            Except that’s not the end of the story.
                            
                            If you’re running a scraper and risking serious
                            legal consequences when you piss off someone
                            running a server enough, then it suddenly matters a
                            great deal independent of what was going on up to
                            that point. Having already made these requests
                            you’ve just lost control of the situation.
                            
                            That’s the real world we’re all living in, you
                            can hope the guy running a server is going to play
                            ball but that’s simply not under your control. 
                            Which is the real reason large established
                            companies care about robots.txt etc.
       
                        oytis wrote 1 day ago:
                        Technically, you are not serving anything - it's just
                        voltage levels going up and down with no meaning at
                        all.
       
                        jraph wrote 1 day ago:
                        > There's no such distinct thing as a scraper--and if
                        your mental model tries to distinguish between a
                        scraper and a human user, you're going to be
                        disappointed.
                        
                        I disagree. If your mental model doesn't allow
                        conceptualizing (abusive) scrapers, it is too
                        simplicistic to be useful to understand and deal with
                        reality.
                        
                        But I'd like to re-state the frame / the concern: it's
                        not about any bot or any scraper, it is about the
                        despicable behavior of LLM providers and their awful
                        scrappers.
                        
                        I'm personally fine with bots accessing my web servers,
                        there are many legitimate use cases for this.
                        
                        > But if you really want to actually limit access to
                        the content, you shouldn't be publishing that content
                        publicly.
                        
                        It is not about denying access to the content to some
                        and allowing access to others.
                        
                        It is about having to deal with abuses.
                        
                        Is a world in which people stop sharing their work
                        publicly because of these abuses desirable? Hell no.
       
                      jack_pp wrote 1 day ago:
                      here's my analogy, it's like you own a museum and you
                      require entrance by "secret" password (your user agent
                      filtering or what not). the problem is the password is
                      the same for everyone so would you be surprised when
                      someone figures it out or gets it from a friend and they
                      visit your museum? Either require a fee (processing
                      power, captcha etc) or make a private password (auth)
                      
                      It is inherently a cat and mouse game that you CHOOSE to
                      play. Either implement throttling for clients that
                      consume too much resources for your server / require auth
                      / captcha / javascript / whatever whenever the client is
                      using too much resources. if the client still chooses to
                      go through the hoops you implemented then I don't see any
                      issue. If u still have an issue then implement more hoops
                      until you're satisfied.
       
                        jraph wrote 1 day ago:
                        > Either require a fee (processing power, captcha etc)
                        or make a private password (auth)
                        
                        Well, I shouldn't have to work or make things worse for
                        everybody because the LLM bros decided to screw us.
                        
                        > It is inherently a cat and mouse game that you CHOOSE
                        to play
                        
                        No, let's not reverse the roles and blame the victims
                        here. We sysadmins and authors are willing to share our
                        work publicly to the world but never asked for it to be
                        abused.
       
                          jack_pp wrote 1 day ago:
                          That's like saying you shouldn't have to sanitize
                          your database inputs because you never asked for
                          people to SQL inject your database. This stance is
                          truly mind boggling to me
       
                            catlifeonmars wrote 20 hours 29 min ago:
                            It’s both. You should sanitize your inputs
                            because there are bad actors, but you also
                            categorize attempts to sql inject as abuse and
                            there is legal recourse.
       
                            jraph wrote 20 hours 55 min ago:
                            Would you take the defense of attackers using SQL
                            injections? Because it feels like people here,
                            including you, are defending the llm scrapers
                            against sysadmins and authors who dare share their
                            work publicly.
                            
                            Ensuring basic security and robustness of a piece
                            of software is simply not remotely comparable to
                            countering the abuse these llm companies carry on.
                            
                            But it's not even the point. And preventing SQL
                            injections (through healthy programming practices)
                            doesn't make things worse for any legitimate user
                            neither.
       
            whimsicalism wrote 1 day ago:
            There's an evolving morality around the internet that is very, very
            different from the pseudo-libertarian rule of the jungle I was
            raised with. Interesting to see things change.
       
              bigbuppo wrote 10 hours 12 min ago:
              You're very much wrong. Two of the key tennets of libertarianism
              is that your rights end where my nose begins and the respect of
              property rights . Your AI bot is causing problems for me, then
              you should be compensating me for the damage or other expense you
              caused. But the AI bros think they should be able to take
              anything they want whenever they want without compensation, and
              they'll use every single shady behavior they can to make that
              happen. In other words, they're robber barrons.
       
              hdgvhicv wrote 1 day ago:
              Based on the comments here the polite world of the internet where
              people obeyed unwritten best practices is certainly over in
              favour of “grab what you can might makes right”
       
                whimsicalism wrote 1 day ago:
                that was never the internet. the old internet was
                “information wants to be free, good luck if you want to
                restrict my access or resharing”
       
              sethhochberg wrote 1 day ago:
              The evolutionary force is really just "everyone else showed up at
              the party". The Internet has gone from a capital-I thing that was
              hard to access, to a little-i internet that was easier to access
              and well known but still largely distinct from the real world, to
              now... just the real world in virtual form. Internet morality
              mirrors real world morality.
              
              For the most part, everybody is participating now, and that
              brings all of the challenges of any other space with everyone's
              competing interests colliding - but fewer established systems of
              governance.
       
            hsbauauvhabzb wrote 1 day ago:
            How else do you tell the bot you do not wish to be scraped? Your
            analogy is lacking - you didn’t order a package, you never wanted
            a package, and the postman is taking something, not leaving it, and
            you’ve explicitly left a sign saying ‘you are not welcome
            here’.
       
              stray wrote 1 day ago:
              You require something the bot won't have that a human would.
              
              Anybody may watch the demo screen of an arcade game for free, but
              you have to insert a quarter to play — and you can have even
              greater access with a key.
              
              > and you’ve explicitly left a sign saying ‘you are not
              welcome here’
              
              And the sign said
              "Long-haired freaky people
              Need not apply"
              So I tucked my hair up under my hat
              And I went in to ask him why
              He said, "You look like a fine upstandin' young man
              I think you'll do"
              So I took off my hat and said, "Imagine that
              Huh, me workin' for you"
       
                michaelt wrote 1 day ago:
                > You require something the bot won't have that a human would.
                
                Is this why the “open web” is showing me a captcha or two,
                along with their cookie banner and newsletter pop up these
                days?
       
                  bigbuppo wrote 10 hours 30 min ago:
                  Up until people started making a big stink about CAPTCHAs
                  being used for unpaid labor at scale, uh, well they had two
                  purposes.
       
              nkrisc wrote 1 day ago:
              Put your content behind authentication if you don’t want it to
              be requested by just anyone.
       
                kelnos wrote 1 day ago:
                But I do want my content accessible to "just anyone", as long
                as they are humans.  I don't want it accessible to bots.
                
                You are free to say "well, there is no mechanism to do that",
                and I would agree with you.  That's the problem!
       
                  nkrisc wrote 13 hours 36 min ago:
                  Even abusive crawlers and scrapers are acting as agents of
                  real humans, just as your browser is acting as your agent. I
                  don't even know how you could reliably draw a reasonable line
                  in the sand between the two without putting some group of
                  people on the wrong side of the line.
                  
                  I suppose the ultimate solution would be browsers and
                  operating systems and hardware manufacturers co-operating to
                  implement some system that somehow cryptographically signs
                  HTTP requests which attests that it was triggered by an
                  actual, physical interaction with a computing device by a
                  human.
                  
                  Though you don't have to think for very long to come up with
                  all kinds of collateral damage that would cause and how bad
                  actors could circumvent it anyway.
                  
                  All in all, this whole issue seems more like a legal problem
                  than a technical one.
       
                    bigbuppo wrote 10 hours 26 min ago:
                    Or the AI people could just stop being abusive jerks.
                    That's an even easier solution.
       
                      nkrisc wrote 8 hours 12 min ago:
                      That would be easier. Too bad it won't ever happen.
       
                      9rx wrote 9 hours 28 min ago:
                      While that is probably good advice in general, the
                      earlier commenter wanted even the abusive jerks to have
                      access to his content.
                      
                      He just doesn't want tools humans use to access content
                      to be used in association with his content.
                      
                      What he failed to realize is that if you eliminate the
                      tools, the human cannot access the content anyway. They
                      don't have the proper biological interfaces. Had he
                      realized that, he'd have come to notice that simply
                      turning off his server fully satisfies the constraints.
       
                  9rx wrote 1 day ago:
                  > as long as they are humans. I don't want it accessible to
                  bots.
                  
                  A curious position. There isn't a secondary species using the
                  internet. There is only humans. Unless you foresee some kind
                  of alien invasion or earthworm uprising, nothing other than
                  humans will ever access your content. Rejecting the tools
                  humans use to bridge their biological gaps is rather
                  nonsensical.
                  
                  > You are free to say "well, there is no mechanism to do
                  that", and I would agree with you. That's the problem!
                  
                  I suppose it would be pretty neat if humans were born with
                  some kind of internet-like telepathy ability, but lacking
                  that mechanism isn't any kind of real problem. Humans are
                  well adept at using tools and have successfully used tools
                  for millennia. The internet itself is a tool! Which, like
                  before, makes rejecting the human use of tools nonsensical.
       
                  1gn15 wrote 1 day ago:
                  What the hell? That is incredibly discriminatory. Fuck off. I
                  support those that counter those discriminatory mechanisms.
       
                    Anamon wrote 16 hours 17 min ago:
                    Discriminatory against bots? That doesn't even make any
                    sense.
       
                      bigbuppo wrote 10 hours 25 min ago:
                      They probably have stock options.
       
              davsti4 wrote 1 day ago:
              Its simple, and I'll quote myself - "robots.txt isn't the law".
       
                bigbuppo wrote 10 hours 28 min ago:
                Violating norms makes you an abusive jerk at best.
       
                ColinWright wrote 1 day ago:
                Quoting Cervisia :
                
                > robots.txt. This is not the law
                
                In Germany, it is the law. § 44b UrhG says (translated):
                
                (1) Text and data mining is the automated analysis of one or
                more digital or digitized works to obtain information, in
                particular about patterns, trends, and correlations.
                
                (2) Reproductions of lawfully accessible works for text and
                data mining are permitted. These reproductions must be deleted
                when they are no longer needed for text and data mining.
                
                (3) Uses pursuant to paragraph 2, sentence 1, are only
                permitted if the rights holder has not reserved these rights. A
                reservation of rights for works accessible online is only
                effective if it is in machine-readable form.
                
                --
                
  HTML          [1]: https://news.ycombinator.com/item?id=45776825
       
              Calavar wrote 1 day ago:
              If you are serving web pages, you are soliciting GET requests,
              kind of like ordering a package is soliciting a delivery.
              
              "Taking" versus "giving" is neither here nor there for this
              discussion. The question is are you expressing a preference on
              etiquette versus a hard rule that must be followed. I personally
              believe robots.txt is the former, and I say that as someone who
              serves more pages than they scrape
       
                pluto_modadic wrote 1 day ago:
                ignoring a rate limit gets you blocked.
       
                  hsbauauvhabzb wrote 1 day ago:
                  Scrapers actively bypass this by rotating IP addresses.
       
                davesque wrote 1 day ago:
                If I order a package from a company selling a good, am I
                inviting all that company's competitors to show up at my
                doorstep to try and outbid the delivery person from the
                original company when they arrive, and maybe they all show up
                at the same time and cause my porch to collapse? No, because my
                front porch is a limited resource for which I paid for an
                intended purpose. Is it illegal for those other people to show
                up? Maybe not by the letter of the law.
       
                kelnos wrote 1 day ago:
                > If you are serving web pages, you are soliciting GET requests
                
                So what's the solution?  How do I host a website that welcomes
                human visitors, but rejects all scrapers?
                
                There is no mechanism!    The best I can do is a cat-and-mouse
                arms race where I try to detect the traffic I don't want, and
                block it, while the people generating the traffic keep getting
                more sophisticated about hiding from my detection.
                
                No, putting up a paywall is not a reasonable response to this.
                
                > The question is are you expressing a preference on etiquette
                versus a hard rule that must be followed.
                
                Well, there really aren't any hard rules that must be followed,
                because there are no enforcement mechanisms outside of going
                nuclear (requiring login).  Everything is etiquette.  And I
                agree that robots.txt is also etiquette, and it is super messed
                up that we tolerate "AI" companies stomping all over that
                etiquette.
                
                Do we maybe want laws that say everyone must respect
                robots.txt?  Maybe?  But then people will just move their
                scrapers to a jurisdiction without those laws.    And I'm sure
                someone could make the argument that robots.txt doesn't apply
                to them because they spoofed a browser user-agent (or another
                user-agent that a site explicitly allows).  So perhaps we have
                a new mechanism, or new laws, or new... something.
                
                But this all just highlights the point I'm making here: there
                is no reasonable mechanism (no, login pages and http auth don't
                count) for site owners to restrict access to their site based
                on these sorts of criteria.  And that's a problem.
       
                andoando wrote 1 day ago:
                Well yes this is exactly what's happening as of now. But there
                SHOULD be a way to upload content without giving it access to
                scrapers.
       
                munk-a wrote 1 day ago:
                I disagree strongly here - though not from a technical
                perspective.  There's absolutely a legal concept of making your
                work available for viewing without making it available for
                copying and AI scraping (while we can technically phrase it as
                just viewing a bunch of times) is effectively copying.
                
                Lets say a large art hosting site realizes how damaging AI
                training on their data can be - should they respond by adding a
                paywall before any of their data is visible?  If that paywall
                is added (let's just say $5/mo) can most of the artists
                currently on their site afford to stay there?  Can they afford
                it if their potential future patrons are limited to just those
                folks who can pay $5/mo?  Would the scraper be able to afford a
                one time cost of $5 to scrape all of that data?
                
                I think, as much they are a deeply flawed concept, this is a
                case where EULAs or an assumption of no-access for training
                unless explicitly granted that's actually enforced through the
                legal system is required.  There are a lot of small businesses
                and side projects that are dying because of these models and I
                think that creative outlet has societal value we would benefit
                from preserving.
       
                  jMyles wrote 1 day ago:
                  >  There's absolutely a legal concept of making your work
                  available for viewing without making it available for copying
                  
                  This "legal concept" is enforceable through legacy systems of
                  police and violence.  The internet does not recognize it. 
                  How much more obvious can this get?
                  
                  If we stumble down the path of attempting to apply this legal
                  framework, won't some jurisdiction arise with no IP
                  protections whatsoever and just come to completely dominate
                  the entire economy of the internet?
                  
                  If I can spin up a server in copyleftistan with a complete
                  copy of every album and film ever made, available for free
                  download, why would users in copyrightistan use the locked
                  down services of their domestic economy?
       
                    kelnos wrote 1 day ago:
                    > legacy systems of police and violence
                    
                    You use "legacy" as if these systems are obsolete and on
                    their way out.    They're not.  They're here to stay, and
                    will remain dominant, for better or worse.  Calling them
                    "legacy" feels a bit childish, as if you're trying to
                    ignore reality and base arguments on your preferred vision
                    of how things should be.
                    
                    > The internet does not recognize it.
                    
                    Sure it does.  Not universally, but there are a lot of
                    things governments and law enforcement can do to control
                    what people see and do on the internet.
                    
                    > If we stumble down the path of attempting to apply this
                    legal framework, won't some jurisdiction arise with no IP
                    protections whatsoever and just come to completely dominate
                    the entire economy of the internet?
                    
                    No, of course not, that's silly.  That only really works on
                    the margins.  Any other country would immediately slap
                    economic sanctions on that free-for-all jurisdiction and
                    cripple them.  If that fails, there's always a military
                    response they can resort to.
                    
                    > If I can spin up a server in copyleftistan with a
                    complete copy of every album and film ever made, available
                    for free download, why would users in copyrightistan use
                    the locked down services of their domestic economy?
                    
                    Because the governments of all the copyrightistans will
                    block all traffic going in and out of copyleftistan.  While
                    this may not stop determined, technically-adept people, it
                    will work for the most part.  As I said, this sort of thing
                    only really works on the margins.
       
                      jMyles wrote 1 day ago:
                      I guess I'm more optimistic about the future of the human
                      condition.
                      
                      > You use "legacy" as if these systems are obsolete and
                      on their way out. They're not.
                      
                      I have serious doubts that nation states will still exist
                      in 500 years.  I feel quite certain that they'll be gone
                      in 10,000.  And I think it's generally good to build an
                      internet for those time scales.
                      
                      > base arguments on your preferred vision of how things
                      should be.
                      
                      I hope we all build toward our moral compass; I don't
                      mean for arguments to fall into fallacies on this basis,
                      but yeah I think our internet needs to resilient against
                      the waxing and waning of the affairs of state.    I don't
                      know if that's childish... Maybe we need to have a more
                      child-like view of things?  The internet _is_ a child in
                      the sense of its maturation timeframe.
                      
                      > there are a lot of things governments and law
                      enforcement can do to control what people see and do on
                      the internet.
                      
                      Of course there are things that governments do.  But are
                      they effective?  I just returned from a throatsinging
                      retreat in Tuva - a fairly remote part of Siberia.  The
                      Russian government has apparently quietly begun to censor
                      quite a few resources on the internet, and it has caused
                      difficulty in accessing the traditional music of the
                      Tuvan people.  And I was very happily astonished to find
                      that everybody to whom I ran into, including a shaman
                      grandmother, was fairly adept at routing around this
                      censorship using a VPN and/or SSH tunnel.
                      
                      I think the internet is doing a wonderful job at routing
                      around censorship - better than any innovation ever
                      discovered by humans so far.
                      
                      > Any other country would immediately slap economic
                      sanctions on that free-for-all jurisdiction and cripple
                      them. If that fails, there's always a military response
                      they can resort to.
                      
                      Again, maybe I'm just more optimistic, but I think that
                      on longer time frames, the sober elder statesmen/women
                      will prevail and realize that violence is not an
                      appropriate response to bytes transiting the wire that
                      they wish weren't.
                      
                      And at the end of the day, I don't think governments even
                      have the power here - the content creators do.    I
                      distribute my music via free channels because that's the
                      easiest way to reach my audience, and because, given the
                      high availability of compelling free content, there's
                      just no way I can make enough money on publishing to even
                      concern myself with silly restrictions.
                      
                      It seems to me that I'm ahead of the curve in this area,
                      not behind it.    But I'm certainly open to being convinced
                      otherwise.
       
                        dns_snek wrote 17 hours 7 min ago:
                        > Again, maybe I'm just more optimistic, but I think
                        that on longer time frames, the sober elder
                        statesmen/women will prevail and realize that violence
                        is not an appropriate response to bytes transiting the
                        wire that they wish weren't.
                        
                        Your framing is off because this notion of fairness or
                        morality isn't something they concern themselves with.
                        They're using violence because if they didn't, they
                        would be allowing other entities to gain wealth and
                        power at their expense. I don't think it's much more
                        complex than that.
                        
                        See how differently these same bytes are treated in the
                        hands of Aaron Swartz vs OpenAI. One threatened to
                        empower humanity at the expense of reducing profits for
                        a few rich men, so he got crucified for it. The other
                        is hoping to make humans redundant, concentrate the
                        distribution of wealth even further, and strengthen the
                        US world dominance, so all of the right wheels get
                        greased for them and they get a license to kill -
                        figuratively and literally.
       
                          jMyles wrote 12 hours 28 min ago:
                          I mean... I agree with everything you've said here. 
                          I'm not sure what makes you think I've mis-framed the
                          stakes.
       
                yuliyp wrote 1 day ago:
                Having a front door physically allows anyone on the street to
                come to knock on it. Having a "no soliciting" sign is an
                instruction clarifying that not everybody is welcome. Having a
                web site should operate in a similar fashion. The robots.txt is
                the equivalent of such a sign.
       
                  czscout wrote 1 day ago:
                  And a no soliciting sign is no more cosmically binding than
                  robots.txt. It's a request, not an enforceable command.
       
                    hsbauauvhabzb wrote 1 day ago:
                    Tell me you work in an ethically bankrupt industry without
                    telling me you work in an ethically bankrupt industry.
       
                  halJordan wrote 1 day ago:
                  No soliciting signs are polite requests that no one has to
                  follow, and door to door salesman regularly walk right past
                  them.
                  
                  No one is calling for the criminalization of door-to-door
                  sales and no one is worried about how much door-to-door sales
                  increases water consumption.
       
                    distances wrote 17 hours 28 min ago:
                    > No one is calling for the criminalization of door-to-door
                    sales
                    
                    Door-to-door sales absolutely are banned in many
                    jurisdictions.
       
                    duskdozer wrote 21 hours 1 min ago:
                    >No one is calling for the criminalization of door-to-door
                    sales
                    
                    Ok, I am, right now.
                    
                    It seems like there are two sides here that are talking
                    past one another: "people will do X and you accept it if
                    you do not actively prevent it, if you can" and "X is bad
                    behavior that should be stopped and shouldn't be the burden
                    of individuals to stop". As someone who leans to the
                    latter, the former just sounds like restating the problem
                    being complained about.
       
                    ahtihn wrote 1 day ago:
                    If a company was sending hundreds of salesmen to knock at a
                    door one after the other, I'm pretty sure they could
                    successfully get sued for harassment.
       
                      hsbauauvhabzb wrote 1 day ago:
                      Can’t Americans literally shoot each other for
                      trespassing?
       
                        dragonwriter wrote 1 day ago:
                        Generally, legally, no, not just for ignoring a “no
                        soliciting” sign.
       
                          hsbauauvhabzb wrote 22 hours 49 min ago:
                          But they’re presumably trespassing.
       
                            dragonwriter wrote 13 hours 7 min ago:
                            And, despite what ideas you may get from the media,
                            mere trespass without imminent threat to life is
                            not a justification for deadly force.
                            
                            There are some states where the considerations for
                            self defense do not include a duty to retreat if
                            possible, either in general (“stand your ground"
                            law) or specifically in the home (“castle
                            doctrine"), but all the other requirements
                            (imminent threat of certain kinds of serious harm,
                            proportional force) for self-defense remain part of
                            the law in those states, and trespassing by/while
                            disregarding a ”no soliciting” would not, by
                            itself, satisfy those requirements.
       
                    oytis wrote 1 day ago:
                    >  door to door salesman regularly walk right past them.
                    
                    Oh, now I understand why Americans can't see a problem
                    here.
       
              bakql wrote 1 day ago:
              Stop your http server if you do not wish to receive http
              requests.
       
                bigbuppo wrote 10 hours 29 min ago:
                Ah yes, and unplug the mail server to stop all spam. Great
                idea!
       
                vkou wrote 1 day ago:
                Turn off your phone if you don't want to receive robo-dialed
                calls and unsolicited texts 300 times a day.
                
                Fence off your yard if you don't want people coming by and
                dumping a mountain of garbage on it every day.
                
                You can certainly choose to live in a society that thinks these
                are acceptable solutions. I think it's bullshit, and we'd all
                be better off if anyone doing these things would be breaking
                rocks with their teeth in a re-education camp, until they learn
                how to be a decent human being.
       
          XenophileJKO wrote 1 day ago:
          What about people using an LLM as their web client? Are you now
          saying the website owner should be able to dictate what client I use
          and how it must behave?
       
            grayhatter wrote 1 day ago:
            Yes? I'd suggest that you understand that's not an unreasonable
            expectation either.
            
            Your browser has a bug, if you leave my webpage open in a tab,
            because of that bug, it's going to close the connection, reconnect,
            new tls handshake and everything and re-request that page without
            any cache tag, every second, everyday, for as long as you have the
            tab open.
            
            That feels kinda problematic, right?
            
            Web servers block well formed clients all the time, and I agree
            with you, that's dumb. But servers should be allowed to serve only
            the traffic they wish. If you want to use some LLM client, but the
            way that client behaves puts undue strain on wy server, what should
            I do, just accept that your client, and by proxy you, are an
            asshole and just accept that?
            
            You shouldn't put your rules on my webserver, exactly as much I my
            webserver shouldn't put my rules on yours. But i believe that
            ethically, we should both attempt to respect and follow the rules
            of the other. Blocking traffic when it starts to behave abusively.
            It's not complex, just try to be nice and help the other as much as
            you reasonably can.
       
            aDyslecticCrow wrote 1 day ago:
            > Are you now saying the website owner should be able to dictate
            what client I use and how it must behave?
            
            Already pretty well established with Ad-block actually. It's a
            pretty similar case even. AI's don't click ads, so why should we
            accept their traffic? If it's un-proportionally loading the server
            without contributing to the funding of the site, get blocked.
            
            The server can set whatever rules it wants. If the maintainer hates
            google and wants to block all chrome users, it can do so.
       
              XenophileJKO wrote 1 day ago:
              That was kind of what I was really hinting at, as the HN
              community tends to embrace things like ad blockers and archive
              links on stories, but god forbid someone read a site using an
              LLM.
       
                aDyslecticCrow wrote 1 day ago:
                I use adblock myself, and don't feel bad for using it (it's a
                security and privacy tool). But i don't blame websites that
                kick me out for it; hosting costs money.
                
                Server owners should have all the right to set the terms of
                their server access. Better tools to control LLMs and scrapers
                are all good in my book.
                
                I really wish ad platforms were better at managing malware,
                trackers and fraud through. It is rather difficult to fully
                argue for website owner authority with how bad ads actually are
                for the user.
       
                1gn15 wrote 1 day ago:
                Humans are usually hypocritical. They support whatever they
                personally use while opposing whatever inconveniences them,
                even though they're basically the same thing.
                
                This whole thing has made me hate humans, so so much. Robots
                are much better.
       
        sharkjacobs wrote 1 day ago:
        Fun to see practical applications of interesting research[1]
        
  HTML  [1]: https://news.ycombinator.com/item?id=45529587
       
        OhMeadhbh wrote 1 day ago:
        I blame modern CS programs that don't teach kids about parsing.  The
        last time I looked at some scraping code, the dev was using regexes to
        "parse" html to find various references.
        
        Maybe that's a way to defend against bots that ignore robots.txt,
        include a reference to a Honeypot HTML file with garbage text, but
        include the link to it in a comment.
       
          mrweasel wrote 18 hours 23 min ago:
          You don't need to teach parsing, that won't help much any way. We
          need to teach people to be good netizen again. I'd argue that it was
          always viewed as reasonable to scrape content, as long as you didn't
          misrepresent content as your own and if you scraped responsibly,
          backing of if the server started to slow down, or simply not crawling
          to fast to begin with.
          
          Currently we have at least three problems:
          
          1) Companies have no issue with not providing sources and not linking
          back.
          
          2) There are too many scrapers, even if they behaved, some site would
          struggle to handle all of them.
          
          3) Srapers go full throttle 24/7, expecting the sites to rate-limit
          them if they are going to fast. Hammer a site into the ground, just
          wait until it's back and hammer it again, grabbing what you can
          before it crashes once more.
          
          There's no longer a sense of the internet being for all of us and
          that we need to make room for each other. Website / human generated
          content exists as a resource to be strip mined.
       
          mikeiz404 wrote 1 day ago:
          It’s been some time since I have dealt with web scrapers but it
          takes less resources to run a regex than it does to parse the DOM
          (which may have syntactically incorrect parts anyway). This can add
          up when running many scraping requests in parallel. So depending on
          your goals using a regex can be much preferred.
       
          vaylian wrote 1 day ago:
          The people who do this type of scraping to feed their AI are probably
          also using AI to write their scraper.
       
          ericmcer wrote 1 day ago:
          How would recommend doing it? If I was just trying to pull  tag links
          out I feel like treating it like text and using regex would be way
          more efficient than a full on HTML parser like JSDom or something.
       
            singron wrote 1 day ago:
            You don't need javascript to parse HTML. Just use an HTML parser.
            They are very fast. HTML isn't a regular language, so you can't
            parse it with regular expressions.
            
            Obligatory:
            
  HTML      [1]: https://stackoverflow.com/questions/1732348/regex-match-op...
       
              zahlman wrote 1 day ago:
              The point is: if you're trying to find all the URLs within the
              page source, it doesn't really matter to you what tags they're
              in, or how the document is structured, or even whether they're
              given as link targets or in the readable text or just what.
       
          tuwtuwtuwtuw wrote 1 day ago:
          Do you think that if some CS programs taught parsing, the authors of
          the bot would parse the HTML to properly extract links, instead of
          just doing plain text search?
          
          I doubt it.
       
        latenightcoding wrote 1 day ago:
        when I used to crawl the web, battle tested Perl regexes were more
        reliable than anything else, commented urls would have been added to my
        queue.
       
          rightbyte wrote 1 day ago:
          DOM navigation for fetching some data is for tryhards. Using a regex
          to grab the correct paragraph or div or whatever is fine and is more
          robust versus things moving around on the page.
       
            horseradish7k wrote 1 day ago:
            but not when crawling. you don't know the page format in advance -
            you don't even know what the page contains!
       
            chaps wrote 1 day ago:
            Doing both is fine! Just, once you've figured out your regex and
            such, hardening/generalizing demands DOM iteration. It sucks but it
            is what is is.
       
        Noumenon72 wrote 1 day ago:
        It doesn't seem that abusive. I don't comment things out thinking "this
        will keep robots from reading this".
       
          mostlysimilar wrote 1 day ago:
          The article mentions using this as a means of detecting bots, not as
          a complaint that it's abusive.
          
          EDIT: I was chastised, here's the original text of my comment: Did
          you read the article or just the title? They aren't claiming it's
          abusive. They're saying it's a viable signal to detect and ban bots.
       
            ang_cire wrote 1 day ago:
            They call the scrapers "malicious", so they are definitely
            complaining about them.
            
            > A few of these came from user-agents that were obviously
            malicious:
            
            (I love the idea that they consider any python or go request to be
            a malicious scraper...)
       
            woodrowbarlow wrote 1 day ago:
            the first few words of the article are:
            
            > Last Sunday I discovered some abusive bot behaviour [...]
       
              foobarbecue wrote 1 day ago:
              Yeah but the abusive behavior is ignoring robots.txt and scraping
              to train AI. Following commented URLs was not the crime, just
              evidence inadvertently left behind.
       
              mostlysimilar wrote 1 day ago:
              > The robots.txt for the site in question forbids all crawlers,
              so they were either failing to check the policies expressed in
              that file, or ignoring them if they had.
       
            pseudalopex wrote 1 day ago:
            Please don't comment on whether someone read an article. "Did you
            even read the article? It mentions that" can be shortened to "The
            article mentions that".[1]
            
  HTML      [1]: https://news.ycombinator.com/newsguidelines.html
       
          michael1999 wrote 1 day ago:
          Crawlers ignoring robots.txt is abusive.  That they then start
          scanning all docs for commented urls just adds to the pile of scummy
          behaviour.
       
            tveyben wrote 1 day ago:
            Human behavior is interesting - me, me, me…
       
        rokkamokka wrote 1 day ago:
        I'm not overly surprised, it's probably faster to search the text for
        http/https than parse the DOM
       
          marginalia_nu wrote 19 hours 51 min ago:
          The regex approach is certainly easier to implement, but honestly
          static DOM parsing is pretty cheap, but quite fiddly to get right. 
          You're probably gonna be limited by network congestion (or ephemeral
          ports) before you run out of CPU time doing this type of crawling.
       
          embedding-shape wrote 1 day ago:
          Not probably, searching through plaintext (which they seem to be
          doing) VS iterating on the DOM have vastly different amount of work
          behind them in terms of resources used and performance that
          "probably" is way underselling the difference :)
       
            franktankbank wrote 1 day ago:
            Reminds me of the shortcut that works for the happy path but is
            utterly fucked by real data.  This is an interesting trap, can it
            easily be avoided without walking the dom?
       
              embedding-shape wrote 1 day ago:
              Yes, parse out HTML comments which is also kind of trivial if
              you've ever done any sort of parsing, listen for "". But then
              again, these people are using AI to build scrapers, so I wouldn't
              put too much pressure on them to produce high-quality software.
       
                jcheng wrote 1 day ago:
                It's not quite as trivial as that; one could start the page
                with a    tag that contains "", and that would hide all the
                content from your scraper but not from real browsers.
                
                But I think it's moot, parsing HTML is not very expensive if
                you don't have to actually render it.
       
                stevage wrote 1 day ago:
                Lots of other ways to include URLs in an HTML document that
                wouldn't be visible to a real user, though.
       
       
   DIR <- back to front page