_____ __________ ______ / ___// ____/ __ \/ ____/______ ______ ___ ____ \__ \/ __/ / / / / / __/ ___/ / / / __ `__ \/ __ \ ___/ / /___/ /_/ / /_/ / / / /_/ / / / / / / /_/ / /____/_____/\____/\____/_/ \__,_/_/ /_/ /_/ .___/ /_/ ╭⋟─────────────────────────────────────────────────────────────────────╮ | | | TITLE: LLMs and Gopher | | | | DATE: May 14, 2025 | | | | AUTHOR: grump@seogrump.com | | | ╰─────────────────────────────────────────────────────────────────────⋞╯ I was thinking about LLMs yesterday, because it's impossible not to if you're in my line of work. I was working on content for a client that likes really long, detailed articles, and they prefer to provide outlines listing all of the things they'd like me to cover. I can tell that the outlines are AI generated. That's not a bad thing in itself; organization is one thing that LLMs do very well. This particular outline, however, contained a phrase that I know originated from me. I see my own phrases in AI content fairly often, but this is the first time I've ever seen one in an assignment from a client. It was a pretty strange feeling. One of the reasons why I've enjoyed my experiences with gopher so much is because I feel like it's a "safe" place on the Internet not yet polluted by AI. Like so many other creators, I'm pretty peeved about my work being scraped for someone else's gain. It occurred to me, though, that there's actually nothing stopping AI companies from scraping gopher for content. Gopher is a simple protocol. Ask ChatGPT how to create a gophermap, and it'll give you the right answer. There's vastly more content on the web, of course, but AI companies are becoming increasingly desperate for more human-generated content with which to feed their models. That's why Microsoft, Apple and all the other Big Tech companies are making their AI training opt-out rather than opt-in. It's becoming harder to find authentic human content that hasn't been scraped already. The web is more gummed up with AI-generated content than a lot of people realize. Even former bastions of authentic human-written content like Reddit are now full of AI text. LLMs are notoriously bad at telling the difference between AI- and human-generated text -- and if you train an AI model on AI-generated text, it eventually leads to the collapse of the model (garbage in, garbage out). Does the volume of text on gopher compare to what's on the web? Obviously not. It's a vast trove of authentic human content, though, and scraping it would be trivial. I wonder if it's already happened. ╰─────────────────────────────────────────────────────────────────────⋞╯