Hackers' Pub

@hongminhee洪民憙 (Hong Minhee) I think you're giving the AI companies too much credit and goodwill. The technology itself may have use, but their total lack of respect is not the only problem. The energy required to do the training and the inference, the environmental impact of it will not be addressed by freeing the models.

But... that's probably worth another blog post. Nevertheless, I'd like to address a few things here:

OpenAI and Anthropic have already scraped what they need.

They did not. I'm receiving 3+ million requests a day from Anthropic's ClaudeBot, and about a 1 million a day from OpenAI: see the stats on @iocaineiocaine powder . If they scraped everything they need, they wouldn't aggressively continue the practice, would they?

They need new content to "improve" the models. They need new data to scrape now, more than ever, because when the internet is getting filled with slop, legit human work to train on becomes even more desirable.

Heck, I've seen scraping waves where my sites received over 100 million requests in a single day! I do not know which companies were responsible (though, I have my suspicions), but they most definitely do not have all the data they need.

And I'm hosting small, tiny things, nothing of much importance. I imagine more juicy targets like Codeberg receive a whole lot more of these.

GitHub already has everyone's code.

GitHub has a lot of code, but far from everyone's. And we shouldn't give them more to exploit. Just because they already exploited us in the past 10 years doesn't mean we should "accept reality" and let them continue.

Then, you go and ponder licensing: it doesn't matter. See the beginning of my blog post:

None of the major models keep attribution properly, and their creators and proponents of these “tools” assert that they do not need to keep either. That by nature of the training, the models recycle and remix, and no substantial code is emitted from the original as-is, only small parts that are not copyrightable in themselves. As such, they do not constitute derived work, and no attribution is necessary.

No matter how you word your licensing, as long as they can argue that training emits only uncopyrightable fragments through remixing and recycling, your license is irrelevant.

You can try and incorporate explicitly allowing training if the wights are released - they will not care. Once they deem your code uncopyrightable, they can do whatever they want, like they've been doing now.

You assume these companies behave ethically. They do not.

See the recentish Anthropic vs Authors case: Anthropic was fined not because they violated copyright, but because they sourced the books illegally. The copyright violation was dismissed.

Why do you think applying a different license would help, when there's existing legal precedent that it does fuck all?

Also, releasing the weights is... insufficient. Important, but insufficient. To free a model, you also need the training data, otherwise you can't reproduce it. Training data should be considered part of its source, because you can't reproduce the model without it.

Good luck with that. There is no scenario where surrendering to this "new reality" plays out well.

It really is quite simple: as we do not negotiate with fascists, we do not negotiate with AI companies either.

@algernonsmall rodent who dislikes browsers very much @iocaineiocaine powder Thank you for taking the time to engage with my piece and for sharing your concrete experience with aggressive crawling. The scale you describe—3+ million daily requests from ClaudeBot alone—makes the problem tangible in a way abstract discussion doesn't.

Where we agree: AI companies don't behave ethically. I don't assume they do, and I certainly don't expect them to voluntarily follow rules out of goodwill. The environmental costs you mention are real and serious concerns that I share. And your point about needing training data alongside weights for true reproducibility is well-taken—I should have been more explicit about that.

On whether they've “scraped everything”

I overstated this point. When I said they've already scraped what they need, I was making a narrower claim than I stated: that the major corporations have already accumulated sufficient training corpora that individual developers withdrawing their code won't meaningfully degrade those models. Your traffic numbers actually support this—if they're still crawling that aggressively, it means they have the resources and infrastructure to get what they want regardless of individual resistance.

But you raise an important nuance I hadn't fully considered: the value of fresh human-generated content in an internet increasingly filled with synthetic output. That's a real dynamic worth taking seriously.

On licensing strategy

I hear your skepticism about licensing, and the Anthropic case you cite is instructive. But I think we may be drawing different conclusions from it. Yes, the copyright claim was dismissed while the illegal sourcing claim succeeded—but this tells me that legal framing matters. The problem isn't that law is irrelevant; it's that current licenses don't adequately address this use case.

I'm not suggesting a new license because I believe companies will voluntarily comply. I'm suggesting it because it changes the legal terrain. Right now, they can argue—as you note—that training doesn't create derivative works and thus doesn't trigger copyleft obligations. A training-specific copyleft wouldn't eliminate violations, but it would make them explicit rather than ambiguous. It would create clearer grounds for legal action and community pressure.

You might say this is naïve optimism about law, but I'd point to GPL's history. It also faced the critique that corporations would simply ignore it. They didn't always comply voluntarily, but the license created the framework for both legal action and social norms that, over time, did shape behavior. Imperfectly, yes, but meaningfully.

The strategic question I'm still wrestling with

Here's where I'm genuinely uncertain: even if we grant that licensing won't stop corporate AI companies (and I largely agree it won't, at least not immediately), what's the theory of victory for the withdrawal strategy?

My concern—and I raise this not as a gotcha but as a genuine question—is that OpenAI and Anthropic already have their datasets. They have the resources to continue acquiring what they need. Individual developers blocking crawlers may slow them marginally, but it won't stop them. What it will do, I fear, is starve open source AI development of high-quality training data.

The companies you're fighting have billions in funding, massive datasets, and legal teams. Open source projects like Llama or Mistral, or the broader ecosystem of researchers trying to build non-corporate alternatives, don't. If the F/OSS community treats AI training as inherently unethical and withdraws its code from that use, aren't we effectively conceding the field to exactly the corporations we oppose?

This isn't about “accepting reality” in the sense of surrender. It's about asking: what strategy actually weakens corporate AI monopolies versus what strategy accidentally strengthens them? I worry that withdrawal achieves the latter.

On environmental costs and publicization

Freeing model weights alone doesn't solve environmental costs, I agree. But I'd argue that publicization of models does address this, though perhaps I didn't make the connection clear enough.

Right now we have competitive redundancy: every major company training similar models independently, duplicating compute costs. If models were required to be open and collaborative development was the norm, we'd see less wasteful duplication. This is one reason why treating LLMs as public infrastructure rather than private property matters—not just for access, but for efficiency.

The environmental argument actually cuts against corporate monopolization, not for it.

A final thought

I'm not advocating negotiation with AI companies in the sense of compromise or appeasement. I'm advocating for a different field of battle. Rather than fighting to keep them from training (which I don't believe we can win), I'm suggesting we fight over the terms: demanding that what's built from our commons remains part of the commons.

You invoke the analogy of not negotiating with fascists. I'd push back gently on that framing—not because these corporations aren't doing real harm, but because the historical anti-fascist struggle wasn't won through withdrawal. It was won through building alternative power bases, through organization, through creating the structures that could challenge and eventually supplant fascist power.

That's what I'm trying to articulate: not surrender to a “new reality,” but the construction of a different one—one where the productive forces of AI are brought under collective rather than private control.

I may be wrong about the best path to get there. But I think we share the destination.

Syntax	Description	Examples
`"` keyword `"`	Finds the string within quotes, including spaces. Case-insensitive. (Escape quotes inside with `\"`)	`"Hackers' Pub"`
`from:` handle	Finds content written by the specified user.	`from:hongminhee` `from:hongminhee@hollo.social`
`lang:` ISO 639-1	Finds content written in the specified language.	`lang:en`
`#` tag	Finds content with the specified tag. Case-insensitive.	`#HackersPub`
condition condition	Finds content that satisfies both conditions on either side of the space (logical AND).	`"Hackers' Pub" lang:en`
condition `OR` condition	Finds content that satisfies at least one of the conditions on either side of the OR operator (logical OR).	`#HackersPub OR "Hackers' Pub" lang:en`
`(` condition `)`	Combines the operators within the parentheses first.	`(#HackersPub OR "Hackers' Pub" OR "Hackers Pub") lang:en`