Code generation tools: a degenerative disease for free and open source software

There have been some conversations in the Scientific Python community over the past several months about what, if anything, should be done regarding contributions that utilize code generation tools. In preparation for next week's Developer Summit, Matthew Brett opened an issue to begin drafting a SPEC to help the community navigate this terrain. What follows is my contribution on that thread.

Any project that accepts code that was generated by tools violating license terms dilute and invalidate their own license.

These include approximately all of the automated tools in use today, as even training sets that exclude GPL nevertheless violate the terms of BSD and MIT licenses which require attribution.

I think a way to address this would be to have attestations in the form of checkboxes and fill-in-the-blank, akin to those in Contributor License Agreements (CLAs), that states either that "I did not use code-generation tools in the production of the code submitted here", or that "The code generation tools I used was trained only on public domain, no attribution such as zero-clause BSD, zero-clause MIT, or code within this project" with a fill-in-the-blank requirement to specify the models that were used. In particular, in no uncertain terms it should reject contribution that utilized any of the popular tools in use today, which includes GitHub CoPilot, ChatGPT, Claude, Llama, Mistral, Replit Code, StarCoder, CodeLLama.

Projects without such a direct approach are open to drifting and sliding into a post-legal world. It doesn't matter what license they use, as they lack the means to assert their own license due to a lack of provenance of their incoming contributions.

You can find this post on LinkedIn, Bsky, Twitter, or Mastodon.