In 1847, French composer Ernest Bourget stirred controversy over copyright by refusing to pay for his drink, claiming the café’s musicians played his music, thus owing him. Every epoch introduces new questions about copyright, from photography and recorded music to radio, home videos, and now, generative AI.
AI models such as ChatGPT and Midjourney discern patterns from massive datasets. Though OpenAI’s exact data sources for ChatGPT4 remain undisclosed, similar projects utilize varied resources, from the Common Crawl’s public web archive to Project Gutenberg’s books and scientific paper archives. While much of this content is freely accessible, it’s also copyrighted.
Unlike traditional data retrieval, these AI systems don’t merely reproduce a single piece of data from their training set. Instead, they match patterns from vast arrays of references. As Tim O’Reilly once stated, ‘data isn’t oil – data is sand,’ painting a picture of the individual data point’s insignificance in the grand scheme.
Current copyright laws struggle to comprehend these models, as they don’t reproduce singular items but generate new content based on an aggregate. This relates to the Ernest Bourget incident, raising complex issues about AI, copyright, and even brand identity when AI recreates or imitates known styles or personas.
These models’ ‘freely available’ aspect adds another layer of complexity. As reported by FT, OpenAI, and Google are discussing payment for access to training data with newspaper publishers. While news platforms may not have a valid argument for payment based on search appearances, they have a point if AI can synthesize news from multiple sources, bypassing direct traffic to the original sites. This could extend to other specialized domains as well.
In essence, OpenAI’s use of massive data isn’t to build a database but an attempt to automate intelligence creation. However, framing this as a copyright issue may be overly simplistic. The uncharted waters of AI and copyright are yet to be fully navigated, leaving us with more questions than answers.