AI program that can imitate Shakespeare replicated by two graduates
When OpenAI developed software powerful enough to write hyper-realistic fake news, it decided not to release it to the public for fear it could be exploited. But then, two graduates built a version of the program using the original code, prompting debate on ownership in the murky world of open-source code
Consider these two passages: “For in that sleep of death, what dreams may come when we have shuffled off this mortal coil.” And: “If death, in some obscure and distant hour, strikes me still as I slept, if I yet dream.”
While William Shakespeare wrote the first, an ultra-powerful artificial intelligence (AI) program imitated the Bard’s style in the second – and yet, the two are almost indistinguishable.
The program, called GPT-2, was developed by AI lab OpenAI in February. It’s able to write hyper-realistic text in any format, using nothing but a few words as a prompt. However, having perfected the software, OpenAI elected not to release it over concerns that it could be used to mass produce fake news or propaganda. Instead, the company released a watered-down version of the program along with a scientific paper explaining the significance of what it had created.
OpenAI considered its decision to be watertight: it had protected the public from the program’s potentially nefarious impact. It did not foresee that two computer scientists from Brown University would be able to recreate the program and publish it online for anyone to download – but that’s exactly what Aaron Gokaslan and Vanya Cohen did.
The duo, aged 23 and 24, recreated OpenAI’s language-production software using code available in the public domain. They then trained it using millions of web pages and $50,000 worth of free cloud-computing tools available from Google. Unlike OpenAI, they, along with other members of the AI community, do not believe the program poses a danger to society.
“OpenAI’s claim that it is somehow dangerous to release the code is just a new iteration of the ‘security through obscurity’ argument,” said Mark Taylor, CEO of Sirius and founder of the UK’s Open Source Consortium. “The argument has rightly been ridiculed widely, and has by definition been disproved by the two graduates who replicated it.”
OpenAI’s decision not to release the full program was further criticised for being counterintuitive to the collaborative attitude that typically pervades when it comes to developing new code. By releasing a slimmed-down version of the code and declaring the program dangerous, the company not only invited potential copycat versions, but also ignited a debate about the power that comes with ownership of such a formidable tool.
Joint effort
Ownership of any kind of code can be attributed in two ways: through either a proprietary or an open-source licence. “The key difference is who is allowed to see it and modify it,” explained Taylor. “Proprietary licences treat code as ‘secret sauce’ and only allow the owner to modify and improve it, whereas open-source licences take the scientific approach of not only allowing others to modify, but also encouraging peer review.”
The latter type is preferred by developers, particularly within the AI community, as the nascent nature of much of the technology means it is highly unlikely to be perfect first time round. Making it public allows others to modify it and improves the quality of the code over time. “No matter how good an engineer, people are fallible and if other experts can check their work, errors can be identified and corrected,” Taylor told The New Economy. “One never knows where or from whom the next genius idea can come from… With access to, and the ability to modify or contribute to, source code means a product can be rapidly improved and even transformed into class-leading software.”
Cohen also argued in an interview with Wired that by releasing the full version of the code rather than declaring it dangerous, OpenAI would not have drawn so much attention to its invention. He told Wired that his recreation “allows everyone to have an important conversation about security, and researchers to help secure against future potential abuses”.
Power sharing
Nevertheless, some argue that freely releasing code can create public safety issues, as someone with malicious motives could weaponise it. But Taylor refutes this argument. “All tools can be exploited or used for a detrimental purpose,” he told The New Economy. “Tools themselves are neutral – it is the intent of the user that matters.” In other words, he doesn’t see potential safety concerns as a justifiable reason for not releasing code into the public domain, especially when it has numerous other innovation-related benefits.
What’s more, the alternative to open-source is for companies to have in-house teams of data scientists that operate entirely within the limitations of one firm. This raises concerns about the potentially dangerous implications of concentrating that much power within the hands of a single company. “History tells us that centralising power rarely works out well,” said Taylor. “Proponents of this sort of proprietary ownership are effectively a priesthood, imploring others to trust [that they will use the code responsibly].” When code is in the public domain, it’s clear for all the world to see how it is being utilised, making it much easier to identify cases where it is being put to malicious use.
Since Cohen and Gokaslan released their recreation of GPT-2 into the public domain, OpenAI has announced it is aware of at least five other copies of the program. It’s unclear whether the firm plans to take any action toward removing these versions, or whether it even has the power to. “If you train a program using someone else’s data, the original owner of that data would normally have a claim to at least part ownership of it,” explained John-Paul Rooney, a partner and patent attorney at intellectual property firm Withers & Rogers. “However, if the data was made freely available as part of an open-source agreement, then its original owner would forfeit any right of ownership, unless there were specific conditions of use written into the open-source agreement. In this case, it seems likely that OpenAI was aware that it was forgoing any right of ownership to programs produced as a result of releasing its datasets.”
Given that it willingly released a slimmed-down version of the GPT-2 code, it’s highly unlikely that OpenAI could seek any sort of successful legal recourse for the replications. The firm has also said that it plans to release the full version of the program itself at a later point, undermining its earlier statement regarding safety concerns. This raises the question as to why it held it back in the first place, although this could simply be explained by the level of media coverage that has surrounded the staged release: by following a non-traditional path in that regard, the firm has piqued the interest of the AI community, meaning more developers have viewed and worked on the code than if the company had quietly released the entire program straight off the bat. Perhaps that was OpenAI’s plan all along.