Open internet, web scraping and AI: the unbreakable link
Last year, a nonprofit internet archive organization, The Internet Archive (IA), lost in First Circuit Court (Hachette v. Internet Archive) against four major publishers who sued IA over its decision to act as a digital library during the pandemic and more borrow one copy of a book at a time.
Whether it was an ethical decision and who is right in this battle – publishers using the existing provisions of copyright law to their advantage, or the IA, which argues that current copyright law is outdated and does not meet the demands of digital societies – falls remains to be seen. answered. The IA appealed its loss to the Second Circuit Court, a decision supported by many authors themselves.
However, the IA case points to a broader issue: a struggle to keep access to information open on a free and open internet. In recent years, this mission has become increasingly complicated as lawsuits mount against artificial intelligence companies that collect web data for algorithmic training, contextual advertising services that analyze public data to understand the content of various sites, and even non- profit organizations that collect web data for social purposes. -driven purposes – earlier this year, X sued the Center for Countering Digital Hate and lost the case.
Although on the surface it is presented as a battle over data ownership, it is usually a battle over the distribution of the monetary gains that a growing digital economy provides. Without rethinking current compensation mechanisms, this battle could end in nothing more than a fragmented society, a proliferation of disinformation and biased, primitive AI solutions.
The philosophy of the open internet
The concept of the open web is a broad blend of ideas based on the basic principles of information as a public good, people’s right to share it, and the importance of data neutrality. Its proponents promote equal access to the Internet as a way to spread knowledge globally, primarily through nonprofits such as the Creative Commons, open-source science and coding, open licensing, and archival organizations such as the aforementioned IA.
The open internet has its disadvantages. A simple example would be that cybercrime could benefit significantly from open source encryption, while open access to digital content could encourage piracy. But crime also proliferates in closed social systems. Making the Internet less accessible would therefore hardly solve this problem.
Open access to information, on the other hand, has been the key driver of human civilization, from the time our hominid ancestors developed language, to the Gutenberg Revolution and the rise of the World Wide Web.
The argument for access to public web data
The Internet Archive is the embodiment of the open Internet and free access to data. With the Wayback Machine’s archive of 410 billion web pages, tens of millions of books, images, and audio recordings, and more than 200,000 software programs (including historical applications), it is a vast historical repository, a sociocultural phenomenon, and an educational institution. project with a mission to spread knowledge to remote locations.
The content to the IA can be uploaded by users, but the lion’s share is collected from the Internet using web crawlers: automated solutions that crawl the Internet and store the content of the websites. The IA crawlers only collect data from the public domain, which means that information is left out behind logins or paywalls.
There are several ways in which free data repositories like the IA benefit critical social missions. The IA is used for scientific research, to access old court documents and even as evidence in legal proceedings. It can also be used to support the fight against disinformation and investigative journalism.
AI in the echo chambers
A relatively new use case that requires open access to large amounts of public web data, including historical repositories, is training artificial intelligence (AI, don’t mix it with IA) algorithms. Making AI training and test data as diverse as possible is not only a prerequisite for the development of increasingly complex systems, but also to keep AI algorithms less biased, avoid hallucinations and improve accuracy.
As my colleague has argued, if training datasets are based primarily on data that is synthetic or too homogeneous, the system will tend to accentuate specific patterns (including biases) inherent in the underlying datasets, resulting in echo chambers and the AI output primitive and less reliable. Moreover, probabilistic algorithms would form closed epistemic systems in which the abundance of ideas, theories and other representations of the real world would slowly disappear.
Unfortunately, the biggest challenge for AI developers today is to gain open access to the abundant human-created data. AI companies suffered a huge social and legal backlash over their use of publicly available web data, partly related to data privacy concerns and partly to concerns about data ownership and copyright.
On the one hand, the argument that AI companies developing popular commercial AI solutions should compensate content owners (be they photographers, writers, designers, or scientists) for the use of their work sounds perfectly legitimate. On the other hand, it leaves AI developers at a stalemate.
First, Web content is virtually limitless, and much of it can be considered “technically copyrighted” without any clear rights being assigned to it. Content actively produced by millions of Internet users is the best example of this phenomenon. Typically, none of them claim their public publications as copyrighted material, and it would be impossible to identify all potential copyright holders. Moreover, it would also mean negotiating compensation terms with all of them, an effort of such magnitude that it would make commercial AI development unfeasible.
Recognizing the complicated nature of the situation, some major data owners (often called “gatekeepers”) rushed to monetize their resources. BBC announced it is “in talks with technology companies to sell access to its content archive to use as AI training data,” and other publishers are also considering similar revenue diversification models.
However, this solution could still make the costs of AI development prohibitive, especially for smaller companies. Without rethinking the current compensation mechanisms and the established copyright regime that currently favors the big players, the movement towards more intelligent, reliable and responsible AI systems could remain stuck in the realm of science fiction for years to come.
The rapid expansion of the Internet has drastically changed the way people live their daily lives in recent decades. First, we started consuming digital information: reading books, watching movies, listening to music and talking to each other through our gadgets. Nowadays it’s not just us, but also robots creating digital art, collecting all kinds of information and ‘reading’ it online, trying to make sense of the content that humans have created.
However, the established copyright regime and resulting compensation mechanisms are not fast enough to adapt, causing problems for several participants in the digital economy: companies that collect public web information, historical repositories that store internet data for future generations, and AI developers that need to make robots smart and, more importantly, reliable. As the case of the Internet Archive shows, even the concept of a digital library is still legally problematic.
With existing technological capabilities, open access to publicly available web data is the only way to improve the quality of AI results. AI tools that can better process and distribute information would in turn make information more accessible and useful to a wider audience. However, if AI developers are forced to pay for all the data they use, there may no longer be a business case for further developing these systems.
Do you want to scour the internet for profit? We have the best proxies.
Read the latest news: Bots scrap public chats on Discord to resell user data.
This article was produced as part of Ny BreakingPro’s Expert Insights channel, where we profile the best and brightest minds in today’s technology industry. The views expressed here are those of the author and are not necessarily those of Ny BreakingPro or Future plc. If you are interested in contributing, you can read more here: https://www.techradar.com/news/submit-your-story-to-techradar-pro