Are we entering a “Data Winter”?
The urgent need to prevent further decline in opening up data for reuse in the public interest
Stefaan G. Verhulst
Introduction
In an era where data drives decision-making, the accessibility of data for public interest purposes has never been more crucial. Whether shaping public policy, responding to disasters, or empowering research, data plays a pivotal role in our understanding of complex social, environmental, and economic issues. In 2015, I introduced the concept of Data Collaboratives to advance new and innovative partnerships between the public and private sectors that could make data more accessible for public interest purposes. More recently, I have been advocating for a reimagined approach to data stewardship to make data collaboration more systematic, agile, sustainable, and responsible.
Despite many advances toward data stewardship (especially during Covid19) and despite the creation of several important data collaboratives (e.g., the Industry Data for Society Partnership) the project of opening access to data is proving increasingly challenging. Indeed, unless we step up our efforts in 2024, we may be entering a prolonged data winter — analogous to previous Artificial Intelligence winters, marked by reduced funding and interest in AI research, in which data assets that could be leveraged for the common good are instead frozen and immobilized. Recent developments, such as a decline in access to social media data for research and the growing privatization of climate data, along with a decrease in open data policy activity, signify a worrying trend. This blog takes stock of these developments and, building on some recent expert commentary, raises a number of concerns about the current state of data accessibility and its implications for the public interest. We conclude by calling for a new Decade of Data — one marked by a reinvigorated commitment to open data and data reuse for the public interest.
A Digital Dark Age? Decreased Access to Social Media Data for Research
In a recent thought-provoking article in Wired, Gina Neff warns that we may be entering a new era in data accessibility. She sheds light on a notable change in the data accessibility landscape, marked by restrictions on social media platform data for computational social science. Social media, once a goldmine for researchers seeking insights into societal trends, political movements, and human behavior, is now entering a phase of restricted access, ushering in what Neff describes as a “grim digital dark age.”
To fully grasp the implications of this transformation, we need to understand that, in their early days, platforms like Twitter and Facebook were more than just social networking sites; they were rich, real-time libraries of public sentiment and behavior. Researchers relied on the resulting data to gain a deeper understanding of diverse phenomena, from political crises to epidemic patterns and natural disasters. However, the landscape in 2024 starkly contrasts with this once-open data environment. In particular, the introduction of large language models (LLMs) and AI-generated content on these platforms has not only raised concerns about the accuracy and reliability of information but also led to a fundamental shift in how data is accessed, shared and valued. Instead of social media data fulfilling a public good function for research and evidence based decision making, it has largely become fodder for LLMs
Likewise, the decision by Elon Musk to end free access to Twitter’s API dealt a significant blow to the research community. This move made it increasingly challenging, if not impossible, for researchers to obtain essential data on topics like public health, disaster response, and economic activity. It stands as a stark reminder that the modern internet, contrary to its perceived openness and democratic nature, is increasingly controlled by a small number of gatekeepers who dictate data access.
All told, the notion of closer cooperation with platform companies to harness data for public good seems increasingly illusory, despite efforts by organizations such as the European Digital Media Observatory to create an independent intermediary body to support research on digital platforms (as mandated by the Digital Services Act). Even when researchers are granted access, the use of policies such as Meta’s “independence by permission” reveals a troubling dynamic: control over what types of questions can be asked (and who can ask them) continues to rest with the companies. This control significantly limits the scope and independence of research, undermining the role of data in serving the public interest.
The Privatization and Hoarding of Climate Data
Another recent article, by Justin S. Mankin, brings to light a further concerning trend, this one in the realm of climate data. In recent years, climate data has gradually shifted from a public good to a commodity governed by market forces. This privatization of climate data not only raises ethical questions but also poses a significant threat to equitable access to information and, more broadly, to the public good.
Climate data, once largely the domain of public research institutions and widely accessible, is increasingly being seen as a lucrative asset in the private sector. Venture capitalists and major corporations are investing heavily in climate analytics, recognizing the growing demand for detailed information about climate risks. This commodification of climate science is creating a market where climate data and risk models are treated as products to be bought and sold.
Mankin’s concern lies in the implications of this trend for social equity and justice. With private firms dominating the climate information space, there’s a growing divide between those who can afford such data and insights and those who cannot. This divide could lead to scenarios where the wealthy are better equipped to adapt to climate risks, while the less affluent are left vulnerable. Such a scenario not only perpetuates existing inequalities but could also undermine the very essence of using climate data as a tool for widespread societal benefit.
While recognizing the role of the private sector in contributing to climate information, Mankin warns against an overreliance on it for climate adaptation information. The risk is that the privatization and commodification of climate data might hollow out the publicly provided climate science, leading to a situation where the rich pay with money and the poor pay with their lives. This potential outcome highlights the critical need for maintaining climate data as a public good, ensuring its accessibility and utility for all, irrespective of financial capability.
Generative AI-nxiety and the potential decrease in Data Accessibility
The overall turn against data accessibility has been worsened by the emergence of Generative AI, which has brought with it a new wave of apprehension and tightening of restrictions. This apprehension (sometimes termed “Generative AI-nxiety”) has significant implications for data sharing, particularly in the context of public interest reuse and the development of foundational AI models.
One of the primary concerns revolves around the unauthorized use and potential misuse of data for training generative AI models. Such concerns, highlighted by The New York Times case against OpenAI and Microsoft, are not entirely unfounded, and concerns over the unauthorized use of sensitive or proprietary information may be valid.
However, the trouble is that this “AI-nxiety” now applies to legitimate and non-discriminatory reuses of data, too, and is starting to stunt data accessibility for public interest purposes. Overall, the apprehension that data could be used to train AI models, which might then generate outputs with unforeseen consequences due to decisions informed by the outputs, is leading to a more guarded approach to data sharing. This trend poses a significant challenge for researchers and organizations that rely on open data to address societal issues, develop public policies, and advance scientific understanding.
Generative AI represents a double-edged sword in the context of data accessibility. On the one hand, it offers immense potential for innovation and democratizing access to data and knowledge. On the other hand, the fear of misuse and the lack of clear regulatory frameworks around AI data reuse are contributing to a more restrictive data environment. This paradox highlights the need for a balanced approach that safeguards against misuse while not stifling data flows essential for the public interest and social good.
Stalling of Legislative and Policy Activity Toward Open Data
The advent of a potential data winter is not simply being hastened by technological and market forces. Recent developments on the policy and legal front, too, indicate a worrying stagnation in the development of open data policies and regulations.
Despite growing recognition regarding the importance of data for public interest purposes, there has been a noticeable lack of recent progress in the development and implementation of open data policies. For instance, our repository on open data policies and regulations has not witnessed any meaningful advances in the past year. Likewise, the European Data Act, which seemed to hold considerable promise for opening up business data, serves as a case study of recent legislative shortcomings. As assessed by Ingo Dachwitz, the Act has been largely disappointing, failing to live up to its potential. One of the main issues has been the lack of a clear implementation pathway, leaving a gap between legislative intent and practical application. For instance, the current legislation ignores the need for the human infrastructure of data stewards, which can balance the interests of private entities with the broader societal need for data accessibility. This gap not only hinders the accessibility of business data for public interest but also reflects a broader trend of ineffective policy-making in the data domain.
Conclusion: From a Data Winter to a Decade of Data
As we navigate through the complexities of data accessibility in today’s world, it is evident that the battle to keep data open and accessible for public interest purposes is facing significant challenges. From the restrictive policies of social media giants to the privatization of climate data, the landscape of data collaboration is becoming increasingly restrictive and commodified. The introduction of generative AI has added a new dimension to this issue, fueling further concerns about an impending data winter. Legislative efforts, which should ideally provide a framework for equitable and open data access, have likewise been disappointing.
Despite these hurdles, the need for accessible data in the public interest has never been more critical. Data drives our understanding of complex global issues, informs policy decisions, and fuels scientific advancements. The ongoing challenges underscore the need for a balanced approach that safeguards community and proprietary interests without stifling the flow of data that can benefit society at large.
Looking ahead, it is imperative that stakeholders from various sectors — governments, private organizations, academia, and civil society — collaborate to focus on halting this trend of enclosure and hoarding by calling for a Decade of Data so as to forge pathways to ensure data remains a tool for public good. This requires not only effective and pragmatic legislation but also a cultural shift in how we perceive and value data, and investment in both human infrastructure, such as data stewards. The goal should be to create an ecosystem where data is not just a commodity to be traded but a resource to empower communities and science and foster a more informed, equitable world. And all this should go hand-in-hand with and can be facilitated through increased digital self-determination.
At the global level, we stand at a pivotal juncture with a unique opportunity to redefine the trajectory of data cooperation—not in the distant future, but in the coming months. This change hinges on the decisions to be made during the Global Digital Compact, where world leaders will determine the scope and nature of their collaboration on digital matters. The current landscape is marked by a trend towards isolation, with ‘small gardens’ surrounded by ‘ever higher walls’.
The Global Digital Compact presents an opportunity for world leaders to make a concerted effort towards enhancing digital cooperation, aiming to lower these barriers. It’s crucial to recognize data as a fundamental cornerstone of the AI era, not merely a byproduct. Such recognition underscores the need for a balanced approach that fosters open data exchange while ensuring robust privacy and security measures. By doing so, the Compact has the potential to lay the groundwork for a more interconnected and responsible digital future, where data collaboration and innovation go hand in hand with ethical considerations and global cooperation.
It’s not too late to move from a limiting, chilling data winter to an enabling, socially beneficial data decade.
Thanks to Akash Kapur and David Passarelli for their suggestions on the earlier draft.