It Was the Best of Times, It Was the Worst of Times: The Dual Realities of Data Access in the Age of Generative AI
(First Published in Industry Data for Society Partnership’s (IDSP) 2024 Year in Review)
Dr. Stefaan Verhulst
“It was the best of times, it was the worst of times… It was the spring of hope, it was the winter of despair.”
–Charles Dickens, A Tale of Two Cities
Charles Dickens’s famous line captures the contradictions of the present moment in the world of data. On the one hand, data has become central to addressing humanity’s most pressing challenges — climate change, healthcare, economic development, public policy, and scientific discovery. On the other hand, despite the unprecedented quantity of data being generated, significant obstacles remain to accessing and reusing it. As our digital ecosystems evolve, including the rapid advances in artificial intelligence, we find ourselves both on the verge of a golden era of open data and at risk of slipping deeper into a restrictive “data winter.”
These two realities are concurrent: the challenges posed by growing restrictions on data reuse, and the countervailing potential brought by advancements in privacy-enhancing technologies (PETs), synthetic data, and data commons approaches. It argues that while current trends toward closed data ecosystems threaten innovation, new technologies and frameworks could lead to a “Fourth Wave of Open Data,” potentially ushering in a new era of data accessibility and collaboration.
Winter of Despair: The Growing Challenge of Data Access
Data has long been positioned as a critical resource for solving global problems. With the rise of artificial intelligence (AI) and big data analytics, the potential to turn vast amounts of information into actionable insights has never been greater. However, this potential is increasingly constrained by growing barriers to data access, creating a paradox where, despite the abundance of data, fewer entities can utilize it effectively for the public good.
One major challenge is the reduction of research access to data from social media platforms and other digital ecosystems. Social media data has been invaluable for tracking public health trends, understanding population movements, and analyzing public opinion. However, we increasingly see restricted access to these datasets, citing privacy concerns, competitive interests, and regulatory pressures. This can severely hamper research on everything from disaster response to political polarization. With control over such data in the hands of a few, the democratization of data access is under threat.
Climate data, another crucial resource, is facing a similar fate. Historically, open access to climate data — through satellite imagery, weather sensors, and environmental monitoring tools — has been vital for understanding climate change, predicting natural disasters, and shaping environmental policy. Yet, the privatization of climate data, driven by its growing commercial value, has created barriers for independent researchers and policymakers. In addition to rising costs, geopolitical tensions over data sovereignty have further restricted cross-border sharing of environmental data. This fragmentation threatens to undermine global efforts to combat climate change.
Further complicating matters is what might be termed “Generative AI-anxiety.” As generative AI models, such as large language models (LLMs), gain prominence, companies and institutions are increasingly reluctant to share their data for fear of it being used without authorization or compensation. Legal battles over copyright and data ownership highlight the growing tension between the need for data to train AI models and the needs of rightsholders. The chilling effect of these disputes has led to a reduction in data sharing, further limiting access to critical datasets that could benefit society.
Finally, the reduction of open government data initiatives represents another step backward. Once heralded to increase transparency, accountability, and innovation, open data initiatives are now stalling or even reversing. The annual Open Data Barometer Report notes that many governments are scaling back their commitments to open data, citing budget constraints, political concerns, or the growing complexity of managing large datasets. The result is a shrinking pool of publicly available data, reducing opportunities for researchers, civic technologists, and policymakers o harness data for public good.
Spring of Hope: Advances in Data Technologies
While the emerging “data winter” may seem bleak, a parallel trend offers reasons for optimism. Advances in privacy-enhancing technologies (PETs), the growing use of synthetic data, and the emergence of new data commons frameworks have the potential to counterbalance the restrictive forces at play. These innovations represent a “spring of hope” (to return to Charles Dickens) that could lead to a more open, collaborative, and equitable data ecosystem.
PETs, which include tools like differential privacy and data sandboxes, allow data to be used in a privacy-preserving manner, potentially helping unblock some of the restrictions discussed previously. These technologies enable the analysis of sensitive data without exposing individuals’ private information, thus addressing one of the primary concerns driving data restrictions. By enabling researchers and institutions to work with data without compromising privacy, PETs could unlock valuable datasets for public use, especially in fields like healthcare and social science, where privacy concerns are paramount.
Synthetic data, another promising PET related development, similarly, allows organizations to create artificial datasets that mimic real-world data while avoiding privacy risks. This approach can be particularly useful in industries where access to real data is limited due to ethical, legal, or practical constraints. For example, synthetic data can be used to train AI models in healthcare without violating patient confidentiality. While synthetic data is not without its challenges — such as ensuring it accurately reflects the complexity of real-world data; and doesn’t accelerate bias — it offers a way to continue innovating in data-driven fields, even as access to real data becomes more restricted.
Perhaps most promising is the emergence of data commons approaches–collaborative frameworks designed to pool and share data among stakeholders for mutual benefit. Data commons initiatives seek to balance the need for data access with concerns about privacy, intellectual property, and equity. By establishing clear governance structures and ethical guidelines for data sharing, data commons can foster a culture of openness and collaboration while ensuring that data is used responsibly. These frameworks are particularly important for addressing global challenges like climate change, where the free flow of data across borders and sectors is essential for coordinated action.
Finally, Generative AI could also play a central role in opening data and insights. By leveraging AI to create user-friendly interfaces for interacting with complex datasets, we can make open data more accessible to a broader range of users. AI-powered tools can help users explore, analyze, and visualize data, lowering the technical barriers that have traditionally limited the use of open data. Perhaps most significantly, generative AI can help bridge the gap between raw data and actionable insights, turning vast amounts of unstructured information into meaningful, policy-relevant findings.
Towards a Fourth Wave of Open Data
These advances suggest the possibility of a “Fourth Wave of Open Data,” a new era where data is more accessible, conversational, and collaborative. Unlike previous waves of open data, which focused primarily on publishing datasets for public use, this new wave would emphasize making data ready for AI applications and ensuring it can be used in privacy-respecting, ethically sound ways.
However, realizing the potential of the Fourth Wave of Open Data will require concerted effort. Policymakers, technologists, and data stewards must work together to address the barriers to data access. Industry groups, such the Industry Data for Society Partnership, also play a role in contributing to technical and governance frameworks necessary to support responsible data sharing. This includes advancing standards for data quality and provenance, fostering interoperability across data systems, and ensuring that the benefits of open data are distributed equitably.
Conclusion: Best of Times, Worst of Times
In the Dickensian spirit, we find ourselves living in both the best of times and the worst of times when it comes to data. While the challenges of data access are real and growing, so too are the opportunities brought by new technologies and collaborative frameworks. It is essential to remember that the trajectory of our data ecosystem is not predetermined; rather, it will be shaped by the choices we make today.
It is within our power to ensure that the Fourth Wave of Open Data does not mark the end of data’s potential. Instead, we could be at the cusp of a new era, a time of unprecedented data access, insight, and innovation. Whether we remain in a winter of despair or move toward a spring of hope depends on us, and how we respond to the challenges and opportunities of the era.