Posted on

by

in

Open Data and Open Source AI: Charting a course to get more of both

While working to define Open Source AI, we realized that data governance is an unresolved issue. The Open Source Initiative organized a workshop to discuss data sharing and governance for AI training. The critical question posed to attendees was “How can we best govern and share data to power Open Source AI?” The main objective of this workshop was to establish specific approaches and strategies for both Open Source AI developers and other stakeholders.

The Workshop: Building bridges across “Open” streams  

Held on October 10-11, 2024, and hosted by Linagora’s Villa Good Tech, the OSI workshop brought together 20 experts from diverse fields and regions. Funded by the Alfred P. Sloan Foundation, the event focused on actionable steps to align open data practices with the goals of Open Source AI.  

Participants, listed below, comprised academics, civil society leaders, technologists, and representatives from organizations like Mozilla Foundation, Creative Commons, EleutherAI Institute and others. 

Ignatius Ezeani University of Lancaster  / Nigeria 

Masayuki Hatta Debian, Open Source Group Japan / Japan

Aviya Skowron EleutherAI Institute / Poland

Stefano Zacchiroli Software Heritage / Italy

Ricardo Torres Digital Public Goods Alliance / Mexico

Kristina Podnar Data and Trust Alliance / Croatia + USA

Joana Varon Coding Rights / Brazil

Renata Avila Open Knowledge Foundation / Guatemala

Alek Tarkowski Open Future / Poland

Maximilian Gantz  Mozilla Foundation / Germany

Stefaan Verhulst  GovLab / USA+ Belgium

Paul Keller Open Future / Germany

Thom Vaughan Common Crawl / UK  

Julie Hunter Linagora / USA 

Deshni Govender  GIZ FAIR Forward – AI for All / South Africa

Ramya Chandrasekhar CNRS / India

Anna Tumadóttir Creative Commons / Iceland  

Stefano Maffulli Open Source Initiative / Italy 

Over two days, the group worked to frame a cohesive approach to data governance. Alek Tarkowski and Paul Keller of the Open Future Foundation are working with OSI to complete the white paper summarizing the group’s work. In the meantime, here is a quick “tease”—just a few of the many topics that the group discussed:  

The streams of “open” merge, creating waves

AI is where Open Source software, open data, open knowledge, and open science meet in a new way. Since OpenAI released ChatGPT, what once were largely parallel tracks with occasional junctures are now a turbulent merger of streams, creating ripples in all of these disciplines and forcing us to reassess our principles: How do we merge these streams without eroding the principles of transparency and access that define openness?

We discovered in the process of defining Open Source AI that the basic freedoms we’ve put in the Open Source Definition and its foundation, the Free Software Definition, are still good and relevant. Open Source software has had decades to mature into a structured ecosystem with clear rules, tools, and legal frameworks. Same with Open Knowledge and Open Science: While rooted in age-old traditions, open knowledge and science have seen modern rejuvenation through platforms like Wikipedia and the Open Knowledge Foundation. Open data, however, feels less solid: often serving as a one-way pipeline from public institutions to private profiteers, is now dragged into a whole new territory. 

How are these principles of “open” interacting with each other, how are we going to merge Open Data with Open Source with Open Science and Open Knowledge in Open Source AI?

The broken social contract of data 

Data fuels AI. The sheer scale of data required to train models like ChatGPT reveals not just a technological challenge but also a societal dilemma. Much of this data comes from us—the blogs we write, the code we share, the information we give freely to platforms. 

OpenAI, for example, “slurps” all the data it can find, and much of it is what we willingly give: the blogs we write; the code we share; the pictures, emails and address books we keep in “the cloud”; and all the other information we give freely to platforms. 

We, the people, make the “data,” but what are we getting in exchange? OpenAI owns and controls the machine built with our data, and it grants us access via API, until it changes its mind. We are essentially being stripmined for a proprietary system that grants access at a price—until the owner decides otherwise.

We need a different future, one where data empowers communities, not just corporations. That starts with revisiting the principles of openness that underpin the open source, open science, and open knowledge movements. The question is: How do we take back control?  

Charting a path forward  

We want the machine for ourselves. We want machines that the people can own and control. We need to find a way to swing the pendulum back to our meaning of Open. And it’s all about the “data.”

The OSI’s work on the Open Source AI Definition provides a starting point. An Open Source AI machine is one that the people can meaningfully fork without having to ask for permission. For AI to truly be open, developers need access to the same tools and data as the original creators. That means transparent training processes, open filtering code, and, critically, open datasets. 

Group photo of the participants to the workshop on data governance, Paris, Oct 2024.

Next steps  

The white paper, expected in December, will synthesize the workshop’s discussions and propose concrete strategies for data governance in Open Source AI. Its goal is to lay the groundwork for an ecosystem where innovation thrives without sacrificing openness or equity.  

As the lines between “open” streams continue to blur, the choices we make now will define the future of AI. Will it be a tool controlled by a few, or a shared resource for all?  

The answer lies in how we navigate the waves of data and openness. Let’s get it right. 

Source: opensource.org