{"id":14006,"date":"2024-11-18T09:00:00","date_gmt":"2024-11-18T08:00:00","guid":{"rendered":"http:\/\/plus.maciejpiasecki.info\/index.php\/2024\/11\/18\/open-data-and-open-source-ai-charting-a-course-to-get-more-of-both\/"},"modified":"2024-11-21T21:06:46","modified_gmt":"2024-11-21T20:06:46","slug":"open-data-and-open-source-ai-charting-a-course-to-get-more-of-both","status":"publish","type":"post","link":"https:\/\/plus.maciejpiasecki.info\/index.php\/2024\/11\/18\/open-data-and-open-source-ai-charting-a-course-to-get-more-of-both\/","title":{"rendered":"Open Data and Open Source AI: Charting a course to get more of both"},"content":{"rendered":"<p>While working to define Open Source AI, we realized that data governance is an unresolved issue. The Open Source Initiative organized a workshop to discuss data sharing and governance for AI training. The critical question posed to attendees was \u201cHow can we best govern and share data to power Open Source AI?\u201d The main objective of this workshop was to establish specific approaches and strategies for both Open Source AI developers and other stakeholders.<\/p>\n<p>The Workshop: Building bridges across \u201cOpen\u201d streams\u00a0\u00a0<\/p>\n<p>Held on October 10-11, 2024, and hosted by Linagora\u2019s Villa Good Tech, the OSI workshop brought together 20 experts from diverse fields and regions. Funded by the Alfred P. Sloan Foundation, the event focused on actionable steps to align open data practices with the goals of Open Source AI.\u00a0\u00a0<\/p>\n<p>Participants, listed below, comprised academics, civil society leaders, technologists, and representatives from organizations like Mozilla Foundation, Creative Commons, EleutherAI Institute and others.\u00a0<\/p>\n<p>Ignatius Ezeani University of Lancaster \u00a0\/ Nigeria\u00a0<\/p>\n<p>Masayuki Hatta Debian, Open Source Group Japan \/ Japan<\/p>\n<p>Aviya Skowron EleutherAI Institute \/ Poland<\/p>\n<p>Stefano Zacchiroli Software Heritage \/ Italy<\/p>\n<p>Ricardo Torres Digital Public Goods Alliance \/ Mexico<\/p>\n<p>Kristina Podnar Data and Trust Alliance \/ Croatia + USA<\/p>\n<p>Joana Varon Coding Rights \/ Brazil<\/p>\n<p>Renata Avila Open Knowledge Foundation \/ Guatemala<\/p>\n<p>Alek Tarkowski Open Future \/ Poland<\/p>\n<p>Maximilian Gantz \u00a0Mozilla Foundation \/ Germany<\/p>\n<p>Stefaan Verhulst \u00a0GovLab \/ USA+ Belgium<\/p>\n<p>Paul Keller Open Future \/ Germany<\/p>\n<p>Thom Vaughan Common Crawl \/ UK\u00a0\u00a0<\/p>\n<p>Julie Hunter Linagora \/ USA\u00a0<\/p>\n<p>Deshni Govender \u00a0GIZ FAIR Forward \u2013 AI for All \/ South Africa<\/p>\n<p>Ramya Chandrasekhar CNRS \/ India<\/p>\n<p>Anna Tumad\u00f3ttir Creative Commons \/ Iceland\u00a0\u00a0<\/p>\n<p>Stefano Maffulli Open Source Initiative \/ Italy\u00a0<\/p>\n<p>Over two days, the group worked to frame a cohesive approach to data governance. Alek Tarkowski and Paul Keller of the Open Future Foundation are working with OSI to complete the white paper summarizing the group\u2019s work. In the meantime, here is a quick \u201ctease\u201d\u2014just a few of the many topics that the group discussed:\u00a0\u00a0<\/p>\n<p>The streams of \u201copen\u201d merge, creating waves<\/p>\n<p>AI is where Open Source software, open data, open knowledge, and open science meet in a new way. Since OpenAI released ChatGPT, what once were largely parallel tracks with occasional junctures are now a turbulent merger of streams, creating ripples in all of these disciplines and forcing us to reassess our principles: How do we merge these streams without eroding the principles of transparency and access that define openness?<\/p>\n<p>We discovered in the process of defining Open Source AI that the basic freedoms we\u2019ve put in the Open Source Definition and its foundation, the Free Software Definition, are still good and relevant. Open Source software has had decades to mature into a structured ecosystem with clear rules, tools, and legal frameworks. Same with Open Knowledge and Open Science: While rooted in age-old traditions, open knowledge and science have seen modern rejuvenation through platforms like Wikipedia and the Open Knowledge Foundation. Open data, however, feels less solid: often serving as a one-way pipeline from public institutions to private profiteers, is now dragged into a whole new territory.\u00a0<\/p>\n<p>How are these principles of \u201copen\u201d interacting with each other, how are we going to merge Open Data with Open Source with Open Science and Open Knowledge in Open Source AI?<\/p>\n<p>The broken social contract of data\u00a0<\/p>\n<p>Data fuels AI. The sheer scale of data required to train models like ChatGPT reveals not just a technological challenge but also a societal dilemma. Much of this data comes from us\u2014the blogs we write, the code we share, the information we give freely to platforms.\u00a0<\/p>\n<p>OpenAI, for example, \u201cslurps\u201d all the data it can find, and much of it is what we willingly give: the blogs we write; the code we share; the pictures, emails and address books we keep in \u201cthe cloud\u201d; and all the other information we give freely to platforms.\u00a0<\/p>\n<p>We, the people, make the \u201cdata,\u201d but what are we getting in exchange? OpenAI owns and controls the machine built with our data, and it grants us access via API, until it changes its mind. We are essentially being stripmined for a proprietary system that grants access at a price\u2014until the owner decides otherwise.<\/p>\n<p>We need a different future, one where data empowers communities, not just corporations. That starts with revisiting the principles of openness that underpin the open source, open science, and open knowledge movements. The question is: How do we take back control?\u00a0\u00a0<\/p>\n<p>Charting a path forward\u00a0\u00a0<\/p>\n<p>We want the machine for ourselves. We want machines that the people can own and control. We need to find a way to swing the pendulum back to our meaning of Open. And it\u2019s all about the \u201cdata.\u201d<\/p>\n<p>The OSI\u2019s work on the Open Source AI Definition provides a starting point. An Open Source AI machine is one that the people can meaningfully fork without having to ask for permission. For AI to truly be open, developers need access to the same tools and data as the original creators. That means transparent training processes, open filtering code, and, critically, open datasets.\u00a0<\/p>\n<p>Group photo of the participants to the workshop on data governance, Paris, Oct 2024.<\/p>\n<p>Next steps\u00a0\u00a0<\/p>\n<p>The white paper, expected in December, will synthesize the workshop\u2019s discussions and propose concrete strategies for data governance in Open Source AI. Its goal is to lay the groundwork for an ecosystem where innovation thrives without sacrificing openness or equity.\u00a0\u00a0<\/p>\n<p>As the lines between \u201copen\u201d streams continue to blur, the choices we make now will define the future of AI. Will it be a tool controlled by a few, or a shared resource for all?\u00a0\u00a0<\/p>\n<p>The answer lies in how we navigate the waves of data and openness. Let\u2019s get it right.\u00a0&#013;<br \/>\n&#013;<br \/>\nSource: opensource.org&#013;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>While working to define Open Source AI, we realized that data governance is an unresolved issue. The Open Source Initiative [&hellip;]<\/p>\n","protected":false},"author":61,"featured_media":0,"comment_status":"false","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-14006","post","type-post","status-publish","format-standard","hentry","category-mp"],"_links":{"self":[{"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/posts\/14006","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/users\/61"}],"replies":[{"embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/comments?post=14006"}],"version-history":[{"count":1,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/posts\/14006\/revisions"}],"predecessor-version":[{"id":14007,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/posts\/14006\/revisions\/14007"}],"wp:attachment":[{"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/media?parent=14006"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/categories?post=14006"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/tags?post=14006"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}