{"id":13384,"date":"2023-07-27T09:34:23","date_gmt":"2023-07-27T07:34:23","guid":{"rendered":"http:\/\/plus.maciejpiasecki.info\/index.php\/2023\/07\/27\/takeaways-from-the-defining-open-ai-community-workshop\/"},"modified":"2023-07-27T22:05:51","modified_gmt":"2023-07-27T20:05:51","slug":"takeaways-from-the-defining-open-ai-community-workshop","status":"publish","type":"post","link":"https:\/\/plus.maciejpiasecki.info\/index.php\/2023\/07\/27\/takeaways-from-the-defining-open-ai-community-workshop\/","title":{"rendered":"Takeaways from the \u201cDefining Open AI\u201d community workshop"},"content":{"rendered":"<p>The Open Source Initiative is deep into a multi-stakeholder process to define machine learning systems that can be characterized as \u201cOpen Source.\u201d\u00a0<\/p>\n<p>About 40 people put their heads together for the first community discussion in an hour-long session I led at FOSSY 2023.\u00a0<\/p>\n<p>If you missed it, there are still plenty of ways to get involved. Send a proposal to speak at the online webinar series before August 4, 2023 and check out the timeline for upcoming in-person workshops. Get caught up with the recap from the kickoff meeting, too.<\/p>\n<p>Why data is the sticking point of machine learning<\/p>\n<p>The session started with a short presentation highlighting why we need to define \u201copen\u201d in the AI context and why we need to do it now.\u00a0<\/p>\n<p>Open Source gives users and developers the ability to decide for themselves how and where to use the technology without the need to engage with a third party. We want to get the same things for machine learning systems. We\u2019ll need to find our way there.<\/p>\n<p>First, we need to clarify that machine learning systems are a little different than classic software. For one, machine learning depends on data, lots of it. Developers can\u2019t rely on just their own laptops and knowledge to build new AI systems. The legal landscape is also a lot more complicated than for pure software: Data is covered by a lot of different laws, often very different between countries.<\/p>\n<p>After the initial meeting in San Francisco, it became clear that the most crucial question to ask (and try to answer) is around data.\u00a0<\/p>\n<p>At the Portland session, I asked one simple question:\u00a0<\/p>\n<p>How tightly coupled should the original data and the ML models be?<\/p>\n<p>I started with the three pieces that go into a typical ML system:<\/p>\n<p>Software for training and testing, inference and analysisFor the crowd it was easy to agree that all software written by a human and copyrightable must be Open Source for a ML system to be considered open.<\/p>\n<p>Model architecture with its weights and training parametersThese should be made available with terms and conditions that don\u2019t restrict who can use them and how they\u2019re used; There also shouldn\u2019t be restrictions on retraining the artifacts and redistributing them. The group wasn\u2019t as clearly in agreement on this point but did concur that resolving this is within reach.<\/p>\n<p>Raw data and prepared datasets, for training and testingI started with the assumption that the original dataset is not the preferred form for making modifications to model\/weights and asked the group: Does that mean an \u201cOpen ML\u201d can ignore the original data? How much of the original dataset do we need in order to exercise the right to modify a model?<\/p>\n<p>This final question required people to get on the same page. Some AI developers in the room shared their views that the original dataset is not necessary to modify a model. They also stated that they would need a sufficiently precise description of the original data, though, and other elements. This would be necessary for technical reasons and for transparency, to evaluate bias, etc.<\/p>\n<p>A few people took a different view, leaning more on the fact that data is somewhat equivalent to the source code of a model and the model is the binary, as if training was the equivalent of compilation. Some of their comments gave the impression that they were less familiar with developing ML systems.<\/p>\n<p>Other participants explained why the analogy to software\u2019s source-binary doesn\u2019t hold water: A binary-only piece of software cannot be modified and in fact the GNU GPLv3 explains in detail the preferred methods of making modifications to software. On the contrary, AI models can be modified to be fine-tuned and retrained without the original dataset, if they\u2019re accompanied by other elements.<\/p>\n<p>During the session, folks were encouraged to contribute their thoughts on an Etherpad. Comments there touch on the cultural implications of public data, the importance of documenting data transparency and whether \u201copen with restrictions\u201d carve outs will be necessary when it comes to personal or health data.<\/p>\n<p>What\u2019s next<\/p>\n<p>Remember: We want you to weigh in with a proposal to speak and invite you to participate in upcoming community workshops.\u00a0<\/p>\n<p>For now,\u00a0 we\u2019ll leave you with this quote from the Portland session:<\/p>\n<p>\u00a0\u201cI think I\u2019m coming to a position that AI maybe isn\u2019t open without open data or a really good description of the data used (based on the \u201cspigot\u201d example), but that there will be a significant number of use cases that aren\u2019t open, for various cultural reasons e.g. they may use other licenses, defined within those communities, but also aren\u2019t the kind of extractive commercial stuff that invokes puking either. Open isn\u2019t an exclusive synonym for \u2018good\u2019.\u201d<\/p>\n<p>Participants also debated on the legality of training models on copyrighted and trademarked data, and voiced concerns about the output of generative AI systems.\u00a0<\/p>\n<p>We have a long road ahead and must move quickly \u2013 join us on this important journey.<br \/>\nThe post &lt;span class=&#8217;p-name&#8217;&gt;Takeaways from the \u201cDefining Open AI\u201d community workshop&lt;\/span&gt; appeared first on Voices of Open Source.&#013;<br \/>\n<img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/plus.maciejpiasecki.info\/wp-content\/uploads\/2023\/07\/AI-Def-meeting-2-1024x468-1.jpg\" width=\"1024\" height=\"468\">&#013;<br \/>\nSource: opensource.org&#013;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>The Open Source Initiative is deep into a multi-stakeholder process to define machine learning systems that can be characterized as [&hellip;]<\/p>\n","protected":false},"author":53,"featured_media":13385,"comment_status":"false","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[3],"tags":[],"class_list":["post-13384","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-mp"],"_links":{"self":[{"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/posts\/13384","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/users\/53"}],"replies":[{"embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/comments?post=13384"}],"version-history":[{"count":1,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/posts\/13384\/revisions"}],"predecessor-version":[{"id":13386,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/posts\/13384\/revisions\/13386"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/media\/13385"}],"wp:attachment":[{"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/media?parent=13384"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/categories?post=13384"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/plus.maciejpiasecki.info\/index.php\/wp-json\/wp\/v2\/tags?post=13384"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}