The intense hatred for “stealing” content might be blunted if all the subsequent work product goes to the public domain.
But what do you do when you start getting copywrite struck on your own works, because someone else decided to steal it and claim ownership?
People talk about open source models, but there’s no such thing. They are all black boxes where you have no idea what went into them.
People talk about open source models, but there’s no such thing.
:-/
Source code isn’t real? Schematics and blue prints don’t exist?
The intense hatred for “stealing” content might be blunted if all the subsequent work product goes to the public domain
Fun fact…It does!
If letting AI train on other people’s works is unjust enrichment then what the record lables did to creatives through the entire 20th century taking ownership of their work through coercive contracting is extra-unjust enrichment.
Not saying it isn’t, but it’s not new, and bothersome that we’re only complaining a lot now.
don’t misunderstand me now, i really don’t want to defend record companies, but
legally they made deals and wrote contracts. It’s not really the same thing.
When the labels held an oligopoly on access to the public, it was absolutely coercive when the choice was between having your work published while you got screwed vs. never being known ever.
This is one of the reasons the labels were so resistant to music on the internet in the first place (which Thomas Dolby and David Bowie were experimenting with in the early 1990s and why they hired US ICE to raid the Dotcom estate in New Zealand because it wasn’t just about MegaUpload being used for piracy sometimes. (PS: That fight is still going on, twelve years later.)
Yep. And the streaming tech bros collusion with the industry mobsters took it to another level. The people making the art are a mere annoyance to the jerks profiting from it. And yet the ai which they think saves them from this annoyance requires the art be created in the first place. I guess the history of recorded music holds a fair amount to plunder . But art - and even pop music - is an expression and reflection of individuals and wider zeitgeist: actual humanity. I don’t see what value is added when a person creates something semi unique, and a supercomputer burns massive amounts of energy to mimic it. At this stage all of supposed AI is a marketing gimmic to sell things. Corporations once again showing their hostility to humanity.
It seems like it’s only copyright infringement when poor people take rich people’s stuff.
When it’s the other way round, it’s fair use.
I don’t think it is relatively difficult to make “Ethical” AI.
Simply refer to the sources you used and make everything, from the data used, the models and the weights, of public domain.
It baffles me as to why they don’t, wouldn’t it just be much simpler?
Simply refer to the sources you used
Source: The Internet.
Most things are duplicated thousands of times on the Internet. So stating sources would very quickly become a bigger text than almost any answer from an AI.
But even disregarding that, as an example: Stating that you scraped republican and democrat home sites on a general publicly available site documenting the AI, does not explain which if any was used for answering a political question.
Your proposal sounds simple, but is probably extremely hard to implement in a useful way.
They don’t do it because they claim that there isn’t enough public domain data… But let’s be honest, nobody has tried because nobody wants a machine that isn’t able to reference anything in the last 100 years.
You should read this letter by Katherine Klosek, the director of information policy and federal relations at the Association of Research Libraries.
Why are scholars and librarians so invested in protecting the precedent that training AI LLMs on copyright-protected works is a transformative fair use? Rachael G. Samberg, Timothy Vollmer, and Samantha Teremi (of UC Berkeley Library) recently wrote that maintaining the continued treatment of training AI models as fair use is “essential to protecting research,” including non-generative, nonprofit educational research methodologies like text and data mining (TDM). If fair use rights were overridden and licenses restricted researchers to training AI on public domain works, scholars would be limited in the scope of inquiries that can be made using AI tools. Works in the public domain are not representative of the full scope of culture, and training AI on public domain works would omit studies of contemporary history, culture, and society from the scholarly record, as Authors Alliance and LCA described in a recent petition to the US Copyright Office. Hampering researchers’ ability to interrogate modern in-copyright materials through a licensing regime would mean that research is less relevant and useful to the concerns of the day.
I would disagree, because I don’t see the research into AI as something of value to preserve.
So, you want an AI LLM trained to respond like a person from ~180 years ago, with their highly religious and cultural bias from a time so far removed from ours that you would feel offended by its answers, with no knowledge of anything from the past 100+ years? Would you be able to use such a thing in daily life?
Consider that even school textbooks are copywrited, and people writing open source projects are sometimes offended by their OPEN SOURCE CODE being trained for AI, you basically cut away the ability for the AI model to learn basic human knowledge or even do the thing it’s actually “good” at if you took the full “no offense taken” approach.
The other part of the problem is, legally speaking, making it where it is forbidden to train on copywrited data opens up a huge window for companies with aggressive copywrite protections to effectively end all fan works of something, or even forbid people from making things with even a hint that their concept was conceived based on their once vaguely hearing about or seeing a copywrited work. How do you legally prove you’ve never been exposed to, even briefly, and thus have never been influenced by something that’s memetically and culturally everywhere, for example?
As for AI art and music, there are open source pd/cc only models out there, as I call them, “vegan models”. CommonCanvas, for instance. The problem with these models is the lack of subject material available (only 10 million images, which there are a lot more than 10 million things to look at in the world, before considering ways to combine them), and the lack of interest in doing the proper legwork to make sure the AI learns properly through good image tagging, which can take upwards of years to complete. Training AI is very expensive and time consuming (especially the captioning part, due to it being a human task!) and if you don’t have a literal supercomputer you can run for several months at tens of thousands of dollars per month, you aren’t going to make even a small model work in any reasonable amount of time. What makes the big art models good at what they do is both the size of the dataset and the captioning. You need a dataset in the billions.
For example, if you have never seen any kind of cat before ever, and no one tells you what a cat looks like, and no one tells you how biology works, and you get a single image of a lion, which contains a side-on image, and you are told that is a cat, will you be able to draw it in every perspective angle? No, you won’t. You can guess and infer, but it may not be right. You have the advantage of many, many more data points to draw from in your mind, the human advantage. These AI models don’t have that. You want an AI to draw a lion from every perspective, you need to show it lion images from every perspective so it knows what it looks like.
As for AI “tracing”, well, that’s not accurate either. AI models do not normally contain training image data in reproducible form in any way. They contain probability matrices of shapes and curves, which mathematically describe the probability of a certain shape in correlation with other concepts alongside it. Take a single one of these “neuron” matrices and graph it, and you get a mess of shapes and curves that vaguely resble a psychodellic abstract art of different parts of that concept… and sometimes other concepts too, because it can and often does use the same “neuron” for other, logically unrelated concepts, but make sense for something that is only interested in defining shapes.
Most importantly, AI models do not use binary logic like most people are used to with computer logic. It is not a definitive yes/no on anything. It is a floating point number, a varying scale of “maybe”, which allows it to combine and be nuanced with concepts wothout being rigid. This is what makes the AI able to do more than be a tracing machine.
Where this really comes to is the human factor, the primal fear of “the machine” or “something greater” being able to outcompete the human. Media has given us the concept of Rogue AI destroying civilization since the dawn of the machine age, and it is thoroughly engrained in our culture that smart machines = evil, even though we don’t yet have a reality that far. People forget how much support is required to keep a machine going. They don’t heal themselves or magically keep running forever.
If the only way your product can be useful is by stealing other people’s work, well then it’s not a well made product.
Because there’s not enough PD content there to train AI on.
Copyright law is generally (yes I know this varies country by country but) gives the creator immediate ownership without any further requirements, which means every doodle, shitpost and hot take online is property of it’s owner UNLESS they chose to license it in a way that would allow use.
Nobody does, and thus the data the AI needs simply doesn’t exist as PD content and that makes the only choices for someone training a model is either to steal everything, or don’t do it.
You can see what choice has been universally made.
People were also a lot more open to their data being used by machine learning because it was used in universally appreciable tasks like image classification or image upscaling; tasks no human would want to do manually and which threatens nobody.
The difference today is not the data used, but the threat from the use-case. Or, more accurately, people don’t mind their data being used if they know the outcome is of universal benefit.
Copying is not theft. Letting only massive and notoriously untransparent corporations control an emerging technology is.