14
GitHub's Copilot; Can the original author please stand up?
~ Richard Serra, 1973
This famous line was in regards to the 'free' broadcast TeleVision programs that most everyone spent hours in front of per week, if not hours per day. (People watched news & entertainment on free broadcast stations before the internet and cable.) In USA, almost all programming was supported by advertising. Early ads were simple, but as more psychological research was done those ads became more and more effective in influencing public perceptions. Entire genres of TV were catered to specific groups. Early 'soap operas' that were shown mid-day targeted stay-at-home mothers were & are full of ads for household cleaning products. Saturday morning cartoons were full of ads to get children to get their parents to buy sugary cereals and the latest toys. In fact, some cartoons were blatantly 30 minute ads about toys (GI Joe, Transformers, etc).
Today, TV has has become more targeted thanks to cable and satellite offering 100s of different channels for every market. Channels just for sports fans, others just for women, children, SiFi fans, etc. You can notice how each channel runs the same type of ads constantly, eg news channels have many ads related medicine, since those watchers tend to be older folks concerned about their health.
I like to check out newsletters for web developers (since I'm a webdev myself) to check out the latest articles, videos, and software related to the platforms and programming languages I'm interested in. Most of the newsletters I read have ads for products, services, and job offers I sometimes are interested in. So it is a win-win-win; advertisers are curated by the newsletter author to deliver to a niche market content many people want to read both the free and paid content.
There is another 'advertising' model for many of the commercial services that we developers use; the 'free-tier' or 'freemium'. This model opens a limited service available to any & all with an email address, in hopes that you will buy the paid tiers later. This helps the users to test out products by spending only time. It also helps the service provider test out their platform, gets contact info for latter follow-ups, raise awareness, and increase community size. More on this later.
There were free code sharing platforms long before GitHub arrived. In the pre-internet days, programmers would use dial-up modems to connect to a BBS where you could download both programs an source code. There was 'sneaker-net', where people would copy and swap disks and tapes full of code. Early internet opened the door for FTP sites (often hosted by universities) to host archives of source code. Forums, IRC, and Usenet provided more 'social' avenues for people to share code, tips, and suggested revisions.
Sourceforge one of the larger portals to allow programmers both share their source code, compiled programs, and gather feedback to add more features and fix bugs. Their user-base grew as more larger projects (like Linux distros, audio, video and code editors) made Sourceforge their home. They seem to be supported by ads, but I still wonder how deep in the red they run. I still visit there on occasion, mostly for FileOptimizer.
GitHub had great timing; they took a semi-new but powerful revision tracking system (git) add a feedback forum (issues), hosting, and mini social profile (to help personalize) wrapped together by a back-end language that was quick to develop on (Ruby).
IMHO, their clean interface and generous ad-free freemium level attracted many many programmers and entire organizations to host their code on GitHub. They made git
fun to use and easy to hack in quick changes. Their search and categories allowed many people to quickly find code to solve problems they had, or even provide new ideas. GitHub became the primary portal for code hosting and sharing, enabling many programmers and companies to flourish.
Though not thought of as a 'social media site' in the likes of Facebook, Twitter, etc, it is a place for many cultures to intermingle. Very common to people from different countries, religions, beliefs, etc to gather and create solutions together.
GitHub does have a paid tier that many pay for, but I doubt that alone was worth the $7.5billon MicroSoft paid for GitHub. It would be the user-base and all the code there that made GH worth that much.
Most (but not all) code on GitHub has some level of copy protection. MIT, creative commons, GPL all have 'strings attached' to their license. Usually the requirement is that the copyright notice (& authors' name) is kept with copies of the code, but may include greater restrictions for commercial use... And these restrictions will carry over to derived programs. So if you use code that includes another library that has a dependency of a 'no-commercial-use' license and you want to use said code in your job or side-hustle, you may find yourself in a legal situation.
Trust me, lawyers and company heads do not like legal ambiguity. I've lost several job opportunities as soon as I mention OSS.
I started to see sites and tools that scraped GitHub repos and presented that copyrighted code as uncopyrighted code without attribution, hmmm about a year ago (?) as top search results. But I've seen scrapper sites directly quote eintire StackOverflow answers long before that, so this wrong is nothing new.
But what is new, is a major corporation (MicroSoft/GitHub) doing the same thing. They call it 'AI trained' but like those 'find the traffic light' Captcha tests, humans are training the AI. & the results are sometimes direct quotes of code, which is unauthorized. Is some the code shown from fully 'all rights reserved' private codebases hosted on GitHub? One PM I received from a major contributor to npm (JavaScript libraries) is that others are 'seeing your code signature everywhere'.
(Another example)
Kabir Nagrecha@kabirnagrechaForgotten secrets of AGI revealed by CoPilot…it was the LSTM all along! twitter.com/awjuliani/stat…16:23 PM - 02 Jul 2021Arthur Juliani @awjulianiLooks like according to co-pilot, LSTMs are all we need after all... (Must be those Schmidhuber papers in the GPT training dataset) https://t.co/HiFmdN8XHx
I'm not the only one concerned:
Tomislav Kraljic@kraljic_tThe majority of tech twitter is not speaking out against the unethical and unauthorized use of licensed source code of GitHub's Copilot Program because most are working for Microsoft or other FAANG companies that steal and profit of our your data.
I said what I said.01:55 AM - 04 Jul 2021
Kelly Sommers@kellabyteI heard GitHub is training their co-pilot with other peoples copyrighted source code thus allowing copyrighted source code to be injected into other peoples products.
There was a time I felt GitHub was on the ethical side of things but that’s starting to fade.22:19 PM - 03 Jul 2021
I'm honestly surprised at the backlash I'm seeing to @github's copilot. People, including myself, are complaining that it infringes on their work.
What surprises me is that developers didn't seem to care about stealing everyone else's code without attribution until now.03:05 AM - 04 Jul 2021
Hopefully Copilot is not giving away passwords... but maybe they are? Or atleast encryption methods?
There is legal precedent for music that taking even a few bars of music is a copyright violation. We should keep this in mind...
Sad fact is; those using code-grabbing tools like this may end up being more productive, leaving those who don't use said tools behind....
There are other GitHub clones written in Go (Gitea) and V (Gitly)
If your code is on the internet, someone can take it. People are willing to steal; I know someone who had his hardware reverse-engineered and 100s of clones made, putting him out of business.
Only real solution is not to share your code at all. Which makes me a bit sad, but in a world where doctors can't share medical advice on FaceBook anymore, and scientists get their conversations censored, this is the world we live in. :(
14