๐ข Google don confirm say e dey train Bard on scraped web data too! ๐ฑ
โฌ๏ธ Pidgin โฌ๏ธ โฌ๏ธ Black American Slang โฌ๏ธ English
On Monday, Gizmodo spot Google as dem update dia privacy policy to yan say dia different AI services like Bard and Cloud AI fit dey trained on public data wey di company don scrape from di web. ๐
“Our privacy policy don always transparent say Google dey use information wey dey available for public for web train language models wey wey dey useful for services like Google Translate,” na wetin Google spokesperson, Christa Muldoon, yan The Verge. “Dis latest update just dey clear say services like Bard self join inside. As we dey develop our AI technologies, we dey follow privacy principles and safeguards, as we put dem for our AI Principles.” ๐ค
After di update on July 1st, 2023, Google privacy policy con yan say “Google dey use information make we improve our services and develop new products, features, and technologies wey go benefit our users and di public.” Dem talk say dem fit “use information wey dey available for public help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” ๐จโ๐ป
You fit see from di policy revision history say dis update add small clarity as to di services wey go use di data wey dem don collect. For example, di document con yan say di information fit use for “AI Models” instead of “language models,” wey give Google freedom to train and build systems beside LLMs on top di public data. But, na as you click di link wey dey inside di policy, na im you go fit open di part wey concern you wey dey talk about “publically accessible sources.” ๐
Dis updated policy con specify say dem dey use “publicly available information” train Google’s AI products, but dem no yan how dem go take prevent copyrighted materials from enter inside di data pool. Some websites wey dey available for public get policy wey ban data collection or web scraping wey fit use train large language models and oda AI tools. E go dey interesting to see as dem go handle dis matter for di global regulations like GDPR wey dey protect people from dia data wey dem no give permission. ๐ซ๐
Di combination of these laws and competition for market don make makers of popular generative AI systems like OpenAI’s GPT-4 dey yan coded coded about where dem collect dia data wey dem use train dem and whether e include social media posts or copyrighted works wey artists and authors create. ๐ค๐๐ฅ๏ธ
Di matter of whether fair use doctrine extend reach dis kind application still dey inside area wey no clear. Di uncertainty don ginger lawsuits and make lawmakers for some countries bring stricter laws wey fit regulate how AI companies dey collect and use dia training data. E still raise questions about how dem dey process di data to make sure say e no dey cause dangerous failures inside AI systems. People wey dey responsible for sort out di plenty training data often dey work long hours under extreme conditions. ๐โฐ๐ป
Gannett, wey be di biggest newspaper publisher for United States, don sue Google and dem parent company, Alphabet, talk say advancements for AI technology don help di search giant hold monopoly for digital ad market. Some people don even call Google’s AI search beta “plagiarism engines” and yan say e dey reduce website traffic. ๐ฐ๐
Meanwhile, Twitter and Reddit โ two social platforms wey get plenty public information โ don make drastic changes to try prevent oda companies from just dey harvest dia data like dat. Di changes and limitations wey dem put for di platforms don cause serious wahala for Twitter and Reddit users wey dey complain say e don affect di core experience wey dem dey enjoy. ๐ก๐ฒ
NOW IN BLACK AMERICAN SLANG
๐ข Google confirms it’s training Bard on scraped web data too! ๐ฑ
On Monday, Gizmodo peeped dat Google updated its privacy policy to let us know dat its various AI services, like Bard and Cloud AI, may be trained on public data dat the company done scraped from the web. ๐
“Our privacy policy been transparent ’bout how Google use publicly available info from the open web to train language models fo’ services like Google Translate,” said Google spokesperson Christa Muldoon to The Verge. “Dis latest update just be makin’ it clear dat newer services like Bard be included too. We incorporate privacy principles and safeguards into the development of our AI technologies, in line wit’ our AI Principles.” ๐ค
After the update on July 1st, 2023, Google’s privacy policy now say dat “Google use info to improve our services and to develop new products, features, and technologies dat benefit our users and the public” and dat the company may “use publicly available info to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” ๐จโ๐ป
Ya can peep from the policy’s revision history dat the update give some extra clarity ’bout the services dat go dey trained usin’ the collected data. Fo’ example, the document now say dat the info may be used fo’ “AI Models” instead of “language models,” givin’ Google more freedom to train and build systems beside LLMs on ya public data. And even dat note be tucked away unda an embedded link fo’ “publically accessible sources” undaneath the policy’s “Ya Local Information” tab dat ya gotta click to open the relevant section. ๐
The updated policy specify dat “publicly available info” be used to train Google’s AI products but don’t say how (or if) the company go prevent copyrighted materials from bein’ included in dat data pool. Many publicly accessible websites got policies dat ban data collection or web scrapin’ fo’ the purpose of trainin’ large language models and otha AI tools. It gonna be interestin’ to peep how dis approach play out wit’ various global regulations like GDPR dat protect people against their data bein’ misused without they express permission, too. ๐ซ๐
A combination of these laws and increased market competition done made makers of popular generative AI systems like OpenAI’s GPT-4 hella cagey ’bout where they got the data used to train ’em and whether or not it include social media posts or copyrighted works by human artists and authors. ๐ค๐๐ฅ๏ธ
The matter of whether or not the fair use doctrine extend to this kinda application currently sittin’ in a legal gray area. The uncertainty done sparked various lawsuits and pushed lawmakers in some nations to introduce stricter laws dat be betta equipped to regulate how AI companies collect and use they trainin’ data. It also raise questions ’bout how this data bein’ processed to ensure it don’t contribute to dangerous failures within AI systems, wit’ the people tasked wit’ sortin’ through these vast pools of trainin’ data often subjected to long hours and extreme workin’ conditions. ๐โฐ๐ป
Gannett, the largest newspaper publisher in the United States, be suin’ Google and its parent company, Alphabet, claimin’ dat advancements in AI technology done helped the search giant to hold a monopoly ova the digital ad market. Products like Google’s AI search beta also been called “plagiarism engines” and criticized fo’ starvin’ websites of traffic. ๐ฐ๐
Meanwhile, Twitter and Reddit โ two social platforms dat contain vast amounts of public info โ done recently took drastic measures to try and prevent otha companies from freely harvestin’ they data. The API changes and limitations placed on the platforms done been met wit’ backlash by they respective communities, as anti-scrapin’ changes done negatively affected the core Twitter and Reddit user experiences. ๐ก๐ฒ
NOW IN ENGLISH
๐ข Google confirms it’s training Bard on scraped web data too! ๐ฑ
On Monday, Gizmodo spotted that Google updated its privacy policy to disclose that its various AI services, such as Bard and Cloud AI, may be trained on public data that the company has scraped from the web. ๐
“Our privacy policy has long been transparent that Google uses publicly available information from the open web to train language models for services like Google Translate,” said Google spokesperson Christa Muldoon to The Verge. “This latest update simply clarifies that newer services like Bard are also included. We incorporate privacy principles and safeguards into the development of our AI technologies, in line with our AI Principles.” ๐ค
Following the update on July 1st, 2023, Google’s privacy policy now states that “Google uses information to improve our services and to develop new products, features, and technologies that benefit our users and the public” and that the company may “use publicly available information to help train Google’s AI models and build products and features like Google Translate, Bard, and Cloud AI capabilities.” ๐จโ๐ป
You can see from the policy’s revision history that the update provides some additional clarity as to the services that will be trained using the collected data. For example, the document now says that the information may be used for “AI Models” rather than “language models,” granting Google more freedom to train and build systems beside LLMs on your public data. And even that note is buried under an embedded link for “publically accessible sources” underneath the policy’s “Your Local Information” tab that you have to click to open the relevant section. ๐
The updated policy specifies that “publicly available information” is used to train Google’s AI products but doesn’t say how (or if) the company will prevent copyrighted materials from being included in that data pool. Many publicly accessible websites have policies in place that ban data collection or web scraping for the purpose of training large language models and other AI toolsets. It’ll be interesting to see how this approach plays out with various global regulations like GDPR that protect people against their data being misused without their express permission, too. ๐ซ๐
A combination of these laws and increased market competition have made makers of popular generative AI systems like OpenAI’s GPT-4 extremely cagey about where they got the data used to train them and whether or not it includes social media posts or copyrighted works by human artists and authors. ๐ค๐๐ฅ๏ธ
The matter of whether or not the fair use doctrine extends to this kind of application currently sits in a legal gray area. The uncertainty has sparked various lawsuits and pushed lawmakers in some nations to introduce stricter laws that are better equipped to regulate how AI companies collect and use their training data. It also raises questions regarding how this data is being processed to ensure it doesn’t contribute to dangerous failures within AI systems, with the people tasked with sorting through these vast pools of training data often subjected to long hours and extreme working conditions. ๐โฐ๐ป
Gannett, the largest newspaper publisher in the United States, is suing Google and its parent company, Alphabet, claiming that advancements in AI technology have helped the search giant to hold a monopoly over the digital ad market. Products like Google’s AI search beta have also been dubbed “plagiarism engines” and criticized for starving websites of traffic. ๐ฐ๐
Meanwhile, Twitter and Reddit โ two social platforms that contain vast amounts of public information โ have recently taken drastic measures to try and prevent other companies from freely harvesting their data. The API changes and limitations placed on the platforms have been met with backlash by their respective communities, as anti-scraping changes have negatively affected the core Twitter and Reddit user experiences. ๐ก๐ฒ