🐂 Not all bull runs are created equal. November’s AI picks include 5 stocks up +20% eachUnlock Stocks

Top AI Dataset Features Cryptocurrency Websites in its Datafeed

Published 22/04/2023, 15:00
Top AI Dataset Features Cryptocurrency Websites in its Datafeed

  • Colossal Clean Crawled Corpus depends on multiple crypto platforms for data.
  • Analysis shows part of C4’s text snippets are extracted from crypto-based websites.
  • The presence of crypto sites in C4’s dataset could affect its level of bias.

The top AI tool, Colossal Clean Crawled Corpus (C4), depends on multiple crypto platforms for a significant portion of its data. An analysis shows that C4 extracts millions of text snippets from crypto-based websites or web platforms closely related to cryptocurrency.

According to reports, the U.S. Securities and Exchange Commission (SEC), which now contains a significant amount of crypto-related information, accounts for 36 million C4 tokens, representing 0.02% of the platform’s dataset. The SEC’s website (sec.gov), from which C4 fetches the data, ranked 39th among the websites engaged by C4.

Satoshi Nakamoto’s Bitcointalk.org accounted for 6.1 million C4 tokens, equivalent to 0.004% of the total tokens. It ranked as the 780th website engaged by the platform.

Other crypto platforms engaged by C4 for data acquisition include the crypto news website, Cointelegraph, and the tokens aggregation platform, CoinmarketCap. These and six more related websites accounted for 0.008% of all C4 tokens, while other websites related to specific cryptocurrencies formed a negligible part of the representation.

IPFS (ipfs.io) and Steemit (steemit.com) featured significantly in C4’s dataset. IPFS ranked 16th, while Steemit ranked in the 594th position. Both these sites are not directly involved in crypto but have significant inclinations toward the crypto industry.

The involvement of crypto-related platforms in C4’s AI training process exposes cryptocurrency’s encroachment into the mainstream. Crypto websites’ extent of representation is significant enough to influence the outcome of C4, even though mainstream websites like Google and Facebook (NASDAQ:META) outrank them significantly.

C4 has faced criticism over pirated data and hate speech, despite reports of the dataset being “cleaned”. With only 400 words in its list for censoring specific content, it suggests there could still be controversial content within C4. The presence of crypto sites in its dataset could also affect its level of bias.

The post Top AI Dataset Features Cryptocurrency Websites in its Datafeed appeared first on Coin Edition.

Read more on Coin Edition

Latest comments

Risk Disclosure: Trading in financial instruments and/or cryptocurrencies involves high risks including the risk of losing some, or all, of your investment amount, and may not be suitable for all investors. Prices of cryptocurrencies are extremely volatile and may be affected by external factors such as financial, regulatory or political events. Trading on margin increases the financial risks.
Before deciding to trade in financial instrument or cryptocurrencies you should be fully informed of the risks and costs associated with trading the financial markets, carefully consider your investment objectives, level of experience, and risk appetite, and seek professional advice where needed.
Fusion Media would like to remind you that the data contained in this website is not necessarily real-time nor accurate. The data and prices on the website are not necessarily provided by any market or exchange, but may be provided by market makers, and so prices may not be accurate and may differ from the actual price at any given market, meaning prices are indicative and not appropriate for trading purposes. Fusion Media and any provider of the data contained in this website will not accept liability for any loss or damage as a result of your trading, or your reliance on the information contained within this website.
It is prohibited to use, store, reproduce, display, modify, transmit or distribute the data contained in this website without the explicit prior written permission of Fusion Media and/or the data provider. All intellectual property rights are reserved by the providers and/or the exchange providing the data contained in this website.
Fusion Media may be compensated by the advertisers that appear on the website, based on your interaction with the advertisements or advertisers
© 2007-2024 - Fusion Media Limited. All Rights Reserved.