Yahoo releases massive machine learning dataset for researchers

Yahoo has introduced its new "Yahoo News Recommendation" dataset, something it says is the biggest ever machine learning dataset released publicly. Says Yahoo, this dataset has been released for the academic research community, giving researchers who are normally unable to access such large-scale datasets the opportunity to conduct research using the mass of information.

This publicly released machine learning dataset contains 110B events, which Yahoo says is 13.5TB uncompressed. The dataset includes anonymized user news item interaction data, according to the company, which was gathered from 20 million or so users over a few month period early last year.

Contained within the Yahoo News Feed dataset is anonymized data from users who have interacted with various Yahoo properties, including things like its Yahoo Movies, Yahoo News, and Yahoo Finances. The company is making this available under its Yahoo Labs Webscope, a data-sharing program.

Yahoo is also adding some demographic data categorized by things like gender and age, as well as general geographic locations. This data is likewise anonymized. Said Yahoo in a statement, "Our goals are to promote independent research in the fields of large-scale machine learning and recommender systems, and to help level the playing field between industrial and academic research."