We have listed some quality datasets for machine learning projects. You can refer to these datasets based on your project requirements and access them for free.
- Labelme – Data labeling for computer vision. Labelme is a large dataset of annotated images. It allows you to control your data labeling accuracy and generate high-quality training data.
- ImageNet – The de facto image dataset for new algorithms, organized according to the WordNet hierarchy, in which hundreds and thousands of images represent each node in the hierarchy.
- Large-scale Scene UNderstanding Challenge (LSUN) – LSUN classification dataset has 10 categories of images. For training data, each category contains a huge repository of images. Somewhere around 120,000 – 3,000,000.
- Microsoft Common Objects in Context (MS COCO) – MS COCO is a large-scale object detection, segmentation, and captioning dataset.
- Columbia Object Image Library (COIL-100) Dataset – It is a dataset with 7200 color images of 100 objects. These objects are photographed at each angle in a 360 rotation.
- Visual Genome – Visual Genome is a very detailed visual knowledge base to connect structured image concepts to language. It contains Visual Question Answering data in a multi-choice setting.
- Google’s Open Images – Open Images by Google is the most comprehensive collection on the internet. It consists of 9 million URLs to images, annotated with tags spanning over 6,000 categories.
- Labeled Faces in the Wild (LFW) – LFW consists of 13,000 tagged images of human faces, for use in developing applications involving facial recognition.
- Stanford Dogs Dataset – Contains 20,580 images and 120 different categories of dog breeds.
- Indoor Scene Recognition – A very specific and very useful data set of 67 interior categories and 15620 images. It works on the principle that most image recognition models perform better “outside” and poor in the indoor settings.
Sentiment Analysis Datasets
- Multi-Domain Sentiment Dataset (version 2.0) – It consists of unprocessed and processed Amazon product reviews.
- IMDb Movie Reviews Dataset – A smaller and relatively small data set for binary sentiment rating includes 50,000 movie reviews. The data set is divided into two sets, with 25000 reviews in each set. Both sets contain an equal number (50%) of positive and negative reviews
- Stanford Sentiment Treebank – Dataset with sentiment annotations of 11,855 single sentences extracted from movie reviews.
- Sentiment140 – A popular dataset containing sentiments of a brand, product, or topic on Twitter. It has over 1.6 million tweets with previously removed emoticons.
- Twitter US Airline Sentiment – Twitter US Airline Sentiment dataset comprises the problems of each major U.S. airline, since February 2015. The dataset is classified as positive, negative and neutral tweets.
Natural Language Processing Datasets
- HotpotQA Dataset – QHotpotQA Dataset is a question answering dataset collected on the English Wikipedia. It contains a strong oversight of supporting data to enable more explainable question answering systems.
- Enron Email Dataset – Enron Dataset contains nearly 500,000 emails from more than 150 users. It is stored in a dictionary, where each key in the dictionary is the name of a person and the value is a dictionary containing all characteristics of that person.
- Amazon Customer Reviews Dataset – Contains around 35 million Amazon reviews spanning 18 years. The data includes product and user information, ratings, and the plain text review.
- Google Ngram Viewer – Google Ngram Viewer contains counted syntactic ngrams extracted from the English text of the Google Books.
- Blog Authorship Corpus– A collection of 681,288 blog posts collected from blogger.com. Each blog contains a minimum of 200 occurrences of commonly used English words.
- Wikilinks Dataset – It is a huge dataset of the network of internal links of wikipedia or Wikilinks. The data set contains almost 1.9 billion words from more than 4 million articles. You can search by word, phrase or part of a paragraph.
- Project Gutenberg – Comprises of an. annotated list of Project Gutenberg eBooks, which is a library of over 60,000 free eBooks
- Hansards Canadian Parliament Dataset – This dataset contains 1.3 million pairs of texts from the records of the 36th Canadian Parliament.
- Jeopardy Dataset – It is a CSV file containing 216,930 Jeopardy questions, answers and other data.
- SMS Spam Collection Dataset –A data set consisting of 5,574 spam SMS messages in English.
- Yelp Reviews – An open dataset published by Yelp, it contains more than 5 million reviews.
Autonomous Vehicles Datasets
- Berkeley DeepDrive BDD100k – It is a huge diverse driving video database. Contains over 100,000 videos of over 1,100 hours of driving at different times of day and weather conditions. The annotated images come from the New York and San Francisco bay areas.
- Baidu ApolloScape Dataset – An open source dataset that defines 26 pre-defined semantic elements, such as cars, bicycles, pedestrians, buildings, lampposts, etc.
- Comma.ai Dataset – More than 33 hours of highway driving in California’s 280 highway. Details include car speed, acceleration, steering angle and GPS coordinates.
- Oxford RobotCar Dataset – Over 100 iterations of the same route through Oxford, UK, captured over a year. The dataset captures different combinations of weather, traffic, and pedestrians, along with long-term changes like construction and roadworks.
Dataset Finders
Here we have listed some of the most popular dataset finders. You can explore these platforms for the dataset of your choice and get started on your ML projects.
- Kaggle
- UC Irvine Machine Learning Repository
- OpenML
- VisualData Discovery
- Google Database Search
Kaggle
Kaggle offers a range of interesting externally contributed datasets. You can find all sorts of text, audio, numerical, and image datasets. Users can explore the data sets using the targeted keywords and also publish datasets for machine learning and other topics.
UC Irvine Machine Learning Repository
One of the oldest sources of datasets on the web, and a great first stop when looking for interesting datasets. UC Irvine Machine Learning Repository currently maintains over 600 machine learning data sets. They have a searchable interface to help you in keyword-based search.
Although the data sets are user-contributed and therefore have different levels of cleanliness, the vast majority are clean. You can download data directly from the UCI Machine Learning repository, without registration.
OpenML
OpenML hosts over 21.000 datasets. It is a platform for sharing and organizing data and is regularly updated.
VisualData Discovery
You can discover machine vision datasets by category. You can input your search queries and filter the results as per your search criteria.
Google Database Search
This platform has over 25 million datasets and would require you some kind of prior knowledge to access these datasets for machine learning. On Google’s Dataset Search, you will be redirected to the host website, which means publisher’s site.
We hope this article gave you an idea about finding free datasets for machine learning projects.
Recommended Articles
- https://24x7offshoring.com/analytics-and-best-machine-learning-datasets
- https://24x7offshoring.com/best-free-public-datasets-to-use-in-python
- https://24x7offshoring.com/approach-for-interpretable-machine-learning
- https://24x7offshoring.com/4-types-of-data-in-datasets
- https://24x7offshoring.com/need-to-find-a-best-dataset-in-machine-learning