machine-learning training data example
Mall Customers Dataset
The Mall Machine customers dataset contains information about people visiting the retail plaza. The dataset has sexual direction, customer id, age, yearly compensation, and spending score. It accumulates encounters from the data and social occasion customers subject to their practices.
- Iris Dataset
The iris dataset is an essential and novice very much arranged dataset that contains information about the sprout petal and sepal sizes. The dataset has 3 classes with 50 models in each class, thusly, it contains 150 lines with only 4 sections.
- MNIST Dataset
This is a data base of deciphered digits. It contains 60,000 getting ready pictures and 10,000 testing pictures. This is an ideal dataset to start doing picture portrayal where you can bunch a digit from 0 to 9.
- The Boston Housing Dataset
This is a well known dataset used in plan affirmation. It contains information about the different houses in Boston subject to wrongdoing rate, charge, number of rooms, etc It has 506 lines and 14 unmistakable variables in sections. You can use this dataset to anticipate house costs.
- Fake News Detection Dataset
It is a CSV report that has 7796 lines with 4 portions. The chief fragment recognizes news, second for the title, third for news text and fourth is the imprint TRUE or FAKE.
- Wine quality dataset
The dataset contains assorted compound information about wine. It has 4898 events with 14 factors each. The dataset is helpful for game plan and backslide tasks. The model can be used to expect wine quality.
- SOCR data – Heights and Weights Dataset
This is a direct dataset regardless. It contains only the height (inches) Machine and stacks (pounds) of 25,000 unmistakable individuals of 18 years of age. This dataset can be used to develop a model that can anticipate the heights or heaps of a human.
- Parkinson Dataset
Parkinson is a tactile framework issue that impacts advancement. The dataset contains 195 records of people with 23 special attributes which contain biomedical assessments. The data is used to segregate strong people from people with Parkinson’s contamination.
- Titanic Dataset
On 15 April 1912, the tough Titanic boat sank and killed 1502 explorers out of 2224. The dataset contains information like name, age, sex, number of family ready, etc of around 891 explorers in the readiness set and 418 voyagers in the testing set.
- Uber Pickups Dataset
The dataset has information of around 4.5 million uber pickups in New York City from April 2014 to September 2014 and 14million more from January 2015 to June 2015. Customers can perform data assessment and amass pieces of information from the data.
- Chars74k Dataset
The dataset contains pictures of character pictures used in the English and Kannada tongues. It has 64 classes (0-9, A-Z, a-z), 7.7k characters from typical pictures, 3.4k hand-drawn characters, and 62k PC consolidated printed styles.
- Mastercard Fraud Detection Dataset
The dataset contains trades made by Mastercards,Machine they are named as bogus or genuine. This is huge for associations that have trade structures to develop a model for recognizing bogus activities.
13 Chatbot Intents Dataset
The dataset is a JSON record that contains different marks like great news, goodbye, hospital_search, pharmacy_search, etc Each tag contains an overview of models a customer can ask and the responses a chatbot can respond as shown by that model. The dataset is helpful for perceiving how chatbot data capacities.
Simulated intelligence Datasets for Natural Language Processing
- Enron Email Dataset
This Enron dataset is popular in standard language dealing with. It contains around 0.5 million messages of in excess of 150 customers out of which an enormous part of the customers are the senior organization of Enron. The size of the data is around 432Mb.
- The Yelp Dataset
The cry made their dataset straightforwardly available yet you really want to fill a design first to get to the data. It contains 1.2 million clues by 1.6 million customers, over 1.2 million business credits and photos for ordinary language taking care of endeavors.
- Hazard Dataset
Risk! is an American TV game show in which general data questions are asked with a turn. The dataset contains 200k+ requests and answers in a CSV or JSON report.
- Recommender Systems Dataset
This is an entry to a grouping of rich datasets that were used in lab research projects at UCSD. It contains distinctive datasets from renowned locales like Goodreads book reviews, Amazon thing studies, bartending data, data from electronic media, etc that are used in building a recommender system.
- UCI Spambase Dataset
Requesting messages as spam or non-spam is an astoundingly ordinary and supportive task. The dataset contains 4601 messages and 57 meta-information about the messages. You can build models to filter through the spam.
- Flickr 30k Dataset
The Flickr 30k dataset resembles the Machine Flickr 8k dataset and it contains more named pictures. This has in excess of 30,000 pictures and their captions. This dataset is used to collect more correct models than the Flickr 8k dataset.
- IMDB reviews
The gigantic film study dataset includes film Machine reviews from IMDB site with in excess of 25,000 reviews for getting ready and 25,000 for the testing set.
- MS COCO dataset
Microsoft’s COCO is a colossal informational collection for object revelation, division and picture captioning endeavors. It has around 1.5 million checked pictures. The dataset is phenomenal for building creation arranged models.
- Flickr 8k Dataset
The Flickr 8k dataset contains 8000 pictures and each image is named with 5 special engravings. The dataset is used to develop an image caption generator.