Machine Learning- Dataset
, , , ,

Top 32 Dataset in Machine Learning | The Best Machine Learning Dataset

Top 32 Datasets in Machine Learning

 The Best Machine Learning Dataset

Building a machine learning model dataset is one of the main parts. Before we start with any algorithm we need to have a proper understanding of the data. These machine-learning datasets are used for research purposes. Most of the datasets are homogeneous.

We use data to train and evaluate our model and it plays a very vital role in the whole process. If our data is structured, less noisy, and properly cleaned then our model will give good accuracy on the evaluation time.

1. ImageNet:

Size of the Dataset: ~ 150 GB

  • Each record consists of bounding boxes and respective class labels
  • ImageNet provides 1000 images for each synset
  •  URLs of the images are given in the ImageNet
  • Because of its large-scale image data set, it helps the researchers

2. Coco dataset:

Coco dataset stands for Common Objects in Context dataset Mirror and it is a large-scale object detection, segmentation, and captioning data set. This data set has 1.5 million object instances for 80 object categories.

COCO has used five types of annotation 
  • object detection
  • keypoint detection
  • stuff segmentation
  • panoptic segmentation
  • image captioning

In COCO data set annotations are stored in a JSON file.

Features are provided by the COCO dataset:
  • Object segmentation
  • Recognition in context
  • Superpixel stuff segmentation
  • 330K images (>200K labeled)
  • 1.5 million object instances
  • 80 object categories
  • 91 stuff categories
  • 5 captions per image
  • 250,000 people with key points

3. Iris Flower Dataset:

The iris flower data is built for beginners who just started learning machine learning techniques and algorithms. With the help of this data, you can start building a simple project in machine learning algorithms. The size of the data is small and data pre-processing is not needed. It has three different types of iris flowers Setosa, Versicolour, and Virginica, and their petal and sepal length, stored in a 150×4 numpy. ndarray.

Features
  • The data consists of four attributes, i.e., sepal length in cm, sepal width in cm, petal length in cm, and petal width in cm.
  • This data has three classes
  • Each class of this data has 50 instances and the classes are Virginica, Setosa, and Versicolor.
  • t characteristics of this data are multivariate.
  • All of the attributes are real in this data

4. Breast cancer Wisconsin (Diagnostic) Dataset:

Breast cancer Wisconsin (Diagnostic) Dataset is one of the most popular data sets for classification problems in machine learning. This data set is based on breast cancer analysis. Features for this data set are computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. They describe the characteristics of the cell nuclei present in the image.

Features
  • Three types of attributes are mentioned in the data set, i.e., ID, diagnosis, and 30 real-valued input features.
  • In the data set for each cell nucleus, there are ten real-valued features calculated,i.e., radius, texture, perimeter, area, etc.
  • The main two classes are specified in the data set to predict i.e., benign and malignant.
  • In this data set a total of 569 instances are present which include 357 benign and 212 malignant.
Attribute Information:
  1.  ID number
  2.  Diagnosis (M = malignant, B = benign)
    3-32)
Ten real-valued features are mentioned for each cell nucleus:
  • Radius (mean of distances from the center to points on the perimeter)
  • texture (standard deviation of grey-scale values)
  • perimeter
  • area
  • smoothness (local variation in radius lengths)
  • compactness (perimeter^2 / area – 1.0)
  •  concavity (severity of concave portions of the contour)
  • concave points (number of concave portions of the contour)
  • symmetry
  • fractal dimension (“coastline approximation” – 1)

5. Twitter sentiment Analysis Dataset:

Analyzing sentiment is one of the most popular applications in natural language processing(NLP) and to build a model on sentiment analysis this dataset will help you. This dataset is text processing data and with the help of this dataset, you can start building your first model on NLP.

Structure of the dataset:

Three main columns are there in this dataset,

  • ItemID – id of twit
  • Sentiment – sentiment
  • SentimentText – text of the twit
Features
  • This dataset consists of three types or three tones of data, neutral, positive, and negative.
  • The format of the dataset is CSV (Comma separated value)
  • The dataset is divided into two parts 1. Train,CSV 2. Test.csv
  • So using this dataset you do not need to split your data for the training and evaluation part.
  • All you need to do, is build your model using train.csv and evaluate your model using test.csv
  • Two data fields are there, i.e., ItemID (ID of tweet) and SentimentText (text of the tweet).

6. MNIST dataset (handwritten data):

MNIST dataset is built on handwritten data. This dataset is one of the most popular deep-learning image classification datasets. This dataset can be used for machine learning purposes as well. The dataset has 60000 instances or examples for training purposes and 10000 instances for model evaluation.

This dataset is beginner-friendly and helps to understand the techniques and the deep learning recognition pattern on real-world data.  Data does not take much time to preprocess. A beginner who is keen to learn deep learning or machine learning, can start their first project with the help of this dataset.

Size: ~50 MB

Number of Records: 70,000 images in 10 classes (including train and test part)

Features
  • MNIST dataset is one of the best datasets that helps to understand and learn the ML techniques and pattern recognition methods in deep learning on real-world data.
  • The dataset contains four types of files like train-images-idx3-ubyte.gz, train-labels-idx1-ubyte.gz, t10k-images-idx3-ubyte.gz, and t10k-labels-idx1-ubyte.gz.
  • MNIST dataset is divided into two parts 1. Train,CSV 2. Test.csv
  • So using this dataset you do not need to split your data for the training and evaluation part.
  • All you need to do, is build your model using train.csv and evaluate your model using test.csv

7. Fashion MNIST dataset:

Fashion MNIST dataset is also one of the most used datasets and builds on cloths data. The fashion MNIST dataset can be used for deep-learning image classification problems. This dataset can be used for machine learning purposes as well. The dataset has 60000 instances or examples for training purposes and 10000 instances for model evaluation.

This dataset is beginner-friendly and helps to understand the techniques and the deep learning recognition pattern on real-world data.  Data does not take much time to preprocess. For a beginner who is keen to learn deep learning or machine learning, they can start their first project with the help of this dataset. The fashion MNIST dataset is created to replace the MNIST dataset. All the images in this dataset are in grayscale with 10 classes.

Size: 30 MB

Number of Records: 70,000 images in 10 classes

Features

  • The fashion MNIST dataset is one of the best datasets that helps to understand and learn the ML techniques and pattern recognition methods in deep learning on real-world data.
  • The fashion MNIST dataset is divided into two parts 1. Train,CSV 2. Test.csv
  • So using this dataset you do not need to split your data for the training and evaluation part.
  • All you need to do, is build your model using train.csv and evaluate your model using test.csv

8.  Amazon review dataset:

Amazon review dataset is also used for Natural language processing purposes. Analyzing sentiment is one of the most popular applications in natural language processing(NLP) and to build a model on sentiment analysis this dataset will help you. This dataset is text processing data and with the help of this dataset, you can start building your first model on NLP.

This dataset contains ratings, text, helpfulness votes, product metadata, description, category information, price, brand,  image features, links for the product, and view and bought graphs as well. All the data contains 142.8 billion reviews spanning May 1996 to July 2014. This dataset will give you the essence of the real business problem and help you to understand the trend the sales over the years.

Features
  • Amazon review dataset consists of Amazon product reviews
  • It includes both product and user information, ratings, and review
  • Official Paper: J. McAuley and J. Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. RecSys, 2013.
  • This data consists of duplicate data as well.

9. Spam SMS classifier dataset:

In today’s society finding spam, the message is one of the most important parts. So data scientists came up with an idea where you can train your model using the dataset and your model will predict the spam message. This dataset will help you to train your model to predict spam messages. Machine learning classification algorithm can be used to build your model and this dataset is also beginner-friendly and easy to understand as well.  The spam SMS classifier dataset has a set of SMS labeled messages that are collected for SMS Spam analysis.

Features
  • The spam SMS classifier dataset has 5,574 messages
  • This dataset is written in English.
  • Each line of this dataset contains one message
  • This dataset has two datasets: One column stands for the classification of spam messages or not and another one is raw text.
  • The spam SMS classifier dataset is in the CSV format (comma-separated value).

 

10. Spam-Mails Dataset: 

In today’s society finding spam mail is one of the most important parts. So data scientists came up with an idea where you can train your model using the dataset and your model will predict the spam mail. This dataset will help you to train your model to predict spam mail. Machine learning classification algorithm can be used to build your model and this dataset is also beginner-friendly and easy to understand as well.

The spam mail dataset has a set of mail tags. This dataset is a  collection of 425 SMS spam messages were manually extracted from the Grumbletext Web site. This is a UK forum where cell phone users make public claims about SMS spam messages. Most of them were receiving a huge number of spam messages every day. And the identification process of those spam messages was a very hard and time-consuming task. the process involved carefully scanning hundreds of web pages. The Grumbletext Web site is http://www.grumbletext.co.uk/. ->

A subset of 3,375 SMS randomly chosen ham messages of the NUS SMS Corpus (NSC), which is a dataset of about 10,000 legitimate messages collected for research at the Department of Computer Science at the National University of Singapore. The messages largely originate from Singaporeans and mostly from students attending the University. These messages were collected from volunteers who were made aware that their contributions were going to be made publicly available.

The NUS SMS Corpus is available at: http://www.comp.nus.edu.sg/~rpnlpir/downloads/corpora/smsCorpus/. -> A list of 450 SMS ham messages collected from Caroline Tag’s PhD Thesis.

  • Most of the parts of the dataset are not spam which is about 86%.
  • In this dataset you need to split your data, it does not come with train and test division

11. Youtube Dataset: 

YouTube video dataset is based on YouTube information about the videos they have. It helps to make a video classification model using a machine learning algorithm. YouTube-8M is a video dataset that consists of millions of YouTube video IDs. It has high-quality machine-generated annotations derived from numerous visual entities and audio-visual features from billions of frames and audio segments.

This dataset helps to learn machine learning as well as computer vision part also. This dataset has improved the quality of annotations and machine-generated labels and also it has  6.1 million URLs, labeled with a vocabulary of 3,862 visual entities. all the videos are annotated with one or more labels (an average of 3 labels per video).

Features

  • This dataset has a large-scale labeled dataset with high-quality machine-generated annotations.
  • In this dataset, videos are sampled uniformly.
  • Each video in the Youtube dataset is associated with at least one entity from the target vocabulary.
  • The vocabulary of the dataset is available in CSV format (Comma-separated value)

12. CIFAR -10: 

CIFAR 10 is also an image classification dataset that consists of various object images. With the help of this dataset, we can perform many operations in machine learning and deep learning as well. CIFAR stands for Canadian Institute For Advanced Research. This dataset is one of the most commonly used datasets for machine learning research. The CIFAR 10 dataset has 60,000 32×32 color images in 10 different classes. Those different classes are

  1. airplanes
  2. cars
  3. birds
  4. cats
  5. deer
  6. dogs
  7. frogs
  8. horses
  9. ships
  10. and trucks

Each of these classes has 6000 images.CIFAR 10 is used for computer-recognizing algorithms in deep learning to train computers to recognize objects. The resolution of the images in CIFAR 10 is 32*32 which is considered low resolution so it allows the learner to learn different algorithms with less time. The CIFAR 10 dataset is beginner-friendly as well. This dataset is famous for deep learning algorithm convolutional neural network.

Features:

  • CIFAR 10  dataset is one of the best datasets that helps to understand and learn the ML techniques and object detection methods in deep learning on real-world data.
  • The CIFAR 10  dataset is divided into two parts 1. Train 2. Test
  • So using this dataset you do not need to split your data for the training and evaluation part.
  • All you need to do, is build your model using train data and evaluate your model using test data
  • IN CIFAR 10 Total, there are 50,000 training images and 10,000 test images.
  • The dataset is divided into 6 parts – 5 training batches and 1 test batch.
  • Each batch has 10,000 images.

Size: 170 MB

Number of Records: 60,000 images in 10 classes

13.  IMDB reviews: 

IMDB dataset stands for  Large Movie Review Data set. Analyzing sentiment is one of the most popular applications in natural language processing(NLP) and to build a model on sentiment analysis IMDB movie review data set will help you. This Large Movie Review data set has 25,000 highly polar moving reviews which may be good or bad. IMDB datset often used for sentiment analysis purposes using Machine learning or deep learning algorithms. This data set was prepared by Standford researchers in 2011.

This data set comes with a 50/50 split for training and testing purposes. This data set also achieved 88.89% accuracy. IMDB  data was used for a Kaggle competition titled “Bag of Words Meets Bags of Popcorn” from 2014 to early 2015. In that competition, accuracy was achieved above 97% with winners achieving 99%.  IMDB is popular for movie lovers as well and binary sentiment classification was mostly made using this.  Without the training and test review examples in the data set, there are further unlabeled data for use.

Size: 80 MB

Number of Records: 25,000 highly polar movie reviews for training, and 25,000 for testing

Features:

  • IMDB  data set is one of the best data sets which helps to understand and learn the ML techniques and deep learning methods on real-world data.
  • IMDB  data set is divided into two parts 1. Train 2. Test
  • So using this data set you do not need to split your data for the training and evaluation part.
  • All you need to do, is build your model using train data and evaluate your model using test data

14. Sentiment 140:

Sentiment 140 data set built on Twitter data. Analyzing sentiment is one of the most popular applications in natural language processing(NLP) and to build a model on sentiment analysis Sentiment 140 data set will help you. This data set is text processing data and with the help of this data set, you can start building your first model on NLP. Sentiment 140 data set is beginner-friendly to start a new project in natural language processing. This data pre-removed the emotions and it had six features altogether.

  • polarity of the tweet
  • id of the tweet
  • date of the tweet
  • the query
  • username of the Twitter
  • text of the tweet

Features:

  • It has 1,600,000 tweets which were extracted using the Twitter API
  • The tweets were annotated like (0 = negative, 2 = neutral, 4 = positive)
  • These annotations are used to detect  the sentiment for the particular tweet

Fields in the dataset:

  • target: the polarity of the tweet (0 = negative, 2 = neutral, 4 = positive)
  • ids: The id of the tweet ( 2087)
  • date: the date of the tweet (Sat May 16 23:58:44 UTC 2009)
  • flag: The query (lynx). If there is no query, then this value is NO_QUERY.
  • user: the user that tweeted (robotickilldozr)
  • text: the text of the tweet (Lyx is cool)

Size: 80 MB (Compressed)

Number of Records: 1,60,000 tweets

15. Facial image Dataset:

The facial image dataset is based on face images of males and females both. Using facial image data sets machine learning and deep learning algorithms can be performed to detect gender and emotion. It has a variation of data like variation of background and scale and variation of expressions.

Information about the dataset:

  • Total number of individuals: 395
  • Number of images per individual: 20
  • Total number of images: 7900
  • Gender:  contains images of male and female subjects
  • Race:  contains images of people of various racial origins
  • Age Range:  the images are mainly of first-year undergraduate students, so the majority of individuals are between 18-20 years old but some older individuals are also present.

Features

  • The data set has four directories.
  • You can download the data set according to your system requirements and demand.
  • All the version of the data has the zipped version.
  • A total of 395 individuals are there and each of them has 20 images
  • The resolution of the images is 180 * 200 pixels stored in 24-bit RGB JPEG format.

16. RED Wine Quality Dataset:

RED wine quality dataset is also popular and interesting for all the machine learning and deep learning enthusiasts. This data set is also beginner-friendly and you can easily apply machine learning algorithms to this data. With the help of this data set, you can train your model to predict the wine quality.

This data set has the wine’s physicochemical properties. Regression and classification both approaches of machine learning can be used by using a Red wine quality data set. This data set is related to red and white variants of the Portuguese “Vinho Verde” wine. Because of privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brands, wine selling prices, etc.). In the data set, the classes are ordered and not balanced (e.g. there are many more normal wines than excellent or poor ones).

Information about input variables based on physicochemical tests:

1 – Fixed acidity

2 – Volatile acidity

3 – Citric acid

4 – Residual sugar

5 – Chlorides

6 – Free sulfur dioxide

7 – Total sulfur dioxide

8 – Density

9 – pH

10 – Sulphates

11 – Alcohol

Output variable (based on sensory data):

12 – Quality (score between 0 and 10)

Features
  •  Two types of variables are there in the data set, i.e., input and output variables.
  • Input variables are fixed acidity, volatile acidity, citric acid, residual sugar, and so forth.
  • The output variable is quality.
  • 12 attributes are present and the attribute characteristics are real.
  • The number of total records is 4898.

17. The Wikipedia corpus:

Wikipedia corpus consists of Wikipedia data only. This has the collection of the full text on Wikipedia and contains almost 1.9 billion words from more than 4 million articles. This data set is used for natural language processing purposes. It is a very powerful data set and you can search by word, phrase, or part of a paragraph itself.

Size: 20 MB

Number of Records: 4,400,000 articles containing 1.9 billion words

Features

  • This data set has a large scale and can be used for machine learning and natural language processing purposes
  • As the data set is big its helps to train the model perfectly
  • It has 4,400,000 articles containing 1.9 billion words

18. Free Spoken digit dataset:

A free Spoken digit dataset is simple audio or speech data that consists of recordings of spoken English digits. The format of the file is WAV at 8 kHz.  All the recordings are trimmed to have near-minimal silence at the beginning and end. This data set is created to solve the task of identifying spoken digits in audio. The main thing about the data set is, that it is open. So anyone can contribute to this repository. As it is open it is expected that the data set will grow over time

 Characteristics of the Dataset:

  • 4 speakers
  • 2,000 recordings (50 of each digit per speaker)
  • English pronunciations

Files format: {digitLabel}_{speakerName}_{index}.wav Example: 7_jackson_32.wav

Features:

  • Open source
  • Helps to solve digit pronunciation problem
  • Allows to contribute to anyone

19. Boston House price dataset: 

Boston House price data set is collected from the U.S Census Service concerning housing in the area of Boston Mass. This data set is used to predict the house price depending on a few attributes. Machine learning regression problems can be done using the data. The data has five hundred-six cases all total.

Features:

  • Total cases in the dataset 506
  •  14 attributes are there in each case, like CRIM, AGE, TAX, and so forth.
  • The format of the dataset is CSV (Comma separated value)
  • Machine learning regression problems can be applied to the dataset

20. Pima Indian Diabetes dataset:

Artificial Intelligence is now widely used in the healthcare and medical industry as well. The dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. Diabetes is one of the most common and dangerous diseases and now spreading diabetes is very easy.

In a chronic condition of diabetes body develops a resistance to insulin and a hormone that converts foods into Glucose. Diabetes affects so many people worldwide and it has Type 1 and Type 2 diabetes. For type 1 and type 2 diabetes, they have different characteristics. So  Pima Indian Diabetes data set is used to predict diabetes based on certain diagnostic measurements. This machine learning model helps society and the patient to detect diabetes disease quickly.

This is one of the best data sets to make a model on diabetes prediction. Particularly we can say all patients here are females at least 21 years old of Pima Indian heritage. There are to total of nine columns in the data set:

  1. Pregnancies
  2. Glucose
  3. Blood pressure
  4. Skin thickness
  5. Insulin
  6. BMI
  7. DiabetesPedigreeFunction
  8. Age
  9. Outcome

Features:

  • The format of the data set is CSV (Comma separated value)
  • Almost most of the patients in this data set are female, and at least 21 years old.
  • There are several variables are there in the data set, like, the number of pregnancies, BMI, insulin level, age, and one target variable.
  • It has a total of 768 rows and 9 columns

21. Iris Dataset:

This famous (Fisher’s or Anderson’s) iris data set gives the measurements in centimeters of the variables sepal length and width and petal length and width, respectively, for 50 flowers from each of the 3 species of iris. The species are Iris setosa, versicolor, and virginica.

Format of the dataset:

iris is a data frame with 150 cases (rows) and 5 variables (columns) named Sepal. Length, Sepal.Width, Petal.Length, Petal.Width, and Species.

22. Diamonds Dataset:

This is a dataset containing the prices and other attributes of almost 54,000 diamonds. The variables are as follows:

Price: price in US dollars (\$326–\$18,823)

Carat: weight of the diamond (0.2–5.01)

Cut: quality of the cut (Fair, Good, Very Good, Premium, Ideal)

Color: diamond color, from D (best) to J (worst)

Clarity: a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))

X: length in mm (0–10.74)

Y: width in mm (0–58.9)

Z: depth in mm (0–31.8)

Depth: total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)

Table: width of the top of the diamond relative to widest point (43–95)

23. mtcars Dataset: (Motor Trend Car Road Tests)

This data was extracted from the 1974 Motor Trend US magazine and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973–74 models).

This dataset comprises the following columns:

mpg Miles/(US) gallon

cyl Number of cylinders

Disp Displacement (cu.in.)

hp Gross horsepower

Drat rear axle ratio

wt Weight (1000 lbs)

sec 1/4 mile time

vs Engine (0 = V-shaped, 1 = straight)

am Transmission (0 = automatic, 1 = manual)

gear Number of forward gears

carb Number of carburetors

24. Boston Dataset: Housing Values in Suburbs of Boston

The Boston data frame has 506 rows and 14 columns.

Description of columns:

Crim: per capita crime rate by town.

Zn: proportion of residential land zoned for lots over 25,000 sq. ft.

Indus: proportion of non-retail business acres per town.

Chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).

Nox: nitrogen oxide concentration (parts per 10 million).

Rm: average number of rooms per dwelling.

Age: proportion of owner-occupied units built before 1940.

Dis: weighted mean of distances to five Boston employment centers.

Read: index of accessibility to radial highways.

Tax: full-value property-tax rate per \$10,000.

Ptratio: pupil-teacher ratio by town.

Black: 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town.

Lestat: lower status of the population (percent).

Medv: median value of owner-occupied homes in \$1000s.

25Titanic Dataset: Survival of passengers on the Titanic

This data set provides information on the fate of passengers on the fatal maiden voyage of the ocean liner ‘Titanic’, summarized according to economic status (class), sex, age, and survival.

Format:

A 4-dimensional array resulting from cross-tabulating 2201 observations on 4 variables. The variables and their levels are as follows:

Class: 1st, 2nd, 3rd, Crew

Sex: Male, Female

Age: Child, Adult

Survived: No, Yes

Details about the event:

The sinking of the Titanic is a famous event, and new books are still being published about it. Many well-known facts—from the proportions of first-class passengers to the ‘women and children first’ policy, and the fact that that policy was not entirely successful in saving the women and children in the third class—are reflected in the survival rates for various classes of passenger

26. Pima Indian Diabetes Dataset:

A population of women who were at least 21 years old, of Pima Indian heritage, and living near Phoenix, Arizona, was tested for diabetes according to World Health Organization criteria. The data was collected by the US National Institute of Diabetes and Digestive and Kidney Diseases.

This data frame comprises of the following columns:

Npreg: number of pregnancies.

Glu: plasma glucose concentration in an oral glucose tolerance test.

Bp: diastolic blood pressure (mm Hg).

Skin: triceps skin fold thickness (mm).

Bmi: body mass index (weight in kg/(height in m)\^2).

Ped: diabetes pedigree function.

Age: age in years.

Type: Yes or No, for diabetic according to WHO criteria.

27. Beavers Dataset:

This data set is part of a long study into body temperature regulation in beavers. Four adult female beavers were live-trapped and had a temperature-sensitive radio transmitter surgically implanted. Readings were taken every 10 minutes. The location of the beaver was also recorded and her activity level was dichotomized by whether she was in the retreat or outside of it since high-intensity activities only occur outside of the retreat.

This data frame contains the following columns:

Day: The day number. The data includes only data from day 307 and early 308.

Time: The time of day is formatted as hour-minute.

Temp: The body temperature in degrees Celsius.

Activ: The dichotomized activity indicator. 1 indicates that the beaver is outside of the retreat and therefore engaged in high-intensity activity.

28Cars93 Dataset: Data from 93 Cars on Sale in the USA in 1993

The Cars93 data frame has 93 rows and 27 columns. Below is the description of the columns:

Manufacturer: Manufacturer of the vehicle

Model: Model of the vehicle

Type:Type: a factor with levels “Small”, “Sporty”, “Compact”, “Midsize”, “Large” and “Van”.

Min. Price: Minimum Price (in \$1,000): price for a basic version.

Price: Midrange Price (in \$1,000): average of Min. Price and Max.Price.

Max.Price: Maximum Price (in \$1,000): price for “a premium version”.

MPG.City: City MPG (miles per US gallon by EPA rating).

MPG.highway: Highway MPG.

AirBags: Air Bags standard. Factor: none, driver only, or driver & passenger.

DriveTrain: Drive train type: rear wheel, front wheel, or 4WD; (factor).

Cylinders: Number of cylinders (missing for Mazda RX-7, which has a rotary engine).

EngineSize: Engine size (liters).

Horsepower: Horsepower (maximum).

RPM: RPM (revs per minute at maximum horsepower).

Rev. per.mile: Engine revolutions per mile (in highest gear).

Man. trans.avail: Is a manual transmission version available? (yes or no, Factor).

Fuel. tank.capacity: Fuel tank capacity (US gallons).

Passengers: Passenger capacity (persons)

Length: Length (inches).

Wheelbase: Wheelbase (inches).

Width: Width (inches).

Turn. circle: U-turn space (feet).

Rear. seat.room: Rear seat room (inches) (missing for 2-seater vehicles).

Luggage. room: Luggage capacity (cubic feet) (missing for vans).

Weight: Weight (pounds).

Origin: Of non-USA or USA company origins? (factor).

Make Combination of Manufacturer and Model (character).

29. Car-seats Dataset:

This is a simulated data set containing sales of child car seats at 400 different stores. So, it is a data frame with 400 observations on the following 11 variables:

Sales: Unit sales (in thousands) at each location

CompPrice: Price charged by competitor at each location

Income: Community income level (in thousands of dollars)

Advertising: Local advertising budget for company at each location (in thousands of dollars)

Population: Population size in the region (in thousands)

Price: The price the company charges for car seats at each site

ShelveLoc: A factor with levels Bad, Good, and Medium indicating the quality of the shelving location for the car seats at each site

Age: Average age of the local population

Education: Education level at each location

Urban: A factor with levels No and Yes to indicate whether the store is in an urban or rural location

US: A factor with levels No and Yes to indicate whether the store is in the US or not

30. sleep Dataset:

This is an updated and expanded version of the mammal sleep data set. It is a data set with 83 rows and 11 variables.

Name: common name

Genus, vore: carnivore, omnivore or herbivore?

Order, conservation: the conservation status of the animal

Sleep_total: total amount of sleep, in hours

Sleep_rem: rem sleep, in hours

Sleep_cycle: length of sleep cycle, in hours

Awake: amount of time spent awake, in hours

Brainwt: brain weight in kilograms

Bodywt: body weight in kilograms

31Cushing’s Dataset: Diagnostic Tests on Patients with Cushing’s Syndrome

Cushing’s syndrome is a hypertensive disorder associated with over-secretion of cortisol by the adrenal gland. The observations are urinary excretion rates of two steroid metabolites.

The Cushings data frame has 27 rows and 3 columns. The description of the columns is below:

Tetrahydrocortisone: urinary excretion rate (mg/24hr) of Tetrahydrocortisone.

Pregnanetriol: urinary excretion rate (mg/24hr) of Pregnanetriol.

Type: underlying type of syndrome, coded an (adenoma), b (bilateral hyperplasia), c (carcinoma) or u for unknown.

32. ToothGrowth Dataset:

The response is the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs. Each animal received one of three dose levels of vitamin C (0.5, 1, and 2 mg/day) by one of two delivery methods, orange juice or ascorbic acid (a form of vitamin C and coded as VC).

This is a data frame with 60 observations on 3 variables.

The data set is the base and first step to building a machine learning application.Data sets are available in different formats like .txt, .csv, and many more. For supervised machine learning, the labeled training data set is used as the label works as a supervisor in the model. For unsupervised learning algorithms in machine learning data set label is required. The unsupervised model learns by itself, not from the label.

Please read the full article to understand which data set is preferable for your machine learning algorithm.

Table of Contents