
The key to getting good at applied machine learning is practicing on lots of different datasets. This is because each problem is different, requiring subtly different data preparation and modeling methods. In this post, you will discover 10 top standard machine learning datasets that you can use for practice.
Standard Datasets
Below is a list of the 10 datasets we’ll cover.
Each dataset is small enough to fit into memory and review in a spreadsheet. All datasets are comprised of tabular data and no (explicitly) missing values.
- Swedish Auto Insurance Dataset.
- Wine Quality Dataset.
- Pima Indians Diabetes Dataset.
- Sonar Dataset.
- Banknote Dataset.
- Iris Flowers Dataset.
- Abalone Dataset.
- Ionosphere Dataset.
- Wheat Seeds Dataset.
- Boston House Price Dataset.
1. Swedish Auto Insurance Dataset
The Swedish Auto Insurance Dataset involves predicting the total payment for all claims in thousands of Swedish Kronor, given the total number of claims.

It is a regression problem. It is comprised of 63 observations with 1 input variable and one output variable. The variable names are as follows:
- Number of claims.
- Total payment for all claims in thousands of Swedish Kronor.
The baseline performance of predicting the mean value is an RMSE of approximately 81 thousand Kronor.
2. Wine Quality Dataset
The Wine Quality involves predicting the quality of white wines on a scale given chemical measures of each wine.
It is a multi-class classification problem, but could also be framed as a regression problem. The number of observations for each class is not balanced. There are 4,898 observations with 11 input variables and one output variable. The variable names are as follows:
- Fixed acidity.
- Volatile acidity.
- Citric acid.
- Residual sugar.
- Chlorides.
- Free sulfur dioxide.
- Total sulfur dioxide.
- Density.
- pH.
- Sulphates.
- Alcohol.
- Quality (score between 0 and 10).
The baseline performance of predicting the mean value is an RMSE of approximately 0.148 quality points.
3. Pima Indians Diabetes Dataset
The Pima Indians Diabetes Dataset involves predicting the onset of diabetes within 5 years in Pima Indians given medical details.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 768 observations with 8 input variables and 1 output variable. Missing values are believed to be encoded with zero values. The variable names are as follows:
- Number of times pregnant.
- Plasma glucose concentration a 2 hours in an oral glucose tolerance test.
- Diastolic blood pressure (mm Hg).
- Triceps skinfold thickness (mm).
- 2-Hour serum insulin (mu U/ml).
- Body mass index (weight in kg/(height in m)^2).
- Diabetes pedigree function.
- Age (years).
- Class variable (0 or 1).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 65%. Top results achieve a classification accuracy of approximately 77%.
4. Sonar Dataset
The Sonar involves the prediction of whether or not an object is a mine or a rock given the strength of sonar returns at different angles.
It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 208 observations with 60 input variables and 1 output variable. The variable names are as follows:
- Sonar returns at different angles
- Class (M for mine and R for rock)
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 53%. Top results achieve a classification accuracy of approximately 88%.
5. Banknote Dataset
The Banknote involves predicting whether a given banknote is authentic given a number of measures taken from a photograph.

It is a binary (2-class) classification problem. The number of observations for each class is not balanced. There are 1,372 observations with 4 input variables and 1 output variable. The variable names are as follows:
- Variance of Wavelet Transformed image (continuous).
- Skewness of Wavelet Transformed image (continuous).
- Kurtosis of Wavelet Transformed image (continuous).
- Entropy of image (continuous).
- Class (0 for authentic, 1 for inauthentic).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 50%.
6. Iris Flowers Dataset
The Iris Flowers involves predicting the flower species given measurements of iris flowers.
It is a multi-class classification problem. The number of observations for each class is balanced. There are 150 observations with 4 input variables and 1 output variable. The variable names are as follows:
- Sepal length in cm.
- Sepal width in cm.
- Petal length in cm.
- Petal width in cm.
- Class (Iris Setosa, Iris Versicolour, Iris Virginica).
The baseline performance of predicting the most prevalent class is a classification accuracy of approximately 26%.
