Find the Best Dataset in Machine Learning
While looking for a dataset it is useful to initially ponder who could gather the information you are keen on. For instance, might it be accumulated by an administration office; a not-for-profit or nongovernmental association or organization; or different specialists? You can then really take a look at the sites of likely information gatherers.
Likewise, you can search for distributions, articles, or government reports, that refer to the information, could you at any point tell where they procured the dataset?
You can find datasets accessible on the web by looking through Google or Open Information Vaults. Likewise, the library might approach the information as a feature of one of our memberships. At long last, you can have a go at messaging a solicitation for the information from the creators or specialists.
Steps to Constructing Your Dataset
To construct your dataset (and before doing data transformation), you should:
- Collect the raw data.
- Identify features and label sources.
- Select a sampling strategy.
- Split the data.
These steps depend a lot on how you’ve framed your ML problem. Use the self-check below to refresh your memory about problem framing and to check your assumptions about data collection.
-
Statista
Find statistics and consumer survey results from over 18,000 sources on over 60,000 topics. Download data sets in a choice of formats.
-
Kaggle
Explore, analyze, and share quality data. Learn more about data types, creating, and collaborating.
Major Academic Repositories
-
Dryad Digital Repository
Dryad is a nonprofit membership organization that is committed to making data available for research and educational reuse now and into the future. -
Harvard Dataverse
A free data repository open to all researchers from any discipline, both inside and outside of the Harvard community, where you can share, archive, cite, access, and explore research data. (Includes the data from other Dataverse repositories, e.g. UCSD) -
Zenodo
To ensure no one is left behind through lack of access to the necessary tools and resources, Zenodo makes the sharing, curation, and publication of data and software a reality for all researchers.
Additional Open Data Repositories
-
Figshare
Figshare is a repository where users can make all of their research outputs available in a citable, shareable, and discoverable manner. -
Google Dataset Search
Similar to how Google Scholar works, Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher’s site, a digital library, or an author’s web page. -
ICPSR (endorsed by APA)
ICPSR maintains a data archive of more than 250,000 files of research in the social and behavioral sciences. -
openAFRICA
Searchable datasets maintained by Code for Africa, with support from the World Bank and AWS. “All the data on openAFRICA is intended to be actionable. The data is meant to help you change the world.” -
Registry of Research Data Repositories
Offers a way to search over 2,000 worldwide data repositories.
Local datasets
-
City of Seattle
The Open Data Program makes the data generated by the City of Seattle openly available to the public to increase the quality of life for our residents; increasing transparency, accountability, and comparability; promoting economic development and research; and improving internal performance management. -
King County
Welcome to King County’s Open Data portal, where you can discover, analyze, and connect with publicly available datasets published by King County agencies.
Additional datasets recommended by SPU faculty
-
Curate Science
A database of studies published in the sciences (and particularly social sciences/psychology); many with open data. Search by topic (e.g., personality, romantic relationships, etc.) and find papers with accompanying datasets. -
Iris Data Set
The UCI Machine Learning Repository is a collection of databases, domain theories, and data generators that are used by the machine learning community for the empirical analysis of machine learning algorithms.
The Iris Data Set contains three classes of 50 instances each, where each class refers to a type of iris plant. -
Kaggle
Inside Kaggle you’ll find all the code & data you need to do your data science work. Use over 50,000 public datasets and 400,000 public notebooks to conquer any analysis in no time. -
Kaggle – Titanic Survivor Data
This classic dataset includes training and testing data (to practice building models and seeing if they cross-validate) trying to predict whether people survived the Titanic disaster in 1912. -
Kaggle – World Happiness Data
This is a classic dataset for analysis. Over several years, it looks at national happiness and other data by country. It contains categorical and continuous data and has data nested within countries over time. -
Linear Regression Data Sets for Machine Learning
Here there are open linear regression datasets for free downloading. Some of the datasets include sample regression tasks to complete with the data. -
Project Implicit – Open Data for Implicit Racial Bias and More
Project Implicit was founded to foster the dissemination and application of implicit social cognition. The site contains many gigabytes of datasets looking at IAT data in psychology (fast, automatically learned associations with race, gender, etc.) and is an important repository of datasets for research in psychology.
Should I Trust This Data Source?
First, consider the overall reputation of the source for your data. At the end of the day, datasets are created by humans, and those humans may have specific agendas or biases that can translate through your work.
All the data sources we’ve listed here are reputable, but there are more sources for data that aren’t as reputable. The one caveat to our list here is that community-contributed collections, like data. world or GitHub, may vary in their quality. When in doubt about your data source’s reputation, compare it to similar sources that cover the same topic.
Is the Data Skewed?
Understanding how your dataset skews will help you choose the right data to analyze. It’s good to use visualizations to see how your dataset skews because it’s not always obvious by looking at the numbers alone.
For numeric columns, use a histogram to see which type of distribution each column has (normal, left, right, uniform, bimodal, etc.). Here’s a quick visualization of a few distributions you might see:
This visual understanding will help you avoid outliers and be aware of general trends as you perform your data analysis. Strict recommendations on what to do next depend on your dataset, but, overall, how it skews will provide a general idea of the data’s quality and hint at which columns to use in an analysis. You can then use this general idea to avoid misrepresenting your data.
For non-numeric columns, use a frequency table to check how many times a value appears. In particular, you might check if there is mostly one value present. If that’s the case, your analysis might be limited because the variety of values is small. Again, this is just to provide a general idea of data quality and indicate which relevant columns to use.
You can create these visuals and frequency tables within Excel or Google Sheets using CSVs, but you might want to look at a business intelligence (BI) tool for complex datasets.
Start Using Your Free Datasets
Once you have your data and are confident of its quality, it’s time to put it to work. You can go a long way with the likes of Excel, Google Sheets, and Google Data Studio, but if you want good practice for your data career, you should cut your teeth with the real deal: a BI platform like Chartio.
A BI platform will provide powerful data visualization capabilities for any dataset, from small CSVs to large datasets hosted in data warehouses, like Google BigQuery or Amazon Redshift. You can play with your data to create dashboards and even collaborate with others.