Find the Best Dataset in Machine Learning
While looking for a dataset it is useful to initially ponder who could gather the information you are keen on. For instance, might it be accumulated by an administration office; a not-for-profit or nongovernmental association or organization; or different specialists? You can then really take a look at the sites of likely information gatherers.
Likewise, you can search for distributions, articles, or government reports, that refer to the information, could you at any point tell where they procured the dataset?
You can find datasets accessible on the web by looking through Google or Open Information Vaults. Likewise, the library might approach the information as a feature of one of our memberships. At long last, you can have a go at messaging a solicitation for the information from the creators or specialists.
Steps to Constructing Your Dataset
To construct your dataset (and before doing data transformation), you should:
- Collect the raw data.
- Identify features and label sources.
- Select a sampling strategy.
- Split the data.
These steps depend a lot on how you’ve framed your ML problem. Use the self-check below to refresh your memory about problem framing and to check your assumptions about data collection.
Major Academic Repositories
Additional Open Data Repositories
Additional datasets recommended by SPU faculty
Should I Trust This Data Source?
First, consider the overall reputation of the source for your data. At the end of the day, datasets are created by humans, and those humans may have specific agendas or biases that can translate through your work.
All the data sources we’ve listed here are reputable, but there are more sources for data that aren’t as reputable. The one caveat to our list here is that community-contributed collections, like data. world or GitHub, may vary in their quality. When in doubt about your data source’s reputation, compare it to similar sources that cover the same topic.
Is the Data Skewed?
Understanding how your dataset skews will help you choose the right data to analyze. It’s good to use visualizations to see how your dataset skews because it’s not always obvious by looking at the numbers alone.
For numeric columns, use a histogram to see which type of distribution each column has (normal, left, right, uniform, bimodal, etc.). Here’s a quick visualization of a few distributions you might see:
This visual understanding will help you avoid outliers and be aware of general trends as you perform your data analysis. Strict recommendations on what to do next depend on your dataset, but, overall, how it skews will provide a general idea of the data’s quality and hint at which columns to use in an analysis. You can then use this general idea to avoid misrepresenting your data.
For non-numeric columns, use a frequency table to check how many times a value appears. In particular, you might check if there is mostly one value present. If that’s the case, your analysis might be limited because the variety of values is small. Again, this is just to provide a general idea of data quality and indicate which relevant columns to use.
You can create these visuals and frequency tables within Excel or Google Sheets using CSVs, but you might want to look at a business intelligence (BI) tool for complex datasets.
Start Using Your Free Datasets
Once you have your data and are confident of its quality, it’s time to put it to work. You can go a long way with the likes of Excel, Google Sheets, and Google Data Studio, but if you want good practice for your data career, you should cut your teeth with the real deal: a BI platform like Chartio.
A BI platform will provide powerful data visualization capabilities for any dataset, from small CSVs to large datasets hosted in data warehouses, like Google BigQuery or Amazon Redshift. You can play with your data to create dashboards and even collaborate with others.