Free Public Datasets for Data Science Projects featured icon

In this post we can find free public datasets for Data Science projects. There is a big number of datasets which cover different areas - machine learning, presentation, data analysis and visualization.

You can find information for:

Note
This post will be updated on regular basis so please suggest new ideas and datasets in the comment section below.

1. Dataset Sources

Below we can find a table of dataset collections. Most of them have advanced searching by:

# site description
1 https://www.kaggle.com/datasets/ https://www.kaggle.com/docs/datasets
2 https://datasetsearch.research.google.com/ Google Dataset Search
3 https://azure.microsoft.com/en-us/services/open-datasets/ Azure Open Datasets
4 https://www.openml.org/search?type=data 4325 datasets found (verified)
5 https://grouplens.org/datasets/ several datasets
6 https://datahub.io/search thousands of datasets

2. Datasets samples

There are several listed below which are used in this site for demonstration of data science basics:

# dataset size link description
1 the-movies-dataset 45466, 24 https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset https://grouplens.org/datasets/movielens/latest/
2 Food Recipes 8009, 16 https://www.kaggle.com/datasets/sarthak71/food-recipes
3
4
5
6
7

3. Datasets resources

4. Read Kaggle Datasets

To read Kaggle datasets we can use the Python library kaggle . Downloading dataset from kaggle with Python code is available from method: dataset_download_file :

import kaggle kaggle.api.authenticate() kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv', path='data/') 

5. Load Datasets by Python libraries

In this section we can find several useful datasets for different purposes like:

5.1 datasets - machine learning

Python library datasets offers a huge number of free and easy to use datasets. It can be installed by:

pip install datasets 

To list all available datasets we can use method: datasets.list_datasets() :

from datasets import list_datasets, load_dataset print(list_datasets()) 

It will return more than 7000 datasets.

To load dataset we can use method: datasets.load_dataset(dataset_name, **kwargs) :

squad_dataset = load_dataset('squad') squad_dataset 

This give us two datasets:

To access the dataset for training we can use: squad_dataset['train'] .

Finally we can loaded as Pandas DataFrame by:

import pandas as pd pd.DataFrame(squad_dataset['train']) 

5.2 pandas - test datasets

Pandas offers multiple ways to download datasets with a single line of code. Let's cover few of them starting with the test data in Pandas github: