In this post we can find free public datasets for Data Science projects. There is a big number of datasets which cover different areas - machine learning, presentation, data analysis and visualization.
You can find information for:
Note
This post will be updated on regular basis so please suggest new ideas and datasets in the comment section below.
Below we can find a table of dataset collections. Most of them have advanced searching by:
# | site | description |
---|---|---|
1 | https://www.kaggle.com/datasets/ | https://www.kaggle.com/docs/datasets |
2 | https://datasetsearch.research.google.com/ | Google Dataset Search |
3 | https://azure.microsoft.com/en-us/services/open-datasets/ | Azure Open Datasets |
4 | https://www.openml.org/search?type=data | 4325 datasets found (verified) |
5 | https://grouplens.org/datasets/ | several datasets |
6 | https://datahub.io/search | thousands of datasets |
There are several listed below which are used in this site for demonstration of data science basics:
# | dataset | size | link | description |
---|---|---|---|---|
1 | the-movies-dataset | 45466, 24 | https://www.kaggle.com/datasets/rounakbanik/the-movies-dataset | https://grouplens.org/datasets/movielens/latest/ |
2 | Food Recipes | 8009, 16 | https://www.kaggle.com/datasets/sarthak71/food-recipes | |
3 | ||||
4 | ||||
5 | ||||
6 | ||||
7 |
To read Kaggle datasets we can use the Python library kaggle . Downloading dataset from kaggle with Python code is available from method: dataset_download_file :
import kaggle kaggle.api.authenticate() kaggle.api.dataset_download_file('dorianlazar/medium-articles-dataset', file_name='medium_data.csv', path='data/')
In this section we can find several useful datasets for different purposes like:
Python library datasets offers a huge number of free and easy to use datasets. It can be installed by:
pip install datasets
To list all available datasets we can use method: datasets.list_datasets() :
from datasets import list_datasets, load_dataset print(list_datasets())
It will return more than 7000 datasets.
To load dataset we can use method: datasets.load_dataset(dataset_name, **kwargs) :
squad_dataset = load_dataset('squad') squad_dataset
This give us two datasets:
To access the dataset for training we can use: squad_dataset['train'] .
Finally we can loaded as Pandas DataFrame by:
import pandas as pd pd.DataFrame(squad_dataset['train'])
Pandas offers multiple ways to download datasets with a single line of code. Let's cover few of them starting with the test data in Pandas github:
Next we can load data from Pandas by scraping wikipedia:
pd.read_html('https://en.wikipedia.org/wiki/Population_growth')[2]
We can create random or fake datasets with Pandas by:
Seaborn offers free tests which are good for visualization. With single line of code we can get DataFrame good for data wrangling and visualization:
import seaborn as sns df = sns.load_dataset('flights')
All datasets available from seaborn library: seaborn-data.
We can get sample datasets from sklearn-learn by methods like: load_iris
from sklearn.datasets import load_iris iris = load_iris()
To find more sample datasets from sklearn we can use the next code:
from sklearn import datasets dir(datasets)
This will list all available options like:
'load_sample_images', 'load_svmlight_file', 'load_svmlight_files', 'load_wine', 'make_biclusters', 'make_blobs', 'make_checkerboard', 'make_circles',
You can find more about sklearn-learn datasets on this link: sklearn.datasets: Datasets.
To load dataset we can use method: load_dataset
from dataprep.datasets import load_dataset df = load_dataset("titanic")
to list datasets we can use:
from dataprep.datasets import get_dataset_names get_dataset_names()
which results into several datasets like:
['waste_hauler', 'wine-quality-red', 'countries', 'house_prices_train', 'iris', 'adult', 'covid19', 'titanic', 'patient_info', 'house_prices_test']
More information about dataprep datasets: Datasets DataPrep
In this article, we covered free datasets sources and discussed common ways to download dataset from them. Through practical examples, we learned how to download and use those datasets in Python and Pandas.
We covered different Python libraries which offer public datasets for learning. Finally, we covered how to create test datasets with fake data.
Those datasets and ideas should be sufficient for practicing and learning data science.
By using DataScientYst - Data Science Simplified, you agree to our Cookie Policy.