How I find and explore datasets from Kaggle using Python

The best way to learn stats and data science is to actually do it. And to do it, you need datasets. One good place to find them is Kaggle. Here’s how I find, download, and explore Kaggle datasets.

Searching for datasets

You can search directly or by topic

Kaggle is a site that hosts data science competitions, but it’s the datasets that I’m really interested in. Users have uploaded a wide variety of datasets. They come from different sources, such as public records, APIs, or are even simulated.

My go-to environment for analysis is what I call the Python data stack: IPython, Jupyter, NumPy, SciPy, pandas, Seaborn, and statsmodels.

The Kaggle website has a decent search function. I will either search for a dataset for a topic I’m interested in or browse. You can also search by topic. Since I write about tech, the “computer science” topic usually reveals a lot of interesting datasets.

From this list, for example, I think I’ll download one on screen time and mental health.

Every dataset on Kaggle has information about its layout, known as a “data card.” This spells out the columns of a dataset, as well as how it was generated.

Downloading the dataset

You can download directly and unzip or use a CLI client

Now I’ll want to download the dataset. It’s a CSV style compressed into a ZIP file. I could just download this and unpack it, but Kaggle has made a command-line client available. I’ve already installed it via pip. You’ll need to set up an API key to use this, but you can do that if you make an account and go into the website settings. Kaggle has documentation on how to use your API key with the client.

I’ll navigate to the directory where I keep my data, and call the Kaggle client with the partial URL of the page. Kaggle makes it easy to paste this in:

kaggle datasets download amar5693/screen-time-sleep-and-stress-analysis-dataset

I’ll unzip the newly downloaded dataset:

unzip social-media-and-mental-health.zip

I can see that the name of the expanded CSV file is “social_media_mental_health.csv.”

Creating a Jupyter notebook

Now it’s time to get to work exploring the dataset

With the dataset downloaded and unpacked, I can now explore it. Exploring datasets is different from conventional programming because it favors interactive use. For interactive Python, I favor IPython for quick experiments and Jupyter notebooks for when I want to save my work for later.

Jupyter notebooks are useful for two things: keeping track of my progress and sharing results with other people. Jupyter notebooks let me mix code, visualizations, and commentary into one document.

My data science/data analysis/statistics toolbox used to live in a Mamba environment, but I found it annoyingly slow to update. I decided to try Pixi for a more streamlined approach. Pixi is primarily project-based, but it also allows me to set up a global environment for experimentation so I don’t have to switch into a new environment. It also updates much more quickly than Mamba does.

My usual workflow is:

Start Jupyter
Create a new notebook
Import the data.
Analyzing it using Python libraries, creating plots, running tests, and including commentary on the results.

I can launch a Jupyter notebook at the command line:

jupyter notebook

This will open up Jupyter in the default browser. This will launch a file browser, starting in the directory where I launched Jupyter.

I’ll create a new notebook. Python is listening to evaluate code in the background. I click on the filename to rename it. With datasets from Kaggle, I like to put in a link to the original URL so I can remember it later.

Text cells are in Markdown.

With these out of the way, I can import my Python statistical toolbox. I create a cell. By default, new cells are code cells.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme()
import statsmodels.formula.api as smf
from scipy import stats

Most of these just import common libraries with shortened names to make them easier to type. The “%matplotlib inline” command is a “magic command” that tells Jupyter to insert any plots into the notebook (Seaborn is a front end to matplotlib for common stats plots). Otherwise, they’ll be displayed in a separate window. The “sns.set_theme()” line tells Seaborn to use its default theme to make plots easier to read.

Importing and Visualizing the data

Descriptive stats, histograms, and scatterplots tell the data’s story

With the libraries imported, I can get to work by importing the data

To do that, I load the data I downloaded into a pandas DataFrame, a rectangular array similar to a spreadsheet.

I’ll import the data using the relative path to where I downloaded it:

screen = pd.read_csv(‘data/ScreenTime vs MentalWellness.csv’)

Next I’ll examine the first few lines of the DataFrame:

screen.head()

I’ll then take descriptive statistics of the numerical columns in the DataFrame.

screen.describe()

These include the mean or average, the standard deviation, the minimum, the 25th percentile or lower quartile, the median or 50th percentile, the 75th percentile or upper quartile, and the maximum value.

I can look at the distribution of a column with a histogram. I’ll use Seaborn’s displot function.

Suppose I wanted to look at the distribution of hours of screen time in this sample. I would use something like

sns.displot(x=’screen_time_hours’)

A histogram will tell me what kind of statistical tests I can use later. This histogram appears to be almost normally distributed, with a bell-shaped curve.

Most statistical tests, such as the t-test or analysis of variance (ANOVA), assume data are normally distributed. Using them with skewed datasets can throw off the results. Non-parametric tests can be more accurate in these circumstances.

I’m also on the lookout for possible correlations. I can test these by making scatterplots of one variable vs. another.

To display the scatterplot of screen time vs. a self-rated stress score, I can use the relplot function:

sns.relplot(x=”screen_time_hours”,y=”stress_level_0_10″,data=screen)

The scatterplot shows a positive linear relationship, indicating that the stress level rises with more screen time. I can plot the regression line over the scatterplot with the regplot function.

sns.regplot(x=”screen_time_hours”,y=”stress_level_0_10″,data=screen)

Finding correlations

Finding trends and making predictions

While there appears that there’s a correlation between screen time and stress, it’s only a correlation. Screen time doesn’t necessarily cause stress, the plot just shows there’s a relationship between the two.

Seaborn’s regression plot doesn’t give me the values of the linear function. To get the slope for x and the y-intercept for the classic slope-intercept equation y = mx + b that you might remember from high school, I have to use another library.

statsmodels is a library that has many common statistical tests and is particularly designed for regression analysis. It can get regressions using formulas inspired by R, another language popular for statistics and data analysis.

results = smf.ols(‘stress_level_0_10 ~ screen_time_hours’,data=screen).fit()
results.summary()

This will put the regression results into a table in the Jupyter notebook. The table shows the values of the coefficients for the slope-intercept equation. Here, the y-intercept is approximately 2.86, and the slope is 0.59. I can also make a rough prediction of how stressed out a person would be if they reported their screen time by plugging it into the equation.

The hardest part of data analysis is finding it

Kaggle solved a major stumbling block in practicing analyzing data: finding it. I can now find interesting datasets quickly in Kaggle and explore them with my favorite Python libraries. This workflow lets me focus on the data instead of trying to find it.

What's Hot

Google I/O | Android Central

Somebody gave the MacBook Neo the 1TB storage improve it by no means acquired from Apple

Witness Caught Utilizing Smartglasses in Court docket Blames all of it on ChatGPT

Is Spotify down? Reside updates because the music streaming service has points

OnePlus’ upcoming funds telephone will elevate the bar for Apple and Samsung mid-rangers

Sony PlayStation 5 vs. PlayStation 5 Professional: Is the $200 Improve Value It for 4K Gaming?

This NotebookLM setup turned YouTube into the last word examine instrument for me

This luxurious SUV feels German—with out the worth

Methods to watch ‘Born to Bowl’ on-line from wherever

Google I/O | Android Central

Somebody gave the MacBook Neo the 1TB storage improve it by no means acquired from Apple

Witness Caught Utilizing Smartglasses in Court docket Blames all of it on ChatGPT

Google I/O | Android Central

Somebody gave the MacBook Neo the 1TB storage improve it by no means acquired from Apple

Witness Caught Utilizing Smartglasses in Court docket Blames all of it on ChatGPT

Usefull link

categories

What's Hot

Searching for datasets

You can search directly or by topic

Downloading the dataset

You can download directly and unzip or use a CLI client

Creating a Jupyter notebook

Now it’s time to get to work exploring the dataset

Importing and Visualizing the data

Descriptive stats, histograms, and scatterplots tell the data’s story

Finding correlations

Finding trends and making predictions

The hardest part of data analysis is finding it

Related Posts

Usefull link

categories