The best way to learn stats and data science is to actually do it. And to do it, you need datasets. One good place to find them is Kaggle. Here’s how I find, download, and explore Kaggle datasets.
Searching for datasets
You can search directly or by topic
Kaggle is a site that hosts data science competitions, but it’s the datasets that I’m really interested in. Users have uploaded a wide variety of datasets. They come from different sources, such as public records, APIs, or are even simulated.
My go-to environment for analysis is what I call the Python data stack: IPython, Jupyter, NumPy, SciPy, pandas, Seaborn, and statsmodels.
The Kaggle website has a decent search function. I will either search for a dataset for a topic I’m interested in or browse. You can also search by topic. Since I write about tech, the “computer science” topic usually reveals a lot of interesting datasets.
From this list, for example, I think I’ll download one on screen time and mental health.
Every dataset on Kaggle has information about its layout, known as a “data card.” This spells out the columns of a dataset, as well as how it was generated.
Downloading the dataset
You can download directly and unzip or use a CLI client
Now I’ll want to download the dataset. It’s a CSV style compressed into a ZIP file. I could just download this and unpack it, but Kaggle has made a command-line client available. I’ve already installed it via pip. You’ll need to set up an API key to use this, but you can do that if you make an account and go into the website settings. Kaggle has documentation on how to use your API key with the client.
I’ll navigate to the directory where I keep my data, and call the Kaggle client with the partial URL of the page. Kaggle makes it easy to paste this in:
kaggle datasets download amar5693/screen-time-sleep-and-stress-analysis-dataset
I’ll unzip the newly downloaded dataset:
unzip social-media-and-mental-health.zip
I can see that the name of the expanded CSV file is “social_media_mental_health.csv.”
Creating a Jupyter notebook
Now it’s time to get to work exploring the dataset
With the dataset downloaded and unpacked, I can now explore it. Exploring datasets is different from conventional programming because it favors interactive use. For interactive Python, I favor IPython for quick experiments and Jupyter notebooks for when I want to save my work for later.
Jupyter notebooks are useful for two things: keeping track of my progress and sharing results with other people. Jupyter notebooks let me mix code, visualizations, and commentary into one document.
My data science/data analysis/statistics toolbox used to live in a Mamba environment, but I found it annoyingly slow to update. I decided to try Pixi for a more streamlined approach. Pixi is primarily project-based, but it also allows me to set up a global environment for experimentation so I don’t have to switch into a new environment. It also updates much more quickly than Mamba does.
My usual workflow is:
- Start Jupyter
- Create a new notebook
- Import the data.
- Analyzing it using Python libraries, creating plots, running tests, and including commentary on the results.
I can launch a Jupyter notebook at the command line:
jupyter notebook
This will open up Jupyter in the default browser. This will launch a file browser, starting in the directory where I launched Jupyter.
I’ll create a new notebook. Python is listening to evaluate code in the background. I click on the filename to rename it. With datasets from Kaggle, I like to put in a link to the original URL so I can remember it later.
Text cells are in Markdown.
With these out of the way, I can import my Python statistical toolbox. I create a cell. By default, new cells are code cells.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_theme()
import statsmodels.formula.api as smf
from scipy import stats
Most of these just import common libraries with shortened names to make them easier to type. The “%matplotlib inline” command is a “magic command” that tells Jupyter to insert any plots into the notebook (Seaborn is a front end to matplotlib for common stats plots). Otherwise, they’ll be displayed in a separate window. The “sns.set_theme()” line tells Seaborn to use its default theme to make plots easier to read.
Importing and Visualizing the data
Descriptive stats, histograms, and scatterplots tell the data’s story
With the libraries imported, I can get to work by importing the data
To do that, I load the data I downloaded into a pandas DataFrame, a rectangular array similar to a spreadsheet.
I’ll import the data using the relative path to where I downloaded it:
screen = pd.read_csv(‘data/ScreenTime vs MentalWellness.csv’)
Next I’ll examine the first few lines of the DataFrame:
screen.head()
I’ll then take descriptive statistics of the numerical columns in the DataFrame.
screen.describe()
These include the mean or average, the standard deviation, the minimum, the 25th percentile or lower quartile, the median or 50th percentile, the 75th percentile or upper quartile, and the maximum value.
I can look at the distribution of a column with a histogram. I’ll use Seaborn’s displot function.
Suppose I wanted to look at the distribution of hours of screen time in this sample. I would use something like
sns.displot(x=’screen_time_hours’)
A histogram will tell me what kind of statistical tests I can use later. This histogram appears to be almost normally distributed, with a bell-shaped curve.
Most statistical tests, such as the t-test or analysis of variance (ANOVA), assume data are normally distributed. Using them with skewed datasets can throw off the results. Non-parametric tests can be more accurate in these circumstances.
.
I’m also on the lookout for possible correlations. I can test these by making scatterplots of one variable vs. another.
To display the scatterplot of screen time vs. a self-rated stress score, I can use the relplot function:
sns.relplot(x=”screen_time_hours”,y=”stress_level_0_10″,data=screen)
The scatterplot shows a positive linear relationship, indicating that the stress level rises with more screen time. I can plot the regression line over the scatterplot with the regplot function.
sns.regplot(x=”screen_time_hours”,y=”stress_level_0_10″,data=screen)
Finding correlations
Finding trends and making predictions
While there appears that there’s a correlation between screen time and stress, it’s only a correlation. Screen time doesn’t necessarily cause stress, the plot just shows there’s a relationship between the two.
Seaborn’s regression plot doesn’t give me the values of the linear function. To get the slope for x and the y-intercept for the classic slope-intercept equation y = mx + b that you might remember from high school, I have to use another library.
statsmodels is a library that has many common statistical tests and is particularly designed for regression analysis. It can get regressions using formulas inspired by R, another language popular for statistics and data analysis.
results = smf.ols(‘stress_level_0_10 ~ screen_time_hours’,data=screen).fit()
results.summary()
This will put the regression results into a table in the Jupyter notebook. The table shows the values of the coefficients for the slope-intercept equation. Here, the y-intercept is approximately 2.86, and the slope is 0.59. I can also make a rough prediction of how stressed out a person would be if they reported their screen time by plugging it into the equation.
The hardest part of data analysis is finding it
Kaggle solved a major stumbling block in practicing analyzing data: finding it. I can now find interesting datasets quickly in Kaggle and explore them with my favorite Python libraries. This workflow lets me focus on the data instead of trying to find it.

