Lecture 14: Python & Data - Exploring with Pandas and Plotly Express
Welcome back, Python explorers! After learning about games and web scraping, today we dive into another powerful use of Python: working with data. We’ll learn how to analyze information and create insightful charts, skills that are useful in many fields!
1. Introduction: Working with Data
Data is all around us: scores from games, information about planets, survey results, and much more. Being able to understand and find patterns in data is a valuable skill. Python, with its powerful libraries, is an excellent tool for this.
What is Data Analysis?
Data analysis involves:
- Gathering data.
- Cleaning and organizing it.
- Asking questions about the data.
- Finding patterns and insights.
- Communicating those findings, often through visualizations like charts.
Our Goals for This Lecture:
- Introduce Pandas, a Python library for powerful data manipulation.
- Learn to load data from a CSV (Comma Separated Values) format.
- Perform basic data inspection and exploration using Pandas:
- Viewing the first few rows (
.head()
). - Getting a summary of the data (
.info()
). - Understanding the data’s dimensions (
.shape
). - Handling missing data (
.isnull().sum()
,fillna()
). - Grouping data (
.groupby()
). - Creating new columns from existing data.
- Viewing the first few rows (
- Introduce Plotly Express, a library for creating interactive charts.
- Create bar charts, scatter plots, and histograms to visualize our data.
- Work through a practical example using a dataset about planets.
Let’s get started with our data exploration tools!
2. Your Toolkit: Pandas & Plotly Express
To effectively work with data in Python, we’ll use two key libraries: Pandas and Plotly Express.
Tool #1: Pandas - For Data Organization & Analysis
- What is Pandas? Pandas is a fundamental Python library for data analysis. It provides data structures and tools designed to make working with structured data (like tables) intuitive and efficient.
- DataFrames: The core data structure in Pandas is the DataFrame. You can think of a DataFrame as a table, similar to a spreadsheet or a SQL table, where data is organized into rows and columns. Each column has a name (a header), and each row represents an observation or record.
Tool #2: Plotly Express - For Data Visualization
- What is Plotly Express? Plotly Express is a high-level interface for Plotly, a powerful graphing library. It allows you to create a wide variety of interactive charts (bar charts, scatter plots, histograms, etc.) with concise and easy-to-understand code.
- These charts are typically opened in a web browser and allow for interactivity like hovering to see data values, zooming, and panning.
Installing Pandas & Plotly Express:
These are external libraries, so if you haven’t used them before, you’ll need to install them. Open your terminal or command prompt and run:
pip install pandas plotly
(Plotly Express is included when you install the plotly
package).
Our Dataset: Planetary Information
For this lecture, we’ll use a dataset about planets in our solar system (and a fictional one for more data points!). This data includes new columns like PlanetType
and will intentionally have some missing values to demonstrate data cleaning.
The data we will use looks like this (and will be embedded directly in our Python script for this example):
PlanetName,Diameter_km,NumberOfMoons,Color,AverageTemp_C,PlanetType
Mercury,4879,0,Gray,-173,Rocky
Venus,12104,0,Yellowish,464,Rocky
Earth,12756,1,Blue,15,Rocky
Mars,6792,2,Red,,Rocky # Missing AverageTemp_C
Jupiter,142984,79,,,-145,Gas Giant # Missing Color
Saturn,120536,82,PaleGold,-178,Gas Giant
Uranus,51118,27,LightBlue,-214,Ice Giant
Neptune,49528,14,DarkBlue,-224,Ice Giant
PlanetNine,19000,3,Brown,50,Rocky
This dataset includes the planet’s name, diameter, number of moons, color, average temperature (Celsius), and its type (Rocky, Gas Giant, Ice Giant).
3. Loading and Inspecting Data with Pandas
The first step in any data analysis task is to load your data into a structure that your tools can work with. With Pandas, this means loading data into a DataFrame.
The Python Script (planet_explorer.py
):
We’ll be working with a Python script named planet_explorer.py
. Make sure this file is in a directory, for example, examples/data_detective/
.
# examples/data_detective/planet_explorer.py (Part 1: Loading & Inspecting)
import pandas as pd
import io # Needed for loading CSV data from a string
print("Planetary Data Analysis with Pandas & Plotly Express")
# For this educational example, CSV data is embedded as a string.
# In a typical scenario, you would load from a .csv file:
# df_planets = pd.read_csv("planets_data.csv")
# (Assuming 'planets_data.csv' is in the same directory)
csv_data_string = """PlanetName,Diameter_km,NumberOfMoons,Color,AverageTemp_C,PlanetType
Mercury,4879,0,Gray,-173,Rocky
Venus,12104,0,Yellowish,464,Rocky
Earth,12756,1,Blue,15,Rocky
Mars,6792,2,Red,,Rocky
Jupiter,142984,79,,,-145,Gas Giant
Saturn,120536,82,PaleGold,-178,Gas Giant
Uranus,51118,27,LightBlue,-214,Ice Giant
Neptune,49528,14,DarkBlue,-224,Ice Giant
PlanetNine,19000,3,Brown,50,Rocky
"""
# Use io.StringIO to allow Pandas to read the string as if it were a file
data_file = io.StringIO(csv_data_string)
# 1. Load data into a Pandas DataFrame
df_planets = pd.read_csv(data_file)
# 2. Initial Data Inspection: .head()
# The .head() method shows the first few rows of the DataFrame (default is 5).
# Useful for a quick glimpse of your data.
print("\n--- First 5 Rows of Planetary Data (Initial Load) ---")
print(df_planets.head())
# 3. DataFrame Summary: .info()
# The .info() method provides a concise summary of the DataFrame:
# - Index data type and number of entries.
# - Column names and their data types (e.g., int64, float64, object for strings).
# - Number of non-null values in each column (helps identify missing data).
# - Memory usage.
print("\n--- DataFrame Structure and Info ---")
df_planets.info()
# 4. DataFrame Dimensions: .shape
# The .shape attribute returns a tuple representing (number_of_rows, number_of_columns).
num_rows = df_planets.shape[0]
num_cols = df_planets.shape[1]
print(f"\nOur dataset has {num_rows} rows (planets) and {num_cols} columns (features).")
Explanation:
import pandas as pd
: Imports Pandas, aliased aspd
.import io
: This module is used here to treat our multilinecsv_data_string
as if it were a file, whichpd.read_csv()
can then process.csv_data_string
: Contains our planetary data, including the newPlanetType
column and some intentionally missing values (for Mars’s temperature and Jupiter’s color).df_planets = pd.read_csv(io.StringIO(csv_data_string))
: This line reads the CSV-formatted string into a Pandas DataFrame nameddf_planets
. If you hadplanets_data.csv
as a separate file in the same directory, you’d usedf_planets = pd.read_csv("planets_data.csv")
.df_planets.head()
: Displays the column headers and the first five rows.df_planets.info()
: Outputs a summary, very useful for seeing data types and counts of non-empty values per column. You’ll noticeAverageTemp_C
andColor
might show fewer non-null entries due to our missing data.df_planets.shape
: Tells us the dimensions of our table.
With the data loaded, we can move on to more detailed analysis.
4. Data Interrogation: Working with Pandas DataFrames
Now that our planetary data is loaded into the df_planets
DataFrame, let’s explore some common Pandas operations to analyze it.
a. Dealing with Missing Data
Real-world data is often messy and can have missing values. Pandas provides tools to find and handle them.
# planet_explorer.py (Continued)
print("\n--- Missing Data Check (Before Handling) ---")
# .isnull() creates a boolean DataFrame (True where data is missing)
# .sum() then counts True values per column.
print(df_planets.isnull().sum())
# Strategy 1: Fill missing numerical data with the mean
avg_temp_fill_value = df_planets['AverageTemp_C'].mean()
df_planets['AverageTemp_C'].fillna(avg_temp_fill_value, inplace=True)
# 'inplace=True' modifies the DataFrame directly.
# Strategy 2: Fill missing categorical data with a placeholder like 'Unknown'
df_planets['Color'].fillna('Unknown', inplace=True)
print("\n--- Data After Handling Missing Values (First 5 rows) ---")
print(df_planets.head())
print("\n--- Missing Data Check (After Handling)---")
print(df_planets.isnull().sum()) # Should show 0 for handled columns
df_planets.isnull().sum()
: Counts missing (NaN or None) values in each column.fillna(value, inplace=True)
: Fills missing values.- For
AverageTemp_C
(a numerical column), we fill missing values with the mean temperature of the other planets. - For
Color
(a text/categorical column), we fill missing values with the string “Unknown”. inplace=True
modifiesdf_planets
directly. Without it, you’d need to dodf_planets['Column'] = df_planets['Column'].fillna(value)
.
- For
b. Selecting Columns & Filtering Rows
# planet_explorer.py (Continued)
print("\n--- Specific Columns: Planet Names and Types ---")
print(df_planets[['PlanetName', 'PlanetType']])
print("\n--- Filtering: Rocky Planets ---")
rocky_planets_df = df_planets[df_planets['PlanetType'] == 'Rocky']
print(rocky_planets_df[['PlanetName', 'AverageTemp_C']])
- Selecting columns:
df[['Col1', 'Col2']]
. - Filtering rows:
df[df['ColumnName'] == 'Value']
creates a new DataFrame containing only rows that meet the condition.
c. Sorting Data
# planet_explorer.py (Continued)
print("\n--- Planets Sorted by Diameter (Largest First) ---")
sorted_by_diameter = df_planets.sort_values(by='Diameter_km', ascending=False)
print(sorted_by_diameter[['PlanetName', 'Diameter_km']])
sort_values(by='ColumnName', ascending=False)
sorts the DataFrame.ascending=True
would sort smallest to largest.
d. Grouping Data - groupby()
groupby()
is powerful for summarizing data by categories.
# planet_explorer.py (Continued)
print("\n--- Average Number of Moons by Planet Type ---")
avg_moons_by_type = df_planets.groupby('PlanetType')['NumberOfMoons'].mean().sort_values(ascending=False)
print(avg_moons_by_type)
print("\n--- Total Diameter by Planet Type (Illustrative) ---")
total_diameter_by_type = df_planets.groupby('PlanetType')['Diameter_km'].sum().sort_values(ascending=False)
print(total_diameter_by_type.apply(lambda x: f"{x:,.0f} km"))
df_planets.groupby('PlanetType')
: This groups rows by unique values in thePlanetType
column.['NumberOfMoons'].mean()
: Then, for each group, it selects theNumberOfMoons
column and calculates the mean..apply(lambda x: f"{x:,.0f} km")
is used for formatting the output string.
e. Creating New Columns from Old Ones
You can easily create new columns based on calculations from existing columns.
# planet_explorer.py (Continued)
print("\n--- Adding Planet Radius Column ---")
df_planets['Radius_km'] = df_planets['Diameter_km'] / 2
print(df_planets[['PlanetName', 'Diameter_km', 'Radius_km']].head())
df_planets['Radius_km'] = ...
: This creates a new column namedRadius_km
.- The values are calculated by taking the
Diameter_km
column and dividing by 2.
These operations provide a glimpse into Pandas’ capabilities for data wrangling and analysis.
5. Visualizing Data with Plotly Express
Tables of data are informative, but charts often make patterns and comparisons much clearer. Plotly Express helps us create interactive visualizations easily.
Importing Plotly Express:
# Add this to the top of your planet_explorer.py if not already there
import plotly.express as px
a. Bar Chart: Moons per Planet
# planet_explorer.py (Continued)
print("\n--- Visualizing the Data with Plotly Express ---")
print("Generating charts... these will typically open in your web browser.")
fig_bar_moons = px.bar(df_planets,
x='PlanetName',
y='NumberOfMoons',
title='Number of Moons per Planet',
color='PlanetType', # Color bars by PlanetType
labels={'PlanetName':'Planet',
'NumberOfMoons':'Known Moons',
'PlanetType':'Type of Planet'})
fig_bar_moons.show()
px.bar()
creates a bar chart. We specify the DataFrame, x-axis column, y-axis column, title, and how to color the bars (byPlanetType
). Thelabels
dictionary provides user-friendly names for axes and legends.fig.show()
displays the chart.
b. Scatter Plot: Diameter vs. Temperature
# planet_explorer.py (Continued)
fig_scatter_temp_dia = px.scatter(df_planets,
x='Diameter_km',
y='AverageTemp_C',
title='Planet Size vs. Avg. Temperature (Moons as Dot Size)',
color='PlanetType',
size='NumberOfMoons', # Dot size represents number of moons
hover_name='PlanetName', # Show planet name on mouse hover
labels={'Diameter_km':'Diameter (km)',
'AverageTemp_C':'Avg. Temp (°C)',
'PlanetType':'Type of Planet',
'NumberOfMoons':'Number of Moons'})
fig_scatter_temp_dia.show()
px.scatter()
creates a scatter plot.size='NumberOfMoons'
makes the size of each point (bubble) reflect the number of moons.hover_name='PlanetName'
shows the planet’s name when you mouse over a point.
c. More Chart Power: Histograms
Histograms show the distribution of a single numerical variable by grouping values into “bins.”
# planet_explorer.py (Continued)
fig_hist_diameter = px.histogram(df_planets,
x='Diameter_km',
title='Distribution of Planet Diameters',
nbins=5, # Suggests number of bins (groups) for the data
labels={'Diameter_km':'Diameter (km)'},
color='PlanetType', # Color bars within the histogram by type
marginal="rug") # Adds small lines (rug plot) to show individual data points
fig_hist_diameter.show()
px.histogram()
creates the histogram.nbins
suggests how many bars (bins) to divide the data into.marginal="rug"
adds a small plot along the axis showing individual data points.
d. Customizing Your Charts
Plotly Express charts are highly customizable. We’ve already used:
title
: Sets the chart title.labels
: Renames axes and legend titles (e.g.,labels={'old_name':'New Name', ...}
).color
: Assigns colors based on values in a column.size
: (For scatter plots) Assigns marker size based on values in a column.color_discrete_map
: (Used in the Python script forfig_bar_moons
andfig_scatter_temp_dia
) Allows you to specify exact colors for categories, e.g.,{'Rocky':'brown', 'Gas Giant':'orange', ...}
.
Plotly offers many more customization options for fonts, layouts, and trace properties if you want to dive deeper!
6. Recap & Further Exploration
You’ve now taken your first steps as a Python Data Analyst! You’ve seen how to load, inspect, clean, analyze, and visualize a dataset.
Key Skills Unlocked:
- Pandas for Data Handling:
- Loading data from CSVs (even from a string!).
- Using
.head()
,.info()
,.shape
for initial inspection. - Detecting missing data with
.isnull().sum()
. - Filling missing data with
fillna()
. - Selecting columns and filtering rows based on conditions.
- Sorting data with
.sort_values()
. - Grouping data with
.groupby()
for aggregate calculations (like mean or sum by category). - Creating new columns from existing data.
- Plotly Express for Visualization:
- Creating interactive bar charts (
px.bar()
), scatter plots (px.scatter()
), and histograms (px.histogram()
). - Customizing charts with titles, labels, colors, and sizes based on data.
- Using
fig.show()
to display charts.
- Creating interactive bar charts (
Where to Go From Here?
Data analysis is a vast and exciting field. You can apply these skills to:
- Analyze game statistics.
- Explore scientific datasets.
- Investigate social trends (using publicly available data).
- Even combine with web scraping to analyze data you collect from websites!
Homework: New Data Investigations!
-
Planet Analysis Deep Dive (in
planet_explorer.py
):- Missing Colors: In the
planet_explorer.py
script, Jupiter’s color is missing. Fill it with “Orange/Brown Bands” or another suitable description. Verify withdf_planets.info()
ordf_planets.isnull().sum()
that it’s filled. - Coldest Outliers: Find and print the names and average temperatures of the 2 coldest planets after filling missing temperature data.
- Planet Type Counts: How many planets of each
PlanetType
are there? (Hint:df_planets['PlanetType'].value_counts()
). Create a bar chart of these counts using Plotly Express. - Largest Rocky Planet: Find the name and diameter of the largest “Rocky” planet.
- Missing Colors: In the
-
Your Own Dataset Adventure:
- Create a new CSV file (e.g.,
my_favorite_books.csv
ormy_game_scores.csv
). - Include at least 4 columns (one text/categorical, at least two numerical) and 8-10 rows.
- Include at least one missing value intentionally in one of your numerical columns.
- Write a new Python script that:
a. Loads your CSV into a Pandas DataFrame.
b. Prints the.info()
and.head()
.
c. Handles the missing value in your numerical column (e.g., fill with mean or a specific number).
d. Performs at least onegroupby()
operation and prints the result (e.g., if you have book genres, find the average rating per genre).
e. Creates two different types of Plotly Express charts from your data (e.g., a scatter plot and a histogram or bar chart). Customize them with titles and labels.
- Create a new CSV file (e.g.,
-
(Challenge) Temperature Range by Planet Type:
- For each
PlanetType
, find the minimum and maximumAverageTemp_C
. - You might need to use
groupby()
along with the.agg()
(aggregate) method in Pandas, for example:df_planets.groupby('PlanetType')['AverageTemp_C'].agg(['min', 'max'])
. - Try to present this information clearly. Could you make a Plotly chart that shows these ranges? (This might require some research into Plotly’s capabilities for range plots or grouped bar charts showing min/max).
- For each
The more you practice, the more comfortable you’ll become with these tools. Data is everywhere, and Python gives you the power to understand it! Happy analyzing!