Lecture 14: Python & Data - Exploring with Pandas and Plotly Express

Lecture 14: Python & Data - Exploring with Pandas and Plotly Express

Welcome back, Python explorers! After learning about games and web scraping, today we dive into another powerful use of Python: working with data. We’ll learn how to analyze information and create insightful charts, skills that are useful in many fields!

1. Introduction: Working with Data

Data is all around us: scores from games, information about planets, survey results, and much more. Being able to understand and find patterns in data is a valuable skill. Python, with its powerful libraries, is an excellent tool for this.

What is Data Analysis?
Data analysis involves:

  • Gathering data.
  • Cleaning and organizing it.
  • Asking questions about the data.
  • Finding patterns and insights.
  • Communicating those findings, often through visualizations like charts.

Our Goals for This Lecture:

  • Introduce Pandas, a Python library for powerful data manipulation.
  • Learn to load data from a CSV (Comma Separated Values) format.
  • Perform basic data inspection and exploration using Pandas:
    • Viewing the first few rows (.head()).
    • Getting a summary of the data (.info()).
    • Understanding the data’s dimensions (.shape).
    • Handling missing data (.isnull().sum(), fillna()).
    • Grouping data (.groupby()).
    • Creating new columns from existing data.
  • Introduce Plotly Express, a library for creating interactive charts.
  • Create bar charts, scatter plots, and histograms to visualize our data.
  • Work through a practical example using a dataset about planets.

Let’s get started with our data exploration tools!


2. Your Toolkit: Pandas & Plotly Express

To effectively work with data in Python, we’ll use two key libraries: Pandas and Plotly Express.

Tool #1: Pandas - For Data Organization & Analysis

  • What is Pandas? Pandas is a fundamental Python library for data analysis. It provides data structures and tools designed to make working with structured data (like tables) intuitive and efficient.
  • DataFrames: The core data structure in Pandas is the DataFrame. You can think of a DataFrame as a table, similar to a spreadsheet or a SQL table, where data is organized into rows and columns. Each column has a name (a header), and each row represents an observation or record.

Tool #2: Plotly Express - For Data Visualization

  • What is Plotly Express? Plotly Express is a high-level interface for Plotly, a powerful graphing library. It allows you to create a wide variety of interactive charts (bar charts, scatter plots, histograms, etc.) with concise and easy-to-understand code.
  • These charts are typically opened in a web browser and allow for interactivity like hovering to see data values, zooming, and panning.

Installing Pandas & Plotly Express:
These are external libraries, so if you haven’t used them before, you’ll need to install them. Open your terminal or command prompt and run:

pip install pandas plotly

(Plotly Express is included when you install the plotly package).

Our Dataset: Planetary Information
For this lecture, we’ll use a dataset about planets in our solar system (and a fictional one for more data points!). This data includes new columns like PlanetType and will intentionally have some missing values to demonstrate data cleaning.

The data we will use looks like this (and will be embedded directly in our Python script for this example):

PlanetName,Diameter_km,NumberOfMoons,Color,AverageTemp_C,PlanetType
Mercury,4879,0,Gray,-173,Rocky
Venus,12104,0,Yellowish,464,Rocky
Earth,12756,1,Blue,15,Rocky
Mars,6792,2,Red,,Rocky  # Missing AverageTemp_C
Jupiter,142984,79,,,-145,Gas Giant # Missing Color
Saturn,120536,82,PaleGold,-178,Gas Giant
Uranus,51118,27,LightBlue,-214,Ice Giant
Neptune,49528,14,DarkBlue,-224,Ice Giant
PlanetNine,19000,3,Brown,50,Rocky

This dataset includes the planet’s name, diameter, number of moons, color, average temperature (Celsius), and its type (Rocky, Gas Giant, Ice Giant).


3. Loading and Inspecting Data with Pandas

The first step in any data analysis task is to load your data into a structure that your tools can work with. With Pandas, this means loading data into a DataFrame.

The Python Script (planet_explorer.py):
We’ll be working with a Python script named planet_explorer.py. Make sure this file is in a directory, for example, examples/data_detective/.

# examples/data_detective/planet_explorer.py (Part 1: Loading & Inspecting)
import pandas as pd
import io # Needed for loading CSV data from a string

print("Planetary Data Analysis with Pandas & Plotly Express")

# For this educational example, CSV data is embedded as a string.
# In a typical scenario, you would load from a .csv file:
# df_planets = pd.read_csv("planets_data.csv")
# (Assuming 'planets_data.csv' is in the same directory)

csv_data_string = """PlanetName,Diameter_km,NumberOfMoons,Color,AverageTemp_C,PlanetType
Mercury,4879,0,Gray,-173,Rocky
Venus,12104,0,Yellowish,464,Rocky
Earth,12756,1,Blue,15,Rocky
Mars,6792,2,Red,,Rocky
Jupiter,142984,79,,,-145,Gas Giant
Saturn,120536,82,PaleGold,-178,Gas Giant
Uranus,51118,27,LightBlue,-214,Ice Giant
Neptune,49528,14,DarkBlue,-224,Ice Giant
PlanetNine,19000,3,Brown,50,Rocky
"""

# Use io.StringIO to allow Pandas to read the string as if it were a file
data_file = io.StringIO(csv_data_string)

# 1. Load data into a Pandas DataFrame
df_planets = pd.read_csv(data_file)

# 2. Initial Data Inspection: .head()
# The .head() method shows the first few rows of the DataFrame (default is 5).
# Useful for a quick glimpse of your data.
print("\n--- First 5 Rows of Planetary Data (Initial Load) ---")
print(df_planets.head())

# 3. DataFrame Summary: .info()
# The .info() method provides a concise summary of the DataFrame:
# - Index data type and number of entries.
# - Column names and their data types (e.g., int64, float64, object for strings).
# - Number of non-null values in each column (helps identify missing data).
# - Memory usage.
print("\n--- DataFrame Structure and Info ---")
df_planets.info()

# 4. DataFrame Dimensions: .shape
# The .shape attribute returns a tuple representing (number_of_rows, number_of_columns).
num_rows = df_planets.shape[0]
num_cols = df_planets.shape[1]
print(f"\nOur dataset has {num_rows} rows (planets) and {num_cols} columns (features).")

Explanation:

  1. import pandas as pd: Imports Pandas, aliased as pd.
  2. import io: This module is used here to treat our multiline csv_data_string as if it were a file, which pd.read_csv() can then process.
  3. csv_data_string: Contains our planetary data, including the new PlanetType column and some intentionally missing values (for Mars’s temperature and Jupiter’s color).
  4. df_planets = pd.read_csv(io.StringIO(csv_data_string)): This line reads the CSV-formatted string into a Pandas DataFrame named df_planets. If you had planets_data.csv as a separate file in the same directory, you’d use df_planets = pd.read_csv("planets_data.csv").
  5. df_planets.head(): Displays the column headers and the first five rows.
  6. df_planets.info(): Outputs a summary, very useful for seeing data types and counts of non-empty values per column. You’ll notice AverageTemp_C and Color might show fewer non-null entries due to our missing data.
  7. df_planets.shape: Tells us the dimensions of our table.

With the data loaded, we can move on to more detailed analysis.


4. Data Interrogation: Working with Pandas DataFrames

Now that our planetary data is loaded into the df_planets DataFrame, let’s explore some common Pandas operations to analyze it.

a. Dealing with Missing Data
Real-world data is often messy and can have missing values. Pandas provides tools to find and handle them.

# planet_explorer.py (Continued)

print("\n--- Missing Data Check (Before Handling) ---")
# .isnull() creates a boolean DataFrame (True where data is missing)
# .sum() then counts True values per column.
print(df_planets.isnull().sum())

# Strategy 1: Fill missing numerical data with the mean
avg_temp_fill_value = df_planets['AverageTemp_C'].mean()
df_planets['AverageTemp_C'].fillna(avg_temp_fill_value, inplace=True)
# 'inplace=True' modifies the DataFrame directly.

# Strategy 2: Fill missing categorical data with a placeholder like 'Unknown'
df_planets['Color'].fillna('Unknown', inplace=True)

print("\n--- Data After Handling Missing Values (First 5 rows) ---")
print(df_planets.head())
print("\n--- Missing Data Check (After Handling)---")
print(df_planets.isnull().sum()) # Should show 0 for handled columns
  • df_planets.isnull().sum(): Counts missing (NaN or None) values in each column.
  • fillna(value, inplace=True): Fills missing values.
    • For AverageTemp_C (a numerical column), we fill missing values with the mean temperature of the other planets.
    • For Color (a text/categorical column), we fill missing values with the string “Unknown”.
    • inplace=True modifies df_planets directly. Without it, you’d need to do df_planets['Column'] = df_planets['Column'].fillna(value).

b. Selecting Columns & Filtering Rows

# planet_explorer.py (Continued)

print("\n--- Specific Columns: Planet Names and Types ---")
print(df_planets[['PlanetName', 'PlanetType']])

print("\n--- Filtering: Rocky Planets ---")
rocky_planets_df = df_planets[df_planets['PlanetType'] == 'Rocky']
print(rocky_planets_df[['PlanetName', 'AverageTemp_C']])
  • Selecting columns: df[['Col1', 'Col2']].
  • Filtering rows: df[df['ColumnName'] == 'Value'] creates a new DataFrame containing only rows that meet the condition.

c. Sorting Data

# planet_explorer.py (Continued)

print("\n--- Planets Sorted by Diameter (Largest First) ---")
sorted_by_diameter = df_planets.sort_values(by='Diameter_km', ascending=False)
print(sorted_by_diameter[['PlanetName', 'Diameter_km']])
  • sort_values(by='ColumnName', ascending=False) sorts the DataFrame. ascending=True would sort smallest to largest.

d. Grouping Data - groupby()
groupby() is powerful for summarizing data by categories.

# planet_explorer.py (Continued)

print("\n--- Average Number of Moons by Planet Type ---")
avg_moons_by_type = df_planets.groupby('PlanetType')['NumberOfMoons'].mean().sort_values(ascending=False)
print(avg_moons_by_type)

print("\n--- Total Diameter by Planet Type (Illustrative) ---")
total_diameter_by_type = df_planets.groupby('PlanetType')['Diameter_km'].sum().sort_values(ascending=False)
print(total_diameter_by_type.apply(lambda x: f"{x:,.0f} km"))
  • df_planets.groupby('PlanetType'): This groups rows by unique values in the PlanetType column.
  • ['NumberOfMoons'].mean(): Then, for each group, it selects the NumberOfMoons column and calculates the mean.
  • .apply(lambda x: f"{x:,.0f} km") is used for formatting the output string.

e. Creating New Columns from Old Ones
You can easily create new columns based on calculations from existing columns.

# planet_explorer.py (Continued)

print("\n--- Adding Planet Radius Column ---")
df_planets['Radius_km'] = df_planets['Diameter_km'] / 2
print(df_planets[['PlanetName', 'Diameter_km', 'Radius_km']].head())
  • df_planets['Radius_km'] = ...: This creates a new column named Radius_km.
  • The values are calculated by taking the Diameter_km column and dividing by 2.

These operations provide a glimpse into Pandas’ capabilities for data wrangling and analysis.


5. Visualizing Data with Plotly Express

Tables of data are informative, but charts often make patterns and comparisons much clearer. Plotly Express helps us create interactive visualizations easily.

Importing Plotly Express:

# Add this to the top of your planet_explorer.py if not already there
import plotly.express as px

a. Bar Chart: Moons per Planet

# planet_explorer.py (Continued)

print("\n--- Visualizing the Data with Plotly Express ---")
print("Generating charts... these will typically open in your web browser.")

fig_bar_moons = px.bar(df_planets,
                       x='PlanetName',
                       y='NumberOfMoons',
                       title='Number of Moons per Planet',
                       color='PlanetType', # Color bars by PlanetType
                       labels={'PlanetName':'Planet',
                               'NumberOfMoons':'Known Moons',
                               'PlanetType':'Type of Planet'})
fig_bar_moons.show()
  • px.bar() creates a bar chart. We specify the DataFrame, x-axis column, y-axis column, title, and how to color the bars (by PlanetType). The labels dictionary provides user-friendly names for axes and legends.
  • fig.show() displays the chart.

b. Scatter Plot: Diameter vs. Temperature

# planet_explorer.py (Continued)

fig_scatter_temp_dia = px.scatter(df_planets,
                                  x='Diameter_km',
                                  y='AverageTemp_C',
                                  title='Planet Size vs. Avg. Temperature (Moons as Dot Size)',
                                  color='PlanetType',
                                  size='NumberOfMoons',  # Dot size represents number of moons
                                  hover_name='PlanetName', # Show planet name on mouse hover
                                  labels={'Diameter_km':'Diameter (km)',
                                          'AverageTemp_C':'Avg. Temp (°C)',
                                          'PlanetType':'Type of Planet',
                                          'NumberOfMoons':'Number of Moons'})
fig_scatter_temp_dia.show()
  • px.scatter() creates a scatter plot.
  • size='NumberOfMoons' makes the size of each point (bubble) reflect the number of moons.
  • hover_name='PlanetName' shows the planet’s name when you mouse over a point.

c. More Chart Power: Histograms
Histograms show the distribution of a single numerical variable by grouping values into “bins.”

# planet_explorer.py (Continued)

fig_hist_diameter = px.histogram(df_planets,
                                 x='Diameter_km',
                                 title='Distribution of Planet Diameters',
                                 nbins=5, # Suggests number of bins (groups) for the data
                                 labels={'Diameter_km':'Diameter (km)'},
                                 color='PlanetType', # Color bars within the histogram by type
                                 marginal="rug") # Adds small lines (rug plot) to show individual data points
fig_hist_diameter.show()
  • px.histogram() creates the histogram.
  • nbins suggests how many bars (bins) to divide the data into.
  • marginal="rug" adds a small plot along the axis showing individual data points.

d. Customizing Your Charts
Plotly Express charts are highly customizable. We’ve already used:

  • title: Sets the chart title.
  • labels: Renames axes and legend titles (e.g., labels={'old_name':'New Name', ...}).
  • color: Assigns colors based on values in a column.
  • size: (For scatter plots) Assigns marker size based on values in a column.
  • color_discrete_map: (Used in the Python script for fig_bar_moons and fig_scatter_temp_dia) Allows you to specify exact colors for categories, e.g., {'Rocky':'brown', 'Gas Giant':'orange', ...}.
    Plotly offers many more customization options for fonts, layouts, and trace properties if you want to dive deeper!

6. Recap & Further Exploration

You’ve now taken your first steps as a Python Data Analyst! You’ve seen how to load, inspect, clean, analyze, and visualize a dataset.

Key Skills Unlocked:

  • Pandas for Data Handling:
    • Loading data from CSVs (even from a string!).
    • Using .head(), .info(), .shape for initial inspection.
    • Detecting missing data with .isnull().sum().
    • Filling missing data with fillna().
    • Selecting columns and filtering rows based on conditions.
    • Sorting data with .sort_values().
    • Grouping data with .groupby() for aggregate calculations (like mean or sum by category).
    • Creating new columns from existing data.
  • Plotly Express for Visualization:
    • Creating interactive bar charts (px.bar()), scatter plots (px.scatter()), and histograms (px.histogram()).
    • Customizing charts with titles, labels, colors, and sizes based on data.
    • Using fig.show() to display charts.

Where to Go From Here?
Data analysis is a vast and exciting field. You can apply these skills to:

  • Analyze game statistics.
  • Explore scientific datasets.
  • Investigate social trends (using publicly available data).
  • Even combine with web scraping to analyze data you collect from websites!

Homework: New Data Investigations!

  1. Planet Analysis Deep Dive (in planet_explorer.py):

    • Missing Colors: In the planet_explorer.py script, Jupiter’s color is missing. Fill it with “Orange/Brown Bands” or another suitable description. Verify with df_planets.info() or df_planets.isnull().sum() that it’s filled.
    • Coldest Outliers: Find and print the names and average temperatures of the 2 coldest planets after filling missing temperature data.
    • Planet Type Counts: How many planets of each PlanetType are there? (Hint: df_planets['PlanetType'].value_counts()). Create a bar chart of these counts using Plotly Express.
    • Largest Rocky Planet: Find the name and diameter of the largest “Rocky” planet.
  2. Your Own Dataset Adventure:

    • Create a new CSV file (e.g., my_favorite_books.csv or my_game_scores.csv).
    • Include at least 4 columns (one text/categorical, at least two numerical) and 8-10 rows.
    • Include at least one missing value intentionally in one of your numerical columns.
    • Write a new Python script that:
      a. Loads your CSV into a Pandas DataFrame.
      b. Prints the .info() and .head().
      c. Handles the missing value in your numerical column (e.g., fill with mean or a specific number).
      d. Performs at least one groupby() operation and prints the result (e.g., if you have book genres, find the average rating per genre).
      e. Creates two different types of Plotly Express charts from your data (e.g., a scatter plot and a histogram or bar chart). Customize them with titles and labels.
  3. (Challenge) Temperature Range by Planet Type:

    • For each PlanetType, find the minimum and maximum AverageTemp_C.
    • You might need to use groupby() along with the .agg() (aggregate) method in Pandas, for example: df_planets.groupby('PlanetType')['AverageTemp_C'].agg(['min', 'max']).
    • Try to present this information clearly. Could you make a Plotly chart that shows these ranges? (This might require some research into Plotly’s capabilities for range plots or grouped bar charts showing min/max).

The more you practice, the more comfortable you’ll become with these tools. Data is everywhere, and Python gives you the power to understand it! Happy analyzing!