Lecture 13: Python the Web Detective - Your First Internet Treasure Hunt!
Hello, Python detectives! We’ve had amazing adventures building games with Pygame. Now, get ready to put on your detective hats because we’re about to teach Python a new trick: exploring the internet and finding hidden treasures (of information, that is!).
1. The Internet’s Blueprints: What’s a Web Page?
When you visit a website, it’s like your computer receives a secret blueprint. This blueprint is usually written in a language called HTML (HyperText Markup Language). HTML tells your web browser (like Chrome or Firefox) exactly how to build the page you see – what text to show, where images go, which words should be links, and how it’s all arranged.
Think of HTML as the instruction manual for every webpage!
2. Web Scraping: Python’s Magnifying Glass
Imagine you’re a detective looking for clues on a website, maybe all the headlines about space discoveries. You could read through everything and write them down, but what if there are hundreds?
Web scraping is like giving Python a powerful magnifying glass and a super-fast notebook. Python can:
- Go to a webpage (fetch its HTML blueprint).
- Use its magnifying glass to scan the blueprint for specific clues (the information you want).
- Automatically copy those clues into its notebook.
So, you’re teaching Python to be your personal web detective, gathering information for you!
3. Playing Fair on the Internet: A Quick Detective’s Code
Detectives have a code of honor, and so do web scrapers! It’s super important to be a good web citizen.
- Check for a “Keep Out” Sign (
robots.txt
): Many websites have a file calledrobots.txt
(likewww.example.com/robots.txt
). It tells bots (like your Python script) which parts of the site they shouldn’t visit. Always respect this! - Don’t Knock Too Loud or Too Often (Go Slow): If you ask a website for information too many times, very quickly, it’s like knocking on a door non-stop! It can slow down the website for others. If you’re scraping a lot, add little pauses (
time.sleep(1)
) between your requests. - Use the Front Door if Available (APIs): Sometimes, websites have a special “front door” just for programs, called an API. If an API exists, it’s usually the most polite and efficient way to get data.
For our first detective mission, we’ll use HTML clues we provide directly, so we don’t cause any fuss on live websites while learning.
4. Your Detective Training in This Lecture:
- Meeting your first informant: The
requests
library for fetching web pages. - Learning to read blueprints: Understanding basic HTML structure (your “X-Ray Vision”).
- Using your “Decoder Ring”: The
Beautiful Soup
library for making sense of HTML. - Finding hidden messages: Extracting specific text and links.
- Your First Case: “The Space News Scoop” – scraping headlines from a sample space news page!
Ready to start the investigation, detective?
5. Informant #1: requests
- Fetching the Evidence!
To start our investigation, our Python detective needs to get the evidence – the HTML content of a web page. For this, we use a trusty informant: the Python library called requests
.
What is requests
?
The requests
library makes it super easy to send HTTP requests. HTTP is like the secret handshake computers use to ask for web pages from web servers. We’ll mostly use a “GET” request, which is like saying, “Hey server, please GET me the evidence from this web address!”
Installing requests
(If it’s not in your detective kit):
requests
is an external library. If you haven’t used it before, open your terminal or command prompt and type:
pip install requests
This tells pip
(Python’s package manager) to add requests
to your toolkit.
How requests
Gathers Intel (Conceptual):
When you use requests.get(url)
, your Python script sends a message across the internet to the server holding the webpage. The server then sends back a response. This response is like a package containing the webpage’s content and a note about whether your request was successful.
Here’s how you might use it (we’ll use a real URL in our mini-project soon!):
import requests
url = 'http://example.com' # A placeholder for now
try:
response = requests.get(url) # Python sends the request!
# Now, 'response' holds what the server sent back.
# (We'll check the server's note next!)
except requests.exceptions.RequestException as e:
print(f"Uh oh! Couldn't reach the server: {e}")
What’s the Server’s Reply? (Status Codes)
The response.status_code
is a quick note from the server. It’s like a secret code:
200 OK
: “Affirmative, Detective! Here’s the evidence you asked for.” (This is what we want!)404 Not Found
: “Sorry, Detective, that file seems to be missing from our archives.” (You might have the wrong web address).403 Forbidden
: “Access Denied, Detective! This area is off-limits.” (You might not have permission to see this page).500 Internal Server Error
: “Houston, we have a problem! Something went wrong on our end (the server).” (Not your fault, try again later).
It’s always smart to check if response.status_code == 200:
before you try to look at the evidence!
Inside the Evidence Bag (Response Content):
If the server gives the 200 OK
:
response.text
: This usually contains the main evidence – the HTML of the page – as a big Python string.response.content
: This is the raw evidence, in bytes. Good for things like images (but we’re mostly after text).response.json()
: Sometimes, especially with APIs (special data gateways), the server sends back data neatly organized in JSON format.response.json()
cleverly turns this JSON into a Python dictionary or list, making it super easy to use!
Important Note for Today’s Training:
For our main training exercise on analyzing HTML (the “Space News Scoop”), we’ll use sample HTML directly in our script. This ensures everyone can follow along without needing live internet or worrying about website changes. The requests
skill is crucial for real-world cases, and we’ll use it for a quick live mini-project next!
6. Mini-Project: Quick First Mission - Fetching a Live Cat Fact!
Now that you know about the requests
library, let’s try using it to get some live data from the internet! This is exciting because the information will be fresh each time (or often) you run the code.
Our Mission:
We’ll fetch a random cat fact from a public API. An API (Application Programming Interface) is like a special URL that websites provide for computer programs to get data in a clean, structured way.
The API We’ll Use: https://catfact.ninja/fact
This is a fun and simple API that gives you a random cat fact in a format called JSON. JSON (JavaScript Object Notation) is very common for APIs because it’s easy for computers (and humans!) to read. Python can easily turn JSON into dictionaries and lists.
The Python Script to Fetch a Cat Fact:
Let’s write the code! We’ll call this requests_cat_fact.py
.
# requests_cat_fact.py
import requests # Make sure you've done 'pip install requests'
# The URL of the Cat Fact API
cat_fact_api_url = "https://catfact.ninja/fact"
print("Contacting the Cat Fact API to fetch a purrfectly random fact...")
try:
# Make the GET request to the API
response = requests.get(cat_fact_api_url)
# Check the status code from the server
print(f"Server responded with Status Code: {response.status_code}") # Let's see what the server said!
if response.status_code == 200: # 200 means "OK!"
# The API sends data in JSON format.
# We can use response.json() to automatically convert it into a Python dictionary.
data = response.json()
# Let's look at what the 'data' dictionary looks like (for learning)
# print("Full JSON response data:", data)
# The cat fact is usually inside a key called "fact" in the dictionary
cat_fact = data['fact']
print("\n*** Your Cat Fact ***")
print(cat_fact)
print("*********************")
elif response.status_code == 404:
print("Oops! The Cat Fact API endpoint seems to be missing (404 Error).")
else:
# For other errors (like 500, 403, etc.)
print(f"Uh oh! Something went wrong. The server responded with status code: {response.status_code}")
print("Response content:", response.text) # Show a bit of the error text if any
except requests.exceptions.RequestException as e:
# This handles network issues (like no internet) or if the URL is totally wrong
print(f"A network error occurred: {e}")
print("Please check your internet connection and the URL!")
Walking Through the Cat Fact Script:
import requests
: We tell Python we need therequests
library.cat_fact_api_url = "https://catfact.ninja/fact"
: We store the API’s web address.try...except requests.exceptions.RequestException as e
: This is our safety net for big network problems (like your Wi-Fi being off).response = requests.get(cat_fact_api_url)
: We send the request to the Cat Fact API.print(f"Status Code: {response.status_code}")
: We print the status code so you can see the server’s direct reply.if response.status_code == 200:
: We only try to process the data if everything was OK.data = response.json()
: This is super important! The Cat Fact API sends its data in JSON format.response.json()
cleverly converts this JSON data directly into a Python dictionary. So,data
will be something like{'fact': 'Cats have over 20 muscles that control their ears.', 'length': 52}
.cat_fact = data['fact']
: Sincedata
is now a Python dictionary, we can access the value associated with the key'fact'
to get our cat fact string.- Then we print it out nicely!
elif response.status_code == 404:
: We have a specific message if the API URL isn’t found.else:
: For any other non-200 status codes, we print a general error message along with the status code.- The
except
block at the end catches more general network problems.
Your Turn to Try!
- Make sure you have
requests
installed (pip install requests
). - Save the code above into a new Python file named
requests_cat_fact.py
(or similar) in your project folder. - Run the script from your terminal:
python requests_cat_fact.py
- You should see a random cat fact printed! Try running it a few times to get different facts.
That was a fun look at fetching live data using requests
with a JSON API! Many websites, however, present their information directly in HTML. To scrape those, we first need to understand a bit about HTML itself, which is our next topic. Then we’ll see how Beautiful Soup helps us navigate that HTML.
7. HTML X-Ray Vision: Understanding Web Page Blueprints
To find the clues hidden in a webpage, our Python Detective needs “X-Ray Vision” to see the structure underneath. That structure is HTML!
What are HTML Tags?
HTML uses “tags” to label different parts of a page. Think of them as special markers. Most tags come in pairs:
- An opening tag:
<tagname>
(e.g.,<p>
for a paragraph) - A closing tag:
</tagname>
(e.g.,</p>
) - The stuff between the opening and closing tag is the content of that element.
Common Clues (HTML Tags) You’ll Encounter:
<html>...</html>
: The main container for the whole HTML document.<head>...</head>
: Contains secret agent info about the page (like the title for the browser tab), not usually the visible clues.<title>...</title>
: The page’s title.
<body>...</body>
: Holds all the visible clues: text, images, links.<h1>...</h1>
,<h2>...</h2>
: Big headlines for important sections.<p>...</p>
: A paragraph of text – often where you find juicy details!<a>...</a>
: An “anchor” tag – this creates clickable links (hyperlinks).- It often has an attribute called
href
that stores the link’s destination (the URL), like<a href="http://www.example.com">Click here for more clues!</a>
.
- It often has an attribute called
<div>...</div>
: A “division” or a section. Detectives often find related clues grouped inside adiv
.<ul>...</ul>
(unordered list) and<ol>...</ol>
(ordered list): For lists of items.<li>...</li>
: Each “list item” in a list.
Special Magnifiers: Attributes like class
and id
Tags can have extra information called “attributes.” Two very useful ones for detectives are:
class="some-name"
: Web designers useclass
to group similar items. You might find many articles withclass="news-story"
.id="unique-name"
: Anid
is like a unique serial number for one specific item on the page.
Example Blueprint (Simple HTML):
<!DOCTYPE html>
<html>
<head><title>Secret Agent Report</title></head>
<body>
<h1>Case File: The Missing Catnip</h1>
<p>A puzzling case of disappearing catnip has struck the city.</p>
<div class="suspect-profile" id="fluffy">
<h2>Suspect: Commander Fluffy</h2>
<p class="description">Known for a love of tuna and intricate plots.</p>
<a href="fluffy_details.html" class="profile-link">View Full Profile</a>
</div>
</body>
</html>
Can you spot the <h1>
headline? The div
with a class
AND an id
? The paragraph with class="description"
?
Your Browser’s Built-in Detective Kit:
Your web browser has “Developer Tools” that let you peek at any website’s HTML blueprint!
- Right-click on any part of a webpage and choose “Inspect” or “Inspect Element.”
- A special panel will open, showing the HTML. As you move your mouse over the HTML lines, it will often highlight the part on the page!
This is great for figuring out which tags and attributes hold the clues you need.
Knowing these HTML basics is like learning to read the map to the treasure. Next, we get our special decoder ring to make searching easy!
8. Decoder Ring: Beautiful Soup
- Your HTML Navigator!
We’ve got the HTML blueprint (either from requests
or a sample string). Now, how do we actually sift through it to find our specific clues? We need a special decoder tool: Beautiful Soup
!
What is Beautiful Soup
?
Beautiful Soup is a Python library that takes that messy HTML blueprint and turns it into a neatly organized structure that Python can understand and search. It’s like a magical decoder ring that makes HTML easy to work with!
Getting Your Decoder Ring (Installing Beautiful Soup
):
Beautiful Soup needs two parts:
- The library itself:
beautifulsoup4
- A “parser”: This is the engine that actually reads and understands the HTML.
lxml
is a popular and fast one.
Open your terminal/command prompt and type:
pip install beautifulsoup4
pip install lxml
(If lxml
causes trouble, html.parser
is a built-in Python option, but lxml
is usually better).
Making the Soup (Parsing HTML):
Let’s use a sample HTML string (like we did before) to see how it works.
from bs4 import BeautifulSoup # Import the class
sample_html_for_soup = """
<html><head><title>Detective Agency Files</title></head>
<body>
<h1 id="case-title">The Case of the Missing Python</h1>
<p class="lead-paragraph">Our lead detective is on the case.</p>
<div class="evidence-locker">
<p>Clue 1: A feather.</p>
<p>Clue 2: A half-eaten cookie.</p>
<a href="map.html" class="evidence-link">View Map</a>
</div>
</body></html>
"""
# Create a BeautifulSoup object (our "soup"!)
# We give it the HTML and tell it to use the 'lxml' parser.
soup = BeautifulSoup(sample_html_for_soup, 'lxml')
# The 'soup' object now holds the parsed HTML, ready for searching!
# print(soup.prettify()) # Shows the HTML nicely indented
Finding Clues in the Soup:
Beautiful Soup lets us find HTML elements (our clues) in many ways:
-
Find the First Instance of a Tag:
soup.find('tag_name')
This grabs the very first tag of that type it finds.# Find the first <h1> tag main_headline_tag = soup.find('h1') if main_headline_tag: # Always good to check if we found anything! # print(main_headline_tag.get_text()) # Get text inside the tag pass
-
Find ALL Instances of a Tag:
soup.find_all('tag_name')
This gets all matching tags and returns them as a list.# Find all <p> (paragraph) tags all_clue_paragraphs = soup.find_all('p') # print(f"Found {len(all_clue_paragraphs)} paragraph clues.") # for p_tag in all_clue_paragraphs: # print(p_tag.get_text())
-
Reading the Clue (Getting Text):
element.get_text()
Once you have an element (likemain_headline_tag
), use.get_text()
to read the text written inside it. -
Checking for Fingerprints (Attributes):
element['attribute_name']
If a tag has attributes (likehref
in an<a>
link), you get its value like looking up a word in a dictionary.# Find the first <a> tag (link) map_link_tag = soup.find('a') if map_link_tag: # print(f"Link text: {map_link_tag.get_text()}") # print(f"Link leads to: {map_link_tag['href']}") # Get the 'href' value pass
-
Searching by Class Label:
soup.find_all('tag_name', class_='your_class_name')
This is super handy! Look for tags that have a specificclass
label.
(Rememberclass_
has an underscore becauseclass
is a special word in Python).# Find the paragraph with class="lead-paragraph" lead_para = soup.find('p', class_='lead-paragraph') if lead_para: # print(f"Lead Paragraph: {lead_para.get_text()}") pass
-
Searching by Unique ID Tag:
soup.find(id='your_id_name')
If an element has a uniqueid
tag, this is a direct way to find it.# Find the element with id="case-title" case_title_element = soup.find(id='case-title') if case_title_element: # print(f"Case Title: {case_title_element.get_text()}") pass
With these methods, our Python Detective can navigate almost any HTML blueprint! The main steps are:
- Examine the HTML to identify unique tags, classes, or IDs around your target clues.
- Use
soup.find()
orsoup.find_all()
with the right parameters. - Extract the
.get_text()
or attribute values.
Now that you’re equipped with requests
(for fetching) and BeautifulSoup
(for decoding), you’re ready for your first big case: The Space News Scoop!
9. Project: “The Space News Scoop”
Alright, Detective! It’s time for your first major case. We’ve intercepted some HTML data that seems to be a list of breaking space news headlines. Your mission, should you choose to accept it, is to extract all the headlines, their summaries, and any links to the “full story.”
The Evidence File (Sample HTML):
This is the HTML data we’ll be working with. In a real case, you might get this from a website using requests
, but for this mission, the data has been “provided by an anonymous source” (it’s right here in our script!).
<!DOCTYPE html>
<html>
<head>
<title>Space News Central</title>
</head>
<body>
<h1>Today's Space Discoveries</h1>
<div class="article">
<h2>New Alien Signal Detected!</h2>
<p class="summary">Scientists at SETI today announced the detection of a peculiar radio signal from the Gliese 581g system. Further analysis is required.</p>
<a href="article1_details.html" class="read-more">Read Full Story</a>
</div>
<div class="article">
<h2>Water Ice Confirmed on Mars Moon Phobos</h2>
<p class="summary">Data from the Martian Moons eXploration (MMX) probe has confirmed the presence of water ice on Phobos, one of Mars's moons.</p>
<a href="article2_details.html" class="read-more">Read Full Story</a>
</div>
<div class="article">
<h2>Galaxy NGC 1300: A Barred Spiral Masterpiece</h2>
<p class="summary">New images from the Hubble Space Telescope showcase the stunning details of the barred spiral galaxy NGC 1300, located 61 million light-years away.</p>
<a href="article3_details.html" class="read-more">Read Full Story</a>
</div>
<div class="article">
<h2>Private Company Announces Plans for Luxury Space Hotel</h2>
<p class="summary">"Orbital Oasis Inc." unveiled their ambitious project to launch a luxury hotel into Earth orbit by 2030, promising breathtaking views.</p>
<a href="article4_details.html" class="read-more">Read Full Story</a>
</div>
</body>
</html>
Case Analysis (Examining the HTML Structure):
Detective, before we write code, let’s examine the structure of these clues:
- The main title of the page is in an
<h1>
tag. - Each news item (headline, summary, link) is nicely contained within a
<div>
tag. - Crucially, each of these “article”
<div>
tags has aclass
attribute with the value"article"
. This is our primary way to find each news block! - Inside each
<div class="article">
:- The headline is in an
<h2>
tag. - The summary is in a
<p>
tag withclass="summary"
. - The link to the full story is in an
<a>
tag withclass="read-more"
, and the actual URL is in itshref
attribute.
- The headline is in an
This structure is our guide for telling Beautiful Soup where to find the information.
The Python Script (space_news_scraper.py
):
Now, let’s write our Python script to extract this information. You should have already created examples/web_scraping/space_news_scraper.py
and put the code from the previous step (or the code below) into it.
# space_news_scraper.py
from bs4 import BeautifulSoup
# Our sample HTML content for Space News
html_doc = """
<!DOCTYPE html>
<html>
<head>
<title>Space News Central</title>
</head>
<body>
<h1>Today's Space Discoveries</h1>
<div class="article">
<h2>New Alien Signal Detected!</h2>
<p class="summary">Scientists at SETI today announced the detection of a peculiar radio signal from the Gliese 581g system. Further analysis is required.</p>
<a href="article1_details.html" class="read-more">Read Full Story</a>
</div>
<div class="article">
<h2>Water Ice Confirmed on Mars Moon Phobos</h2>
<p class="summary">Data from the Martian Moons eXploration (MMX) probe has confirmed the presence of water ice on Phobos, one of Mars's moons.</p>
<a href="article2_details.html" class="read-more">Read Full Story</a>
</div>
<div class="article">
<h2>Galaxy NGC 1300: A Barred Spiral Masterpiece</h2>
<p class="summary">New images from the Hubble Space Telescope showcase the stunning details of the barred spiral galaxy NGC 1300, located 61 million light-years away.</p>
<a href="article3_details.html" class="read-more">Read Full Story</a>
</div>
<div class="article">
<h2>Private Company Announces Plans for Luxury Space Hotel</h2>
<p class="summary">"Orbital Oasis Inc." unveiled their ambitious project to launch a luxury hotel into Earth orbit by 2030, promising breathtaking views.</p>
<a href="article4_details.html" class="read-more">Read Full Story</a>
</div>
</body>
</html>
"""
# 1. Create a BeautifulSoup object
soup = BeautifulSoup(html_doc, 'lxml') # Using 'lxml' parser
# 2. Find all the article blocks
all_article_divs = soup.find_all('div', class_='article')
print("--- Space News Scoop ---")
print(f"Found {len(all_article_divs)} top stories:\n")
# 3. Loop through each article block
for article_div in all_article_divs:
# Find the headline (h2)
headline_element = article_div.find('h2')
headline = headline_element.get_text().strip() if headline_element else "No headline found"
# Find the summary (p with class="summary")
summary_element = article_div.find('p', class_='summary')
summary = summary_element.get_text().strip() if summary_element else "No summary found"
# Find the link (a with class="read-more") and get its href
link_element = article_div.find('a', class_='read-more')
link_url = link_element['href'] if link_element and link_element.has_attr('href') else "#"
# Print the extracted information
print(f"HEADLINE: {headline}")
print(f"SUMMARY: {summary}")
print(f"READ MORE: {link_url}")
print("-" * 30) # Separator
print("\n--- End of Transmission ---")
Debriefing the space_news_scraper.py
Script:
from bs4 import BeautifulSoup
: We import our special decoder tool.html_doc = """..."""
: This multiline string holds our “intercepted” HTML data.soup = BeautifulSoup(html_doc, 'lxml')
: We create ourBeautifulSoup
object, telling it to use thelxml
parser to make sense of the HTML.all_article_divs = soup.find_all('div', class_='article')
: This is key! We tell Beautiful Soup: “Find all the<div>
tags that specifically have theclass
attribute set to'article'
.” This gives us a list where each item is one news article’s block.for article_div in all_article_divs:
: We loop through each article block we found.- Inside the loop (for each article):
headline_element = article_div.find('h2')
: Within the currentarticle_div
, we find the first (and only)<h2>
tag.headline = headline_element.get_text().strip() if headline_element else "..."
: We get the text from the<h2>
tag..strip()
cleans up any extra spaces. We also add a check (if headline_element
) to make sure we actually found an<h2>
tag before trying to get its text, providing a default if not.summary_element = article_div.find('p', class_='summary')
: We find the<p>
tag withclass="summary"
.summary = summary_element.get_text().strip() if summary_element else "..."
: Get its text, with a fallback.link_element = article_div.find('a', class_='read-more')
: We find the<a>
tag withclass="read-more"
.link_url = link_element['href'] if link_element and link_element.has_attr('href') else "#"
: We get the value of thehref
attribute from the link tag. We also check iflink_element
exists and if it actually has anhref
attribute before trying to access it.print(...)
: We print out the extracted headline, summary, and link URL in a nice, readable format.
This mission shows the basic steps of most web scraping tasks: inspect the HTML, find patterns for the data you want, and then use Beautiful Soup to navigate and extract it!
10. Running Your “Famous Quotes Explorer” & What’s Next!
You’ve now seen the HTML and the Python code for our “Famous Quotes Explorer.” Let’s see how to run it and what you’ve learned!
Running Your quote_scraper.py
Script:
- Save the Code: Make sure you’ve saved the Python code for the “Famous Quotes Explorer” into a file named
quote_scraper.py
. It’s best to save it in a new folder, perhaps calledweb_scraping_project
. - Open Your Terminal or Command Prompt:
- Navigate to the folder where you saved
quote_scraper.py
. (Remember commands likecd
for change directory).
- Navigate to the folder where you saved
- Run the Script: Type the following command and press Enter:
(If you have multiple Python versions, you might needpython quote_scraper.py
python3 quote_scraper.py
)
Expected Output:
If everything is correct, you should see output in your terminal that looks something like this:
--- Famous Quotes Explorer ---
Found 5 quotes to display:
"The only way to do great work is to love what you do."
- Steve Jobs
------------------------------
"Strive not to be a success, but rather to be of value."
- Albert Einstein
------------------------------
"Life is what happens when you're busy making other plans."
- John Lennon
------------------------------
"Your time is limited, so don’t waste it living someone else’s life."
- Steve Jobs
------------------------------
"The mind is everything. What you think you become."
- Buddha
------------------------------
--- End of Quotes ---
Isn’t that cool? Your Python script read the HTML blueprint and extracted exactly what you wanted!
Recap: Your Web Scraping Toolkit!
In this lecture, you’ve learned the basics of some powerful Python tools for web scraping:
- Web Scraping Concepts: You understand what web scraping is and the importance of doing it ethically and responsibly.
- HTML Basics: You got a glimpse into how HTML structures web pages with tags and attributes.
requests
Library (Conceptual): You learned thatrequests
is used to fetch live web pages from the internet (though we used a local HTML string for our main example today for stability).Beautiful Soup
Library:- How to install it (
pip install beautifulsoup4 lxml
). - How to create a
BeautifulSoup
object to parse HTML. - How to use methods like
find()
,find_all()
to locate specific HTML elements. - How to search by tag name,
class_
, orid
. - How to extract text content using
.get_text()
and attribute values (e.g.,element['href']
).
- How to install it (
- Project Workflow: You saw how to analyze HTML to find patterns and then write Python code to extract data based on those patterns.
Homework & Your Next Scraping Adventures!
Ready to try some more?
-
Add More Quotes:
- Copy the
html_doc
string inquote_scraper.py
. - Add 2-3 more
<div class="quote-item">...</div>
sections to thehtml_doc
string with new quotes and authors. - Run your script. Does it automatically find and print the new quotes too? (It should!)
- Copy the
-
Add a “Category” to the Quotes:
- Modify the
html_doc
string. For eachdiv class="quote-item"
, add another paragraph for a category, like:<div class="quote-item"> <p class="text">...</p> <p class="author">...</p> <p class="category">Wisdom</p> </div>
- Now, modify your
quote_scraper.py
script to also find and print the category for each quote. - Hint: You’ll need to add another
quote_div.find('p', class_='category')
and then get its text.
- Modify the
-
Create Your Own Simple HTML File:
- Create a new file named
my_hobbies.html
(or similar) on your computer. - Write some simple HTML in it. For example, a list of your hobbies:
<!DOCTYPE html> <html><head><title>My Hobbies</title></head> <body> <h1>My Favorite Hobbies</h1> <ul> <li>Reading Books</li> <li>Playing Python Games</li> <li>Exploring Nature</li> </ul> </body></html>
- Now, can you write a new Python script that:
a. Reads the content of thismy_hobbies.html
file into a string? (Hint: You’ll need to use Python’s file reading:with open('my_hobbies.html', 'r') as f: html_content = f.read()
)
b. Uses Beautiful Soup to parsehtml_content
.
c. Finds all the<li>
(list item) tags and prints out each hobby?
- Create a new file named
-
(Optional - Advanced & Requires Internet) Scrape a Very Simple Live Page:
- Important: Remember the ethical scraping rules! Only do this for practice on sites where it’s clear they allow it or for very simple public data, and don’t make many requests quickly.
- Find a very simple website that has some plain text or a clear list. (A good, safe start might be to try and fetch a plain text file from a public URL if you can find one, just to practice using
requests.get()
.) - Try to use
import requests
andrequests.get('THE_URL').text
to fetch its HTML. - Then use Beautiful Soup to try and extract a specific piece of text from it.
- This can be tricky because live websites are complex and change! Don’t get discouraged if it’s hard – the goal is just to try connecting
requests
withBeautiful Soup
.
Web scraping is a powerful skill. It opens up a lot of possibilities for gathering data and automating tasks. Always remember to be responsible and respectful when you write scrapers for live websites.
Happy scraping!