Day 5: Web Scraping and Text-as-Data

Will Horne

We Made It!

A celebratory image marking the final day of the course.

Review: Comprehensions

Using the below data, write a comprehension to create a dictionary using employee:performance_score as the key:value pair, and only include employees who have completed at least 9 projects.

import pandas as pd

data = {
    "employee": ["Alice", "Bob", "Charlie", "Diana", "Evan"],
    "projects_completed": [  5,    12,        9,      14,     7],
    "performance_score": [ 85,    92,       88,      95,    80]
}

df = pd.DataFrame(data)

Doing the Same with a loop

high_perf_loop = {}
for idx, row in df.iterrows():
    if row["projects_completed"] > 10:
        high_perf_loop[row["employee"]] = row["performance_score"]

df.iterrows is something we haven’t seen before! If we have a pandas dataframe, .iterrows is a method that iterates over rows one by one. idx is the row’s index, and each row is a series.

iterrows

On the first iteration, idx = 0 and row corresponds to the first employee:

row["employee"]           # "Alice"
row["projects_completed"] # 5
row["performance_score"]  # 85

Since 5 > 10 is False, this row is skipped and the loop moves to the next employee.

Review: Pandas

Take a few minutes to reload the data from yesterday. Get the correlation between feeling thermometers for Joe Biden and Anthony Fauci. Plot the distribution of one of the feeling thermometers, using whatever options you choose to make the plot look nice.

APIs v. Webscraping

  • APIs provide structured data (usually JSON)

    • parameters allow you to specify and retrieve exactly what you need
  • Data specifically stored for automated retrieval/consumption

  • APIs are usually well maintained

  • If you can achieve your goal with an API, use the API!

When APIs fail

  • Many sources you may want to access data from do not have an API, but do have a website

  • Sometimes there is an API but…

    • It has a rate limit

    • it is prohibitively expensive to access

    • it does not provide the data you need

  • If a webpage is served to your browser, you can just scrape it!

Scraping static webpages

  • Some web pages are more complex and require selenium, which is beyond the scope of an intro class

    • but you should have the tools to learn how to use it at this point!
  • For static websites, use the requests library

    • send a request with a URL

    • retrieve an HTML document

  • Use BeautifulSoup to pull data out of HTML

HTML Structure

<!DOCTYPE html>
<html>
<head>
    <title> Page Title </title>  <!-- title tag specifies the page title -->
</head>
<body>  <!-- body tag holds the main content -->
    <h1> This is a heading </h1>
    <p> This is a paragraph </p>
</body>
</html>
  • tags need to be closed with a forward slash (/)

    • <tag> some content </tag>
  • tags can have attributes that provide additional information about the element’s behavior

  • <element attribute=“value”> element content </element>

Example

<abbr id="anID" style="color:blue" title="Hypertext Markup Language"> HTML </abbr>

Important attributes to know

  • id provides a document-wide unique identifier for an element

  • class provides a way of classifying similar elements together

  • href specifies the URL of the page a link goes to

Beautiful Soup

  • BeautifulSoup represents an HTML document as a navigable tree

  • We can identify and navigate to specific elements using tags and attributes

  • Excellent tutorials and documentation on the BeautifulSoup website

An Example

Let’s imagine we are working with the following html document, an excerpt from Alice in Wonderland

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser")

print(soup.prettify())
<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Using Ids

Sometimes, we are lucky and the HTML file has id attributes for its elements. If that is the case, we can access by id

soup.find(id="link3")
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Extracting Text

Very frequently, we will want to extract just the text, without tags. Beautiful Soup makes this very easy!

print(soup.get_text())

The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Taking a Step Back - What is on a webpage?

  • Browsers have useful tools to guide scraping

  • navigate to a page you want to scrape

    • go to the object you want to target
  • right click and select “inspect”

  • Should bring up the HTML that pertains to that part of the webpage

Robots.txt

  • a way for websites to communicate with bots, scrapers, crawlers etc

  • robots.txt typically include information on

    • sitemap

    • disallowed sections of the website

    • permissions for scrapers

  • Accessible through URL + /robots.txt

Robots.txt Syntax

  • user-agent: determines who is, or is not, allowed to do the scraping.

    • the * symbol means all scrapers
  • Disallow denotes which parts of the website are disallowed

    • no value means that everything is allowed

    • / means everything is disallowed

    • /folder/ means everything in that subfolder is disallowed

Example Robots.txt

Let’s check out https://data.europa.eu/robots.txt

Why bother with robots.txt?

  • it’s good etiquette!

  • If not respected

    • IP address might be blocked

    • potentially legal consequences

Robots.txt not helpful?

Try looking for

  • /sitemap

  • /sitemap.xml

  • sitemap_index.xml

  • etc, etc, etc…

Sitemaps

  • generally provided in the robots.txt, but may be separate

  • You might not be able to navigate to all pages from the home page

    • provides a map of what can be accessed
  • Can be parsed by Beautiful Soup

How not to crash a website

  • If you are scraping a bunch of different pages from a website, use the sleep function in between requests

    • i.e., place it in your for/while loop

    • Python will take a rest before sending the next request

  • Generally a few seconds is enough, but depends on use

Example

import requests
import time

# A list of URLs to scrape
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
]

for url in urls:
    response = requests.get(url)

    if response.status_code < 300:
        # Process the response (e.g., parse HTML)
        print("  → Success! Got", len(response.text), "bytes.")
    else:
        print(f"  → Failed with status {response.status_code}")

    # Wait 2 seconds before making the next request
    time.sleep(2)

A real scrape: 2024 Democratic Party Platform

We’re going to scrape an actual document: the 2024 Democratic Party Platform, hosted at the American Presidency Project (Peters & Woolley) at UCSB.

It’s a great target because:

  • Static HTML — no JavaScript rendering surprises

  • The document text lives in a single, identifiable container

  • Stable URL, won’t disappear next semester

  • Cited regularly in political science research

Fetching and parsing

Use the now-familiar pattern from earlier today — request the page, then hand the HTML to BeautifulSoup:

import requests
from bs4 import BeautifulSoup

URL = "https://www.presidency.ucsb.edu/documents/2024-democratic-party-platform"

r = requests.get(URL)
r.status_code  # should be 200
soup = BeautifulSoup(r.text, "html.parser")

We now have a navigable tree of the entire page. Most of it is navigation, headers, and footers — we need to extract just the platform text.

Extracting the document text

If you inspect the page in your browser, the document body sits inside <div class="field-docs-content">. A single selector pulls everything we need:

content = soup.select_one("div.field-docs-content")
text = content.get_text(separator=" ", strip=True)

print(text[:300])               # first 300 characters
print(len(text), "characters total")

That’s the entire platform as one string. Now we need to format it for downstream analysis.

Cleaning + formatting for analysis

The GroupAppeals classifier we’re about to use expects a DataFrame with one sentence per row, plus party / date / sentence_id metadata for traceability.

import pandas as pd
import re

# Split on sentence-ending punctuation (a quick-and-dirty approach)
sentences = re.split(r"(?<=[.!?])\s+", text)

# Drop empties and very short fragments
sentences = [s.strip() for s in sentences if len(s.strip()) > 20]

df = pd.DataFrame({
    "party": "Democratic",
    "date": "2024",
    "sentence_id": range(len(sentences)),
    "text": sentences,
})

df.head()
print(f"{len(df)} sentences")

Introducing GroupAppeals

GroupAppeals is a Python package (full disclosure: I’m a co-author) for analyzing how political parties reference and appeal to social groups in their communications.

It runs four fine-tuned transformer models:

  • Token extraction — identifies group references in the text (e.g., “workers,” “immigrants,” “families”)

  • Stance detection — positive, negative, or neutral toward the identified group

  • Policy detection — does the text propose a policy directed at that group?

  • Group classification — maps the reference to a semantic category

Built on transformer models trained on real party manifestos. See Dolinsky, Huber, & Horne (BJPS, forthcoming) for the methodology.

Installing GroupAppeals

In Colab:

!pip install groupappeals

Then import the pipeline and the helper function:

from groupappeals.fullpipeline import run_full_pipeline
from groupappeals.pre_and_post_processing import create_composite_id

First run will download the model weights (a few hundred MB) — be patient.

Running the pipeline

For a live demo we slice down to about 20 sentences so the run completes in a minute or two:

# Take a sample for the demo
sample = df.head(20).copy()
sample.to_csv("dnc_sample.csv", index=False)

# Run the full pipeline
results = run_full_pipeline(
    input_file="dnc_sample.csv",
    output_file="dnc_results.csv",
    create_composite_id=["party", "date", "sentence_id"],
)

print(f"Analyzed {len(results)} sentences")

For the full platform, you’d ideally run this on a GPU or as an overnight job.

Exploring results

The output includes the extracted group reference, the stance toward that group, and whether policy is directed at the group:

results[[
    "text_id",
    "Exact.Group.Text",  # the group reference
    "Stance_Clean",      # positive / negative / neutral
    "Policy_Clean",      # policy / no policy
    "Group1",            # semantic category
]].head(10)

Take a few minutes to read through the output. Which groups does the Democratic platform mention most? Are stances mostly positive, or is there a mix? Which groups are tied to actual policy proposals?

Where to go next

This was a very fast tour of text-as-data. Real toolkits worth learning:

  • spaCy — industrial-strength NLP pipeline (tokenization, POS tags, named entities, dependency parsing)

  • scikit-learn — classic ML for text classification (TF-IDF, logistic regression, SVMs)

  • HuggingFace Transformers — state-of-the-art language models with a friendly Python API

  • NLTK — older but still useful for tokenization, stemming, classic NLP tasks

  • The Manifesto Project — coded political party manifestos across 50+ countries, great for cross-national text analysis

You also have the ICPSR_Token_Classification_Example.ipynb notebook (in Slides_Images/) if you want to walk through training your own classifier.

Wrapping up the week

  • You started Monday not knowing Python. You’re ending Friday having scraped a real document and run it through a transformer classifier. That’s a lot of ground.

  • The hardest part is over. Everything after this is just more of the same patterns: read documentation, try things, hit errors, work through them.

  • Reach out if you ever need a hand: roberho@umich.edu

  • All slides are at will-horne.github.io/icpsr-2026 and on Canvas.

  • Good luck with the rest of your research.