Day 5: Web Scraping and Text-as-Data

Will Horne

We Made It!

A celebratory image marking the final day of the course.

Review: Pandas

Take a few minutes to reload the data from yesterday. Get the correlation between feeling thermometers for Joe Biden and Anthony Fauci. Plot the distribution of one of the feeling thermometers, using whatever options you choose to make the plot look nice.

Accessing Data from the Internet

Data (e.g. web pages) lives on servers
Browsers, apps, etc. are clients
Clients send requests to servers
Servers serve the necessary files to the user

Requests library

The requests library allows us to send requests to servers. This requires us to be working on a machine connected to the internet (obviously).

Let’s see a very simple example

import requests

r = requests.get("https://www.python.org")
r.status_code

What’s in the Response?

What happens if you run this?

print(r.text)

HTML Files

r.text returned the HTML code for the Python webpage, and it contains a ton of information.

style information, including links to CSS files
JavaScript scripts
HTML tags
classes, ids, toggle buttons, etc
navigation bars, sidebars, footers

Take Stock

Go to Wikipedia and load a page on a topic of your interest. What information is actually useful? What information is not worth obtaining?

We want methods for extracting useful, structured, data that we can use in analyses.

Parsing HTML Files

To parse an HTML document, we will need a parsing tool
- Software that recognizes the structure of HTML documents and allows us to extract what we want
beautifulsoup is a library that will allow us to do so
- but that is a topic for tomorrow
Or, we need some other method to interact with the server and bypass this mess!
- so, let’s start with APIs

Requesting Structured Data with APIs

Why APIs?

Application Programming Interfaces (APIs) provide us access to structured data
Design is separate from content (unlike with an HTML file)
We can access the data directly

A needed detour

APIs most commonly return data in the JSON format, or occasionally in the XML format.
- Sometimes you can specify which format you prefer
To interact with APIs, we need to understand how the data that they return will be structured
- And how to manipulate it for our purposes

JSON Files

JSON (JavaScript Object Notation) files store structured data in a simple(ish) and human-readable way.

When working with API responses, we usually call .json() on the response object (e.g., r.json()) — the json module comes into play when reading/writing JSON files on disk.

Extremely popular for exchanging data with servers, storing metadata alongside data, etc.

JSON Structure

JSON files are built on two basic structures:

A collection of name: value pairs (becomes a Python dict)
An ordered list of values (becomes a Python list)

Whether a JSON becomes a dict or a list depends on the top-level structure. Most real APIs return one wrapping the other.

Nested JSON

Real-world JSON often nests these:

A list of dictionaries
A dictionary whose values are lists
Arbitrary combinations of the above

This flexibility is why JSON works for so many use cases — but it’s also why parsing it can sometimes feel difficult.

Example JSON

obj = """
{"name": "Wes",
 "cities_lived": ["Akron", "Nashville", "New York", "San Francisco"],
 "pet": null,
 "siblings": [{"name": "Scott", "age": 34, "hobbies": ["guitars", "soccer"]},
              {"name": "Katie", "age": 42, "hobbies": ["diving", "art"]}]
}
"""

Notice that this contains a variety of data types, and some nesting. Much more free form/flexible than a .csv

Converting JSON to Python Objects

import json 
result = json.loads(obj)

result

type(result)

If we want to extract data into a DataFrame

siblings = pd.DataFrame(result["siblings"], columns=["name", "age"])

siblings

APIs

APIs are a very useful way to get data. Many government agencies and other common data sources have public APIs that we can access from Python.
- Can access with R, but support for Python more robust
Sometimes you will need a key to access data, particularly if data is sensitive or non-public

Get Requests: endpoints

When we want to get data from a server, we use a get request to an endpoint — a URL that the server publishes for programmatic access.

Two examples:

Wikipedia: https://en.wikipedia.org/w/api.php
YouTube comments: https://www.googleapis.com/youtube/v3/channels

Get Requests: parameters

Parameters get appended to the URL with ? and &:

?param1=value1&param2=value2&...

Example — articles containing “america”, sorted by publication date:

https://www.example.com/api/posts?query=america&sort=newest&types=articles

Possible parameters vary by API — always check the docs.

Get Requests: response format

Most APIs return JSON. Some return XML or other formats; a few let you specify with a parameter like &format=json.

JSON is much easier to work with in Python — if you have the choice, take it.

Finding parameters: the API Sandbox

Reading the Wikipedia API docs is rough. There are hundreds of modules, no obvious entry point.

The API Sandbox can help us out:

https://en.wikipedia.org/wiki/Special:ApiSandbox

Click together a query in the UI
See the raw JSON response immediately
Copy the generated URL or parameter dict into your code

Realistic Workflow: find a sandbox or find a working example on Stack Overflow, or ask ChatGPT/Claude.

Your Turn!

Using the API Sandbox at https://en.wikipedia.org/wiki/Special:ApiSandbox:

Set action to query
Set titles to Jimmy Carter
Under prop, find and select langlinkscount
Set format to json
Hit “Make request” — look at the response

Then translate that into a requests.get() call in Python. We’ll do it together on the next slide.

Parameters example

Using requests with a params dict is more readable than typing out the full URL, and avoids formatting mistakes.

import requests

endpoint = "https://en.wikipedia.org/w/api.php"

headers = {"User-Agent": "ICPSR-Python-Course/1.0 (your_email@example.com)"}

parameters = {
    "action": "query",
    "titles": "Jimmy_Carter",
    "prop":   "langlinkscount",
    "format": "json",
}

r = requests.get(endpoint, params=parameters, headers=headers)
r.status_code  # 200 is good. 4xx is your problem; 5xx is theirs.

When the API says no: User-Agent

Notice the headers argument. Wikipedia (and many other APIs) reject requests that use the default python-requests/X.Y.Z user agent — too many abusive scripts hide behind it.

Their etiquette policy asks for a descriptive UA with a contact
Without it, you get a 403 Forbidden before the API even processes the query
This is a real-world API lesson: rules change over time. I didn’t need the User-Agent argument last year!

If you get a 403, check the API’s user-agent and rate-limit policies before debugging your code further.

Convert the response to a Python dict

d = r.json()
type(d)   # dict
d

The structure of d mirrors the JSON. Now we just walk in to grab the value we want.

Drilling in

# Grab the single page-dict (more important when querying multiple pages)
page = next(iter(d['query']['pages'].values()))

# Extract the language-links count
count = page['langlinkscount']

print("Jimmy Carter's page is translated into", count, "languages")

Your Turn — page views

Modify the parameters to request page views instead of language link count, and extract the data.

parameters = {
    "action":  "query",
    "titles":  "Jimmy_Carter",
    "prop":    "pageviews",   # changed from langlinkscount
    "format":  "json",
}

r = requests.get(endpoint, params=parameters)
d = r.json()
d

Pagination: the `continue` token

APIs limit how much data a single query returns. When there’s more to fetch, the response includes a continue key telling you where to pick up.

We’ll grab one batch first, see what comes back, then loop in the next slide.

Setting up the query

API_URL = "https://en.wikipedia.org/w/api.php"

params = {
    "action":  "query",
    "list":    "categorymembers",
    "cmtitle": "Category:Wikipedians_interested_in_history",
    "cmlimit": "50",
    "format":  "json",
}

The first batch

resp = requests.get(API_URL, params=params)
resp.raise_for_status()      # raises on 4xx/5xx errors

data = resp.json()
print("Top-level keys:", list(data.keys()))

If data has a "continue" key, there’s more to fetch.

A while loop to pull all batches

all_members = []

while True:
    resp = requests.get(API_URL, params=params)
    resp.raise_for_status()
    data = resp.json()

    batch = data["query"]["categorymembers"]
    all_members.extend(batch)
    print(f"Fetched {len(batch)} (total: {len(all_members)})")

    if "continue" in data:
        params.update(data["continue"])
    else:
        break

print(f"Done! Total: {len(all_members)}")

How the loop works

Each pass fetches one batch and appends it to all_members
If data["continue"] exists, merge those parameters into params and loop again — Wikipedia tells us where to resume via that field
When data["continue"] is gone, break exits the loop

Pulling From Multiple Pages

endpoint = 'https://en.wikipedia.org/w/api.php'
parameters = {'action': 'query',
              'titles': 'Jimmy_Carter|George_H._W._Bush',
              'prop': 'langlinkscount',
              'format': 'json'}

r = requests.get(endpoint, params=parameters)
r.status_code

d = r.json()
d

API Keys: don’t paste them in your script

Many APIs (FRED, Census, OpenSecrets, OpenAI, …) require a personal API key to access data.

Pasting a key directly into your code is fine for a 30-second demo
But your notebook auto-saves to Drive — the key goes with it
If you ever share or commit the notebook, your key leaks
Public GitHub repos get scraped for keys within minutes

We need a way to use the key without it appearing in the notebook.

Colab: use Secrets

Click the key icon in the left sidebar. Add a new secret — name it FRED_API_KEY, paste the key as the value, and toggle notebook access.

from google.colab import userdata

API_KEY = userdata.get("FRED_API_KEY")

The key never appears in the notebook. If you share the notebook, the recipient sees userdata.get("FRED_API_KEY") and has to set up their own secret to run it.

On your own machine

Two common approaches:

Environment variables — set FRED_API_KEY=... in your shell, read with os.environ
A .env file — store keys in a file, load with python-dotenv

Either way, add .env and any key files to .gitignore so they never get committed.

Local example with python-dotenv

In a file called .env (next to your script):

FRED_API_KEY=abc123...

Then in Python:

import os
from dotenv import load_dotenv

load_dotenv()
API_KEY = os.environ["FRED_API_KEY"]

os.environ["..."] raises a clear KeyError if the variable isn’t set, so you find out immediately if your setup is wrong.

If you accidentally commit a key

It happens. Step-by-step fix:

Rotate the key first — revoke the old one in the API provider’s dashboard, generate a new one
Don’t just delete the line and re-commit — the key is still in git history
Either rewrite history (git filter-repo) or accept the key needs to stay revoked

Treat any leaked key as compromised forever. Depending on the API, an unrevoked key can mean unauthorized charges to your account, permanent API bans, or professional consequences if it was tied to work.

A second API: FRED

FRED (Federal Reserve Economic Data) is one of the most useful APIs for economic data. Register for a free key at https://fred.stlouisfed.org/docs/api/api_key.html.

Same pattern as Wikipedia: endpoint, params dict, parse JSON, convert to DataFrame.

from google.colab import userdata

API_KEY = userdata.get("FRED_API_KEY")

endpoint = "https://api.stlouisfed.org/fred/series/observations"

params = {
    "series_id": "UNRATE",   # US unemployment rate
    "api_key":   API_KEY,
    "file_type": "json",
}

r = requests.get(endpoint, params=params)
data = r.json()

FRED to DataFrame

obs = pd.DataFrame(data["observations"])
obs["date"]  = pd.to_datetime(obs["date"])
obs["value"] = pd.to_numeric(obs["value"])

obs.tail()

pd.to_datetime and pd.to_numeric convert the string columns FRED returns into proper datetime and numeric types — much easier to plot or aggregate. Whenever an API returns dates as strings, this is the first thing to do.

APIs v. Webscraping

APIs provide structured data (usually JSON)
- parameters allow you to specify and retrieve exactly what you need
Data specifically stored for automated retrieval/consumption
APIs are usually well maintained
If you can achieve your goal with an API, use the API!

When APIs fail

Many sources you may want to access data from do not have an API, but do have a website
Sometimes there is an API but…
- It has a rate limit
- it is prohibitively expensive to access
- it does not provide the data you need
If a webpage is served to your browser, you can just scrape it!

Scraping static webpages

Some web pages are more complex and require selenium, which is beyond the scope of an intro class
- but you should have the tools to learn how to use it at this point!
For static websites, use the requests library
- send a request with a URL
- retrieve an HTML document
Use BeautifulSoup to pull data out of HTML

HTML Structure

<!DOCTYPE html>
<html>
<head>
    <title> Page Title </title>  <!-- title tag specifies the page title -->
</head>
<body>  <!-- body tag holds the main content -->
    <h1> This is a heading </h1>
    <p> This is a paragraph </p>
</body>
</html>

tags need to be closed with a forward slash (/)
- <tag> some content </tag>
tags can have attributes that provide additional information about the element’s behavior
<element attribute=“value”> element content </element>

Example

<abbr id="anID" style="color:blue" title="Hypertext Markup Language"> HTML </abbr>

Important attributes to know

id provides a document-wide unique identifier for an element
class provides a way of classifying similar elements together
href specifies the URL of the page a link goes to

Beautiful Soup

BeautifulSoup represents an HTML document as a navigable tree
We can identify and navigate to specific elements using tags and attributes
Excellent tutorials and documentation on the BeautifulSoup website

An Example

Let’s imagine we are working with the following html document, an excerpt from Alice in Wonderland

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc, "html.parser")

print(soup.prettify())

<html>
 <head>
  <title>
   The Dormouse's story
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The Dormouse's story
   </b>
  </p>
  <p class="story">
   Once upon a time there were three little sisters; and their names were
   <a class="sister" href="http://example.com/elsie" id="link1">
    Elsie
   </a>
   ,
   <a class="sister" href="http://example.com/lacie" id="link2">
    Lacie
   </a>
   and
   <a class="sister" href="http://example.com/tillie" id="link3">
    Tillie
   </a>
   ;
and they lived at the bottom of a well.
  </p>
  <p class="story">
   ...
  </p>
 </body>
</html>

Navigating the structure

soup.title

<title>The Dormouse's story</title>

soup.title.string

"The Dormouse's story"

soup.title.parent.name

'head'

Sometimes, a tag appears multiple times

soup.p

<p class="title"><b>The Dormouse's story</b></p>

Beautiful soup will just search for and find the first instance, unless we explicitly tell it to find all cases.

soup.find_all("p")

[<p class="title"><b>The Dormouse's story</b></p>,
 <p class="story">Once upon a time there were three little sisters; and their names were
 <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
 <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
 <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
 and they lived at the bottom of a well.</p>,
 <p class="story">...</p>]

Using Ids

Sometimes, we are lucky and the HTML file has id attributes for its elements. If that is the case, we can access by id

soup.find(id="link3")

<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>

Extracting Links

Often, it will be useful for downstream tasks to extract all of the links on a page.

for link in soup.find_all("a"):  # the <a> tag is used for links
    print(link.get("href"))

http://example.com/elsie
http://example.com/lacie
http://example.com/tillie

Extracting Text

Very frequently, we will want to extract just the text, without tags. Beautiful Soup makes this very easy!

print(soup.get_text())


The Dormouse's story

The Dormouse's story
Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.
...

Taking a Step Back - What is on a webpage?

Browsers have useful tools to guide scraping
navigate to a page you want to scrape
- go to the object you want to target
right click and select “inspect”
Should bring up the HTML that pertains to that part of the webpage

Robots.txt

a way for websites to communicate with bots, scrapers, crawlers etc
robots.txt typically include information on
- sitemap
- disallowed sections of the website
- permissions for scrapers
Accessible through URL + /robots.txt

Robots.txt Syntax

user-agent: determines who is, or is not, allowed to do the scraping.
- the * symbol means all scrapers
Disallow denotes which parts of the website are disallowed
- no value means that everything is allowed
- / means everything is disallowed
- /folder/ means everything in that subfolder is disallowed

Example Robots.txt

Let’s check out https://data.europa.eu/robots.txt

Why bother with robots.txt?

it’s good etiquette!
If not respected
- IP address might be blocked
- potentially legal consequences

Robots.txt not helpful?

Try looking for

/sitemap
/sitemap.xml
sitemap_index.xml
etc, etc, etc…

Sitemaps

generally provided in the robots.txt, but may be separate
You might not be able to navigate to all pages from the home page
- provides a map of what can be accessed
Can be parsed by Beautiful Soup

How not to crash a website

If you are scraping a bunch of different pages from a website, use the sleep function in between requests
- i.e., place it in your for/while loop
- Python will take a rest before sending the next request
Generally a few seconds is enough, but depends on use

Example

import requests
import time

# A list of URLs to scrape
urls = [
    "https://example.com/page1",
    "https://example.com/page2",
    "https://example.com/page3",
]

for url in urls:
    response = requests.get(url)

    if response.status_code < 300:
        # Process the response (e.g., parse HTML)
        print("  → Success! Got", len(response.text), "bytes.")
    else:
        print(f"  → Failed with status {response.status_code}")

    # Wait 2 seconds before making the next request
    time.sleep(2)

A real scrape: 2024 Democratic Party Platform

We’re going to scrape an actual document: the 2024 Democratic Party Platform, hosted at the American Presidency Project (Peters & Woolley) at UCSB.

It’s a great target because:

Static HTML — no JavaScript rendering surprises
The document text lives in a single, identifiable container
Stable URL, won’t disappear next semester
Cited regularly in political science research

Fetching and parsing

Use the now-familiar pattern from earlier today — request the page, then hand the HTML to BeautifulSoup:

import requests
from bs4 import BeautifulSoup

URL = "https://www.presidency.ucsb.edu/documents/2024-democratic-party-platform"

r = requests.get(URL)
r.status_code  # should be 200
soup = BeautifulSoup(r.text, "html.parser")

We now have a navigable tree of the entire page. Most of it is navigation, headers, and footers — we need to extract just the platform text.

Extracting the document text

If you inspect the page in your browser, the document body sits inside <div class="field-docs-content">. A single selector pulls everything we need:

content = soup.select_one("div.field-docs-content")
text = content.get_text(separator=" ", strip=True)

print(text[:300])               # first 300 characters
print(len(text), "characters total")

That’s the entire platform as one string. Now we need to format it for downstream analysis.

Cleaning + formatting for analysis

The GroupAppeals classifier we’re about to use expects a DataFrame with one sentence per row, plus party / date / sentence_id metadata for traceability.

import pandas as pd
import re

# Split on sentence-ending punctuation (a quick-and-dirty approach)
sentences = re.split(r"(?<=[.!?])\s+", text)

# Drop empties and very short fragments
sentences = [s.strip() for s in sentences if len(s.strip()) > 20]

df = pd.DataFrame({
    "party": "Democratic",
    "date": "2024",
    "sentence_id": range(len(sentences)),
    "text": sentences,
})

df.head()
print(f"{len(df)} sentences")

Introducing GroupAppeals

GroupAppeals is a Python package (full disclosure: I’m a co-author) for analyzing how political parties reference and appeal to social groups in their communications.

It runs four fine-tuned transformer models:

Token extraction — identifies group references in the text (e.g., “workers,” “immigrants,” “families”)
Stance detection — positive, negative, or neutral toward the identified group
Policy detection — does the text propose a policy directed at that group?
Group classification — maps the reference to a semantic category

Built on transformer models trained on real party manifestos. See Dolinsky, Huber, & Horne (BJPS, forthcoming) for the methodology.

Installing GroupAppeals

In Colab:

!pip install groupappeals

Then import the pipeline and the helper function:

from groupappeals.fullpipeline import run_full_pipeline

First run will download the model weights (a few hundred MB) — be patient.

Running the pipeline

For a live demo we slice down to about 20 sentences so the run completes in a minute or two:

# Take a sample for the demo
sample = df.head(20).copy()
sample.to_csv("dnc_sample.csv", index=False)

# Run the full pipeline
results = run_full_pipeline(
    input_file="dnc_sample.csv",
    output_file="dnc_results.csv",
    create_composite_id=["party", "date", "sentence_id"],
)

print(f"Analyzed {len(results)} sentences")

For the full platform, you’d ideally run this on a GPU or as an overnight job.

Exploring results

The output includes the extracted group reference, the stance toward that group, and whether policy is directed at the group:

results[[
    "text_id",
    "Exact.Group.Text",  # the group reference
    "Stance_Clean",      # positive / negative / neutral
    "Policy_Clean",      # policy / no policy
    "Group1",            # semantic category
]].head(10)

Take a few minutes to read through the output. Which groups does the Democratic platform mention most? Are stances mostly positive, or is there a mix? Which groups are tied to actual policy proposals?

Where to go next

This was a very fast tour of text-as-data. Real toolkits worth learning:

spaCy — industrial-strength NLP pipeline (tokenization, POS tags, named entities, dependency parsing)
scikit-learn — classic ML for text classification (TF-IDF, logistic regression, SVMs)
HuggingFace Transformers — state-of-the-art language models with a friendly Python API
NLTK — older but still useful for tokenization, stemming, classic NLP tasks
The Manifesto Project — coded political party manifestos across 50+ countries, great for cross-national text analysis

You also have the ICPSR_Token_Classification_Example.ipynb notebook (in Slides_Images/) if you want to walk through training your own classifier.

Wrapping up the week

You started Monday not knowing Python. You’re ending Friday having scraped a real document and run it through a transformer classifier. That’s a lot of ground.
The hardest part is over. Everything after this is just more of the same patterns: read documentation, try things, hit errors, work through them.
Reach out if you ever need a hand: rwhorne@clemson.edu
All slides are at will-horne.github.io/icpsr-2026 and on Canvas.
Good luck with the rest of your research.

Day 5: Web Scraping and Text-as-Data

We Made It!

Review: Pandas

Accessing Data from the Internet

Requests library

What’s in the Response?

HTML Files

Take Stock

Parsing HTML Files

Requesting Structured Data with APIs

A needed detour

JSON Files

JSON Structure

Nested JSON

Example JSON

Converting JSON to Python Objects

APIs

Get Requests: endpoints

Get Requests: parameters

Get Requests: response format

Finding parameters: the API Sandbox

Your Turn!

Parameters example

When the API says no: User-Agent

Convert the response to a Python dict

Drilling in

Your Turn — page views

Pagination: the continue token

Setting up the query

The first batch

A while loop to pull all batches

How the loop works

Pulling From Multiple Pages

API Keys: don’t paste them in your script

Colab: use Secrets

On your own machine

Local example with python-dotenv

If you accidentally commit a key

A second API: FRED

FRED to DataFrame

APIs v. Webscraping

When APIs fail

Scraping static webpages

HTML Structure

Important attributes to know

Beautiful Soup

An Example

Navigating the structure

Using Ids

Extracting Links

Extracting Text

Taking a Step Back - What is on a webpage?

Robots.txt

Robots.txt Syntax

Example Robots.txt

Why bother with robots.txt?

Robots.txt not helpful?

Sitemaps

How not to crash a website

Example

A real scrape: 2024 Democratic Party Platform

Fetching and parsing

Extracting the document text

Cleaning + formatting for analysis

Introducing GroupAppeals

Installing GroupAppeals

Running the pipeline

Exploring results

Where to go next

Wrapping up the week

Pagination: the `continue` token