Using the below data, write a comprehension to create a dictionary using employee:performance_score as the key:value pair, and only include employees who have completed at least 9 projects.
df.iterrows is something we haven’t seen before! If we have a pandas dataframe, .iterrows is a method that iterates over rows one by one. idx is the row’s index, and each row is a series.
On the first iteration, idx = 0 and row corresponds to the first employee:
Since 5 > 10 is False, this row is skipped and the loop moves to the next employee.
Take a few minutes to reload the data from yesterday. Get the correlation between feeling thermometers for Joe Biden and Anthony Fauci. Plot the distribution of one of the feeling thermometers, using whatever options you choose to make the plot look nice.
APIs provide structured data (usually JSON)
Data specifically stored for automated retrieval/consumption
APIs are usually well maintained
If you can achieve your goal with an API, use the API!
Many sources you may want to access data from do not have an API, but do have a website
Sometimes there is an API but…
It has a rate limit
it is prohibitively expensive to access
it does not provide the data you need
If a webpage is served to your browser, you can just scrape it!
Some web pages are more complex and require selenium, which is beyond the scope of an intro class
For static websites, use the requests library
send a request with a URL
retrieve an HTML document
Use BeautifulSoup to pull data out of HTML
<!DOCTYPE html>
<html>
<head>
<title> Page Title </title> <!-- title tag specifies the page title -->
</head>
<body> <!-- body tag holds the main content -->
<h1> This is a heading </h1>
<p> This is a paragraph </p>
</body>
</html>tags need to be closed with a forward slash (/)
tags can have attributes that provide additional information about the element’s behavior
<element attribute=“value”> element content </element>
id provides a document-wide unique identifier for an element
class provides a way of classifying similar elements together
href specifies the URL of the page a link goes to
BeautifulSoup represents an HTML document as a navigable tree
We can identify and navigate to specific elements using tags and attributes
Excellent tutorials and documentation on the BeautifulSoup website
Let’s imagine we are working with the following html document, an excerpt from Alice in Wonderland
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""<html>
<head>
<title>
The Dormouse's story
</title>
</head>
<body>
<p class="title">
<b>
The Dormouse's story
</b>
</p>
<p class="story">
Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">
Elsie
</a>
,
<a class="sister" href="http://example.com/lacie" id="link2">
Lacie
</a>
and
<a class="sister" href="http://example.com/tillie" id="link3">
Tillie
</a>
;
and they lived at the bottom of a well.
</p>
<p class="story">
...
</p>
</body>
</html>
Sometimes, a tag appears multiple times
Beautiful soup will just search for and find the first instance, unless we explicitly tell it to find all cases.
[<p class="title"><b>The Dormouse's story</b></p>,
<p class="story">Once upon a time there were three little sisters; and their names were
<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a> and
<a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>,
<p class="story">...</p>]
Sometimes, we are lucky and the HTML file has id attributes for its elements. If that is the case, we can access by id
Often, it will be useful for downstream tasks to extract all of the links on a page.
Very frequently, we will want to extract just the text, without tags. Beautiful Soup makes this very easy!
Browsers have useful tools to guide scraping
navigate to a page you want to scrape
right click and select “inspect”
Should bring up the HTML that pertains to that part of the webpage
a way for websites to communicate with bots, scrapers, crawlers etc
robots.txt typically include information on
sitemap
disallowed sections of the website
permissions for scrapers
Accessible through URL + /robots.txt
user-agent: determines who is, or is not, allowed to do the scraping.
Disallow denotes which parts of the website are disallowed
no value means that everything is allowed
/ means everything is disallowed
/folder/ means everything in that subfolder is disallowed
Let’s check out https://data.europa.eu/robots.txt
it’s good etiquette!
If not respected
IP address might be blocked
potentially legal consequences
Try looking for
/sitemap
/sitemap.xml
sitemap_index.xml
etc, etc, etc…
generally provided in the robots.txt, but may be separate
You might not be able to navigate to all pages from the home page
Can be parsed by Beautiful Soup
If you are scraping a bunch of different pages from a website, use the sleep function in between requests
i.e., place it in your for/while loop
Python will take a rest before sending the next request
Generally a few seconds is enough, but depends on use
import requests
import time
# A list of URLs to scrape
urls = [
"https://example.com/page1",
"https://example.com/page2",
"https://example.com/page3",
]
for url in urls:
response = requests.get(url)
if response.status_code < 300:
# Process the response (e.g., parse HTML)
print(" → Success! Got", len(response.text), "bytes.")
else:
print(f" → Failed with status {response.status_code}")
# Wait 2 seconds before making the next request
time.sleep(2)We’re going to scrape an actual document: the 2024 Democratic Party Platform, hosted at the American Presidency Project (Peters & Woolley) at UCSB.
It’s a great target because:
Static HTML — no JavaScript rendering surprises
The document text lives in a single, identifiable container
Stable URL, won’t disappear next semester
Cited regularly in political science research
Use the now-familiar pattern from earlier today — request the page, then hand the HTML to BeautifulSoup:
We now have a navigable tree of the entire page. Most of it is navigation, headers, and footers — we need to extract just the platform text.
If you inspect the page in your browser, the document body sits inside <div class="field-docs-content">. A single selector pulls everything we need:
That’s the entire platform as one string. Now we need to format it for downstream analysis.
The GroupAppeals classifier we’re about to use expects a DataFrame with one sentence per row, plus party / date / sentence_id metadata for traceability.
import pandas as pd
import re
# Split on sentence-ending punctuation (a quick-and-dirty approach)
sentences = re.split(r"(?<=[.!?])\s+", text)
# Drop empties and very short fragments
sentences = [s.strip() for s in sentences if len(s.strip()) > 20]
df = pd.DataFrame({
"party": "Democratic",
"date": "2024",
"sentence_id": range(len(sentences)),
"text": sentences,
})
df.head()
print(f"{len(df)} sentences")GroupAppeals is a Python package (full disclosure: I’m a co-author) for analyzing how political parties reference and appeal to social groups in their communications.
It runs four fine-tuned transformer models:
Token extraction — identifies group references in the text (e.g., “workers,” “immigrants,” “families”)
Stance detection — positive, negative, or neutral toward the identified group
Policy detection — does the text propose a policy directed at that group?
Group classification — maps the reference to a semantic category
Built on transformer models trained on real party manifestos. See Dolinsky, Huber, & Horne (BJPS, forthcoming) for the methodology.
In Colab:
Then import the pipeline and the helper function:
First run will download the model weights (a few hundred MB) — be patient.
For a live demo we slice down to about 20 sentences so the run completes in a minute or two:
# Take a sample for the demo
sample = df.head(20).copy()
sample.to_csv("dnc_sample.csv", index=False)
# Run the full pipeline
results = run_full_pipeline(
input_file="dnc_sample.csv",
output_file="dnc_results.csv",
create_composite_id=["party", "date", "sentence_id"],
)
print(f"Analyzed {len(results)} sentences")For the full platform, you’d ideally run this on a GPU or as an overnight job.
The output includes the extracted group reference, the stance toward that group, and whether policy is directed at the group:
Take a few minutes to read through the output. Which groups does the Democratic platform mention most? Are stances mostly positive, or is there a mix? Which groups are tied to actual policy proposals?
This was a very fast tour of text-as-data. Real toolkits worth learning:
spaCy — industrial-strength NLP pipeline (tokenization, POS tags, named entities, dependency parsing)
scikit-learn — classic ML for text classification (TF-IDF, logistic regression, SVMs)
HuggingFace Transformers — state-of-the-art language models with a friendly Python API
NLTK — older but still useful for tokenization, stemming, classic NLP tasks
The Manifesto Project — coded political party manifestos across 50+ countries, great for cross-national text analysis
You also have the ICPSR_Token_Classification_Example.ipynb notebook (in Slides_Images/) if you want to walk through training your own classifier.
You started Monday not knowing Python. You’re ending Friday having scraped a real document and run it through a transformer classifier. That’s a lot of ground.
The hardest part is over. Everything after this is just more of the same patterns: read documentation, try things, hit errors, work through them.
Reach out if you ever need a hand: roberho@umich.edu
All slides are at will-horne.github.io/icpsr-2026 and on Canvas.
Good luck with the rest of your research.