Go to your ae repo, commit any remaining changes, push, and then pull for today’s application exercise.
Reading The Arizona Daily Wildcat
How often do you read The Arizona Daily Wildcat?
Every day
3-5 times a week
Once a week
Rarely
Reading The Arizona Daily Wildcat
What do you think is the most common word in the titles of The Arizona Daily Wildcat opinion pieces?
Analyzing The Arizona Daily Wildcat
Analyzing The Arizona Daily Wildcat
All of this analysis is done in Python!
(mostly) with tools you already know!
Common works in The Arizona Daily Wildcat titles
Code for the earlier plot:
stop_words =set(stopwords.words('english'))wildcat['tokens'] = wildcat['title'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalpha() and word.lower() notin stop_words])word_counts = Counter(word for title in wildcat['tokens'] for word in title)common_words = pd.DataFrame(word_counts.most_common(20), columns=['word', 'count'])plt.figure(figsize=(10, 5))sns.barplot(x='count', y='word', data=common_words, palette='viridis')plt.xlabel('Number of mentions')plt.ylabel('Word')plt.title('Arizona Daily Wildcat - Opinion pieces\nCommon words in the most recent opinion pieces')plt.show()stop_words =set(stopwords.words('english'))wildcat['tokens'] = wildcat['title'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalpha() and word.lower() notin stop_words])word_counts = Counter(word for title in wildcat['tokens'] for word in title)common_words = pd.DataFrame(word_counts.most_common(20), columns=['word', 'count'])plt.figure(figsize=(10, 5))sns.barplot(x='count', y='word', data=common_words, palette='viridis')plt.xlabel('Number of mentions')plt.ylabel('Word')plt.title('Arizona Daily Wildcat - Opinion pieces\nCommon words in the most recent opinion pieces')plt.show()stop_words =set(stopwords.words('english'))wildcat['tokens'] = wildcat['title'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalpha() and word.lower() notin stop_words])word_counts = Counter(word for title in wildcat['tokens'] for word in title)common_words = pd.DataFrame(word_counts.most_common(20), columns=['word', 'count'])plt.figure(figsize=(10, 5))sns.barplot(x='count', y='word', data=common_words, palette='viridis')plt.xlabel('Number of mentions')plt.ylabel('Word')plt.title('Arizona Daily Wildcat - Opinion pieces\nCommon words in the most recent opinion pieces')plt.show()stop_words =set(stopwords.words('english'))wildcat['tokens'] = wildcat['title'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalpha() and word.lower() notin stop_words])word_counts = Counter(word for title in wildcat['tokens'] for word in title)common_words = pd.DataFrame(word_counts.most_common(20), columns=['word', 'count'])plt.figure(figsize=(10, 5))sns.barplot(x='count', y='word', data=common_words, palette='viridis')plt.xlabel('Number of mentions')plt.ylabel('Word')plt.title('Arizona Daily Wildcat - Opinion pieces\nCommon words in the most recent opinion pieces')plt.show()stop_words =set(stopwords.words('english'))wildcat['tokens'] = wildcat['title'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalpha() and word.lower() notin stop_words])word_counts = Counter(word for title in wildcat['tokens'] for word in title)common_words = pd.DataFrame(word_counts.most_common(20), columns=['word', 'count'])plt.figure(figsize=(10, 5))sns.barplot(x='count', y='word', data=common_words, palette='viridis')plt.xlabel('Number of mentions')plt.ylabel('Word')plt.title('Arizona Daily Wildcat - Opinion pieces\nCommon words in the most recent opinion pieces')plt.show()stop_words =set(stopwords.words('english'))wildcat['tokens'] = wildcat['title'].apply(lambda x: [word.lower() for word in word_tokenize(x) if word.isalpha() and word.lower() notin stop_words])word_counts = Counter(word for title in wildcat['tokens'] for word in title)common_words = pd.DataFrame(word_counts.most_common(20), columns=['word', 'count'])plt.figure(figsize=(10, 5))sns.barplot(x='count', y='word', data=common_words, palette='viridis')plt.xlabel('Number of mentions')plt.ylabel('Word')plt.title('Arizona Daily Wildcat - Opinion pieces\nCommon words in the most recent opinion pieces')plt.show()
title author \
0 BOOK REVIEW: ‘Fresh Fruit, Broken Bodies’ by D... Andres F. Diaz
1 OPINION: The first presidential debate lacked ... Luke Lawson
2 OPINION: College WBB favorites and sleeper pic... Melisa Guzeloglu
3 OPINION: College MBB favorites and sleeper pic... Nathaniel Levin
4 EDITORIAL: A desk altered but opinions thrive ... Editor-in-Chief
.. ... ...
995 Here’s how to best help Nepal Hailey Dickson
996 Adderall abuse not harmless Maddie Pickens
997 Court rule is legitimate judge Jacob Winkelman
998 Letters to the editor: May 4, 2015 Gabriel Schivone
999 Capability imperfectly captured by TCEs Maddie Pickens
date abstract column \
0 July 22, 2024 NaN Opinion
1 July 3, 2024 NaN Opinion
2 March 15, 2024 NaN Opinion
3 March 15, 2024 NaN Opinion
4 March 15, 2024 NaN Opinion
.. ... ... ...
995 May 5, 2015 NaN Opinion
996 May 5, 2015 NaN Opinion
997 May 4, 2015 NaN Opinion
998 May 4, 2015 NaN Opinion
999 May 4, 2015 NaN Opinion
url \
0 https://wildcat.arizona.edu/155604/opinions/bo...
1 https://wildcat.arizona.edu/155594/opinions/op...
2 https://wildcat.arizona.edu/154146/opinions/s-...
3 https://wildcat.arizona.edu/154116/opinions/s-...
4 https://wildcat.arizona.edu/154126/opinions/ed...
.. ...
995 https://wildcat.arizona.edu/123054/opinions/he...
996 https://wildcat.arizona.edu/102511/opinions/ad...
997 https://wildcat.arizona.edu/127949/opinions/co...
998 https://wildcat.arizona.edu/142548/opinions/le...
999 https://wildcat.arizona.edu/100073/opinions/ca...
tokens sentiment
0 [book, review, fresh, fruit, broken, bodies, s... 0.0
1 [opinion, first, presidential, debate, lacked,... 0.0
2 [opinion, college, wbb, favorites, sleeper, pi... -1.0
3 [opinion, college, mbb, favorites, sleeper, pi... -1.0
4 [editorial, desk, altered, opinions, thrive, w... 0.0
.. ... ...
995 [best, help, nepal] 5.0
996 [adderall, abuse, harmless] -3.0
997 [court, rule, legitimate, judge] 0.0
998 [letters, editor, may] 0.0
999 [capability, imperfectly, captured, tces] 1.0
[1000 rows x 8 columns]
Web scraping
Scraping the web: what? why?
Increasing amount of data is available on the web
These data are provided in an unstructured format: you can always copy&paste, but it’s time-consuming and prone to errors
Web scraping is the process of extracting this information automatically and transform it into a structured dataset
Two different scenarios:
Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
Hypertext Markup Language
Most of the data on the web is still largely available as HTML - while it is structured (hierarchical) it often is not available in a form useful for analysis (flat / tidy).
<html><head><title>This is a title</title></head><body><p align="center">Hello world!</p><br/><div class="name" id="first">John</div><div class="name" id="last">Doe</div><div class="contact"><div class="home">555-555-1234</div><div class="home">555-555-2345</div><div class="work">555-555-9999</div><div class="fax">555-555-8888</div></div></body></html>
BeautifulSoup
The BeautifulSoup package makes basic processing and manipulation of HTML data straight forward
We will use a tool called SelectorGadget to help us identify the HTML elements of interest by constructing a CSS selector which can be used to subset the HTML document.
html ='''<p> This is the first sentence in the paragraph. This is the second sentence that should be on the same line as the first sentence.<br>This third sentence should start on a new line.</p>'''
This is the first sentence in the paragraph.
This is the second sentence that should be on the same line as the first sentence.This third sentence should start on a new line.
SelectorGadget (selectorgadget.com) is a javascript based tool that helps you interactively build an appropriate CSS selector for the content you are interested in.