import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import re
AE 06: Opinion articles in The Arizona Daily Wildcat
Suggested answers
Application exercise
Answers
Important
These are suggested answers. This document should be used as reference only, it’s not designed to be an exhaustive key.
Part 1 - Data scraping
See wildcat-scrape.py for suggested scraping code.
Part 2 - Data analysis
Let’s start by loading the packages we will need:
- Load the data you saved into the
data
folder and name itwildcat
.
= pd.read_csv("data/wildcat.csv") wildcat
- Who are the most prolific authors of the 100 most recent opinion articles in The Arizona Daily Wildcat?
= wildcat['author'].value_counts().reset_index()
author_counts = ['author', 'count']
author_counts.columns print(author_counts)
author count
0 Greg Castro 35
1 Toni Marcheva 31
2 Apoorva Bhaskara 28
3 Alec Scott 26
4 Sean Fagan 24
.. ... ...
183 Jack Cooper 1
184 Amit Syal 1
185 Quinn McVeigh 1
186 Eric Wise 1
187 Gabriel Schivone 1
[188 rows x 2 columns]
- Draw a line plot of the number of opinion articles published per day in The Arizona Daily Wildcat.
'date'] = pd.to_datetime(wildcat['date'])
wildcat[
= wildcat['date'].value_counts().sort_index().reset_index()
articles_per_day = ['date', 'count']
articles_per_day.columns
=(8, 6))
plt.figure(figsize=articles_per_day, x='date', y='count', marker='o')
sns.lineplot(data'Number of Opinion Articles Published Per Day')
plt.title('Date')
plt.xlabel('Number of Articles')
plt.ylabel( plt.show()
- What percent of the most recent 100 opinion articles in The Arizona Daily Wildcat mention “climate” in their title?
= wildcat.head(100)
most_recent_100
'title_lower'] = most_recent_100['title'].str.lower()
most_recent_100['climate_mentioned'] = most_recent_100['title_lower'].apply(lambda x: 'mentioned' if 'climate' in x else 'not mentioned')
most_recent_100[
= most_recent_100['climate_mentioned'].value_counts(normalize=True).reset_index()
climate_mentions = ['climate_mentioned', 'percentage']
climate_mentions.columns print(climate_mentions)
climate_mentioned percentage
0 not mentioned 0.99
1 mentioned 0.01
- What percent of the most recent 100 opinion articles in The Arizona Daily Wildcat mention “election” in their title or abstract?
= wildcat.head(100)
most_recent_100
'title_lower'] = most_recent_100['title'].str.lower()
most_recent_100['election_mentioned'] = most_recent_100['title_lower'].apply(lambda x: 'mentioned' if 'election' in x else 'not mentioned')
most_recent_100[
= most_recent_100['election_mentioned'].value_counts(normalize=True).reset_index()
climate_mentions = ['election_mentioned', 'percentage']
climate_mentions.columns print(climate_mentions)
election_mentioned percentage
0 not mentioned 1.0
- What are the most common words in the titles of the 100 most recent articles?
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from collections import Counter
import nltk
'stopwords')
nltk.download('punkt')
nltk.download(
= set(stopwords.words('english'))
stop_words
'tokens'] = most_recent_100['title_lower'].apply(lambda x: [word for word in word_tokenize(x) if word.isalpha() and word not in stop_words])
most_recent_100[
# Count the frequency of each word
= Counter([word for tokens in most_recent_100['tokens'] for word in tokens])
word_freq
# Convert to DataFrame and plot
= pd.DataFrame(word_freq.most_common(20), columns=['word', 'count'])
word_freq_df
=(8, 6))
plt.figure(figsize=word_freq_df, x='count', y='word', palette='viridis')
sns.barplot(data'Most Common Words in Titles of 100 Most Recent Articles')
plt.title('Count')
plt.xlabel('Word')
plt.ylabel( plt.show()
- Time permitting: