Assignment 5

In this assignment, you'll scrape text from The California Aggie and then analyze the text.

The Aggie is organized by category into article lists. For example, there's a Campus News list, Arts & Culture list, and Sports list. Notice that each list has multiple pages, with a maximum of 15 articles per page.

The goal of exercises 1.1 - 1.3 is to scrape articles from the Aggie for analysis in exercise 1.4.

Exercise 1.1. Write a function that extracts all of the links to articles in an Aggie article list. The function should:

  • Have a parameter url for the URL of the article list.

  • Have a parameter page for the number of pages to fetch links from. The default should be 1.

  • Return a list of aricle URLs (each URL should be a string).

Test your function on 2-3 different categories to make sure it works.

Hints:

  • Be polite to The Aggie and save time by setting up requests_cache before you write your function.

  • Start by getting your function to work for just 1 page. Once that works, have your function call itself to get additional pages.

  • You can use lxml.html or BeautifulSoup to scrape HTML. Choose one and use it throughout the entire assignment.

In [86]:
import re
import requests
import requests_cache
from bs4 import BeautifulSoup
requests_cache.install_cache('davis_aggie_cache')

def url_scraper(url, numPages = 1):
    '''
    This function takes in an URL from an California Aggie article list and returns a list of article URLs as strings.
    Input: 
        url: The URL of an California Aggie article list, the URL must include a blackslash at the end
        numPages: the number of pages you want to search
    Output:
        list of strings of article URLs
    
    '''
    #initialize final list outside of for loop
    url_list = []
    #loop through 1 to numPages
    for i in range(1, numPages + 1): 
        #get right URL
        url2 = url + r'page/%s' % i
        #BeautifulSoup content and make request to Aggie
        aggie = BeautifulSoup(requests.get(url2).content, "lxml")
        #https://www.crummy.com/software/BeautifulSoup/bs4/doc/
        #for loop the only section to final all a href
        for link in aggie.section.find_all('a'):
            possible = link.get('href')
            #basically the aggie starts articles with the year, otherwise the links are for navigation purposes
            if 'https://theaggie.org/2' in possible:
                url_list.append(possible)
    
    #make list unique and sorted it
    return sorted(list(set(url_list)))

urllist = url_scraper('https://theaggie.org/city/', 3)
urllist
Out[86]:
['https://theaggie.org/2016/11/30/affordable-clean-green-energy-is-coming-to-yolo-county-davis/',
 'https://theaggie.org/2016/12/02/davis-turkey-trot-more-than-just-another-race/',
 'https://theaggie.org/2016/12/02/nodapl-protest-erupts-in-downtown-davis/',
 'https://theaggie.org/2016/12/04/police-logs-6/',
 'https://theaggie.org/2016/12/04/the-season-of-giving/',
 'https://theaggie.org/2016/12/05/childrens-candlelight-parade-lights-up-downtown-davis/',
 'https://theaggie.org/2016/12/07/twas-the-night-before-christmas-in-old-sacramento/',
 'https://theaggie.org/2016/12/08/bail-reform-advocates-gather-at-annual-fundraiser/',
 'https://theaggie.org/2016/12/09/bike-campaign-offers-bicycles-to-those-who-cannot-afford-them/',
 'https://theaggie.org/2016/12/09/sparks-fly-in-light-the-fire/',
 'https://theaggie.org/2017/01/12/neighbors-unite/',
 'https://theaggie.org/2017/01/12/police-logs-7/',
 'https://theaggie.org/2017/01/15/yolo-county-library-materials-to-be-more-widely-available/',
 'https://theaggie.org/2017/01/16/a-dog-named-disney-wins-grant-money-for-rotts-of-friends/',
 'https://theaggie.org/2017/01/17/no-such-thing-as-too-much-thai-food/',
 'https://theaggie.org/2017/01/20/helping-the-homeless/',
 'https://theaggie.org/2017/01/20/police-logs-8/',
 'https://theaggie.org/2017/01/22/chinook-salmon-spawning-in-record-numbers-in-putah-creek/',
 'https://theaggie.org/2017/01/23/yolo-county-farm-bureau-to-honor-local-winery/',
 'https://theaggie.org/2017/01/24/love-laundry-accommodates-student-schedules-with-extended-hours/',
 'https://theaggie.org/2017/01/25/islamic-center-of-davis-victim-of-hate-crime/',
 'https://theaggie.org/2017/01/26/healing-the-mind-body-and-soul/',
 'https://theaggie.org/2017/01/26/police-logs-9/',
 'https://theaggie.org/2017/01/26/thousands-gather-at-sacramento-capitol-building-for-womens-rights/',
 'https://theaggie.org/2017/01/29/a-symphony-to-childrens-ears/',
 'https://theaggie.org/2017/01/29/yolo-countys-poverty-rate-higher-than-before-recession/',
 'https://theaggie.org/2017/01/30/sacramentos-new-public-transportation/',
 'https://theaggie.org/2017/01/30/toastmasters-help-members-conquer-fear-of-public-speaking/',
 'https://theaggie.org/2017/01/31/city-leaders-address-panhandling-issue-in-davis/',
 'https://theaggie.org/2017/02/02/davis-celebrates-mlk-day/',
 'https://theaggie.org/2017/02/02/local-author-cyclist-rides-2300-miles/',
 'https://theaggie.org/2017/02/05/davis-owls-face-eviction-at-marriott-residence-inn/',
 'https://theaggie.org/2017/02/05/police-logs-10/',
 'https://theaggie.org/2017/02/06/the-musical-train-to-memory-lane/',
 'https://theaggie.org/2017/02/12/news-in-brief-a-valentines-day-for-everybody/',
 'https://theaggie.org/2017/02/13/police-logs-11/',
 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/',
 'https://theaggie.org/2017/02/15/suspect-in-davis-islamic-center-vandalism-arrested/',
 'https://theaggie.org/2017/02/16/city-of-davis-to-retain-sanctuary-city-status/',
 'https://theaggie.org/2017/02/19/police-logs-12/',
 'https://theaggie.org/2017/02/20/city-of-davis-awarded-funds-for-new-recycling-bins/',
 'https://theaggie.org/2017/02/21/davis-stands-with-muslim-residents/',
 'https://theaggie.org/2017/02/23/davis-whole-foods-market-shuts-down/',
 'https://theaggie.org/2017/02/23/daviss-historic-city-hall-building-to-be-put-up-for-sale/',
 'https://theaggie.org/2017/02/23/protest-against-planned-parenthood-in-woodland-is-met-with-counter-protests/']

Exercise 1.2. Write a function that extracts the title, text, and author of an Aggie article. The function should:

  • Have a parameter url for the URL of the article.

  • For the author, extract the "Written By" line that appears at the end of most articles. You don't have to extract the author's name from this line.

  • Return a dictionary with keys "url", "title", "text", and "author". The values for these should be the article url, title, text, and author, respectively.

For example, for this article your function should return something similar to this:

{
    'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
    'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design. Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager. "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables. The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances. There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget. "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
    'title': 'Project Toto aims to address questions regarding city finances',
    'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'
}

Hints:

  • The author line is always the last line of the last paragraph.

  • Python 2 displays some Unicode characters as \uXXXX. For instance, \u201c is a left-facing quotation mark. You can convert most of these to ASCII characters with the method call (on a string)

    .translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })

    If you're curious about these characters, you can look them up on this page, or read more about what Unicode is.

In [87]:
def extract_aggie(url):
    '''
    This function takes in an URL from an California Aggie article and returns a dictionary with keys "url", "title", "text", 
    and "author" and the values for that dictionary keys
    Input: 
        url: The URL of an California Aggie article
    Output:
        a dictionary with keys "url", "title", "text", and "author" and the values for that dictionary keys    
    '''
    #Request and BeautifulSoup
    aggie = BeautifulSoup(requests.get(url).content, 'lxml')
    #get title
    title = aggie.title.string.split(' |', 1)[0].translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 })
    #find articleBody
    aggie = aggie.find(itemprop = 'articleBody')
    #from http://stackoverflow.com/questions/40660273/in-beautifulsoup-ignore-children-elements-while-getting-parent-element-data
    #get rid of picture stuff
    for figure in aggie.find_all('figure'):
        figure.decompose()
    #author from last line 
    if 'Written' in aggie.find_all()[-1].parent.parent.text:
        author = aggie.find_all()[-1].parent.parent.text
        trash, part, author = author.strip().partition('Written')
        author = part.strip() + author
    else:
        author = '' 

    #get rid of '\n' and translate
    aggie = aggie.get_text().translate({ 0x2018:0x27, 0x2019:0x27, 0x201C:0x22, 0x201D:0x22, 0x2026:0x20 }).strip('\n').replace('\n', ' ')
    aggie = aggie.replace(author, '')
    return {'author': author, 'text': aggie, 'title': title, 'url': url}
    
article = extract_aggie('https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/')
article
Out[87]:
{'author': u'Written By: Bianca Antunez \xa0\u2014\xa0city@theaggie.org',
 'text': u'Davis residents create financial model to make city\'s financial state more transparent To increase transparency between the city\'s financial situation and the community, three residents created a model called Project Toto which aims to improve how the city communicates its finances in an easily accessible design.  Jeff Miller and Matt Williams, who are members of Davis\' Finance and Budget Commission, joined together with Davis entrepreneur Bob Fung to create the model plan to bring the project to the Finance and Budget Commission in February, according to Kelly Stachowicz, assistant city manager.  "City staff appreciate the efforts that have gone into this, and the interest in trying to look at the city\'s potential financial position over the long term," Stachowicz said in an email interview. "We all have a shared goal to plan for a sound fiscal future with few surprises. We believe the Project Toto effort will mesh well with our other efforts as we build the budget for the next fiscal year and beyond." Project Toto complements the city\'s effort to amplify the transparency of city decisions to community members. The aim is to increase the understanding about the city\'s financial situation and make the information more accessible and easier to understand. The project is mostly a tool for public education, but can also make predictions about potential decisions regarding the city\'s financial future. Once completed, the program will allow residents to manipulate variables to see their eventual consequences, such as tax increases or extensions and proposed developments "This really isn\'t a budget, it is a forecast to see the intervention of these decisions," Williams said in an interview with The Davis Enterprise. "What happens if we extend the sales tax? What does it do given the other numbers that are in?" Project Toto enables users, whether it be a curious Davis resident, a concerned community member or a city leader, with the ability to project city finances with differing variables.  The online program consists of the 400-page city budget for the 2016-2017 fiscal year, the previous budget, staff reports and consultant analyses. All of the documents are cited and accessible to the public within Project Toto. "It\'s a model that very easily lends itself to visual representation," Mayor Robb Davis said. "You can see the impacts of decisions the council makes on the fiscal health of the city." Complementary to this program, there is also a more advanced version of the model with more in-depth analyses of the city\'s finances. However, for an easy-to-understand, simplistic overview, Project Toto should be enough to help residents comprehend Davis finances.  There is still more to do on the project, but its creators are hard at work trying to finalize it before the 2017-2018 fiscal year budget.  "It\'s something I have been very much supportive of," Davis said. "Transparency is not just something that I have been supportive of but something we have stated as a city council objective [ ] this fits very well with our attempt to inform the public of our challenges with our fiscal situation." ',
 'title': u'Project Toto aims to address questions regarding city finances',
 'url': 'https://theaggie.org/2017/02/14/project-toto-aims-to-address-questions-regarding-city-finances/'}

Exercise 1.3. Use your functions from exercises 1.1 and 1.2 to get a data frame of 60 Campus News articles and a data frame of 60 City News articles. Add a column to each that indicates the category, then combine them into one big data frame.

The "text" column of this data frame will be your corpus for natural language processing in exercise 1.4.

In [89]:
import pandas as pd
import numpy as np

#get urls
campus = url_scraper('https://theaggie.org/campus/', 4)
city = url_scraper('https://theaggie.org/city/', 4)
#make dataframes
campus = pd.DataFrame(campus, columns = ['url'])
campus['category'] = 'campus'
city = pd.DataFrame(city, columns = ['url'])
city['category'] = 'city'
#final df
davisNews = city.append(campus)
In [90]:
#list to store all dict
article_list = []
#get all articles
for i in davisNews['url']:
    article_dict = extract_aggie(i)
    article_list.append(article_dict)
#final df   
article_df = pd.DataFrame(article_list)
article_df = article_df.merge(davisNews)
article_df
Out[90]:
author text title url category
0 Written by: Alana Joldersma –– city@theaggie.org Indoor facility will provide a cafeteria, clas... Construction of the All Student Center at Davi... https://theaggie.org/2016/11/14/construction-o... city
1 More turkeys, more tomfoolery, more accidental... Police Logs https://theaggie.org/2016/11/15/police-logs-3/ city
2 Written by: Raul Castellanos — city@theaggie.org Bernie Sanders visits Sacramento to rally for ... Return of the Bern https://theaggie.org/2016/11/15/return-of-the-... city
3 Written By: Bianca Antunez – city@theaggie.org Participants line up for Thanksgiving 5k befor... Yolo Food Bank's eighth Annual Running of the ... https://theaggie.org/2016/11/15/yolo-food-bank... city
4 Written By: Bianca Antunez — city@theaggie.org Election results are in; Davis community conce... Nov. 8 2016: An Election Day many may never fo... https://theaggie.org/2016/11/17/nov-8-2016-an-... city
5 Written By: Anya Rehon — city@theaggie.org The Yolo County community comes together for h... The Yolo Food Bank addresses food insecurity https://theaggie.org/2016/11/17/the-yolo-food-... city
6 Written By: Andie Joldersma — city@theaggie.org Protesters gather at State Capitol Building fo... Water is sacred, water is life https://theaggie.org/2016/11/20/water-is-sacre... city
7 Written by: Kaelyn Tuermer-Lee – city@theaggie... Helping the community through Davis Community ... Tune in to Watermelon Music's strings-for-food... https://theaggie.org/2016/11/21/tune-in-to-wat... city
8 Written By: Sam Solomon — city@theaggie.org Nov. 7 "Subject stated our pizza is ready and ... Police Logs https://theaggie.org/2016/11/22/police-logs-4/ city
9 Written By: Dianna Rivera — city@theaggie.org Don't be left in the dark now that Daylight Sa... Sun down, bike lights out https://theaggie.org/2016/11/22/sun-down-bike-... city
10 Written by: Samantha Solomon – city@theaggie.org Davis residents light candles, promote sanctua... Holding the Light https://theaggie.org/2016/11/27/holding-the-li... city
11 Written by: Sam Solomon – city@theaggie.org Looks like it's been an interesting week Nov. ... Police Logs https://theaggie.org/2016/11/27/police-logs-5/ city
12 Written by: Raul Castellanos Jr. — city@theagg... Ornamental piano outside Mishka's Café vandali... Public piano destroyed in act of vandalism https://theaggie.org/2016/11/27/public-piano-d... city
13 Written by: Anya Rehon — city@theaggie.org College alcohol use, high-risk drinking discus... Local residents attend Davis town hall meeting https://theaggie.org/2016/11/29/local-resident... city
14 Written by: Bianca Antunez – city@theaggie.org Davis community renews local parcel tax for K-... Measure H passes, voters support Davis schools https://theaggie.org/2016/11/29/measure-h-pass... city
15 Written By: Andie Joldersma — city@theaggie.org Citizens can expect more cost-competitive clea... Affordable, clean, green energy is coming to Y... https://theaggie.org/2016/11/30/affordable-cle... city
16 Written by: Andie Joldersma — city@theaggie.org Participants gathered for 5k, 10k, half marath... Davis Turkey Trot: more than just another race https://theaggie.org/2016/12/02/davis-turkey-t... city
17 Written by: Samantha Solomon — city@theaggie.org Students, activists call for solidarity with S... NoDAPL protest erupts in downtown Davis https://theaggie.org/2016/12/02/nodapl-protest... city
18 Written by: Sam Solomon – city@theaggie.org Another week of 'why did people call the polic... Police Logs https://theaggie.org/2016/12/04/police-logs-6/ city
19 Written By: Dianna Rivera – city@theaggie.org The Yolo County Children's Alliance celebrates... The season of giving https://theaggie.org/2016/12/04/the-season-of-... city
20 Written by: Sam Solomon — city@theaggie.org 35th annual tree lighting ceremony kicks off w... Children's Candlelight Parade lights up downto... https://theaggie.org/2016/12/05/childrens-cand... city
21 Written by: Samantha Solomon – city@theaggie.org Live-action retelling of Christmas poem promis... 'Twas the Night Before Christmas in Old Sacram... https://theaggie.org/2016/12/07/twas-the-night... city
22 Written By: Juno Bhardwaj-shah — city@theaggie... Speakers criticize "unconstitutional" system O... Bail reform advocates gather at annual fundraiser https://theaggie.org/2016/12/08/bail-reform-ad... city
23 Written by: Raul Castellanos Jr. — city@theagg... Volunteer program distributes bicycles, aims t... Bike Campaign offers bicycles to those who can... https://theaggie.org/2016/12/09/bike-campaign-... city
24 Written by: Kaelyn Tuermer-Lee — city@theaggie... Local author Matt Biers-Ariel's latest novel M... Sparks fly in Light the Fire https://theaggie.org/2016/12/09/sparks-fly-in-... city
25 Written By: Dianna Rivera — city@theaggie.org The Davis Manor Neighborhood hosts first Holid... Neighbors unite https://theaggie.org/2017/01/12/neighbors-unite/ city
26 Written by: Sam Solomon — city@theaggie.org Season's Greetings Edition Dec. 24 "Downstairs... Police Logs https://theaggie.org/2017/01/12/police-logs-7/ city
27 Written by: Andie Joldersma — city@theaggie.org Books by Mail program will deliver materials s... Yolo County Library materials to be more widel... https://theaggie.org/2017/01/15/yolo-county-li... city
28 Written by: Kaelyn Tuermer-Lee – city@theaggie... Petco Foundation awards $10,000 to local anima... A dog named Disney wins grant money for Rotts ... https://theaggie.org/2017/01/16/a-dog-named-di... city
29 Written by: Raul Castellanos Jr. — city@theagg... Downtown Davis offers a wide range of Thai cui... No such thing as too much Thai food https://theaggie.org/2017/01/17/no-such-thing-... city
... ... ... ... ... ...
90 Written by: Jayashri Padmanabhan — campus@thea... Kathleen Salvaty to oversee implementation of ... UC system hires Title IX coordinator https://theaggie.org/2017/02/02/uc-system-hire... campus
91 Written by: Kenton Goldsby — campus@theaggie.org Law to affect students selected to attend Nati... AB 1887 prevents use of state funds, including... https://theaggie.org/2017/02/05/ab-1887-preven... campus
92 Written by: Ivan Valenzuela — campus@theaggie.org Davis College Democrats host Dodd for question... Senator Bill Dodd visits UC Davis https://theaggie.org/2017/02/06/senator-bill-d... campus
93 Written by: Yvonne Leong — campus@theaggie.org Last week in Senate The ASUCD Senate meeting w... Last week in Senate https://theaggie.org/2017/02/09/last-week-in-s... campus
94 Written by: Jayashri Padmanabhan — campus@thea... Funding to expand innovation, entrepreneurship... UC Davis receives $2.2 million from Assembly B... https://theaggie.org/2017/02/09/uc-davis-recei... campus
95 Written by: Kenton Goldsby — campus@theaggie.org Regents approve tuition increase in 16-4 vote ... University of California Regents meet, approve... https://theaggie.org/2017/02/09/university-of-... campus
96 Written by: Alyssa Vandenberg  — campus@theagg... Chan replaces former senator Sam Park Michael ... Michael Chan sworn in as interim senator https://theaggie.org/2017/02/10/michael-chan-s... campus
97 Written by: Jeanna Totah — campus@theaggie.org Recipients each rewarded $25,000 for research ... 11 new Chancellor Fellows honored for 2016 https://theaggie.org/2017/02/12/11-new-chancel... campus
98 Written by: Aaron Liss — campus@theaggie.org Muslim Student Association curates five-part D... Muslim students respond to recent political ev... https://theaggie.org/2017/02/12/muslim-student... campus
99 Written by: Lindsay Floyd — campus@theaggie.or... Events to promote safe sex On Feb. 1, Student ... Sexcessful Campaign launched in time for Valen... https://theaggie.org/2017/02/12/sexcessful-cam... campus
100 Written by: Lindsay Floyd — campus@theaggie.org New fees to pay for equipment replacement To c... PE classes may charge additional fees https://theaggie.org/2017/02/13/pe-classes-may... campus
101 Written by: Ivan Valenzuela — campus@theaggie.org New showcase provides opportunity for students... Shields Library hosts new exhibit for Davis ce... https://theaggie.org/2017/02/14/shields-librar... campus
102 Written by: Demi Caceres — campus@theaggie.org Students promote fruit and vegetable meals via... Student Health and Counseling Services hosts "... https://theaggie.org/2017/02/14/student-health... campus
103 Written by: Alyssa Vandenberg and Emilie DeFaz... Executive: Josh Dalavai and Adilla Jamaludin I... 2017 ASUCD Winter Elections — Meet the Candidates https://theaggie.org/2017/02/16/2017-asucd-win... campus
104 Written by: Demi Caceres — campus@theaggie.org Last week in Senate The ASUCD Senate meeting w... Last week in Senate https://theaggie.org/2017/02/16/last-week-in-s... campus
105 Written by: Jayashri Padmanabhan — campus@thea... Conference entails full day of speakers, panel... UC Davis holds first mental health conference https://theaggie.org/2017/02/17/uc-davis-holds... campus
106 Written by: Kaitlyn Cheung — campus@theaggie.org Student protesters march from MU flagpole to M... UC Davis students participate in UC-wide #NoDA... https://theaggie.org/2017/02/17/uc-davis-stude... campus
107 Written by: Kimia Akbari — campus@theaggie.org Executive order has immediate consequences for... Trump's immigration ban affects UC Davis commu... https://theaggie.org/2017/02/19/trumps-immigra... campus
108 Written by: Kenton Goldsby — campus@theaggie.org Speakers, including Interim Chancellor Ralph J... UC Davis Global Affairs holds discussion on Pr... https://theaggie.org/2017/02/19/uc-davis-globa... campus
109 Written by: Ivan Valenzuela  — campus@theaggie... SR #7 asks university to increase capacity for... ASUCD Senate passes resolution submitting comm... https://theaggie.org/2017/02/20/asucd-senate-p... campus
110 Written by: Jeanna Totah — campus@theaggie.org Tighter policies require greater approval of o... Katehi controversy prompts decline of UC admin... https://theaggie.org/2017/02/20/katehi-controv... campus
111 Written by: Yvonne Leong — campus@theaggie.org UC Davis leads in sustainability with largest ... UC releases 2016 Annual Report on Sustainable ... https://theaggie.org/2017/02/20/uc-releases-20... campus
112 Written by: Aaron Liss  — campus@theaggie.org Students receive email warnings from UC Davis ... UC Davis experiences several recent hate-based... https://theaggie.org/2017/02/21/uc-davis-exper... campus
113 Written by: Alyssa Vandenberg  — campus@theagg... UC Board of Regents to vote on the appointment... UC President selects Gary May as new UC Davis ... https://theaggie.org/2017/02/21/uc-president-s... campus
114 Written by: Alyssa Vandenberg  — campus@theagg... Shaheen's name to remain on ballot, his votes ... Senate candidate Zaki Shaheen withdraws from race https://theaggie.org/2017/02/22/senate-candida... campus
115 Written by: Kimia Akbari — campus@theaggie.org Faculty, students recount personal tales of im... Academics unite in peaceful rally against immi... https://theaggie.org/2017/02/23/academics-unit... campus
116 Written by: Ivan Valenzuela — campus@theaggie.org Veto included revision abandoning creation of ... ASUCD President Alex Lee vetoes amendment for ... https://theaggie.org/2017/02/23/asucd-presiden... campus
117 Written by: Kenton Goldsby — campus@theaggie.org Opening date pushed back to May 1 Students hav... Memorial Union to reopen Spring Quarter https://theaggie.org/2017/02/23/memorial-union... campus
118 Written by: Aaron Liss and Raul Castellanos  —... Wells Fargo faces fraud, predatory lending cha... University of California, Davis City Council s... https://theaggie.org/2017/02/23/university-of-... campus
119 Written by: Alyssa Vandenberg  — campus@theagg... Six senators, new executive team elected Curre... 2017 Winter Quarter election results https://theaggie.org/2017/02/24/2017-winter-qu... campus

120 rows × 5 columns

Exercise 1.4. Use the Aggie corpus to answer the following questions. Use plots to support your analysis.

  • What topics does the Aggie cover the most? Do city articles typically cover different topics than campus articles?

  • What are the titles of the top 3 pairs of most similar articles? Examine each pair of articles. What words do they have in common?

  • Do you think this corpus is representative of the Aggie? Why or why not? What kinds of inference can this corpus support? Explain your reasoning.

Hints:

  • The nltk book and scikit-learn documentation may be helpful here.

  • You can determine whether city articles are "near" campus articles from the similarity matrix or with k-nearest neighbors.

  • If you want, you can use the wordcloud package to plot a word cloud. To install the package, run

    conda install -c https://conda.anaconda.org/amueller wordcloud

    in a terminal. Word clouds look nice and are easy to read, but are less precise than bar plots.

In [96]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from nltk.corpus import stopwords
import matplotlib.pyplot as plt

text = [i for i in article_df['title']]
text = ' '.join(text)
wordcloud = WordCloud().generate(text)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()

The California Aggie covers various topics including around Davis and Yolo county. From the wordcloud above, some of its main topics in the past couple of months have included Police Logs, which they seem to do every week, and talking about student protests, and ASUCD senate campaigns. Some other reoccuring subjects are food, the controversy behind the former Chancellor, and the 2016 Presidential election and its outcome.

In [95]:
city_df = article_df.loc[article_df['category'] == 'city']
city_text = [i for i in city_df['title']]
city_text = ' '.join(city_text)
wordcloud = WordCloud().generate(city_text)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()
In [93]:
campus_df = article_df.loc[article_df['category'] == 'campus']
campus_text = [i for i in campus_df['title']]
campus_text = ' '.join(campus_text)
wordcloud = WordCloud().generate(campus_text)

plt.imshow(wordcloud)
plt.axis("off")
plt.show()

From the two wordclouds above, we can see the difference in topics covered by the City and Campus sections of the The California Aggie. The campus section talks more about new happenings on campus, like the the ASUCD Senate race, our new Chancellor, and student protests on campus. The city section is heavy with police logs and more city-wide issues like the vandalism at the Islamic Center along with stuff about Yolo county and Sacramento. While there is some overlap between the two sections, it occurs when an issue or event impacts both the city and campus of Davis. One word in both wordclouds was 'protest', which is an event that impacts both the city and the campus.

This corpus is not very representative of the California Aggie, because a city newspaper comments on the happenings of a city and the impact that worldwide events have on that city. Since the events that draw the public's attention always change, a newspaper has to make new articles on a daily basis to cover them. Because of this, the past articles, especially only the articles of the past few months, are not very representative of the future articles the Aggie might write. The kinds of inference that this corpus can support are the articles that are published every week, like the Police Logs, or annual events or articles like stuff about the ASUCD Senate elections.