Assignment 6

In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table Description
crime Crime reports dating back to 2010.
mobile_food_locations List of all locations where mobile food vendors sell.
mobile_food_permits List of all mobile food vendor permits. More details here.
mobile_food_schedule Schedules for mobile food vendors.
noise Noise complaints dating back to August 2015.
parking List of all parking lots.
parks List of all parks.
schools List of all schools.
zillow Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

Exercise 1.1. Which mobile food vendor(s) sells at the most locations?

In [1]:
import sqlite3 as sql
import pandas as pd
import matplotlib.pylab as plt
plt.rcParams['figure.figsize'] = (12, 12)

db = sql.connect("sf_data.sqlite")
schedule = pd.read_sql("SELECT * FROM mobile_food_schedule", db)
locations = pd.read_sql("SELECT * FROM mobile_food_locations", db)
permits = pd.read_sql("SELECT * FROM mobile_food_permits", db)
schedule = schedule.drop(schedule.columns[[2,3,4]], axis = 1)
schedule = schedule.drop_duplicates()
merchants = schedule.merge(permits)
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
merchants['Applicant'].value_counts().plot(kind = 'bar')
plt.show()

May's Catering has requested permits to sell at 58 locations, but they have not been approved according to this dataset.

In [2]:
merchants_approved = merchants.loc[merchants['Status'] == 'APPROVED']
merchants_approved['Applicant'].value_counts().plot(kind="bar")
plt.show()

This means that Park's Catering sells at the most locations, because they have been approved to sell food at 23 locations. Maybe in the future May's Catering will be able to sell at all 58 locations, but until then Park's Catering sells food at the most locations.

Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.

You should try to come up with some questions on your own, but these are examples of reasonable questions:

  • Which parts of the city are the most and least expensive?
  • Which parts of the city are the most dangerous (and at what times)?
  • Are noise complaints and mobile food vendors related?
  • What are the best times and places to find food trucks?
  • Is there a relationship between housing prices and any of the other tables?

Please make sure to clearly state each of your questions in your submission.

Question 1: Which parts of the city are the most and least expensive?

In [3]:
zillow = pd.read_sql("SELECT * FROM zillow", db, parse_dates = 'Date')
zillow = zillow.dropna()
fig, ax = plt.subplots()
zillow1 = zillow.pivot(index = 'Date', columns = 'RegionName', values = 'ZriPerSqft_AllHomes')
[ax.plot_date(zillow1.index, zillow1[i], '-') for i in list(zillow1)]
#http://matplotlib.org/users/legend_guide.html
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=4,
           ncol=5, mode="expand", borderaxespad=0.)
plt.show()
latest = pd.DataFrame([[i, zillow1[i].dropna().iloc[-1]] for i in list(zillow1)], columns = ['GEOID10', 'ZriPerSqft_AllHomes'])
latest['GEOID10'] = latest['GEOID10'].astype(str)
latest
Out[3]:
GEOID10 ZriPerSqft_AllHomes
0 94080 2.708
1 94102 4.682
2 94103 4.398
3 94105 5.094
4 94107 4.362
5 94109 4.638
6 94110 3.960
7 94112 3.150
8 94114 4.168
9 94115 4.372
10 94116 3.122
11 94117 3.848
12 94121 3.108
13 94122 3.022
14 94123 4.704
15 94124 2.982
16 94127 3.144
17 94131 3.862
18 94132 3.064
19 94133 4.786
20 94134 3.064
21 94158 4.332
In [4]:
import shapely.geometry as geom
import geopandas as gpd

zips = gpd.read_file('zips/cb_2015_us_zcta510_500k.shp')
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html
zips = zips[zips['GEOID10'].isin(list(map(str, zillow1)))]
#http://geopandas.org/mapping.html
zips1 = zips.merge(latest, on = 'GEOID10')
zips1.plot(column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()

Using the median of the estimated monthly rent price of all homes per square foot, we find that the the zip code with the most expensive houses is 94105, which is most yellow portion of the map above. Likewise, the cheapest houses in San Francisco are located at zip code 94080, which is not part of the contiguous San Francisco. It is located close to the San Francisco airport and is the bottom most portion of map above. The map above is made using the latest (non-NA) median of the estimated monthly rent price of all homes per square foot value in our data. The time series plot above shows how rental prices in San Francisco have increased over the last 6 years.

In [5]:
plt.rcParams['figure.figsize'] = (12, 12)
fig, ax = plt.subplots()
zillow1 = zillow.pivot(index = 'Date', columns = 'RegionName', values = 'MedianSoldPricePerSqft_AllHomes')
[ax.plot_date(zillow1.index, zillow1[i], '-') for i in list(zillow1)]
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=4,
           ncol=5, mode="expand", borderaxespad=0.)
plt.show()
latest = pd.DataFrame([[i, zillow1[i].dropna().iloc[-1]] for i in list(zillow1)], columns = ['GEOID10', 'MedianSoldPricePerSqft_AllHomes'])
latest['GEOID10'] = latest['GEOID10'].astype(str)
latest
Out[5]:
GEOID10 MedianSoldPricePerSqft_AllHomes
0 94080 628.616900
1 94102 1189.393939
2 94103 985.085000
3 94105 1188.261200
4 94107 1007.227200
5 94109 1006.101100
6 94110 1087.436500
7 94112 755.397000
8 94114 1109.307600
9 94115 1044.638300
10 94116 840.558200
11 94117 1017.780800
12 94121 826.116300
13 94122 798.023500
14 94123 1276.685200
15 94124 583.533173
16 94127 904.136200
17 94131 1123.055600
18 94132 639.163277
19 94133 1063.829787
20 94134 650.256100
21 94158 1120.358515
In [6]:
zips2 = zips.merge(latest, on = 'GEOID10')
zips2.plot(column = 'MedianSoldPricePerSqft_AllHomes', cmap='summer')
plt.show()

We can do the same analysis for Median Sold Home Price, and find that zip code 94123 is the most expensive and 94080 is still the least expensive. With that, we can conclude that the most expensive areas in San Francisco are the ones which are close to the water and the main portion of the city. The least expensive areas are those that inland and further away from Downtown San Francisco (around Market Street).

Question 2: Which parts of the city are the most dangerous (and at what times)?

In [7]:
crime = pd.read_sql("SELECT * FROM crime", db, parse_dates = 'Datetime')
crime['PdDistrict'].value_counts().plot(kind = 'bar')
plt.show()

The Southern and Mission Police Departments have the greatest number of incidents. From this, we can assume that those areas are the most dangerous in San Francisco. Likewise, the Richmond and Park Departments have the least number of incidents, so we can assume that those areas are the least dangerous in San Francisco.

In [8]:
crime['hour'] = crime.Datetime.apply(lambda x: x.hour)
crime['hour'].value_counts().plot(kind = 'bar')
plt.show()

Hour 18 and 17 or 6 PM and 5 PM are the hours with the most incidents. That is very interesting, since I would have expected it to be around 9 PM or later, since San Francisco is a urban city with a lot of nightlife. Likewise, 5 AM and 4 AM are the hours with the least incidents, which makes sense since people are still sleeping and the nightlife activity should have died down by then.

In [9]:
crime['both'] = crime['PdDistrict'] + ' ' + crime['hour'].map(str)
both = crime['both'].value_counts()
both
Out[9]:
SOUTHERN 18     13422
SOUTHERN 17     12901
SOUTHERN 19     12445
SOUTHERN 16     11576
SOUTHERN 12     11108
SOUTHERN 20     10752
SOUTHERN 15     10672
SOUTHERN 14     10074
SOUTHERN 13      9925
SOUTHERN 21      9862
SOUTHERN 22      9574
SOUTHERN 0       9487
SOUTHERN 23      8973
SOUTHERN 11      8952
MISSION 0        8417
SOUTHERN 10      8296
MISSION 18       8210
MISSION 17       7950
NORTHERN 18      7862
NORTHERN 19      7630
MISSION 12       7603
SOUTHERN 9       7584
MISSION 23       7572
MISSION 22       7556
MISSION 19       7477
MISSION 16       7313
MISSION 20       7197
CENTRAL 18       7088
NORTHERN 17      7065
MISSION 21       7013
                ...  
INGLESIDE 3      1517
BAYVIEW 3        1445
PARK 1           1444
TENDERLOIN 5     1432
NORTHERN 5       1429
CENTRAL 6        1340
TENDERLOIN 3     1296
RICHMOND 7       1258
INGLESIDE 6      1230
CENTRAL 4        1212
RICHMOND 2       1205
TARAVAL 3        1201
PARK 2           1175
BAYVIEW 4        1167
CENTRAL 5        1149
BAYVIEW 5        1066
INGLESIDE 4      1046
TARAVAL 6        1045
TENDERLOIN 4      950
PARK 3            921
PARK 6            918
INGLESIDE 5       905
TARAVAL 4         770
RICHMOND 3        753
RICHMOND 6        727
TARAVAL 5         655
PARK 4            632
PARK 5            586
RICHMOND 4        517
RICHMOND 5        479
Name: both, dtype: int64

Taking into account Police Department District and Hour, we learn more about how big the districts actually are and how the compare in the amount of incidents they respond to. The Southern District has the top 15 amount of incidents taken, because they are in the heart of San Francisco. To answer the question what parts of the city are the most dangerous and during what times, it turns out that the Southern and Mission districts, the heart of San Francisco, are always the most dangerous places to be during anytime.

Q3: What are the clearance rates for the different districts in San Francisco? What are the most common categories of incidents?

In [10]:
import numpy as np

crime['clearance'] = np.where(crime['Resolution'] == 'NONE', 'NO', 'YES')
no = pd.DataFrame(crime.groupby(["clearance", "PdDistrict"]).size()['NO'], columns = ['NO']).transpose()
yes = pd.DataFrame(crime.groupby(["clearance", "PdDistrict"]).size()['YES'], columns = ['YES']).transpose()
clearance = yes.append(no).transpose()
clearance['Total'] = clearance['YES'] + clearance['NO']
clearance['Percentage'] = clearance['YES']/clearance['Total']
clearance.plot.bar(y = 'Percentage')
plt.show()
clearance
Out[10]:
YES NO Total Percentage
PdDistrict
BAYVIEW 41735 59920 101655 0.410555
CENTRAL 28936 80306 109242 0.264880
INGLESIDE 30758 58286 89044 0.345425
MISSION 57866 78350 136216 0.424811
NORTHERN 36371 87994 124365 0.292454
PARK 20705 40269 60974 0.339571
RICHMOND 15061 41197 56258 0.267713
SOUTHERN 70923 125255 196178 0.361524
TARAVAL 22667 52792 75459 0.300388
TENDERLOIN 48907 35102 84009 0.582164

The Tenderloin, Mission, and Bayview districts clear the highest percentage of their cases. This is particularly impressive for the Mission, since they have the 2nd most cases behind the Southern district. The Central, Richmond, and Northern districts are really lagging behind on their case closure rates. In particular, Richmond has the lowest amount of incidents, yet they also have the 2nd worst clearance rate. The Tenderloin district, for example, has twice the clearance rate that the Richmond district has.

In [11]:
crime['Category'].value_counts().plot(kind = 'bar')
plt.show()

Larceny, assualt, and vandalism are the three major criminal offenses that happen in San Francisco. This is probably true for most major cities across the United States of America. What I am suprised by is the frequency of close frequency of prostitution, DUI, and kidnapping. In a large city like San Francisco, the frequency of DUI is diminished by the prevalance of taxis and other car services, but to see that those two crimes have about the same frequency is suprising to say the least.

Question 4: Do areas with more expensive housing have more or less frequency of different sorts of crime?

In [12]:
from shapely.geometry import Point

#https://geohackweek.github.io/vector/04-geopandas-intro/
crime_where = crime[crime['Category'] == 'PROSTITUTION']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()

The incidents of prostitution seem to correlate with more expensive neighborhoods. It seems that prostitution in San Francisco is mainly located in two clusters close the heart of San Francisco. This makes sense because prostitution usually follows the nightlife in a city. In this case, the nightlife of San Francisco is everywhere, but mainly around the Mission District and above Market Street, which is also not coincidentally where the incidents of prostitution occur.

In [13]:
crime_where = crime[crime['Category'] == 'SUICIDE']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()

Incidents of suicide in San Francisco seem to spread out around the city. There is a small cluster around the Financial District and Market Street, but the number of people around there every day compared to Southern San Francisco is much larger. Taking everything into account, it seems that people do not come to San Francisco to commit suicide. There does not seem to be a tendency to commit suicide in more or less expensive zip codes.

In [14]:
crime_where = crime[crime['Category'] == 'ARSON']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()

In San Francisco, there are two main clusters of incidents of arson. One in the less expensive zip code, and another that spans multiple expensive zip codes. This is probably because of the confounding variables in each of the zip codes. In the heart of San Francisco, there are a lot businesses, so there are many targets for arson. The less expensive zip code contains warehouses and shipyards, so there may be a tendency for arson happening there as well.