Assignment 6¶

In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite, which you can download here. The database contains the following tables:

Table	Description
`crime`	Crime reports dating back to 2010.
`mobile_food_locations`	List of all locations where mobile food vendors sell.
`mobile_food_permits`	List of all mobile food vendor permits. More details here.
`mobile_food_schedule`	Schedules for mobile food vendors.
`noise`	Noise complaints dating back to August 2015.
`parking`	List of all parking lots.
`parks`	List of all parks.
`schools`	List of all schools.
`zillow`	Zillow rent and housing statistics dating back to 1996. More details here.

The mobile_food_ tables are explicitly connected through the locationid and permit columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.

Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.

Shapefiles for San Francisco Neighborhoods are available here.

Exercise 1.1. Which mobile food vendor(s) sells at the most locations?

import sqlite3 as sql
import pandas as pd
import matplotlib.pylab as plt
plt.rcParams['figure.figsize'] = (12, 12)

db = sql.connect("sf_data.sqlite")
schedule = pd.read_sql("SELECT * FROM mobile_food_schedule", db)
locations = pd.read_sql("SELECT * FROM mobile_food_locations", db)
permits = pd.read_sql("SELECT * FROM mobile_food_permits", db)
schedule = schedule.drop(schedule.columns[[2,3,4]], axis = 1)
schedule = schedule.drop_duplicates()
merchants = schedule.merge(permits)
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
merchants['Applicant'].value_counts().plot(kind = 'bar')
plt.show()

May's Catering has requested permits to sell at 58 locations, but they have not been approved according to this dataset.

merchants_approved = merchants.loc[merchants['Status'] == 'APPROVED']
merchants_approved['Applicant'].value_counts().plot(kind="bar")
plt.show()

This means that Park's Catering sells at the most locations, because they have been approved to sell food at 23 locations. Maybe in the future May's Catering will be able to sell at all 58 locations, but until then Park's Catering sells food at the most locations.

Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.

You should try to come up with some questions on your own, but these are examples of reasonable questions:

Which parts of the city are the most and least expensive?
Which parts of the city are the most dangerous (and at what times)?
Are noise complaints and mobile food vendors related?
What are the best times and places to find food trucks?
Is there a relationship between housing prices and any of the other tables?

Please make sure to clearly state each of your questions in your submission.

Question 1: Which parts of the city are the most and least expensive?

zillow = pd.read_sql("SELECT * FROM zillow", db, parse_dates = 'Date')
zillow = zillow.dropna()
fig, ax = plt.subplots()
zillow1 = zillow.pivot(index = 'Date', columns = 'RegionName', values = 'ZriPerSqft_AllHomes')
[ax.plot_date(zillow1.index, zillow1[i], '-') for i in list(zillow1)]
#http://matplotlib.org/users/legend_guide.html
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=4,
           ncol=5, mode="expand", borderaxespad=0.)
plt.show()
latest = pd.DataFrame([[i, zillow1[i].dropna().iloc[-1]] for i in list(zillow1)], columns = ['GEOID10', 'ZriPerSqft_AllHomes'])
latest['GEOID10'] = latest['GEOID10'].astype(str)
latest

import shapely.geometry as geom
import geopandas as gpd

zips = gpd.read_file('zips/cb_2015_us_zcta510_500k.shp')
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html
zips = zips[zips['GEOID10'].isin(list(map(str, zillow1)))]
#http://geopandas.org/mapping.html
zips1 = zips.merge(latest, on = 'GEOID10')
zips1.plot(column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()

Using the median of the estimated monthly rent price of all homes per square foot, we find that the the zip code with the most expensive houses is 94105, which is most yellow portion of the map above. Likewise, the cheapest houses in San Francisco are located at zip code 94080, which is not part of the contiguous San Francisco. It is located close to the San Francisco airport and is the bottom most portion of map above. The map above is made using the latest (non-NA) median of the estimated monthly rent price of all homes per square foot value in our data. The time series plot above shows how rental prices in San Francisco have increased over the last 6 years.

plt.rcParams['figure.figsize'] = (12, 12)
fig, ax = plt.subplots()
zillow1 = zillow.pivot(index = 'Date', columns = 'RegionName', values = 'MedianSoldPricePerSqft_AllHomes')
[ax.plot_date(zillow1.index, zillow1[i], '-') for i in list(zillow1)]
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=4,
           ncol=5, mode="expand", borderaxespad=0.)
plt.show()
latest = pd.DataFrame([[i, zillow1[i].dropna().iloc[-1]] for i in list(zillow1)], columns = ['GEOID10', 'MedianSoldPricePerSqft_AllHomes'])
latest['GEOID10'] = latest['GEOID10'].astype(str)
latest

zips2 = zips.merge(latest, on = 'GEOID10')
zips2.plot(column = 'MedianSoldPricePerSqft_AllHomes', cmap='summer')
plt.show()

We can do the same analysis for Median Sold Home Price, and find that zip code 94123 is the most expensive and 94080 is still the least expensive. With that, we can conclude that the most expensive areas in San Francisco are the ones which are close to the water and the main portion of the city. The least expensive areas are those that inland and further away from Downtown San Francisco (around Market Street).

Question 2: Which parts of the city are the most dangerous (and at what times)?

crime = pd.read_sql("SELECT * FROM crime", db, parse_dates = 'Datetime')
crime['PdDistrict'].value_counts().plot(kind = 'bar')
plt.show()

The Southern and Mission Police Departments have the greatest number of incidents. From this, we can assume that those areas are the most dangerous in San Francisco. Likewise, the Richmond and Park Departments have the least number of incidents, so we can assume that those areas are the least dangerous in San Francisco.

crime['hour'] = crime.Datetime.apply(lambda x: x.hour)
crime['hour'].value_counts().plot(kind = 'bar')
plt.show()

Hour 18 and 17 or 6 PM and 5 PM are the hours with the most incidents. That is very interesting, since I would have expected it to be around 9 PM or later, since San Francisco is a urban city with a lot of nightlife. Likewise, 5 AM and 4 AM are the hours with the least incidents, which makes sense since people are still sleeping and the nightlife activity should have died down by then.

crime['both'] = crime['PdDistrict'] + ' ' + crime['hour'].map(str)
both = crime['both'].value_counts()
both

SOUTHERN 18     13422
SOUTHERN 17     12901
SOUTHERN 19     12445
SOUTHERN 16     11576
SOUTHERN 12     11108
SOUTHERN 20     10752
SOUTHERN 15     10672
SOUTHERN 14     10074
SOUTHERN 13      9925
SOUTHERN 21      9862
SOUTHERN 22      9574
SOUTHERN 0       9487
SOUTHERN 23      8973
SOUTHERN 11      8952
MISSION 0        8417
SOUTHERN 10      8296
MISSION 18       8210
MISSION 17       7950
NORTHERN 18      7862
NORTHERN 19      7630
MISSION 12       7603
SOUTHERN 9       7584
MISSION 23       7572
MISSION 22       7556
MISSION 19       7477
MISSION 16       7313
MISSION 20       7197
CENTRAL 18       7088
NORTHERN 17      7065
MISSION 21       7013
                ...  
INGLESIDE 3      1517
BAYVIEW 3        1445
PARK 1           1444
TENDERLOIN 5     1432
NORTHERN 5       1429
CENTRAL 6        1340
TENDERLOIN 3     1296
RICHMOND 7       1258
INGLESIDE 6      1230
CENTRAL 4        1212
RICHMOND 2       1205
TARAVAL 3        1201
PARK 2           1175
BAYVIEW 4        1167
CENTRAL 5        1149
BAYVIEW 5        1066
INGLESIDE 4      1046
TARAVAL 6        1045
TENDERLOIN 4      950
PARK 3            921
PARK 6            918
INGLESIDE 5       905
TARAVAL 4         770
RICHMOND 3        753
RICHMOND 6        727
TARAVAL 5         655
PARK 4            632
PARK 5            586
RICHMOND 4        517
RICHMOND 5        479
Name: both, dtype: int64

Taking into account Police Department District and Hour, we learn more about how big the districts actually are and how the compare in the amount of incidents they respond to. The Southern District has the top 15 amount of incidents taken, because they are in the heart of San Francisco. To answer the question what parts of the city are the most dangerous and during what times, it turns out that the Southern and Mission districts, the heart of San Francisco, are always the most dangerous places to be during anytime.

Q3: What are the clearance rates for the different districts in San Francisco? What are the most common categories of incidents?

import numpy as np

crime['clearance'] = np.where(crime['Resolution'] == 'NONE', 'NO', 'YES')
no = pd.DataFrame(crime.groupby(["clearance", "PdDistrict"]).size()['NO'], columns = ['NO']).transpose()
yes = pd.DataFrame(crime.groupby(["clearance", "PdDistrict"]).size()['YES'], columns = ['YES']).transpose()
clearance = yes.append(no).transpose()
clearance['Total'] = clearance['YES'] + clearance['NO']
clearance['Percentage'] = clearance['YES']/clearance['Total']
clearance.plot.bar(y = 'Percentage')
plt.show()
clearance

The Tenderloin, Mission, and Bayview districts clear the highest percentage of their cases. This is particularly impressive for the Mission, since they have the 2nd most cases behind the Southern district. The Central, Richmond, and Northern districts are really lagging behind on their case closure rates. In particular, Richmond has the lowest amount of incidents, yet they also have the 2nd worst clearance rate. The Tenderloin district, for example, has twice the clearance rate that the Richmond district has.

crime['Category'].value_counts().plot(kind = 'bar')
plt.show()

Larceny, assualt, and vandalism are the three major criminal offenses that happen in San Francisco. This is probably true for most major cities across the United States of America. What I am suprised by is the frequency of close frequency of prostitution, DUI, and kidnapping. In a large city like San Francisco, the frequency of DUI is diminished by the prevalance of taxis and other car services, but to see that those two crimes have about the same frequency is suprising to say the least.

Question 4: Do areas with more expensive housing have more or less frequency of different sorts of crime?

from shapely.geometry import Point

#https://geohackweek.github.io/vector/04-geopandas-intro/
crime_where = crime[crime['Category'] == 'PROSTITUTION']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()

The incidents of prostitution seem to correlate with more expensive neighborhoods. It seems that prostitution in San Francisco is mainly located in two clusters close the heart of San Francisco. This makes sense because prostitution usually follows the nightlife in a city. In this case, the nightlife of San Francisco is everywhere, but mainly around the Mission District and above Market Street, which is also not coincidentally where the incidents of prostitution occur.

crime_where = crime[crime['Category'] == 'SUICIDE']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()

Incidents of suicide in San Francisco seem to spread out around the city. There is a small cluster around the Financial District and Market Street, but the number of people around there every day compared to Southern San Francisco is much larger. Taking everything into account, it seems that people do not come to San Francisco to commit suicide. There does not seem to be a tendency to commit suicide in more or less expensive zip codes.

crime_where = crime[crime['Category'] == 'ARSON']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()

In San Francisco, there are two main clusters of incidents of arson. One in the less expensive zip code, and another that spans multiple expensive zip codes. This is probably because of the confounding variables in each of the zip codes. In the heart of San Francisco, there are a lot businesses, so there are many targets for arson. The less expensive zip code contains warehouses and shipyards, so there may be a tendency for arson happening there as well.

	GEOID10	ZriPerSqft_AllHomes
0	94080	2.708
1	94102	4.682
2	94103	4.398
3	94105	5.094
4	94107	4.362
5	94109	4.638
6	94110	3.960
7	94112	3.150
8	94114	4.168
9	94115	4.372
10	94116	3.122
11	94117	3.848
12	94121	3.108
13	94122	3.022
14	94123	4.704
15	94124	2.982
16	94127	3.144
17	94131	3.862
18	94132	3.064
19	94133	4.786
20	94134	3.064
21	94158	4.332

	GEOID10	MedianSoldPricePerSqft_AllHomes
0	94080	628.616900
1	94102	1189.393939
2	94103	985.085000
3	94105	1188.261200
4	94107	1007.227200
5	94109	1006.101100
6	94110	1087.436500
7	94112	755.397000
8	94114	1109.307600
9	94115	1044.638300
10	94116	840.558200
11	94117	1017.780800
12	94121	826.116300
13	94122	798.023500
14	94123	1276.685200
15	94124	583.533173
16	94127	904.136200
17	94131	1123.055600
18	94132	639.163277
19	94133	1063.829787
20	94134	650.256100
21	94158	1120.358515

	YES	NO	Total	Percentage
PdDistrict
BAYVIEW	41735	59920	101655	0.410555
CENTRAL	28936	80306	109242	0.264880
INGLESIDE	30758	58286	89044	0.345425
MISSION	57866	78350	136216	0.424811
NORTHERN	36371	87994	124365	0.292454
PARK	20705	40269	60974	0.339571
RICHMOND	15061	41197	56258	0.267713
SOUTHERN	70923	125255	196178	0.361524
TARAVAL	22667	52792	75459	0.300388
TENDERLOIN	48907	35102	84009	0.582164