In this assignment, you'll analyze a collection of data sets from the San Francisco Open Data Portal and Zillow. The data sets have been stored in the SQLite database sf_data.sqlite
, which you can download here. The database contains the following tables:
Table | Description |
---|---|
crime |
Crime reports dating back to 2010. |
mobile_food_locations |
List of all locations where mobile food vendors sell. |
mobile_food_permits |
List of all mobile food vendor permits. More details here. |
mobile_food_schedule |
Schedules for mobile food vendors. |
noise |
Noise complaints dating back to August 2015. |
parking |
List of all parking lots. |
parks |
List of all parks. |
schools |
List of all schools. |
zillow |
Zillow rent and housing statistics dating back to 1996. More details here. |
The mobile_food_
tables are explicitly connected through the locationid
and permit
columns. The other tables are not connected, but you may be able to connect them using dates, latitude/longitude, or postal codes.
Shapefiles for US postal codes are available here. These may be useful for converting latitude/longitude to postal codes.
Shapefiles for San Francisco Neighborhoods are available here.
Exercise 1.1. Which mobile food vendor(s) sells at the most locations?
import sqlite3 as sql
import pandas as pd
import matplotlib.pylab as plt
plt.rcParams['figure.figsize'] = (12, 12)
db = sql.connect("sf_data.sqlite")
schedule = pd.read_sql("SELECT * FROM mobile_food_schedule", db)
locations = pd.read_sql("SELECT * FROM mobile_food_locations", db)
permits = pd.read_sql("SELECT * FROM mobile_food_permits", db)
schedule = schedule.drop(schedule.columns[[2,3,4]], axis = 1)
schedule = schedule.drop_duplicates()
merchants = schedule.merge(permits)
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Series.value_counts.html
merchants['Applicant'].value_counts().plot(kind = 'bar')
plt.show()
May's Catering has requested permits to sell at 58 locations, but they have not been approved according to this dataset.
merchants_approved = merchants.loc[merchants['Status'] == 'APPROVED']
merchants_approved['Applicant'].value_counts().plot(kind="bar")
plt.show()
This means that Park's Catering sells at the most locations, because they have been approved to sell food at 23 locations. Maybe in the future May's Catering will be able to sell at all 58 locations, but until then Park's Catering sells food at the most locations.
Exercise 1.2. Ask and use the database to analyze 5 questions about San Francisco. For each question, write at least 150 words and support your answer with plots. Make a map for at least 2 of the 5 questions.
You should try to come up with some questions on your own, but these are examples of reasonable questions:
Please make sure to clearly state each of your questions in your submission.
zillow = pd.read_sql("SELECT * FROM zillow", db, parse_dates = 'Date')
zillow = zillow.dropna()
fig, ax = plt.subplots()
zillow1 = zillow.pivot(index = 'Date', columns = 'RegionName', values = 'ZriPerSqft_AllHomes')
[ax.plot_date(zillow1.index, zillow1[i], '-') for i in list(zillow1)]
#http://matplotlib.org/users/legend_guide.html
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=4,
ncol=5, mode="expand", borderaxespad=0.)
plt.show()
latest = pd.DataFrame([[i, zillow1[i].dropna().iloc[-1]] for i in list(zillow1)], columns = ['GEOID10', 'ZriPerSqft_AllHomes'])
latest['GEOID10'] = latest['GEOID10'].astype(str)
latest
import shapely.geometry as geom
import geopandas as gpd
zips = gpd.read_file('zips/cb_2015_us_zcta510_500k.shp')
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.isin.html
zips = zips[zips['GEOID10'].isin(list(map(str, zillow1)))]
#http://geopandas.org/mapping.html
zips1 = zips.merge(latest, on = 'GEOID10')
zips1.plot(column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()
Using the median of the estimated monthly rent price of all homes per square foot, we find that the the zip code with the most expensive houses is 94105, which is most yellow portion of the map above. Likewise, the cheapest houses in San Francisco are located at zip code 94080, which is not part of the contiguous San Francisco. It is located close to the San Francisco airport and is the bottom most portion of map above. The map above is made using the latest (non-NA) median of the estimated monthly rent price of all homes per square foot value in our data. The time series plot above shows how rental prices in San Francisco have increased over the last 6 years.
plt.rcParams['figure.figsize'] = (12, 12)
fig, ax = plt.subplots()
zillow1 = zillow.pivot(index = 'Date', columns = 'RegionName', values = 'MedianSoldPricePerSqft_AllHomes')
[ax.plot_date(zillow1.index, zillow1[i], '-') for i in list(zillow1)]
ax.legend(bbox_to_anchor=(0., 1.02, 1., .102), loc=4,
ncol=5, mode="expand", borderaxespad=0.)
plt.show()
latest = pd.DataFrame([[i, zillow1[i].dropna().iloc[-1]] for i in list(zillow1)], columns = ['GEOID10', 'MedianSoldPricePerSqft_AllHomes'])
latest['GEOID10'] = latest['GEOID10'].astype(str)
latest
zips2 = zips.merge(latest, on = 'GEOID10')
zips2.plot(column = 'MedianSoldPricePerSqft_AllHomes', cmap='summer')
plt.show()
We can do the same analysis for Median Sold Home Price, and find that zip code 94123 is the most expensive and 94080 is still the least expensive. With that, we can conclude that the most expensive areas in San Francisco are the ones which are close to the water and the main portion of the city. The least expensive areas are those that inland and further away from Downtown San Francisco (around Market Street).
crime = pd.read_sql("SELECT * FROM crime", db, parse_dates = 'Datetime')
crime['PdDistrict'].value_counts().plot(kind = 'bar')
plt.show()
The Southern and Mission Police Departments have the greatest number of incidents. From this, we can assume that those areas are the most dangerous in San Francisco. Likewise, the Richmond and Park Departments have the least number of incidents, so we can assume that those areas are the least dangerous in San Francisco.
crime['hour'] = crime.Datetime.apply(lambda x: x.hour)
crime['hour'].value_counts().plot(kind = 'bar')
plt.show()
Hour 18 and 17 or 6 PM and 5 PM are the hours with the most incidents. That is very interesting, since I would have expected it to be around 9 PM or later, since San Francisco is a urban city with a lot of nightlife. Likewise, 5 AM and 4 AM are the hours with the least incidents, which makes sense since people are still sleeping and the nightlife activity should have died down by then.
crime['both'] = crime['PdDistrict'] + ' ' + crime['hour'].map(str)
both = crime['both'].value_counts()
both
Taking into account Police Department District and Hour, we learn more about how big the districts actually are and how the compare in the amount of incidents they respond to. The Southern District has the top 15 amount of incidents taken, because they are in the heart of San Francisco. To answer the question what parts of the city are the most dangerous and during what times, it turns out that the Southern and Mission districts, the heart of San Francisco, are always the most dangerous places to be during anytime.
import numpy as np
crime['clearance'] = np.where(crime['Resolution'] == 'NONE', 'NO', 'YES')
no = pd.DataFrame(crime.groupby(["clearance", "PdDistrict"]).size()['NO'], columns = ['NO']).transpose()
yes = pd.DataFrame(crime.groupby(["clearance", "PdDistrict"]).size()['YES'], columns = ['YES']).transpose()
clearance = yes.append(no).transpose()
clearance['Total'] = clearance['YES'] + clearance['NO']
clearance['Percentage'] = clearance['YES']/clearance['Total']
clearance.plot.bar(y = 'Percentage')
plt.show()
clearance
The Tenderloin, Mission, and Bayview districts clear the highest percentage of their cases. This is particularly impressive for the Mission, since they have the 2nd most cases behind the Southern district. The Central, Richmond, and Northern districts are really lagging behind on their case closure rates. In particular, Richmond has the lowest amount of incidents, yet they also have the 2nd worst clearance rate. The Tenderloin district, for example, has twice the clearance rate that the Richmond district has.
crime['Category'].value_counts().plot(kind = 'bar')
plt.show()
Larceny, assualt, and vandalism are the three major criminal offenses that happen in San Francisco. This is probably true for most major cities across the United States of America. What I am suprised by is the frequency of close frequency of prostitution, DUI, and kidnapping. In a large city like San Francisco, the frequency of DUI is diminished by the prevalance of taxis and other car services, but to see that those two crimes have about the same frequency is suprising to say the least.
from shapely.geometry import Point
#https://geohackweek.github.io/vector/04-geopandas-intro/
crime_where = crime[crime['Category'] == 'PROSTITUTION']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()
The incidents of prostitution seem to correlate with more expensive neighborhoods. It seems that prostitution in San Francisco is mainly located in two clusters close the heart of San Francisco. This makes sense because prostitution usually follows the nightlife in a city. In this case, the nightlife of San Francisco is everywhere, but mainly around the Mission District and above Market Street, which is also not coincidentally where the incidents of prostitution occur.
crime_where = crime[crime['Category'] == 'SUICIDE']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()
Incidents of suicide in San Francisco seem to spread out around the city. There is a small cluster around the Financial District and Market Street, but the number of people around there every day compared to Southern San Francisco is much larger. Taking everything into account, it seems that people do not come to San Francisco to commit suicide. There does not seem to be a tendency to commit suicide in more or less expensive zip codes.
crime_where = crime[crime['Category'] == 'ARSON']
crime_loc = [Point(xy) for xy in zip(crime_where['Lon'], crime_where['Lat'])]
crime_where = crime_where[['Category', 'Lon', 'Lat']]
crime_gdf = gpd.GeoDataFrame(crime_where, geometry = crime_loc)
fig, ax = plt.subplots()
ax.set_aspect('equal')
crime_gdf.plot(ax = ax, color = 'red', markersize = 2)
zips1.plot(ax = ax, column = 'ZriPerSqft_AllHomes', cmap='summer')
plt.show()
In San Francisco, there are two main clusters of incidents of arson. One in the less expensive zip code, and another that spans multiple expensive zip codes. This is probably because of the confounding variables in each of the zip codes. In the heart of San Francisco, there are a lot businesses, so there are many targets for arson. The less expensive zip code contains warehouses and shipyards, so there may be a tendency for arson happening there as well.