The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.
The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip
file contains the fruit and vegetable files in the directories fruit
and vegetables
, respectively.
Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:
type | food | form | price_per_lb | yield | lb_per_cup | price_per_cup |
---|---|---|---|---|---|---|
fruit | watermelon | Fresh1 | 0.333412 | 0.52 | 0.330693 | 0.212033 |
fruit | cantaloupe | Fresh1 | 0.535874 | 0.51 | 0.374786 | 0.3938 |
vegetables | onions | Fresh1 | 1.03811 | 0.9 | 0.35274 | 0.406868 |
... |
It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:
str
methodsos
os.path
read_excel()
, concat()
, .fillna()
, .str
, plotting methodsAsk questions and search the documentation/web to find the functions you need.
import pandas as pd
import os
def getFresh(owd, foodtype = 'fruit'):
'''
getFresh takes in Excel files and outputs the price estimates for Fresh category
inputs:
owd: current working directory
foodtype: name of folder (ie 'fruit', 'vegetables', etc)
outputs:
dataframe: with columns 'form', 'price_per_lb', 'type', 'food', 'lb_per_cup', 'price_per_cup', 'yield'
'''
#get correct path based on folder name
owd = os.getcwd()
os.chdir(owd)
path = r'/'
path = path + foodtype
path = os.getcwd() + path
#get file names and change path
files = os.listdir(path)
os.chdir(path)
#initialize dataframe
fruit = pd.DataFrame()
for f in files:
#get names to put in dataframe
name = f.replace(".xlsx", "")
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
#read excel file, and skip 2 rows
data = pd.read_excel(f, skiprows = 2)
#only need Fresh row
del data["Unnamed: 2"]
del data["Unnamed: 5"]
data['Unnamed: 8'] = name
data['Unnamed: 7'] = foodtype
#append to Dataframe
fruit = fruit.append(data)
#change back to previous directory
os.chdir(owd)
#rename rows
fruit.columns = ['form', 'price_per_lb', 'type', 'food', 'lb_per_cup', 'price_per_cup', 'yield']
#http://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex
#select correct rows
fruit = fruit[fruit.form.str.match('(Fresh.*)').str.len() > 0]
return fruit
#get fruit
fruit = getFresh(os.getcwd(), 'fruit')
fruit
Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.
Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?
You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.
#get vegetables
vegetables = getFresh(os.getcwd(), 'vegetables')
vegetables
No, my code does not produce the correct prices for tomatoes and some other vegetables.
That is because the Fresh category for tomatoes is split up into 3 different subcategories.
Carrots, cauliflowers, brocolli, celery, romaine lettuce, mushrooms, and spinach all have the same problem.
Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).
#get everything
everything = fruit.append(vegetables)
#http://stackoverflow.com/questions/15891038/pandas-change-data-type-of-columns
#change everything to int (that can be changed)
everything = everything.apply(lambda x: pd.to_numeric(x, errors='ignore'))
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
#drop nas
everything = everything.dropna(how = "any")
everything
Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
fruit = everything.ix[everything['type'] == 'fruit']
highest = fruit.ix[(fruit['price_per_lb'] > 4)]
lowest = fruit.ix[(fruit['price_per_lb'] < 0.6)]
fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True)
sns.barplot('food', 'price_per_lb', data = highest, ax = ax1)
sns.barplot('food', 'price_per_lb', data = lowest, ax = ax2)
ax1.set(title = 'Most Expensive Fruit per Pound', xlabel='Fruit', ylabel='Price per Pound')
ax2.set(title = 'Least Expensive Fruit per Pound', xlabel='Fruit', ylabel='Price per Pound')
The most expensive fruits per pound were raspberries, blueberries, and blackberries. The least expensive fruits per pound were watermelon, bananas, and cantaloupe. Berries are expensive per pound because they can be crushed easily, and picking them takes more effort than picking stuff like melons and bananas. Bananas are some of the most easily available fruit, and come in banana hearts of up to 20, which decreases the price of one single banana. Melons have a thick outer shell to protect the fruit, which can increase the weight of the fruit.
fig, ax1 = plt.subplots(ncols=1, sharey=True)
sns.boxplot('price_per_lb', 'type', data = everything, ax = ax1)
ax1.set(title = 'Price Distribution of Fruit and Vegetables', ylabel='Type', xlabel='Price per Pound')
Vegetables and fruit have about the same mean, but the range of price per pound values they have are different. Vegetables have a smaller range compared to fruit, since some fruit in our data set can be very expensive. Because of this the standard deviation for fruit is higher than the standard deviation for vegetables.
value_cup = everything.ix[everything['price_per_cup'] < 0.35]
value_pound = everything.ix[everything['price_per_lb'] < 0.6]
fig, ax1 = plt.subplots(ncols=1, sharey=True)
sns.barplot('food', 'price_per_lb', data = value_pound, ax = ax1)
ax1.set(title = 'Best Value per Pound', xlabel='Food', ylabel='Price per Pound')
fig, ax2 = plt.subplots(ncols=1, sharey=True)
sns.barplot('food', 'price_per_cup', data = value_cup, ax = ax2)
ax2.set(title = 'Best Value per Cup', xlabel='Food', ylabel='Price per Cup')
The best value vegetable per cup and pound was the potato. The best value fruit per cup and pound was watermelon. According to the plots, vegetables get more expensive per pound compared to fruit, which comparatively get less expensive per pound. This is probably because people buy vegetables by the pound, so producers are trying to make more money that way. On the contrary, people do buy fruit by the cup, so the prices can increase, because the market is there.
sns.regplot('yield', 'price_per_lb', data = everything, fit_reg = False)
plt.title('Yield vs. Price per Pound')
plt.xlabel('Yield')
plt.ylabel('Price per Pound')
The most suprising thing about this dataset is the relationship, or lack of relationship between yield and price per pound. One would think that the lower the yield, the more the price of a vegetable or fruit would be. But according to the data, the relationship does not exist or there is a positive correlation between price and yield. Maybe this is because the low yield fruits and vegetables like acorn squash are more seasonal and producers can not charge a high price, if they want to get rid of all their inventory. Some other low yield products like bananas sell everywhere and anywhere, so the price can be low.
best = everything.ix[everything['price_per_lb'] < 2]
best = best.ix[best['yield'] > 0.85]
sns.regplot('yield', 'price_per_lb', data = best, fit_reg = False)
plt.title('Yield vs. Price per Pound')
plt.xlabel('Yield')
plt.ylabel('Price per Pound')
best
Vegetables like onions, radish, iceberg lettuce, and cucumbers have high yield and low prices. These vegetables, especially the lettuce and onions, are used in everyday cooking and can be bought at any time of the year. Fruits like apples, nectarines, peaches, pears, and plums also have high yield and low prices. Although some of these fruit are seasonal, they are cheap and can provide nutrients as a snack. These fruit are also easily accessible and at any time of year at least 3 of them can be found at any grocery store.