Assignment 3¶

The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.

The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.

Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:

type	food	form	price_per_lb	yield	lb_per_cup	price_per_cup
fruit	watermelon	Fresh1	0.333412	0.52	0.330693	0.212033
fruit	cantaloupe	Fresh1	0.535874	0.51	0.374786	0.3938
vegetables	onions	Fresh1	1.03811	0.9	0.35274	0.406868
...

It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:

str methods
os
os.path
pandas: read_excel(), concat(), .fillna(), .str, plotting methods

Ask questions and search the documentation/web to find the functions you need.

import pandas as pd
import os

def getFresh(owd, foodtype = 'fruit'): 
    '''
    getFresh takes in Excel files and outputs the price estimates for Fresh category
    
    inputs: 
            owd: current working directory
            foodtype: name of folder (ie 'fruit', 'vegetables', etc)
    
    outputs: 
            dataframe: with columns 'form', 'price_per_lb',  'type', 'food', 'lb_per_cup', 'price_per_cup', 'yield'
    
    '''
    #get correct path based on folder name
    owd = os.getcwd()
    os.chdir(owd)
    path = r'/'
    path = path + foodtype
    path = os.getcwd() + path
    #get file names and change path
    files = os.listdir(path)
    os.chdir(path)

    #initialize dataframe
    fruit = pd.DataFrame()

    for f in files:
        #get names to put in dataframe
        name = f.replace(".xlsx", "")
        #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
        #read excel file, and skip 2 rows
        data = pd.read_excel(f, skiprows = 2)
        #only need Fresh row
        del data["Unnamed: 2"]
        del data["Unnamed: 5"]
        data['Unnamed: 8'] = name 
        data['Unnamed: 7'] = foodtype
        #append to Dataframe
        fruit = fruit.append(data)

    #change back to previous directory
    os.chdir(owd)
    #rename rows
    fruit.columns = ['form', 'price_per_lb',  'type', 'food', 'lb_per_cup', 'price_per_cup', 'yield']
    #http://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex
    #select correct rows
    fruit = fruit[fruit.form.str.match('(Fresh.*)').str.len() > 0]
    return fruit

#get fruit
fruit = getFresh(os.getcwd(), 'fruit')
fruit

C:\Users\ckashyap\Anaconda2\lib\site-packages\ipykernel\__main__.py:49: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.

Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.

Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?

You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.

#get vegetables
vegetables = getFresh(os.getcwd(), 'vegetables')
vegetables

C:\Users\ckashyap\Anaconda2\lib\site-packages\ipykernel\__main__.py:49: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.

No, my code does not produce the correct prices for tomatoes and some other vegetables.

That is because the Fresh category for tomatoes is split up into 3 different subcategories.

Carrots, cauliflowers, brocolli, celery, romaine lettuce, mushrooms, and spinach all have the same problem.

Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).

#get everything
everything = fruit.append(vegetables)
#http://stackoverflow.com/questions/15891038/pandas-change-data-type-of-columns
#change everything to int (that can be changed)
everything = everything.apply(lambda x: pd.to_numeric(x, errors='ignore'))
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
#drop nas
everything = everything.dropna(how = "any")
everything

Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.

What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
How do the price distributions compare for fruit and vegetables?
Which foods are the best value for the price?
What's something surprising about this data set?
Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

fruit = everything.ix[everything['type'] == 'fruit']
highest = fruit.ix[(fruit['price_per_lb'] > 4)]
lowest = fruit.ix[(fruit['price_per_lb'] < 0.6)]

fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True)
sns.barplot('food', 'price_per_lb', data = highest, ax = ax1)
sns.barplot('food', 'price_per_lb', data = lowest, ax = ax2)
ax1.set(title = 'Most Expensive Fruit per Pound', xlabel='Fruit', ylabel='Price per Pound')
ax2.set(title = 'Least Expensive Fruit per Pound', xlabel='Fruit', ylabel='Price per Pound')

[<matplotlib.text.Text at 0xb8c57b0>,
 <matplotlib.text.Text at 0xb8b8a70>,
 <matplotlib.text.Text at 0xb927ff0>]

The most expensive fruits per pound were raspberries, blueberries, and blackberries. The least expensive fruits per pound were watermelon, bananas, and cantaloupe. Berries are expensive per pound because they can be crushed easily, and picking them takes more effort than picking stuff like melons and bananas. Bananas are some of the most easily available fruit, and come in banana hearts of up to 20, which decreases the price of one single banana. Melons have a thick outer shell to protect the fruit, which can increase the weight of the fruit.

fig, ax1 = plt.subplots(ncols=1, sharey=True)
sns.boxplot('price_per_lb', 'type', data = everything, ax = ax1)
ax1.set(title = 'Price Distribution of Fruit and Vegetables', ylabel='Type', xlabel='Price per Pound')

[<matplotlib.text.Text at 0xdb9e5f0>,
 <matplotlib.text.Text at 0xd64bf90>,
 <matplotlib.text.Text at 0xd891a30>]

Vegetables and fruit have about the same mean, but the range of price per pound values they have are different. Vegetables have a smaller range compared to fruit, since some fruit in our data set can be very expensive. Because of this the standard deviation for fruit is higher than the standard deviation for vegetables.

value_cup = everything.ix[everything['price_per_cup'] < 0.35]
value_pound = everything.ix[everything['price_per_lb'] < 0.6]
fig, ax1 = plt.subplots(ncols=1, sharey=True)
sns.barplot('food', 'price_per_lb', data = value_pound, ax = ax1)
ax1.set(title = 'Best Value per Pound', xlabel='Food', ylabel='Price per Pound')

[<matplotlib.text.Text at 0x10871b70>,
 <matplotlib.text.Text at 0x10854910>,
 <matplotlib.text.Text at 0x10887e50>]

fig, ax2 = plt.subplots(ncols=1, sharey=True)
sns.barplot('food', 'price_per_cup', data = value_cup, ax = ax2)
ax2.set(title = 'Best Value per Cup', xlabel='Food', ylabel='Price per Cup')

[<matplotlib.text.Text at 0x10854950>,
 <matplotlib.text.Text at 0x1087ef10>,
 <matplotlib.text.Text at 0x109586f0>]

The best value vegetable per cup and pound was the potato. The best value fruit per cup and pound was watermelon. According to the plots, vegetables get more expensive per pound compared to fruit, which comparatively get less expensive per pound. This is probably because people buy vegetables by the pound, so producers are trying to make more money that way. On the contrary, people do buy fruit by the cup, so the prices can increase, because the market is there.

sns.regplot('yield', 'price_per_lb', data = everything, fit_reg = False)
plt.title('Yield vs. Price per Pound')
plt.xlabel('Yield')
plt.ylabel('Price per Pound')

<matplotlib.text.Text at 0x11ec3cf0>

The most suprising thing about this dataset is the relationship, or lack of relationship between yield and price per pound. One would think that the lower the yield, the more the price of a vegetable or fruit would be. But according to the data, the relationship does not exist or there is a positive correlation between price and yield. Maybe this is because the low yield fruits and vegetables like acorn squash are more seasonal and producers can not charge a high price, if they want to get rid of all their inventory. Some other low yield products like bananas sell everywhere and anywhere, so the price can be low.

best =  everything.ix[everything['price_per_lb'] < 2]
best = best.ix[best['yield'] > 0.85]
sns.regplot('yield', 'price_per_lb', data = best, fit_reg = False)
plt.title('Yield vs. Price per Pound')
plt.xlabel('Yield')
plt.ylabel('Price per Pound')
best

Vegetables like onions, radish, iceberg lettuce, and cucumbers have high yield and low prices. These vegetables, especially the lettuce and onions, are used in everyday cooking and can be bought at any time of the year. Fruits like apples, nectarines, peaches, pears, and plums also have high yield and low prices. Although some of these fruit are seasonal, they are cheap and can provide nutrients as a snack. These fruit are also easily accessible and at any time of year at least 3 of them can be found at any grocery store.

form	price_per_lb	type	food	lb_per_cup	price_per_cup	yield
Fresh1	1.56752	fruit	apples	0.242508	0.422373	0.9
Fresh1	3.04007	fruit	apricots	0.363763	1.1891	0.93
Fresh1	0.566983	fruit	bananas	0.330693	0.292965	0.64
Fresh1	5.77471	fruit	blackberries	0.31967	1.92292	0.96
Fresh1	4.73462	fruit	blueberries	0.31967	1.59318	0.95
Fresh1	0.535874	fruit	cantaloupe	0.374786	0.3938	0.51
Fresh1	3.59299	fruit	cherries	0.341717	1.33455	0.92
Fresh1	0.897802	fruit	grapefruit	0.462971	0.848278	0.49
Fresh1	2.09383	fruit	grapes	0.330693	0.721266	0.96
Fresh1	0.796656	fruit	honeydew	0.374786	0.649077	0.46
Fresh1	2.04468	fruit	kiwi	0.385809	1.03797	0.76
Fresh1	1.37756	fruit	mangoes	0.363763	0.705783	0.71
Fresh1	1.76115	fruit	nectarines	0.31967	0.618667	0.91
Fresh1	1.03517	fruit	oranges	0.407855	0.578357	0.73
Fresh1	1.29801	fruit	papaya	0.308647	0.646174	0.62
Fresh1	1.59119	fruit	peaches	0.341717	0.56639	0.96
Fresh1	1.46157	fruit	pears	0.363763	0.59074	0.9
Fresh1	0.627662	fruit	pineapple	0.363763	0.447686	0.51
Fresh1	1.82742	fruit	plums	0.363763	0.707176	0.94
Fresh1	2.17359	fruit	pomegranate	0.341717	1.32634	0.56
Fresh1	6.97581	fruit	raspberries	0.31967	2.32287	0.96
Fresh1	2.35881	fruit	strawberries	0.31967	0.802171	0.94
Fresh1	1.37796	fruit	tangerines	0.407855	0.759471	0.74
Fresh1	0.333412	fruit	watermelon	0.330693	0.212033	0.52

	form	price_per_lb	type	food	lb_per_cup	price_per_cup	yield
0	Fresh1	1.17225	vegetables	acorn_squash	0.451948	1.155360	0.458554
0	Fresh1	2.21305	vegetables	artichoke	0.385809	2.274967	0.375309
0	Fresh1	3.21349	vegetables	asparagus	0.396832	2.582272	0.493835
0	Fresh1	2.23587	vegetables	avocados	0.31967	0.964886	0.740753
0	Fresh		vegetables	broccoli		NaN
0	Fresh1	2.76355	vegetables	brussels_sprouts	0.341717	0.890898	1.06
0	Fresh1	1.24474	vegetables	butternut_squash	0.451948	0.787893	0.714
0	Fresh green cabbage1	0.579208	vegetables	cabbage	0.330693	0.245944	0.778797
1	Fresh red cabbage1	1.05645	vegetables	cabbage	0.330693	0.448412	0.779107
0	Fresh	NaN	vegetables	carrots	NaN	NaN	NaN
0	Fresh1		vegetables	cauliflower		NaN
0	Fresh1	NaN	vegetables	celery	NaN	NaN	NaN
0	Fresh1	2.63084	vegetables	collard_greens	0.286601	0.650001	1.16
0	Fresh1	2.69062	vegetables	corn_sweet	0.363763	1.812497	0.54
0	Fresh, consumed with peel1	1.29593	vegetables	cucumbers	0.264555	0.353448	0.97
1	Fresh, peeled1	1.29593	vegetables	cucumbers	0.264555	0.469650	0.73
0	Fresh1	2.13997	vegetables	green_beans	0.275578	0.696606	0.846575
0	Fresh1	1.41036	vegetables	green_peppers	0.264555	0.455022	0.82
0	Fresh1	2.8073	vegetables	kale	0.286601	0.766262	1.05
0	Fresh1	1.21304	vegetables	lettuce_iceberg	0.242508	0.309655	0.95
0	Fresh1		vegetables	lettuce_romaine		NaN	NaN
0	Fresh1	NaN	vegetables	mushrooms	NaN	NaN	NaN
0	Fresh1	2.56924	vegetables	mustard_greens	0.308647	0.944032	0.84
0	Fresh1	3.21355	vegetables	okra	0.35274	1.473146	0.769474
0	Fresh1	1.03811	vegetables	onions	0.35274	0.406868	0.9
0	Fresh1	0.56432	vegetables	potatoes	0.264555	0.184017	0.811301
0	Fresh1	1.31163	vegetables	radish	0.275578	0.401618	0.9
0	Fresh1	2.27794	vegetables	red_peppers	0.264555	0.734926	0.82
0	Fresh1	NaN	vegetables	spinach	NaN	NaN	NaN
0	Fresh1	1.63948	vegetables	summer_squash	0.396832	0.845480	0.7695
0	Fresh1	0.918897	vegetables	sweet_potatoes	0.440925	0.499400	0.811301
0	Fresh	NaN	vegetables	tomatoes	NaN	NaN	NaN
0	Fresh1	2.47175	vegetables	turnip_greens	0.31967	1.053526	0.75

	form	price_per_lb	type	food	lb_per_cup	price_per_cup	yield
0	Fresh1	1.56752	fruit	apples	0.242508	0.422373	0.9
0	Fresh1	3.04007	fruit	apricots	0.363763	1.189102	0.93
0	Fresh1	0.566983	fruit	bananas	0.330693	0.292965	0.64
0	Fresh1	5.77471	fruit	blackberries	0.31967	1.922919	0.96
0	Fresh1	4.73462	fruit	blueberries	0.31967	1.593177	0.95
0	Fresh1	0.535874	fruit	cantaloupe	0.374786	0.393800	0.51
0	Fresh1	3.59299	fruit	cherries	0.341717	1.334548	0.92
0	Fresh1	0.897802	fruit	grapefruit	0.462971	0.848278	0.49
0	Fresh1	2.09383	fruit	grapes	0.330693	0.721266	0.96
0	Fresh1	0.796656	fruit	honeydew	0.374786	0.649077	0.46
0	Fresh1	2.04468	fruit	kiwi	0.385809	1.037970	0.76
0	Fresh1	1.37756	fruit	mangoes	0.363763	0.705783	0.71
0	Fresh1	1.76115	fruit	nectarines	0.31967	0.618667	0.91
0	Fresh1	1.03517	fruit	oranges	0.407855	0.578357	0.73
0	Fresh1	1.29801	fruit	papaya	0.308647	0.646174	0.62
0	Fresh1	1.59119	fruit	peaches	0.341717	0.566390	0.96
0	Fresh1	1.46157	fruit	pears	0.363763	0.590740	0.9
0	Fresh1	0.627662	fruit	pineapple	0.363763	0.447686	0.51
0	Fresh1	1.82742	fruit	plums	0.363763	0.707176	0.94
0	Fresh1	2.17359	fruit	pomegranate	0.341717	1.326342	0.56
0	Fresh1	6.97581	fruit	raspberries	0.31967	2.322874	0.96
0	Fresh1	2.35881	fruit	strawberries	0.31967	0.802171	0.94
0	Fresh1	1.37796	fruit	tangerines	0.407855	0.759471	0.74
0	Fresh1	0.333412	fruit	watermelon	0.330693	0.212033	0.52
0	Fresh1	1.17225	vegetables	acorn_squash	0.451948	1.155360	0.458554
0	Fresh1	2.21305	vegetables	artichoke	0.385809	2.274967	0.375309
0	Fresh1	3.21349	vegetables	asparagus	0.396832	2.582272	0.493835
0	Fresh1	2.23587	vegetables	avocados	0.31967	0.964886	0.740753
0	Fresh1	2.76355	vegetables	brussels_sprouts	0.341717	0.890898	1.06
0	Fresh1	1.24474	vegetables	butternut_squash	0.451948	0.787893	0.714
0	Fresh green cabbage1	0.579208	vegetables	cabbage	0.330693	0.245944	0.778797
1	Fresh red cabbage1	1.05645	vegetables	cabbage	0.330693	0.448412	0.779107
0	Fresh1	2.63084	vegetables	collard_greens	0.286601	0.650001	1.16
0	Fresh1	2.69062	vegetables	corn_sweet	0.363763	1.812497	0.54
0	Fresh, consumed with peel1	1.29593	vegetables	cucumbers	0.264555	0.353448	0.97
1	Fresh, peeled1	1.29593	vegetables	cucumbers	0.264555	0.469650	0.73
0	Fresh1	2.13997	vegetables	green_beans	0.275578	0.696606	0.846575
0	Fresh1	1.41036	vegetables	green_peppers	0.264555	0.455022	0.82
0	Fresh1	2.8073	vegetables	kale	0.286601	0.766262	1.05
0	Fresh1	1.21304	vegetables	lettuce_iceberg	0.242508	0.309655	0.95
0	Fresh1	2.56924	vegetables	mustard_greens	0.308647	0.944032	0.84
0	Fresh1	3.21355	vegetables	okra	0.35274	1.473146	0.769474
0	Fresh1	1.03811	vegetables	onions	0.35274	0.406868	0.9
0	Fresh1	0.56432	vegetables	potatoes	0.264555	0.184017	0.811301
0	Fresh1	1.31163	vegetables	radish	0.275578	0.401618	0.9
0	Fresh1	2.27794	vegetables	red_peppers	0.264555	0.734926	0.82
0	Fresh1	1.63948	vegetables	summer_squash	0.396832	0.845480	0.7695
0	Fresh1	0.918897	vegetables	sweet_potatoes	0.440925	0.499400	0.811301
0	Fresh1	2.47175	vegetables	turnip_greens	0.31967	1.053526	0.75