Assignment 3

The US Department of Agriculture publishes price estimates for fruits and vegetables online. The most recent estimates are based on a 2013 survey of US retail stores.

The estimates are provided as a collection of MS Excel files, with one file per fruit or vegetable. The assignment3_data.zip file contains the fruit and vegetable files in the directories fruit and vegetables, respectively.

Exercise 1.1. Use pandas to extract the "Fresh" row(s) from the fruit Excel files. Combine the data into a single data frame. Your data frame should look something like this:

type food form price_per_lb yield lb_per_cup price_per_cup
fruit watermelon Fresh1 0.333412 0.52 0.330693 0.212033
fruit cantaloupe Fresh1 0.535874 0.51 0.374786 0.3938
vegetables onions Fresh1 1.03811 0.9 0.35274 0.406868
...

It's okay if the rows and columns of your data frame are in a different order. These modules are especially relevant:

Ask questions and search the documentation/web to find the functions you need.

In [15]:
import pandas as pd
import os

def getFresh(owd, foodtype = 'fruit'): 
    '''
    getFresh takes in Excel files and outputs the price estimates for Fresh category
    
    inputs: 
            owd: current working directory
            foodtype: name of folder (ie 'fruit', 'vegetables', etc)
    
    outputs: 
            dataframe: with columns 'form', 'price_per_lb',  'type', 'food', 'lb_per_cup', 'price_per_cup', 'yield'
    
    '''
    #get correct path based on folder name
    owd = os.getcwd()
    os.chdir(owd)
    path = r'/'
    path = path + foodtype
    path = os.getcwd() + path
    #get file names and change path
    files = os.listdir(path)
    os.chdir(path)

    #initialize dataframe
    fruit = pd.DataFrame()

    for f in files:
        #get names to put in dataframe
        name = f.replace(".xlsx", "")
        #http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_excel.html
        #read excel file, and skip 2 rows
        data = pd.read_excel(f, skiprows = 2)
        #only need Fresh row
        del data["Unnamed: 2"]
        del data["Unnamed: 5"]
        data['Unnamed: 8'] = name 
        data['Unnamed: 7'] = foodtype
        #append to Dataframe
        fruit = fruit.append(data)

    #change back to previous directory
    os.chdir(owd)
    #rename rows
    fruit.columns = ['form', 'price_per_lb',  'type', 'food', 'lb_per_cup', 'price_per_cup', 'yield']
    #http://stackoverflow.com/questions/15325182/how-to-filter-rows-in-pandas-by-regex
    #select correct rows
    fruit = fruit[fruit.form.str.match('(Fresh.*)').str.len() > 0]
    return fruit

#get fruit
fruit = getFresh(os.getcwd(), 'fruit')
fruit
C:\Users\ckashyap\Anaconda2\lib\site-packages\ipykernel\__main__.py:49: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
Out[15]:
form price_per_lb type food lb_per_cup price_per_cup yield
0 Fresh1 1.56752 fruit apples 0.242508 0.422373 0.9
0 Fresh1 3.04007 fruit apricots 0.363763 1.1891 0.93
0 Fresh1 0.566983 fruit bananas 0.330693 0.292965 0.64
0 Fresh1 5.77471 fruit blackberries 0.31967 1.92292 0.96
0 Fresh1 4.73462 fruit blueberries 0.31967 1.59318 0.95
0 Fresh1 0.535874 fruit cantaloupe 0.374786 0.3938 0.51
0 Fresh1 3.59299 fruit cherries 0.341717 1.33455 0.92
0 Fresh1 0.897802 fruit grapefruit 0.462971 0.848278 0.49
0 Fresh1 2.09383 fruit grapes 0.330693 0.721266 0.96
0 Fresh1 0.796656 fruit honeydew 0.374786 0.649077 0.46
0 Fresh1 2.04468 fruit kiwi 0.385809 1.03797 0.76
0 Fresh1 1.37756 fruit mangoes 0.363763 0.705783 0.71
0 Fresh1 1.76115 fruit nectarines 0.31967 0.618667 0.91
0 Fresh1 1.03517 fruit oranges 0.407855 0.578357 0.73
0 Fresh1 1.29801 fruit papaya 0.308647 0.646174 0.62
0 Fresh1 1.59119 fruit peaches 0.341717 0.56639 0.96
0 Fresh1 1.46157 fruit pears 0.363763 0.59074 0.9
0 Fresh1 0.627662 fruit pineapple 0.363763 0.447686 0.51
0 Fresh1 1.82742 fruit plums 0.363763 0.707176 0.94
0 Fresh1 2.17359 fruit pomegranate 0.341717 1.32634 0.56
0 Fresh1 6.97581 fruit raspberries 0.31967 2.32287 0.96
0 Fresh1 2.35881 fruit strawberries 0.31967 0.802171 0.94
0 Fresh1 1.37796 fruit tangerines 0.407855 0.759471 0.74
0 Fresh1 0.333412 fruit watermelon 0.330693 0.212033 0.52

Exercise 1.2. Reuse your code from exercise 1.1 to extract the "Fresh" row(s) from the vegetable Excel files.

Does your code produce the correct prices for tomatoes? If not, why not? Do any other files have the same problem as the tomatoes file?

You don't need to extract the prices for these problem files. However, make sure the prices are extracted for files like asparagus that don't have this problem.

In [7]:
#get vegetables
vegetables = getFresh(os.getcwd(), 'vegetables')
vegetables
C:\Users\ckashyap\Anaconda2\lib\site-packages\ipykernel\__main__.py:49: FutureWarning: In future versions of pandas, match will change to always return a bool indexer.
Out[7]:
form price_per_lb type food lb_per_cup price_per_cup yield
0 Fresh1 1.17225 vegetables acorn_squash 0.451948 1.155360 0.458554
0 Fresh1 2.21305 vegetables artichoke 0.385809 2.274967 0.375309
0 Fresh1 3.21349 vegetables asparagus 0.396832 2.582272 0.493835
0 Fresh1 2.23587 vegetables avocados 0.31967 0.964886 0.740753
0 Fresh vegetables broccoli NaN
0 Fresh1 2.76355 vegetables brussels_sprouts 0.341717 0.890898 1.06
0 Fresh1 1.24474 vegetables butternut_squash 0.451948 0.787893 0.714
0 Fresh green cabbage1 0.579208 vegetables cabbage 0.330693 0.245944 0.778797
1 Fresh red cabbage1 1.05645 vegetables cabbage 0.330693 0.448412 0.779107
0 Fresh NaN vegetables carrots NaN NaN NaN
0 Fresh1 vegetables cauliflower NaN
0 Fresh1 NaN vegetables celery NaN NaN NaN
0 Fresh1 2.63084 vegetables collard_greens 0.286601 0.650001 1.16
0 Fresh1 2.69062 vegetables corn_sweet 0.363763 1.812497 0.54
0 Fresh, consumed with peel1 1.29593 vegetables cucumbers 0.264555 0.353448 0.97
1 Fresh, peeled1 1.29593 vegetables cucumbers 0.264555 0.469650 0.73
0 Fresh1 2.13997 vegetables green_beans 0.275578 0.696606 0.846575
0 Fresh1 1.41036 vegetables green_peppers 0.264555 0.455022 0.82
0 Fresh1 2.8073 vegetables kale 0.286601 0.766262 1.05
0 Fresh1 1.21304 vegetables lettuce_iceberg 0.242508 0.309655 0.95
0 Fresh1 vegetables lettuce_romaine NaN NaN
0 Fresh1 NaN vegetables mushrooms NaN NaN NaN
0 Fresh1 2.56924 vegetables mustard_greens 0.308647 0.944032 0.84
0 Fresh1 3.21355 vegetables okra 0.35274 1.473146 0.769474
0 Fresh1 1.03811 vegetables onions 0.35274 0.406868 0.9
0 Fresh1 0.56432 vegetables potatoes 0.264555 0.184017 0.811301
0 Fresh1 1.31163 vegetables radish 0.275578 0.401618 0.9
0 Fresh1 2.27794 vegetables red_peppers 0.264555 0.734926 0.82
0 Fresh1 NaN vegetables spinach NaN NaN NaN
0 Fresh1 1.63948 vegetables summer_squash 0.396832 0.845480 0.7695
0 Fresh1 0.918897 vegetables sweet_potatoes 0.440925 0.499400 0.811301
0 Fresh NaN vegetables tomatoes NaN NaN NaN
0 Fresh1 2.47175 vegetables turnip_greens 0.31967 1.053526 0.75

No, my code does not produce the correct prices for tomatoes and some other vegetables.

That is because the Fresh category for tomatoes is split up into 3 different subcategories.

Carrots, cauliflowers, brocolli, celery, romaine lettuce, mushrooms, and spinach all have the same problem.

Exercise 1.3. Remove rows without a price from the vegetable data frame and then combine the fruit and vegetable data frames. Make sure all columns of numbers are numeric (not strings).

In [8]:
#get everything
everything = fruit.append(vegetables)
#http://stackoverflow.com/questions/15891038/pandas-change-data-type-of-columns
#change everything to int (that can be changed)
everything = everything.apply(lambda x: pd.to_numeric(x, errors='ignore'))
#http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.dropna.html
#drop nas
everything = everything.dropna(how = "any")
everything
Out[8]:
form price_per_lb type food lb_per_cup price_per_cup yield
0 Fresh1 1.56752 fruit apples 0.242508 0.422373 0.9
0 Fresh1 3.04007 fruit apricots 0.363763 1.189102 0.93
0 Fresh1 0.566983 fruit bananas 0.330693 0.292965 0.64
0 Fresh1 5.77471 fruit blackberries 0.31967 1.922919 0.96
0 Fresh1 4.73462 fruit blueberries 0.31967 1.593177 0.95
0 Fresh1 0.535874 fruit cantaloupe 0.374786 0.393800 0.51
0 Fresh1 3.59299 fruit cherries 0.341717 1.334548 0.92
0 Fresh1 0.897802 fruit grapefruit 0.462971 0.848278 0.49
0 Fresh1 2.09383 fruit grapes 0.330693 0.721266 0.96
0 Fresh1 0.796656 fruit honeydew 0.374786 0.649077 0.46
0 Fresh1 2.04468 fruit kiwi 0.385809 1.037970 0.76
0 Fresh1 1.37756 fruit mangoes 0.363763 0.705783 0.71
0 Fresh1 1.76115 fruit nectarines 0.31967 0.618667 0.91
0 Fresh1 1.03517 fruit oranges 0.407855 0.578357 0.73
0 Fresh1 1.29801 fruit papaya 0.308647 0.646174 0.62
0 Fresh1 1.59119 fruit peaches 0.341717 0.566390 0.96
0 Fresh1 1.46157 fruit pears 0.363763 0.590740 0.9
0 Fresh1 0.627662 fruit pineapple 0.363763 0.447686 0.51
0 Fresh1 1.82742 fruit plums 0.363763 0.707176 0.94
0 Fresh1 2.17359 fruit pomegranate 0.341717 1.326342 0.56
0 Fresh1 6.97581 fruit raspberries 0.31967 2.322874 0.96
0 Fresh1 2.35881 fruit strawberries 0.31967 0.802171 0.94
0 Fresh1 1.37796 fruit tangerines 0.407855 0.759471 0.74
0 Fresh1 0.333412 fruit watermelon 0.330693 0.212033 0.52
0 Fresh1 1.17225 vegetables acorn_squash 0.451948 1.155360 0.458554
0 Fresh1 2.21305 vegetables artichoke 0.385809 2.274967 0.375309
0 Fresh1 3.21349 vegetables asparagus 0.396832 2.582272 0.493835
0 Fresh1 2.23587 vegetables avocados 0.31967 0.964886 0.740753
0 Fresh1 2.76355 vegetables brussels_sprouts 0.341717 0.890898 1.06
0 Fresh1 1.24474 vegetables butternut_squash 0.451948 0.787893 0.714
0 Fresh green cabbage1 0.579208 vegetables cabbage 0.330693 0.245944 0.778797
1 Fresh red cabbage1 1.05645 vegetables cabbage 0.330693 0.448412 0.779107
0 Fresh1 2.63084 vegetables collard_greens 0.286601 0.650001 1.16
0 Fresh1 2.69062 vegetables corn_sweet 0.363763 1.812497 0.54
0 Fresh, consumed with peel1 1.29593 vegetables cucumbers 0.264555 0.353448 0.97
1 Fresh, peeled1 1.29593 vegetables cucumbers 0.264555 0.469650 0.73
0 Fresh1 2.13997 vegetables green_beans 0.275578 0.696606 0.846575
0 Fresh1 1.41036 vegetables green_peppers 0.264555 0.455022 0.82
0 Fresh1 2.8073 vegetables kale 0.286601 0.766262 1.05
0 Fresh1 1.21304 vegetables lettuce_iceberg 0.242508 0.309655 0.95
0 Fresh1 2.56924 vegetables mustard_greens 0.308647 0.944032 0.84
0 Fresh1 3.21355 vegetables okra 0.35274 1.473146 0.769474
0 Fresh1 1.03811 vegetables onions 0.35274 0.406868 0.9
0 Fresh1 0.56432 vegetables potatoes 0.264555 0.184017 0.811301
0 Fresh1 1.31163 vegetables radish 0.275578 0.401618 0.9
0 Fresh1 2.27794 vegetables red_peppers 0.264555 0.734926 0.82
0 Fresh1 1.63948 vegetables summer_squash 0.396832 0.845480 0.7695
0 Fresh1 0.918897 vegetables sweet_potatoes 0.440925 0.499400 0.811301
0 Fresh1 2.47175 vegetables turnip_greens 0.31967 1.053526 0.75

Exercise 1.4. Discuss the questions below (a paragraph each is sufficient). Use plots to support your ideas.

  • What kinds of fruits are the most expensive (per pound)? What kinds are the least expensive?
  • How do the price distributions compare for fruit and vegetables?
  • Which foods are the best value for the price?
  • What's something surprising about this data set?
  • Which foods do you expect to provide the best combination of price, yield, and nutrition? A future assignment may combine this data set with another so you can check your hypothesis.
In [9]:
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

fruit = everything.ix[everything['type'] == 'fruit']
highest = fruit.ix[(fruit['price_per_lb'] > 4)]
lowest = fruit.ix[(fruit['price_per_lb'] < 0.6)]

fig, (ax1, ax2) = plt.subplots(ncols=2, sharey=True)
sns.barplot('food', 'price_per_lb', data = highest, ax = ax1)
sns.barplot('food', 'price_per_lb', data = lowest, ax = ax2)
ax1.set(title = 'Most Expensive Fruit per Pound', xlabel='Fruit', ylabel='Price per Pound')
ax2.set(title = 'Least Expensive Fruit per Pound', xlabel='Fruit', ylabel='Price per Pound')
Out[9]:
[<matplotlib.text.Text at 0xb8c57b0>,
 <matplotlib.text.Text at 0xb8b8a70>,
 <matplotlib.text.Text at 0xb927ff0>]

The most expensive fruits per pound were raspberries, blueberries, and blackberries. The least expensive fruits per pound were watermelon, bananas, and cantaloupe. Berries are expensive per pound because they can be crushed easily, and picking them takes more effort than picking stuff like melons and bananas. Bananas are some of the most easily available fruit, and come in banana hearts of up to 20, which decreases the price of one single banana. Melons have a thick outer shell to protect the fruit, which can increase the weight of the fruit.

In [48]:
fig, ax1 = plt.subplots(ncols=1, sharey=True)
sns.boxplot('price_per_lb', 'type', data = everything, ax = ax1)
ax1.set(title = 'Price Distribution of Fruit and Vegetables', ylabel='Type', xlabel='Price per Pound')
Out[48]:
[<matplotlib.text.Text at 0xdb9e5f0>,
 <matplotlib.text.Text at 0xd64bf90>,
 <matplotlib.text.Text at 0xd891a30>]

Vegetables and fruit have about the same mean, but the range of price per pound values they have are different. Vegetables have a smaller range compared to fruit, since some fruit in our data set can be very expensive. Because of this the standard deviation for fruit is higher than the standard deviation for vegetables.

In [65]:
value_cup = everything.ix[everything['price_per_cup'] < 0.35]
value_pound = everything.ix[everything['price_per_lb'] < 0.6]
fig, ax1 = plt.subplots(ncols=1, sharey=True)
sns.barplot('food', 'price_per_lb', data = value_pound, ax = ax1)
ax1.set(title = 'Best Value per Pound', xlabel='Food', ylabel='Price per Pound')
Out[65]:
[<matplotlib.text.Text at 0x10871b70>,
 <matplotlib.text.Text at 0x10854910>,
 <matplotlib.text.Text at 0x10887e50>]
In [66]:
fig, ax2 = plt.subplots(ncols=1, sharey=True)
sns.barplot('food', 'price_per_cup', data = value_cup, ax = ax2)
ax2.set(title = 'Best Value per Cup', xlabel='Food', ylabel='Price per Cup')
Out[66]:
[<matplotlib.text.Text at 0x10854950>,
 <matplotlib.text.Text at 0x1087ef10>,
 <matplotlib.text.Text at 0x109586f0>]

The best value vegetable per cup and pound was the potato. The best value fruit per cup and pound was watermelon. According to the plots, vegetables get more expensive per pound compared to fruit, which comparatively get less expensive per pound. This is probably because people buy vegetables by the pound, so producers are trying to make more money that way. On the contrary, people do buy fruit by the cup, so the prices can increase, because the market is there.

In [85]:
sns.regplot('yield', 'price_per_lb', data = everything, fit_reg = False)
plt.title('Yield vs. Price per Pound')
plt.xlabel('Yield')
plt.ylabel('Price per Pound')
Out[85]:
<matplotlib.text.Text at 0x11ec3cf0>

The most suprising thing about this dataset is the relationship, or lack of relationship between yield and price per pound. One would think that the lower the yield, the more the price of a vegetable or fruit would be. But according to the data, the relationship does not exist or there is a positive correlation between price and yield. Maybe this is because the low yield fruits and vegetables like acorn squash are more seasonal and producers can not charge a high price, if they want to get rid of all their inventory. Some other low yield products like bananas sell everywhere and anywhere, so the price can be low.

In [88]:
best =  everything.ix[everything['price_per_lb'] < 2]
best = best.ix[best['yield'] > 0.85]
sns.regplot('yield', 'price_per_lb', data = best, fit_reg = False)
plt.title('Yield vs. Price per Pound')
plt.xlabel('Yield')
plt.ylabel('Price per Pound')
best
Out[88]:
form price_per_lb type food lb_per_cup price_per_cup yield
0 Fresh1 1.56752 fruit apples 0.242508 0.422373 0.9
0 Fresh1 1.76115 fruit nectarines 0.31967 0.618667 0.91
0 Fresh1 1.59119 fruit peaches 0.341717 0.566390 0.96
0 Fresh1 1.46157 fruit pears 0.363763 0.590740 0.9
0 Fresh1 1.82742 fruit plums 0.363763 0.707176 0.94
0 Fresh, consumed with peel1 1.29593 vegetables cucumbers 0.264555 0.353448 0.97
0 Fresh1 1.21304 vegetables lettuce_iceberg 0.242508 0.309655 0.95
0 Fresh1 1.03811 vegetables onions 0.35274 0.406868 0.9
0 Fresh1 1.31163 vegetables radish 0.275578 0.401618 0.9

Vegetables like onions, radish, iceberg lettuce, and cucumbers have high yield and low prices. These vegetables, especially the lettuce and onions, are used in everyday cooking and can be bought at any time of the year. Fruits like apples, nectarines, peaches, pears, and plums also have high yield and low prices. Although some of these fruit are seasonal, they are cheap and can provide nutrients as a snack. These fruit are also easily accessible and at any time of year at least 3 of them can be found at any grocery store.