In this assignment, you'll combine the assignment 3 data set with nutrition data from the USDA Food Composition Databases. The CSV file fresh.csv
contains the fresh fruits and vegetables data you extracted in assignment 3.
The USDA Food Composition Databases have a documented web API that returns data in JSON format . You need a key in order to use the API. Only 1000 requests are allowed per hour, so it would be a good idea to use caching.
Sign up for an API key here. The key will work with any Data.gov API. You may need the key again later in the quarter, so make sure you save it.
These modules may be useful:
Exercise 1.1. Read the search request documentation, then write a function called ndb_search()
that makes a search request. The function should accept the search term as an argument. The function should return the search result items as a list (for 0 items, return an empty list).
Note that the search url is: https://api.nal.usda.gov/ndb/search
As an example, a search for "quail eggs"
should return this list:
[{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'CHAOKOH, QUAIL EGG IN BRINE, UPC: 044738074186',
u'ndbno': u'45094707',
u'offset': 0},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'L&W, QUAIL EGGS, UPC: 024072000256',
u'ndbno': u'45094890',
u'offset': 1},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'BUDDHA, QUAIL EGGS IN BRINE, UPC: 761934535098',
u'ndbno': u'45099560',
u'offset': 2},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u'GRAN SABANA, QUAIL EGGS, UPC: 819140010103',
u'ndbno': u'45169279',
u'offset': 3},
{u'ds': u'BL',
u'group': u'Branded Food Products Database',
u'name': u"D'ARTAGNAN, QUAIL EGGS, UPC: 736622102630",
u'ndbno': u'45178254',
u'offset': 4},
{u'ds': u'SR',
u'group': u'Dairy and Egg Products',
u'name': u'Egg, quail, whole, fresh, raw',
u'ndbno': u'01140',
u'offset': 5}]
As usual, make sure you document and test your function.
from urllib2 import Request, urlopen
from urlparse import urlparse, urlunparse
import requests
import requests_cache
import pandas as pd
import json
import os
from matplotlib import pyplot as plt
plt.style.use('ggplot')
%matplotlib inline
requests_cache.install_cache('usda_cache')
key = 'O6dynzQFxqlFUFzc9XoqGxr0froELHWUtxhW3aVR'
def ndb_search(search_term):
"""
This queries the USDA Food composite database to make a search request
Input: search_term (string of terms to search for)
Output: json object (of search results)
"""
urlbase = "https://api.nal.usda.gov/ndb/search"
try:
usdareq = requests.get(urlbase + "?format=json" + "&q={}".format(search_term) + "&api_key={}".format(key))
usda_json = usdareq.json()
usda_json = usda_json['list']['item']
return usda_json
except (KeyError):
pass
ndb_search('potatoes raw')
Exercise 1.2. Use your search function to get NDB numbers for the foods in the fresh.csv
file. It's okay if you don't get an NDB number for every food, but try to come up with a strategy that gets most of them. Discuss your strategy in a short paragraph.
Hints:
ndb_search()
to a data frame with pd.DataFrame()
.pd.merge()
.path = os.path.join(os.getcwd(), 'fresh.csv')
fresh = pd.read_csv(path)
fresh['food'] = fresh['food'].str.replace("_", " ")
fresh_df = pd.DataFrame()
for i in fresh['food']:
search = i + ' raw'
j = i.replace(" ", "|")
try:
results = pd.DataFrame(ndb_search(search))
results = results.loc[results.ds.str.startswith("SR"), :]
results = results.loc[results.group.str.startswith(("Fruits", "Vegetables")), :]
if results.shape == (1,5):
results["offset"] = i
fresh_df = fresh_df.append(results)
else:
results = results[results['name'].str.contains(j, case = False)]
try:
results["offset"] = i
fresh_df = fresh_df.append(results.iloc[0])
except (IndexError):
pass
except (AttributeError):
pass
fresh_df = fresh_df.drop_duplicates()
fresh_df = fresh_df.rename(columns = {'offset':'food'})
fresh_df = fresh_df.merge(fresh)
fresh_df
ndbno = [i for i in fresh_df['ndbno']]
nd = ['09279']
My strategy for tackling this problem started with replacing the '_' in all food names with spaces then adding ' raw' in order to use the ndb_search function to search for that term. Then I further narrowed the results DataFrame down by eliminating all rows without 'SR', which meant that no branded foods would be included. Then, I narrowed the results to those that were in the Fruits or Vegetables categories, since I was getting some results in other categories. Finally, I further narrowed the results to keep all results that contained the key words of the food I was searching for. In order to do this well, I replaced all spaces in the food types with the '|' (or) symbol. Because, I still had 300+ results, I simply just took the first entry of each search and appended that into the fresh_df DataFrame. In the end, I ended up with most of the correct rows that I wanted. The cucumber rows were a sad consequence, since the keyword given here was "cucumbers" and the keyword in the results DataFrame was "cucumber". Besides that, the search for "kiwi raw" did not give any results, since in the USDA database it is under "Kiwifruit, raw". Finally, I ended up with the wrong cabbage, artichoke, and potato types. Besides those 5 mistakes, I ended up with the correct fruits and vegetables for each food type.
Exercise 1.3. Read the food reports V2 documentation, then write a function called ndb_report()
that requests a basic food report. The function should accept the NDB number as an argument and return the list of nutrients for the food.
Note that the report url is: https://api.nal.usda.gov/ndb/V2/reports
For example, for "09279"
(raw plums) the first element of the returned list should be:
{u'group': u'Proximates',
u'measures': [{u'eqv': 165.0,
u'eunit': u'g',
u'label': u'cup, sliced',
u'qty': 1.0,
u'value': u'143.93'},
{u'eqv': 66.0,
u'eunit': u'g',
u'label': u'fruit (2-1/8" dia)',
u'qty': 1.0,
u'value': u'57.57'},
{u'eqv': 151.0,
u'eunit': u'g',
u'label': u'NLEA serving',
u'qty': 1.0,
u'value': u'131.72'}],
u'name': u'Water',
u'nutrient_id': u'255',
u'unit': u'g',
u'value': u'87.23'}
Be sure to document and test your function.
def ndb_report(ndbno):
"""
This queries the USDA Food composite database to make a search request for reports based on an NDB number
Input: ndbno(takes in one number)
Output: json object(of nutrient values of the NDB numbers provided)
"""
urlbase = "https://api.nal.usda.gov/ndb/V2/reports?"
try:
usdareq = requests.get(urlbase + 'ndbno={}&'.format("".join(ndbno)) + "type=b&"+ "format=json&" + "api_key={}".format(key))
usda_json = usdareq.json()
usda_json = usda_json['foods'][0]['food']['nutrients']
return usda_json
except (KeyError):
pass
ndb_report(nd)
#list of all foods
usda_list = [ndb_report(i) for i in ndbno]
kcal_list = [i[1]['value'] for i in usda_list]
kcal_list = pd.Series(map(int, kcal_list))
fresh_df['nutrition'] = kcal_list.values
usda_list[37][1]
Exercise 1.4. Which foods provide the best combination of price, yield, and nutrition? You can use kilocalories as a measure of "nutrition" here, but more a detailed analysis is better. Use plots to support your analysis.
import seaborn as sns
sns.regplot('yield', 'price_per_lb', data = fresh_df, fit_reg = False)
plt.title('Yield vs. Price per Pound')
plt.xlabel('Yield')
plt.ylabel('Price per Pound')
sns.regplot('nutrition', 'price_per_lb', data = fresh_df, fit_reg = False)
plt.title('Nutrition vs. Price per Pound')
plt.xlabel('Nutrition in KCal')
plt.ylabel('Price per Pound')
It seems as though, the best combination of Nutrition, Price per Pound, and Yield can be calculated by this equation: (Nutrition * Yield)/(Price per Lb)
fresh_df['score'] = fresh_df['nutrition']*fresh_df['yield']/fresh_df['price_per_lb']
best = fresh_df.ix[fresh_df['score'] > 50]
best
For this analysis I used the amount of KCals in 100 grams of each food (nutrition), the Price per Pound, and the yield percentage. By multiplying the yield and nutrition, we get the amount of KCals that survived the growing and picking process of the fruit or vegetables per 100 grams. This eliminates many low yield and low energy foods. Next by dividing by the price per pound, instead of price per cup, we make sure that we still using weight instead of volume as our comparison tool. By diving by pricer per pound, we allow lower priced foods with high energy per yield to stand out and get higher scores.
In the end, excluding the wrong potato type, I ended up with bananas being the best combination of nutrition, price, and yield, with sweet potatoes and avocados coming in 2nd and 3rd place.