Skip to content

Data exploration

Loading and inspecting the data

After visiting Michigan and learning that wine grapes can be grown (and that wine can be made!) in such a cold place, you decide that you would like to start a vineyard there. You've seen the vineyards and know that, although it is possible to grow wine grapes there, that sometimes it is too cold. You wonder if because of climate change, Michigan might soon have a warmer, more suitable climate for growing grapes.

You know that Europe has a long history of growing grapes, and you wonder if they kept records that might indicate how grapes respond to changes in temperature. You find a study that has compiled numerous records of grape harvest dates for more than four centuries and also a database of temperature anomalies in Europe dating back to 1655.

Using the provided dataset, grape_harvest.csv (download here), you're going to explore how the European grape harvest date changes with respect to temperature across centuries of data.

To get started, import pandas in the cell below:

# Import pandas here
Answer
### ANSWER ###

import pandas as pd

Then, read in grape_harvest.csv using the pd.read_csv() function a pandas dataframe.

# Read in the grape harvest data here
# Put grape_harvest.csv in the same directory you are running this .ipynb from
# If in a different directory, you will need to specify the path to the file

# Alternatively, you can read in the data from GitHub using the following url:
# https://raw.githubusercontent.com/DanChitwood/PlantsAndPython/master/grape_harvest.csv
Answer
### ANSWER ###

# Read in the grape harvest data here
# data = pd.read_csv("grape_harvest.csv")

# Or read in by url from GitHub

data = pd.read_csv("https://raw.githubusercontent.com/DanChitwood/PlantsAndPython/master/grape_harvest.csv")

Now, write some code to inspect the properties of the data and then answer the following questions:

Use a pandas function to look at the first five lines of data:

# Put your code here
Answer
### ANSWER ###

data.head()

year region harvest anomaly
0 1700 alsace 42.9 -0.91
1 1701 alsace 35.9 -0.76
2 1702 alsace 45.0 -1.40
3 1703 alsace 49.4 -1.21
4 1704 alsace 30.4 -0.44

Use a pandas function to look at the last five lines of data:

# Put your code here
Answer
### ANSWER ###

data.tail()

year region harvest anomaly
4727 1873 vendee_poitou_charente 32.0 0.06
4728 1874 vendee_poitou_charente 2.0 -0.22
4729 1875 vendee_poitou_charente 29.0 -1.02
4730 1876 vendee_poitou_charente 32.0 -0.55
4731 1877 vendee_poitou_charente 34.0 -0.56

Use a pandas function to look at summary statistics (like the count, min, max, and mean) for columns with continuous data:

# Put your code here
Answer
### ANSWER ###

data.describe()

year harvest anomaly
count 4732.000000 4732.000000 4732.000000
mean 1832.835376 33.959510 -0.337811
std 91.713152 11.807714 0.675309
min 1655.000000 -13.000000 -2.470000
25% 1762.000000 25.900000 -0.750000
50% 1834.500000 34.000000 -0.280000
75% 1903.000000 42.600000 0.060000
max 2007.000000 75.000000 1.460000

Use a pandas function to retrieve the names of the columns.

# Put your code here
Answer
### ANSWER ###

data.columns
Index(['year', 'region', 'harvest', 'anomaly'], dtype='object')

For one of the columns that is a categorical variable, use a function to list all the levels for that variable.

# Put your code here
Answer
### ANSWER ###

data['region'].unique()
array(['alsace', 'auvergne', 'auxerre_avalon', 'beaujolais_maconnais',
       'bordeaux', 'burgundy', 'champagne_1', 'champagne_2',
       'gaillac_south_west', 'germany', 'high_loire_valley',
       'ile_de_france', 'jura', 'languedoc', 'low_loire_valley',
       'luxembourg', 'maritime_alps', 'northern_italy',
       'northern_lorraine', 'northern_rhone_valley', 'savoie',
       'southern_lorraine', 'southern_rhone_valley', 'spain',
       'switzerland_leman_lake', 'various_south_east',
       'vendee_poitou_charente'], dtype=object)

For the categorical variable, also use a function to determine how many rows there are representing each level.

# Put your code here
Answer
### ANSWER ###

data['region'].value_counts()
switzerland_leman_lake    353
burgundy                  350
southern_rhone_valley     333
jura                      306
ile_de_france             302
bordeaux                  274
alsace                    262
languedoc                 233
spain                     231
low_loire_valley          203
champagne_2               183
germany                   165
northern_italy            156
maritime_alps             136
auxerre_avalon            128
northern_lorraine         127
northern_rhone_valley     126
savoie                    123
southern_lorraine         109
luxembourg                107
high_loire_valley          92
various_south_east         82
champagne_1                81
auvergne                   80
vendee_poitou_charente     75
beaujolais_maconnais       73
gaillac_south_west         42
Name: region, dtype: int64

How many rows are in this dataset?

# Put your code here
Answer
### ANSWER ###

print(len(data))

4732

Congratulations on reading in the data and exploring its structure! In the next activity, we will be exploring the relationship between grape harvest dates and climate!