Data exploration

Loading and inspecting the data

After visiting Michigan and learning that wine grapes can be grown (and that wine can be made!) in such a cold place, you decide that you would like to start a vineyard there. You've seen the vineyards and know that, although it is possible to grow wine grapes there, that sometimes it is too cold. You wonder if because of climate change, Michigan might soon have a warmer, more suitable climate for growing grapes.

You know that Europe has a long history of growing grapes, and you wonder if they kept records that might indicate how grapes respond to changes in temperature. You find a study that has compiled numerous records of grape harvest dates for more than four centuries and also a database of temperature anomalies in Europe dating back to 1655.

Using the provided dataset, grape_harvest.csv (download here), you're going to explore how the European grape harvest date changes with respect to temperature across centuries of data.

To get started, import pandas in the cell below:

# Import pandas here

Answer

### ANSWER ###

import pandas as pd

Then, read in grape_harvest.csv using the pd.read_csv() function a pandas dataframe.

# Read in the grape harvest data here
# Put grape_harvest.csv in the same directory you are running this .ipynb from
# If in a different directory, you will need to specify the path to the file

# Alternatively, you can read in the data from GitHub using the following url:
# https://raw.githubusercontent.com/DanChitwood/PlantsAndPython/master/grape_harvest.csv

Answer

### ANSWER ###

# Read in the grape harvest data here
# data = pd.read_csv("grape_harvest.csv")

# Or read in by url from GitHub

data = pd.read_csv("https://raw.githubusercontent.com/DanChitwood/PlantsAndPython/master/grape_harvest.csv")

Now, write some code to inspect the properties of the data and then answer the following questions:

Use a pandas function to look at the first five lines of data:

# Put your code here

Answer

### ANSWER ###

data.head()

	year	region	harvest	anomaly
0	1700	alsace	42.9	-0.91
1	1701	alsace	35.9	-0.76
2	1702	alsace	45.0	-1.40
3	1703	alsace	49.4	-1.21
4	1704	alsace	30.4	-0.44

Use a pandas function to look at the last five lines of data:

# Put your code here

Answer

### ANSWER ###

data.tail()

	year	region	harvest	anomaly
4727	1873	vendee_poitou_charente	32.0	0.06
4728	1874	vendee_poitou_charente	2.0	-0.22
4729	1875	vendee_poitou_charente	29.0	-1.02
4730	1876	vendee_poitou_charente	32.0	-0.55
4731	1877	vendee_poitou_charente	34.0	-0.56

Use a pandas function to look at summary statistics (like the count, min, max, and mean) for columns with continuous data:

# Put your code here

Answer

### ANSWER ###

data.describe()

	year	harvest	anomaly
count	4732.000000	4732.000000	4732.000000
mean	1832.835376	33.959510	-0.337811
std	91.713152	11.807714	0.675309
min	1655.000000	-13.000000	-2.470000
25%	1762.000000	25.900000	-0.750000
50%	1834.500000	34.000000	-0.280000
75%	1903.000000	42.600000	0.060000
max	2007.000000	75.000000	1.460000

Use a pandas function to retrieve the names of the columns.

# Put your code here

Answer

### ANSWER ###

data.columns

Index(['year', 'region', 'harvest', 'anomaly'], dtype='object')

For one of the columns that is a categorical variable, use a function to list all the levels for that variable.

# Put your code here

Answer

### ANSWER ###

data['region'].unique()

array(['alsace', 'auvergne', 'auxerre_avalon', 'beaujolais_maconnais',
       'bordeaux', 'burgundy', 'champagne_1', 'champagne_2',
       'gaillac_south_west', 'germany', 'high_loire_valley',
       'ile_de_france', 'jura', 'languedoc', 'low_loire_valley',
       'luxembourg', 'maritime_alps', 'northern_italy',
       'northern_lorraine', 'northern_rhone_valley', 'savoie',
       'southern_lorraine', 'southern_rhone_valley', 'spain',
       'switzerland_leman_lake', 'various_south_east',
       'vendee_poitou_charente'], dtype=object)

For the categorical variable, also use a function to determine how many rows there are representing each level.

# Put your code here

Answer

### ANSWER ###

data['region'].value_counts()

switzerland_leman_lake    353
burgundy                  350
southern_rhone_valley     333
jura                      306
ile_de_france             302
bordeaux                  274
alsace                    262
languedoc                 233
spain                     231
low_loire_valley          203
champagne_2               183
germany                   165
northern_italy            156
maritime_alps             136
auxerre_avalon            128
northern_lorraine         127
northern_rhone_valley     126
savoie                    123
southern_lorraine         109
luxembourg                107
high_loire_valley          92
various_south_east         82
champagne_1                81
auvergne                   80
vendee_poitou_charente     75
beaujolais_maconnais       73
gaillac_south_west         42
Name: region, dtype: int64

How many rows are in this dataset?

# Put your code here

Answer

### ANSWER ###

print(len(data))

4732

Congratulations on reading in the data and exploring its structure! In the next activity, we will be exploring the relationship between grape harvest dates and climate!