Predicting West Nile Virus - Kaggle Competition

In [1]:
import pandas as pd
from matplotlib import pyplot as plt
import numpy as np
# pip install pandas-profiling
import pandas_profiling as pdp
import seaborn as sns
%matplotlib inline
import time
import datetime

traps = pd.read_csv('assets/train.csv')
test = pd.read_csv('assets/test.csv')
spray = pd.read_csv('assets/spray.csv')
weather = pd.read_csv('assets/weather.csv')
//anaconda/lib/python2.7/site-packages/matplotlib/__init__.py:1401: UserWarning:  This call to matplotlib.use() has no effect
because the backend has already been chosen;
matplotlib.use() must be called *before* pylab, matplotlib.pyplot,
or matplotlib.backends is imported for the first time.

  warnings.warn(_use_error_msg)

EDA

Traps

In [2]:
pdp.ProfileReport(traps)
Out[2]:

Overview

Dataset info

Number of variables 12
Number of observations 10506
Total Missing (%) 0.0%
Total size in memory 985.0 KiB
Average record size in memory 96.0 B

Variables types

Numeric 6
Categorical 6
Date 0
Text (Unique) 0
Rejected 0

Warnings

  • Address has a high cardinality: 138 distinct values Warning
  • AddressNumberAndStreet has a high cardinality: 138 distinct values Warning
  • Date has a high cardinality: 95 distinct values Warning
  • Street has a high cardinality: 128 distinct values Warning
  • Trap has a high cardinality: 136 distinct values Warning
  • WnvPresent has 9955 / 94.8% zeros
  • Dataset has 813 duplicate rows Warning

Variables

Address
Categorical

Distinct count 138
Unique (%) 1.3%
Missing (%) 0.0%
Missing (n) 0
ORD Terminal 5, O'Hare International Airport, Chicago, IL 60666, USA
 
750
South Doty Avenue, Chicago, IL, USA
 
542
South Stony Island Avenue, Chicago, IL, USA
 
314
Other values (135)
8900
Value Count Frequency (%)  
ORD Terminal 5, O'Hare International Airport, Chicago, IL 60666, USA 750 7.1%
 
South Doty Avenue, Chicago, IL, USA 542 5.2%
 
South Stony Island Avenue, Chicago, IL, USA 314 3.0%
 
4100 North Oak Park Avenue, Chicago, IL 60634, USA 185 1.8%
 
4200 West 127th Street, Alsip, IL 60803, USA 183 1.7%
 
2200 North Cannon Drive, Chicago, IL 60614, USA 163 1.6%
 
2400 East 105th Street, Chicago, IL 60617, USA 160 1.5%
 
7000 West Armitage Avenue, Chicago, IL 60707, USA 156 1.5%
 
3700 118th Street, Chicago, IL 60617, USA 152 1.4%
 
University of Illinois at Chicago, 1100 South Ashland Avenue, Chicago, IL 60607, USA 151 1.4%
 
Other values (128) 7750 73.8%
 

AddressAccuracy
Numeric

Distinct count 4
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 7.8195
Minimum 3
Maximum 9
Zeros (%) 0.0%

Quantile statistics

Minimum 3
5-th percentile 5
Q1 8
Median 8
Q3 9
95-th percentile 9
Maximum 9
Range 6
Interquartile range 1

Descriptive statistics

Standard deviation 1.4529
Coef of variation 0.18581
Kurtosis 0.65475
Mean 7.8195
MAD 1.0534
Skewness -1.3695
Sum 82152
Variance 2.111
Memory size 82.1 KiB
Value Count Frequency (%)  
8 4628 44.1%
 
9 3980 37.9%
 
5 1807 17.2%
 
3 91 0.9%
 

Minimum 5 values

Value Count Frequency (%)  
3 91 0.9%
 
5 1807 17.2%
 
8 4628 44.1%
 
9 3980 37.9%
 

Maximum 5 values

Value Count Frequency (%)  
3 91 0.9%
 
5 1807 17.2%
 
8 4628 44.1%
 
9 3980 37.9%
 

AddressNumberAndStreet
Categorical

Distinct count 138
Unique (%) 1.3%
Missing (%) 0.0%
Missing (n) 0
1000 W OHARE AIRPORT, Chicago, IL
 
750
1200 S DOTY AVE, Chicago, IL
 
542
1000 S STONY ISLAND AVE, Chicago, IL
 
314
Other values (135)
8900
Value Count Frequency (%)  
1000 W OHARE AIRPORT, Chicago, IL 750 7.1%
 
1200 S DOTY AVE, Chicago, IL 542 5.2%
 
1000 S STONY ISLAND AVE, Chicago, IL 314 3.0%
 
4100 N OAK PARK AVE, Chicago, IL 185 1.8%
 
4200 W 127TH PL, Chicago, IL 183 1.7%
 
2200 N CANNON DR, Chicago, IL 163 1.6%
 
2400 E 105TH ST, Chicago, IL 160 1.5%
 
7000 W ARMITAGE AVENUE, Chicago, IL 156 1.5%
 
3700 E 118TH ST, Chicago, IL 152 1.4%
 
1100 S ASHLAND AVE, Chicago, IL 151 1.4%
 
Other values (128) 7750 73.8%
 

Block
Numeric

Distinct count 64
Unique (%) 0.6%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 35.688
Minimum 10
Maximum 98
Zeros (%) 0.0%

Quantile statistics

Minimum 10
5-th percentile 10
Q1 12
Median 33
Q3 52
95-th percentile 82
Maximum 98
Range 88
Interquartile range 40

Descriptive statistics

Standard deviation 24.339
Coef of variation 0.68201
Kurtosis -0.73761
Mean 35.688
MAD 20.794
Skewness 0.62672
Sum 374936
Variance 592.41
Memory size 82.1 KiB
Value Count Frequency (%)  
10 1722 16.4%
 
11 736 7.0%
 
12 605 5.8%
 
22 500 4.8%
 
13 345 3.3%
 
37 330 3.1%
 
17 305 2.9%
 
42 300 2.9%
 
70 295 2.8%
 
52 277 2.6%
 
Other values (54) 5091 48.5%
 

Minimum 5 values

Value Count Frequency (%)  
10 1722 16.4%
 
11 736 7.0%
 
12 605 5.8%
 
13 345 3.3%
 
14 97 0.9%
 

Maximum 5 values

Value Count Frequency (%)  
90 77 0.7%
 
91 111 1.1%
 
93 21 0.2%
 
96 31 0.3%
 
98 23 0.2%
 

Date
Categorical

Distinct count 95
Unique (%) 0.9%
Missing (%) 0.0%
Missing (n) 0
2007-08-01
 
551
2007-08-15
 
276
2013-08-01
 
186
Other values (92)
9493
Value Count Frequency (%)  
2007-08-01 551 5.2%
 
2007-08-15 276 2.6%
 
2013-08-01 186 1.8%
 
2007-08-24 186 1.8%
 
2007-08-21 186 1.8%
 
2007-10-04 185 1.8%
 
2007-08-07 184 1.8%
 
2013-07-12 182 1.7%
 
2013-07-19 182 1.7%
 
2013-08-08 181 1.7%
 
Other values (85) 8207 78.1%
 

Latitude
Numeric

Distinct count 138
Unique (%) 1.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 41.841
Minimum 41.645
Maximum 42.017
Zeros (%) 0.0%

Quantile statistics

Minimum 41.645
5-th percentile 41.673
Q1 41.733
Median 41.846
Q3 41.955
95-th percentile 41.992
Maximum 42.017
Range 0.37282
Interquartile range 0.22171

Descriptive statistics

Standard deviation 0.11274
Coef of variation 0.0026945
Kurtosis -1.4402
Mean 41.841
MAD 0.10219
Skewness -0.08944
Sum 439580
Variance 0.012711
Memory size 82.1 KiB
Value Count Frequency (%)  
41.974689 750 7.1%
 
41.673408 542 5.2%
 
41.726465 314 3.0%
 
41.95469 185 1.8%
 
41.662014 183 1.7%
 
41.921965 163 1.6%
 
41.704572 160 1.5%
 
41.916265 156 1.5%
 
41.680946 152 1.4%
 
41.868077 151 1.4%
 
Other values (128) 7750 73.8%
 

Minimum 5 values

Value Count Frequency (%)  
41.644612 17 0.2%
 
41.659112 112 1.1%
 
41.662014 183 1.7%
 
41.673408 542 5.2%
 
41.678618 129 1.2%
 

Maximum 5 values

Value Count Frequency (%)  
42.008314 139 1.3%
 
42.009876 50 0.5%
 
42.010412 63 0.6%
 
42.011601 66 0.6%
 
42.01743 69 0.7%
 

Longitude
Numeric

Distinct count 138
Unique (%) 1.3%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean -87.7
Minimum -87.931
Maximum -87.532
Zeros (%) 0.0%

Quantile statistics

Minimum -87.931
5-th percentile -87.891
Q1 -87.76
Median -87.695
Q3 -87.628
95-th percentile -87.547
Maximum -87.532
Range 0.39936
Interquartile range 0.13227

Descriptive statistics

Standard deviation 0.096514
Coef of variation -0.0011005
Kurtosis -0.43627
Mean -87.7
MAD 0.077598
Skewness -0.37028
Sum -921380
Variance 0.009315
Memory size 82.1 KiB
Value Count Frequency (%)  
-87.890615 750 7.1%
 
-87.599862 542 5.2%
 
-87.585413 314 3.0%
 
-87.800991 185 1.8%
 
-87.724608 183 1.7%
 
-87.632085 163 1.6%
 
-87.565666 160 1.5%
 
-87.800515 156 1.5%
 
-87.535198 152 1.4%
 
-87.666901 151 1.4%
 
Other values (128) 7750 73.8%
 

Minimum 5 values

Value Count Frequency (%)  
-87.930995 140 1.3%
 
-87.890615 750 7.1%
 
-87.862995 80 0.8%
 
-87.832763 133 1.3%
 
-87.824812 35 0.3%
 

Maximum 5 values

Value Count Frequency (%)  
-87.538693 112 1.1%
 
-87.536497 83 0.8%
 
-87.535198 152 1.4%
 
-87.531657 23 0.2%
 
-87.531635 45 0.4%
 

NumMosquitos
Numeric

Distinct count 50
Unique (%) 0.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 12.854
Minimum 1
Maximum 50
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 2
Median 5
Q3 17
95-th percentile 50
Maximum 50
Range 49
Interquartile range 15

Descriptive statistics

Standard deviation 16.134
Coef of variation 1.2552
Kurtosis 0.5224
Mean 12.854
MAD 12.778
Skewness 1.4049
Sum 135039
Variance 260.3
Memory size 82.1 KiB
Value Count Frequency (%)  
1 2307 22.0%
 
2 1300 12.4%
 
50 1019 9.7%
 
3 896 8.5%
 
4 593 5.6%
 
5 489 4.7%
 
6 398 3.8%
 
7 326 3.1%
 
8 244 2.3%
 
9 237 2.3%
 
Other values (40) 2697 25.7%
 

Minimum 5 values

Value Count Frequency (%)  
1 2307 22.0%
 
2 1300 12.4%
 
3 896 8.5%
 
4 593 5.6%
 
5 489 4.7%
 

Maximum 5 values

Value Count Frequency (%)  
46 43 0.4%
 
47 37 0.4%
 
48 36 0.3%
 
49 35 0.3%
 
50 1019 9.7%
 

Species
Categorical

Distinct count 7
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
CULEX PIPIENS/RESTUANS
4752
CULEX RESTUANS
2740
CULEX PIPIENS
2699
Other values (4)
 
315
Value Count Frequency (%)  
CULEX PIPIENS/RESTUANS 4752 45.2%
 
CULEX RESTUANS 2740 26.1%
 
CULEX PIPIENS 2699 25.7%
 
CULEX TERRITANS 222 2.1%
 
CULEX SALINARIUS 86 0.8%
 
CULEX TARSALIS 6 0.1%
 
CULEX ERRATICUS 1 0.0%
 

Street
Categorical

Distinct count 128
Unique (%) 1.2%
Missing (%) 0.0%
Missing (n) 0
W OHARE AIRPORT
 
750
S DOTY AVE
 
542
S STONY ISLAND AVE
 
347
Other values (125)
8867
Value Count Frequency (%)  
W OHARE AIRPORT 750 7.1%
 
S DOTY AVE 542 5.2%
 
S STONY ISLAND AVE 347 3.3%
 
S ASHLAND AVE 266 2.5%
 
N OAK PARK AVE 216 2.1%
 
W 51ST ST 185 1.8%
 
W 127TH PL 183 1.7%
 
N PULASKI RD 173 1.6%
 
N CANNON DR 172 1.6%
 
E 105TH ST 160 1.5%
 
Other values (118) 7512 71.5%
 

Trap
Categorical

Distinct count 136
Unique (%) 1.3%
Missing (%) 0.0%
Missing (n) 0
T900
 
750
T115
 
542
T138
 
314
Other values (133)
8900
Value Count Frequency (%)  
T900 750 7.1%
 
T115 542 5.2%
 
T138 314 3.0%
 
T002 185 1.8%
 
T135 183 1.7%
 
T054 163 1.6%
 
T128 160 1.5%
 
T151 156 1.5%
 
T212 152 1.4%
 
T090 151 1.4%
 
Other values (126) 7750 73.8%
 

WnvPresent
Numeric

Distinct count 2
Unique (%) 0.0%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 0.052446
Minimum 0
Maximum 1
Zeros (%) 94.8%

Quantile statistics

Minimum 0
5-th percentile 0
Q1 0
Median 0
Q3 0
95-th percentile 1
Maximum 1
Range 1
Interquartile range 0

Descriptive statistics

Standard deviation 0.22294
Coef of variation 4.2507
Kurtosis 14.13
Mean 0.052446
MAD 0.099391
Skewness 4.0159
Sum 551
Variance 0.0497
Memory size 82.1 KiB
Value Count Frequency (%)  
0 9955 94.8%
 
1 551 5.2%
 

Minimum 5 values

Value Count Frequency (%)  
0 9955 94.8%
 
1 551 5.2%
 

Maximum 5 values

Value Count Frequency (%)  
0 9955 94.8%
 
1 551 5.2%
 

Sample

Date Address Species Block Street Trap AddressNumberAndStreet Latitude Longitude AddressAccuracy NumMosquitos WnvPresent
0 2007-05-29 4100 North Oak Park Avenue, Chicago, IL 60634,... CULEX PIPIENS/RESTUANS 41 N OAK PARK AVE T002 4100 N OAK PARK AVE, Chicago, IL 41.954690 -87.800991 9 1 0
1 2007-05-29 4100 North Oak Park Avenue, Chicago, IL 60634,... CULEX RESTUANS 41 N OAK PARK AVE T002 4100 N OAK PARK AVE, Chicago, IL 41.954690 -87.800991 9 1 0
2 2007-05-29 6200 North Mandell Avenue, Chicago, IL 60646, USA CULEX RESTUANS 62 N MANDELL AVE T007 6200 N MANDELL AVE, Chicago, IL 41.994991 -87.769279 9 1 0
3 2007-05-29 7900 West Foster Avenue, Chicago, IL 60656, USA CULEX PIPIENS/RESTUANS 79 W FOSTER AVE T015 7900 W FOSTER AVE, Chicago, IL 41.974089 -87.824812 8 1 0
4 2007-05-29 7900 West Foster Avenue, Chicago, IL 60656, USA CULEX RESTUANS 79 W FOSTER AVE T015 7900 W FOSTER AVE, Chicago, IL 41.974089 -87.824812 8 4 0

From this profile report of traps, there are a few things we found important and will deal with:

  • the multiple address features seem irrelevant with Latitude and Longitude. Remove them
  • WnvPresent has nearly 95% zero!!! We will need to stratify our training data so that the negative class doesn't dominate the results
  • Species has 7 categorical values: create dummies
  • Date feature should be converted to date datatype
  • 813 duplicate rows: We know from the data descriptions that observations are capped at 50 mosquitos, so it's definitely possible that a single observation spans a couple of records. We'll leave the duplicates for this reason, and sum up the number of mosquitos with a groupby.
In [2]:
## We can tackle the multiple addresses and the grouping in this one step
traps = traps[['Trap',
               'Latitude',
               'Longitude',
               'Date',
               'Species',
               'WnvPresent',
               'NumMosquitos']].groupby(['Trap',
                                         'Latitude',
                                         'Longitude',
                                         'Date', 
                                         'Species',
                                         'WnvPresent']).agg({'NumMosquitos':np.sum}).reset_index()

feats = traps.columns

## get species dummies and add to traps df
traps = pd.get_dummies(traps, columns=['Species'])

## convert to date
traps.Date = pd.to_datetime(traps.Date)

Let's look at the features that we'll have when we run our test data through the models:

In [3]:
test.columns
Out[3]:
Index([u'Id', u'Date', u'Address', u'Species', u'Block', u'Street', u'Trap',
       u'AddressNumberAndStreet', u'Latitude', u'Longitude',
       u'AddressAccuracy'],
      dtype='object')
In [4]:
# The test data doesn't include NumMosquitos... 
# this means we should drop this column from traps 
# and then drop the duplicates

traps.drop('NumMosquitos',axis=1,inplace=True)
traps.drop_duplicates(inplace=True)


## Might as well go ahead and perform the same modifications on the test data

# remove the redundant address features from test
test = test[feats.drop(['WnvPresent','NumMosquitos'])]

## get species dummies and add to traps df
test = pd.get_dummies(test, columns=['Species'])

## convert to date
test.Date = pd.to_datetime(test.Date)

Spray

In [6]:
pdp.ProfileReport(spray)
Out[6]:

Overview

Dataset info

Number of variables 4
Number of observations 14835
Total Missing (%) 1.0%
Total size in memory 463.7 KiB
Average record size in memory 32.0 B

Variables types

Numeric 2
Categorical 2
Date 0
Text (Unique) 0
Rejected 0

Warnings

  • Time has 584 / 3.9% missing values Missing
  • Time has a high cardinality: 8584 distinct values Warning
  • Dataset has 541 duplicate rows Warning

Variables

Date
Categorical

Distinct count 10
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
2013-08-15
2668
2013-08-29
2302
2013-07-17
2202
Other values (7)
7663
Value Count Frequency (%)  
2013-08-15 2668 18.0%
 
2013-08-29 2302 15.5%
 
2013-07-17 2202 14.8%
 
2011-09-07 2114 14.3%
 
2013-07-25 1607 10.8%
 
2013-08-22 1587 10.7%
 
2013-08-08 1195 8.1%
 
2013-09-05 924 6.2%
 
2013-08-16 141 1.0%
 
2011-08-29 95 0.6%
 

Latitude
Numeric

Distinct count 12887
Unique (%) 86.9%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 41.905
Minimum 41.714
Maximum 42.396
Zeros (%) 0.0%

Quantile statistics

Minimum 41.714
5-th percentile 41.725
Q1 41.785
Median 41.94
Q3 41.981
95-th percentile 42.006
Maximum 42.396
Range 0.68206
Interquartile range 0.19598

Descriptive statistics

Standard deviation 0.10438
Coef of variation 0.0024909
Kurtosis 1.7358
Mean 41.905
MAD 0.083218
Skewness -0.08886
Sum 621660
Variance 0.010895
Memory size 116.0 KiB
Value Count Frequency (%)  
41.98646 541 3.6%
 
41.9953963549 11 0.1%
 
41.9827717583 10 0.1%
 
41.9944843118 9 0.1%
 
41.9856518944 9 0.1%
 
41.9848838581 8 0.1%
 
41.9935242665 8 0.1%
 
41.9969324275 8 0.1%
 
41.9840678196 8 0.1%
 
41.9950123368 7 0.0%
 
Other values (12877) 14216 95.8%
 

Minimum 5 values

Value Count Frequency (%)  
41.713925 1 0.0%
 
41.714005 1 0.0%
 
41.7140416667 1 0.0%
 
41.7140983333 1 0.0%
 
41.7141116667 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
42.395095 1 0.0%
 
42.3952183333 1 0.0%
 
42.3953516667 1 0.0%
 
42.3956966667 1 0.0%
 
42.3959833333 1 0.0%
 

Longitude
Numeric

Distinct count 13007
Unique (%) 87.7%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean -87.737
Minimum -88.096
Maximum -87.587
Zeros (%) 0.0%

Quantile statistics

Minimum -88.096
5-th percentile -87.818
Q1 -87.794
Median -87.728
Q3 -87.694
95-th percentile -87.627
Maximum -87.587
Range 0.50974
Interquartile range 0.10012

Descriptive statistics

Standard deviation 0.067292
Coef of variation -0.00076698
Kurtosis 3.5397
Mean -87.737
MAD 0.054238
Skewness -0.65318
Sum -1301600
Variance 0.0045282
Memory size 116.0 KiB
Value Count Frequency (%)  
-87.794225 541 3.6%
 
-87.8069107897 9 0.1%
 
-87.8068627874 9 0.1%
 
-87.8167512547 7 0.0%
 
-87.805710733 6 0.0%
 
-87.8056147285 6 0.0%
 
-87.8104149553 5 0.0%
 
-87.8083508578 5 0.0%
 
-87.8115670097 5 0.0%
 
-87.8137271118 5 0.0%
 
Other values (12997) 14237 96.0%
 

Minimum 5 values

Value Count Frequency (%)  
-88.0964683333 1 0.0%
 
-88.0964466667 1 0.0%
 
-88.096445 1 0.0%
 
-88.0964433333 1 0.0%
 
-88.09644 1 0.0%
 

Maximum 5 values

Value Count Frequency (%)  
-87.5867916667 1 0.0%
 
-87.5867866667 1 0.0%
 
-87.586775 1 0.0%
 
-87.586755 1 0.0%
 
-87.5867266667 1 0.0%
 

Time
Categorical

Distinct count 8584
Unique (%) 60.2%
Missing (%) 3.9%
Missing (n) 584
7:44:32 PM
 
541
8:59:06 PM
 
5
8:55:36 PM
 
5
Other values (8580)
13700
(Missing)
 
584
Value Count Frequency (%)  
7:44:32 PM 541 3.6%
 
8:59:06 PM 5 0.0%
 
8:55:36 PM 5 0.0%
 
8:57:56 PM 5 0.0%
 
9:35:47 PM 5 0.0%
 
9:37:27 PM 5 0.0%
 
8:57:46 PM 5 0.0%
 
8:58:26 PM 5 0.0%
 
9:40:27 PM 5 0.0%
 
9:31:27 PM 5 0.0%
 
Other values (8573) 13665 92.1%
 
(Missing) 584 3.9%
 

Sample

Date Time Latitude Longitude
0 2011-08-29 6:56:58 PM 42.391623 -88.089163
1 2011-08-29 6:57:08 PM 42.391348 -88.089163
2 2011-08-29 6:57:18 PM 42.391022 -88.089157
3 2011-08-29 6:57:28 PM 42.390637 -88.089158
4 2011-08-29 6:57:38 PM 42.390410 -88.088858

This is a really simple dataset. There are missing values in time, but it doesn't seem like a significant feature as it's too specific and variable. We will exclude Time, and don't have any modifications to make aside from converting the Date feature into date type.

In [5]:
spray.Date = pd.to_datetime(spray.Date)
In [6]:
spray.duplicated().sum()
Out[6]:
541
In [7]:
# Drop the duplicate records
spray.drop_duplicates(inplace=True)

Weather

In [10]:
pdp.ProfileReport(weather)
Out[10]:

Overview

Dataset info

Number of variables 22
Number of observations 2944
Total Missing (%) 0.0%
Total size in memory 506.1 KiB
Average record size in memory 176.0 B

Variables types

Numeric 5
Categorical 15
Date 0
Text (Unique) 0
Rejected 2

Warnings

  • AvgSpeed has a high cardinality: 178 distinct values Warning
  • CodeSum has a high cardinality: 98 distinct values Warning
  • Date has a high cardinality: 1472 distinct values Warning
  • DewPoint is highly correlated with Tmin (ρ = 0.90436) Rejected
  • PrecipTotal has a high cardinality: 168 distinct values Warning
  • SeaLevel has a high cardinality: 102 distinct values Warning
  • StnPressure has a high cardinality: 104 distinct values Warning
  • Sunrise has a high cardinality: 122 distinct values Warning
  • Sunset has a high cardinality: 119 distinct values Warning
  • Tavg has a high cardinality: 60 distinct values Warning
  • Water1 has constant value M Rejected

Variables

AvgSpeed
Categorical

Distinct count 178
Unique (%) 6.0%
Missing (%) 0.0%
Missing (n) 0
6.9
 
63
5.8
 
60
7.4
 
55
Other values (175)
2766
Value Count Frequency (%)  
6.9 63 2.1%
 
5.8 60 2.0%
 
7.4 55 1.9%
 
8.1 49 1.7%
 
7.0 47 1.6%
 
7.7 44 1.5%
 
9.2 44 1.5%
 
8.0 43 1.5%
 
6.0 42 1.4%
 
7.3 42 1.4%
 
Other values (168) 2455 83.4%
 

CodeSum
Categorical

Distinct count 98
Unique (%) 3.3%
Missing (%) 0.0%
Missing (n) 0
1609
RA
 
296
RA BR
 
238
Other values (95)
801
Value Count Frequency (%)  
1609 54.7%
 
RA 296 10.1%
 
RA BR 238 8.1%
 
BR 110 3.7%
 
TSRA RA BR 92 3.1%
 
BR HZ 81 2.8%
 
RA DZ BR 65 2.2%
 
TSRA RA 43 1.5%
 
HZ 39 1.3%
 
RA BR HZ 38 1.3%
 
Other values (88) 333 11.3%
 

Cool
Categorical

Distinct count 31
Unique (%) 1.1%
Missing (%) 0.0%
Missing (n) 0
0
1147
8
 
138
12
 
117
Other values (28)
1542
Value Count Frequency (%)  
0 1147 39.0%
 
8 138 4.7%
 
12 117 4.0%
 
5 117 4.0%
 
10 110 3.7%
 
6 109 3.7%
 
9 107 3.6%
 
7 104 3.5%
 
4 103 3.5%
 
13 102 3.5%
 
Other values (21) 790 26.8%
 

Date
Categorical

Distinct count 1472
Unique (%) 50.0%
Missing (%) 0.0%
Missing (n) 0
2013-06-02
 
2
2009-09-14
 
2
2014-05-31
 
2
Other values (1469)
2938
Value Count Frequency (%)  
2013-06-02 2 0.1%
 
2009-09-14 2 0.1%
 
2014-05-31 2 0.1%
 
2014-09-26 2 0.1%
 
2009-09-13 2 0.1%
 
2009-09-12 2 0.1%
 
2009-09-11 2 0.1%
 
2009-09-10 2 0.1%
 
2009-09-17 2 0.1%
 
2009-09-16 2 0.1%
 
Other values (1462) 2924 99.3%
 

Depart
Categorical

Distinct count 42
Unique (%) 1.4%
Missing (%) 0.0%
Missing (n) 0
M
1472
2
 
93
-1
 
84
Other values (39)
1295
Value Count Frequency (%)  
M 1472 50.0%
 
2 93 3.2%
 
-1 84 2.9%
 
-2 80 2.7%
 
5 77 2.6%
 
7 76 2.6%
 
1 76 2.6%
 
3 75 2.5%
 
0 74 2.5%
 
-3 72 2.4%
 
Other values (32) 765 26.0%
 

Depth
Categorical

Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
M
1472
0
1472
Value Count Frequency (%)  
M 1472 50.0%
 
0 1472 50.0%
 

DewPoint
Highly correlated

This variable is highly correlated with Tmin and should be ignored for analysis

Correlation 0.90436

Heat
Categorical

Distinct count 31
Unique (%) 1.1%
Missing (%) 0.0%
Missing (n) 0
0
1870
4
 
88
1
 
86
Other values (28)
900
Value Count Frequency (%)  
0 1870 63.5%
 
4 88 3.0%
 
1 86 2.9%
 
2 81 2.8%
 
8 67 2.3%
 
3 66 2.2%
 
5 61 2.1%
 
15 57 1.9%
 
7 49 1.7%
 
12 49 1.7%
 
Other values (21) 470 16.0%
 

PrecipTotal
Categorical

Distinct count 168
Unique (%) 5.7%
Missing (%) 0.0%
Missing (n) 0
0.00
1577
T
 
318
0.01
 
127
Other values (165)
922
Value Count Frequency (%)  
0.00 1577 53.6%
 
T 318 10.8%
 
0.01 127 4.3%
 
0.02 63 2.1%
 
0.03 46 1.6%
 
0.04 36 1.2%
 
0.05 32 1.1%
 
0.08 28 1.0%
 
0.12 28 1.0%
 
0.06 27 0.9%
 
Other values (158) 662 22.5%
 

ResultDir
Numeric

Distinct count 36
Unique (%) 1.2%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 17.495
Minimum 1
Maximum 36
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 2
Q1 7
Median 19
Q3 25
95-th percentile 34
Maximum 36
Range 35
Interquartile range 18

Descriptive statistics

Standard deviation 10.064
Coef of variation 0.57523
Kurtosis -1.1616
Mean 17.495
MAD 8.7028
Skewness -0.062597
Sum 51505
Variance 101.28
Memory size 23.1 KiB
Value Count Frequency (%)  
21 156 5.3%
 
3 139 4.7%
 
23 138 4.7%
 
19 138 4.7%
 
24 122 4.1%
 
4 121 4.1%
 
20 118 4.0%
 
22 116 3.9%
 
5 113 3.8%
 
6 111 3.8%
 
Other values (26) 1672 56.8%
 

Minimum 5 values

Value Count Frequency (%)  
1 62 2.1%
 
2 110 3.7%
 
3 139 4.7%
 
4 121 4.1%
 
5 113 3.8%
 

Maximum 5 values

Value Count Frequency (%)  
32 47 1.6%
 
33 34 1.2%
 
34 49 1.7%
 
35 37 1.3%
 
36 72 2.4%
 

ResultSpeed
Numeric

Distinct count 190
Unique (%) 6.5%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 6.9607
Minimum 0.1
Maximum 24.1
Zeros (%) 0.0%

Quantile statistics

Minimum 0.1
5-th percentile 1.9
Q1 4.3
Median 6.4
Q3 9.2
95-th percentile 13.5
Maximum 24.1
Range 24
Interquartile range 4.9

Descriptive statistics

Standard deviation 3.5875
Coef of variation 0.5154
Kurtosis 0.7425
Mean 6.9607
MAD 2.8541
Skewness 0.73492
Sum 20492
Variance 12.87
Memory size 23.1 KiB
Value Count Frequency (%)  
5.9 49 1.7%
 
6.4 47 1.6%
 
6.2 42 1.4%
 
5.3 42 1.4%
 
4.9 38 1.3%
 
4.8 37 1.3%
 
6.0 37 1.3%
 
5.8 37 1.3%
 
6.3 37 1.3%
 
8.3 36 1.2%
 
Other values (180) 2542 86.3%
 

Minimum 5 values

Value Count Frequency (%)  
0.1 1 0.0%
 
0.2 1 0.0%
 
0.3 3 0.1%
 
0.4 3 0.1%
 
0.5 3 0.1%
 

Maximum 5 values

Value Count Frequency (%)  
21.7 1 0.0%
 
21.8 1 0.0%
 
22.6 1 0.0%
 
22.7 1 0.0%
 
24.1 1 0.0%
 

SeaLevel
Categorical

Distinct count 102
Unique (%) 3.5%
Missing (%) 0.0%
Missing (n) 0
30.00
 
96
29.94
 
85
29.98
 
85
Other values (99)
2678
Value Count Frequency (%)  
30.00 96 3.3%
 
29.94 85 2.9%
 
29.98 85 2.9%
 
29.92 83 2.8%
 
29.89 82 2.8%
 
30.05 81 2.8%
 
29.95 80 2.7%
 
29.91 80 2.7%
 
30.02 80 2.7%
 
29.93 79 2.7%
 
Other values (92) 2113 71.8%
 

SnowFall
Categorical

Distinct count 4
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
M
1472
0.0
1459
T
 
12
Value Count Frequency (%)  
M 1472 50.0%
 
0.0 1459 49.6%
 
T 12 0.4%
 
0.1 1 0.0%
 

Station
Numeric

Distinct count 2
Unique (%) 0.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 1.5
Minimum 1
Maximum 2
Zeros (%) 0.0%

Quantile statistics

Minimum 1
5-th percentile 1
Q1 1
Median 1.5
Q3 2
95-th percentile 2
Maximum 2
Range 1
Interquartile range 1

Descriptive statistics

Standard deviation 0.50008
Coef of variation 0.33339
Kurtosis -2.0014
Mean 1.5
MAD 0.5
Skewness 0
Sum 4416
Variance 0.25008
Memory size 23.1 KiB
Value Count Frequency (%)  
1 1472 50.0%
 
2 1472 50.0%
 

Minimum 5 values

Value Count Frequency (%)  
1 1472 50.0%
 
2 1472 50.0%
 

Maximum 5 values

Value Count Frequency (%)  
1 1472 50.0%
 
2 1472 50.0%
 

StnPressure
Categorical

Distinct count 104
Unique (%) 3.5%
Missing (%) 0.0%
Missing (n) 0
29.34
 
128
29.28
 
124
29.26
 
123
Other values (101)
2569
Value Count Frequency (%)  
29.34 128 4.3%
 
29.28 124 4.2%
 
29.26 123 4.2%
 
29.21 107 3.6%
 
29.31 106 3.6%
 
29.23 104 3.5%
 
29.36 96 3.3%
 
29.41 91 3.1%
 
29.39 89 3.0%
 
29.29 86 2.9%
 
Other values (94) 1890 64.2%
 

Sunrise
Categorical

Distinct count 122
Unique (%) 4.1%
Missing (%) 0.0%
Missing (n) 0
-
1472
0416
 
104
0417
 
64
Other values (119)
1304
Value Count Frequency (%)  
- 1472 50.0%
 
0416 104 3.5%
 
0417 64 2.2%
 
0419 40 1.4%
 
0418 32 1.1%
 
0420 32 1.1%
 
0422 32 1.1%
 
0425 32 1.1%
 
0421 24 0.8%
 
0423 24 0.8%
 
Other values (112) 1088 37.0%
 

Sunset
Categorical

Distinct count 119
Unique (%) 4.0%
Missing (%) 0.0%
Missing (n) 0
-
1472
1931
 
96
1930
 
56
Other values (116)
1320
Value Count Frequency (%)  
- 1472 50.0%
 
1931 96 3.3%
 
1930 56 1.9%
 
1929 48 1.6%
 
1923 32 1.1%
 
1925 32 1.1%
 
1927 32 1.1%
 
1928 32 1.1%
 
1926 24 0.8%
 
1918 24 0.8%
 
Other values (109) 1096 37.2%
 

Tavg
Categorical

Distinct count 60
Unique (%) 2.0%
Missing (%) 0.0%
Missing (n) 0
73
 
138
77
 
117
70
 
117
Other values (57)
2572
Value Count Frequency (%)  
73 138 4.7%
 
77 117 4.0%
 
70 117 4.0%
 
75 110 3.7%
 
71 109 3.7%
 
74 107 3.6%
 
72 104 3.5%
 
69 103 3.5%
 
78 102 3.5%
 
76 100 3.4%
 
Other values (50) 1837 62.4%
 

Tmax
Numeric

Distinct count 63
Unique (%) 2.1%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 76.166
Minimum 41
Maximum 104
Zeros (%) 0.0%

Quantile statistics

Minimum 41
5-th percentile 54
Q1 69
Median 78
Q3 85
95-th percentile 92
Maximum 104
Range 63
Interquartile range 16

Descriptive statistics

Standard deviation 11.462
Coef of variation 0.15049
Kurtosis -0.26791
Mean 76.166
MAD 9.3499
Skewness -0.55888
Sum 224233
Variance 131.38
Memory size 23.1 KiB
Value Count Frequency (%)  
84 128 4.3%
 
79 121 4.1%
 
82 118 4.0%
 
81 117 4.0%
 
83 109 3.7%
 
80 107 3.6%
 
85 107 3.6%
 
86 101 3.4%
 
77 100 3.4%
 
87 97 3.3%
 
Other values (53) 1839 62.5%
 

Minimum 5 values

Value Count Frequency (%)  
41 1 0.0%
 
42 1 0.0%
 
44 5 0.2%
 
45 5 0.2%
 
46 9 0.3%
 

Maximum 5 values

Value Count Frequency (%)  
100 3 0.1%
 
101 4 0.1%
 
102 2 0.1%
 
103 2 0.1%
 
104 1 0.0%
 

Tmin
Numeric

Distinct count 54
Unique (%) 1.8%
Missing (%) 0.0%
Missing (n) 0
Infinite (%) 0.0%
Infinite (n) 0
Mean 57.81
Minimum 29
Maximum 83
Zeros (%) 0.0%

Quantile statistics

Minimum 29
5-th percentile 40
Q1 50
Median 59
Q3 66
95-th percentile 73
Maximum 83
Range 54
Interquartile range 16

Descriptive statistics

Standard deviation 10.382
Coef of variation 0.17959
Kurtosis -0.577
Mean 57.81
MAD 8.6016
Skewness -0.3523
Sum 170194
Variance 107.78
Memory size 23.1 KiB
Value Count Frequency (%)  
63 121 4.1%
 
65 111 3.8%
 
60 109 3.7%
 
61 106 3.6%
 
62 105 3.6%
 
66 103 3.5%
 
68 103 3.5%
 
57 103 3.5%
 
64 101 3.4%
 
59 100 3.4%
 
Other values (44) 1882 63.9%
 

Minimum 5 values

Value Count Frequency (%)  
29 6 0.2%
 
31 7 0.2%
 
32 7 0.2%
 
33 10 0.3%
 
34 12 0.4%
 

Maximum 5 values

Value Count Frequency (%)  
79 9 0.3%
 
80 3 0.1%
 
81 3 0.1%
 
82 2 0.1%
 
83 1 0.0%
 

Water1
Constant

This variable is constant and should be ignored for analysis

Constant value M

WetBulb
Categorical

Distinct count 48
Unique (%) 1.6%
Missing (%) 0.0%
Missing (n) 0
63
 
135
65
 
131
59
 
129
Other values (45)
2549
Value Count Frequency (%)  
63 135 4.6%
 
65 131 4.4%
 
59 129 4.4%
 
61 123 4.2%
 
64 121 4.1%
 
62 118 4.0%
 
67 117 4.0%
 
66 113 3.8%
 
60 111 3.8%
 
69 107 3.6%
 
Other values (38) 1739 59.1%
 

Sample

Station Date Tmax Tmin Tavg Depart DewPoint WetBulb Heat Cool Sunrise Sunset CodeSum Depth Water1 SnowFall PrecipTotal StnPressure SeaLevel ResultSpeed ResultDir AvgSpeed
0 1 2007-05-01 83 50 67 14 51 56 0 2 0448 1849 0 M 0.0 0.00 29.10 29.82 1.7 27 9.2
1 2 2007-05-01 84 52 68 M 51 57 0 3 - - M M M 0.00 29.18 29.82 2.7 25 9.6
2 1 2007-05-02 59 42 51 -3 42 47 14 0 0447 1850 BR 0 M 0.0 0.00 29.38 30.09 13.0 4 13.4
3 2 2007-05-02 60 43 52 M 42 47 13 0 - - BR HZ M M M 0.00 29.44 30.08 13.3 2 13.4
4 1 2007-05-03 66 46 56 2 40 48 9 0 0446 1851 0 M 0.0 0.00 29.39 30.12 11.7 7 11.9

Important notes from the profile report:

  • Need to convert Date to actual date datatype
  • The weather data has 2 weather observations per date, one from station 1 and the other from station 2. Maybe we only need data from one station. After looking at the sample of the data in the profile report, it looks like there might be a difference in the amount of 'M'==missing values
  • Several features should be numeric datatypes, but contain indicators like 'T' or 'M' that need to be removed. T = trace, M = missing. Need to clean up these features
  • CodeSum: any code in this feature indicates a significant weather event, so we can turn this into a binary feature indicating whether or not a significant weather event occurred
  • Tavg has more missing values than Tmin, Tmax, so we will just set it to the average of Tmin and Tmax
In [8]:
weather.Date = pd.to_datetime(weather.Date)
In [9]:
# skip the numeric columns, because they can't contain the 'M' string
check_missing_values = ['Depart','Heat','Cool','Sunrise','Sunset','Depth','Water1','SnowFall']
for col in weather[check_missing_values]:
    station_1 = len(weather[(weather[col].str.contains('\D')) & (weather.Station==1)])
    print col + ' has ' + str(station_1) + ' missing values at station 1'
    station_2 = len(weather[(weather[col].str.contains('\D')) & (weather.Station==2)])
    print col + ' has ' + str(station_2) + ' missing values at station 2'
    print ''
Depart has 1271 missing values at station 1
Depart has 1472 missing values at station 2

Heat has 0 missing values at station 1
Heat has 11 missing values at station 2

Cool has 1096 missing values at station 1
Cool has 1021 missing values at station 2

Sunrise has 0 missing values at station 1
Sunrise has 1472 missing values at station 2

Sunset has 0 missing values at station 1
Sunset has 1472 missing values at station 2

Depth has 0 missing values at station 1
Depth has 1472 missing values at station 2

Water1 has 1472 missing values at station 1
Water1 has 1472 missing values at station 2

SnowFall has 1472 missing values at station 1
SnowFall has 1472 missing values at station 2

It looks like station 2 consistently has more missing values, so let's stick to using station 1 as our weather source.

In [10]:
weather = weather[weather.Station==1].drop('Station',axis=1)

Actually, a bunch of the features are 0 and won't give us any information. We'll remove them:

In [11]:
exclude = ['Depart','Heat','Cool','Sunrise','Sunset','Depth','Water1','SnowFall']
include = weather.columns.drop(exclude)
weather = weather[include]

weather.columns
Out[11]:
Index([u'Date', u'Tmax', u'Tmin', u'Tavg', u'DewPoint', u'WetBulb', u'CodeSum',
       u'PrecipTotal', u'StnPressure', u'SeaLevel', u'ResultSpeed',
       u'ResultDir', u'AvgSpeed'],
      dtype='object')
In [12]:
# clean up features that should be numeric

# the PrecipTotal column contains the letter 'T' in some rows
# this indicates a 'trace' amount of precipitation, which is
# defined as less than 0.005
# 'M' indicates missing data

def clean_col(column):
    weather[column] = weather[column].str.replace('T','0.005')
    weather[column] = weather[column].str.replace('M','0.0')
    weather[column] = weather[column].astype(float)



columns = ['Tavg','PrecipTotal','WetBulb','StnPressure','SeaLevel','AvgSpeed']
for col in columns:
    clean_col(col)

# If the CodeSum contains letters, they signify some type
# of significant weather event. Let's instead flag these as 1
# and then 0 for the values that are just whitespace
weather.CodeSum = weather.CodeSum.str.strip()
weather.CodeSum[weather.CodeSum.str.contains('^\w')] = '1'
weather.CodeSum[weather.CodeSum!='1'] = '0'
weather.CodeSum = weather.CodeSum.astype(float)

# fill in missing Tavg by just taking the avg of Tmin and Tmax
# while this is not how Tavg is calculated, but it's probably close
weather['Tavg'][weather.Tavg==0] = (weather.Tmin + weather.Tmax) / 2
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:29: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:30: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
/anaconda/lib/python2.7/site-packages/ipykernel/__main__.py:35: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Modified Features -- Look at pairplots

In [13]:
no_dummies = [x for x in traps.columns if not 'Species_' in x]
sns.pairplot(traps[no_dummies])
Out[13]:
<seaborn.axisgrid.PairGrid at 0x115fa2dd0>
In [14]:
sns.pairplot(spray)
Out[14]:
<seaborn.axisgrid.PairGrid at 0x115fa2890>
In [15]:
no_dummies = weather.columns.drop(['CodeSum','SeaLevel','StnPressure'])
sns.pairplot(weather[no_dummies])
Out[15]:
<seaborn.axisgrid.PairGrid at 0x11763be90>

Create More Features

In [16]:
#Before adding anything, let's see what the feature correlation looks like
def find_correlation(data,title):
    correlation = data.corr()
    sns.heatmap(correlation, vmax=.8, square=True)
    plt.title(title)

find_correlation(traps, 'Traps')
In [17]:
## Parse Dates to see if they can be used in the modeling
traps['year']  = traps['Date'].dt.year
traps['month'] = traps['Date'].dt.month
traps['day']   = traps['Date'].dt.day

Add Weather data

This function will add weather features to our traps df, and takes three arguments:

  • the column name as a string from weather dataframe that we want to extract
  • a numpy function that we want to do to the data
  • an integer number of previous days to extract
In [18]:
def weather_add(df, weather_col, func, days_range=7):
    new_list = []
    for i in df['Date']:
        mask = (weather['Date'] <= i) & (weather['Date'] >= i - pd.Timedelta(days=days_range))
        data_list = func(weather[weather_col][mask])
        new_list.append(data_list)
    return new_list
In [1]:
## running this cell will extract the average temp min, average temp max, and precipitation sum 
## to the traps dataframe from the previous 7 days, unless otherwise specified


for col in ['Tmax','Tmin','Tavg','WetBulb','CodeSum','StnPressure','SeaLevel', 'ResultSpeed', 'ResultDir','AvgSpeed']:
    traps[col] = weather_add(traps, weather_col=col, func=np.mean)
    
traps['PrecipTotal']= weather_add(traps, weather_col='PrecipTotal', func= np.sum)
traps['DewPoint']= weather_add(traps, weather_col ='DewPoint', func=np.mean, days_range = 10)

traps['Tmax_20'] = weather_add(traps, weather_col='Tmax',func=np.mean, days_range=20)
traps['Tmin_20'] = weather_add(traps, weather_col='Tmin', func=np.mean, days_range=20)

traps['Tmin_3']  = weather_add(traps, weather_col='Tmin', func=np.mean, days_range=3)
traps['Tmax_3']  = weather_add(traps, weather_col='Tmax', func=np.mean, days_range=3)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-1-962517005f6c> in <module>()
      4 
      5 for col in ['Tmax','Tmin','Tavg','WetBulb','CodeSum','StnPressure','SeaLevel', 'ResultSpeed', 'ResultDir','AvgSpeed']:
----> 6     traps[col] = weather_add(traps, weather_col=col, func=np.mean)
      7 
      8 traps['PrecipTotal']= weather_add(traps, weather_col='PrecipTotal', func= np.sum)

NameError: name 'weather_add' is not defined

Add Spray data

This function will add spray features to our traps df, and takes 5 arguments:

  • dataframe that will be added to
  • start: integer number of days from trap inspection (closer to date)
  • stop: integer number of days from trap inspection (farther from date)
  • col: spray column with the closest distance of spray
  • Spray_col: spray column with the number of days since that spray
In [5]:
def add_cols(traps,start,stop,col,Spray_col):
    # ex) I want the week leading up to the inspection date: start = 0, end = 7
    # ex) I want the week before that: start = 7, end = 14
    from geopy.distance import vincenty
    traps_sprayed = traps[traps[Spray_col].isnull()].index.values
    remaining = len(traps_sprayed)   
    
    for i in traps_sprayed:
        if start>0: # subtract # of days from the date the trap is inspected
            start_date = traps.Date.loc[i] - pd.Timedelta(days=start) 
        else: # if start=0, then we're starting on the trap inspection day and collecting data backwards
            start_date = traps.Date.loc[i]
        end_date = traps.Date.loc[i] - pd.Timedelta(days=stop)
        dates = pd.date_range(start=end_date, end=start_date).tolist() # create list of all dates in our start, end points
        trap_lat = traps.Latitude.loc[i]
        trap_long = traps.Longitude.loc[i]
        dist = []

        # for each dated trap inspection, select only the spray records within the date range
        spray_temp = spray[['Latitude','Longitude','Date']][spray.Date.isin(dates)]
        
        remaining-=1 # counts how many records are left to process
        print str(remaining),' remaining'

        # run through each coordinate in the spray data and record the distance from our trap
        for j in range(0,len(spray_temp)):
            spray_lat = spray_temp.Latitude.iloc[j] 
            spray_long = spray_temp.Longitude.iloc[j]
            a = (trap_lat, trap_long) # trap coordinates
            b = (spray_lat, spray_long) # spray coordinates
            dist.append(vincenty(a, b).miles) # calculate the distance between the points
        
        try:
            # set the spray value to the shortest distance
            traps[Spray_col].loc[i] = min(dist)
            dt = dist.index(min(dist))
            spray_dt = spray_temp.Date.iloc[dt]
            # record the number of days from inspection that the nearest spray occurred
            traps[col].loc[i] = pd.Timedelta(traps.Date.loc[i]-spray_dt).days
        except:
            pass
In [ ]:
# get the indicies of trap data that actually have spray data
traps_spray = traps[traps.Year.isin([2011,2013])]

# create the columns
traps_spray['Spray_Dist'] = np.NaN
traps_spray['Spray_Days_Ago'] = np.NaN

# week leading up to inspection
add_cols(traps_spray,0,7,'Spray_Days_Ago','Spray_Dist')
#one week out
add_cols(traps_spray,7,14,'Spray_Days_Ago','Spray_Dist')
# two weeks out
add_cols(traps_spray,14,21,'Spray_Days_Ago','Spray_Dist')
# three weeks out
add_cols(traps_spray,21,28,'Spray_Days_Ago','Spray_Dist')
# four weeks out
add_cols(traps_spray,28,35,'Spray_Days_Ago','Spray_Dist')
In [ ]:
# Check to see how many values we were able to fill in
traps['Spray'].isnull().sum()/float(len(traps['Spray']))

Nope. Not going to include spray data. We could potentially model the spray data from 2011, 2013 to predict values for 2007, 2009 to fill our data, but we didn't have time to go down that path.

In [ ]:
traps.drop(['Spray_Dist','Spray_Days_Ago'],axis=1,inplace=True)
In [23]:
## Save the transformed tables so we don't have to run the earlier cells
# traps.to_csv('./assets/Train_transformed/traps_jd.csv', encoding='utf-8', index=False)
In [2]:
##re-import the altered traps file if re-running and want to skip steps above
# traps = pd.read_csv('./assets/Train_transformed/traps_jd.csv')

Feature Selection

In [20]:
## Check out how the correlation map changed with our new features
find_correlation(traps, 'Traps')
In [24]:
X.columns
Out[24]:
Index([u'Latitude', u'Longitude', u'Species_CULEX ERRATICUS',
       u'Species_CULEX PIPIENS', u'Species_CULEX PIPIENS/RESTUANS',
       u'Species_CULEX RESTUANS', u'Species_CULEX SALINARIUS',
       u'Species_CULEX TARSALIS', u'Species_CULEX TERRITANS', u'year',
       u'month', u'day', u'Tmax', u'Tmin', u'PrecipTotal', u'Tmax_3',
       u'Tmax_20', u'DewPoint', u'Tmin_3', u'Tmin_20', u'Tavg', u'WetBulb',
       u'CodeSum', u'StnPressure', u'SeaLevel', u'ResultSpeed', u'ResultDir',
       u'AvgSpeed'],
      dtype='object')
In [21]:
# Get rid of features that we cannot use
features = traps.columns.drop(['WnvPresent','Date','Trap'])

X = traps[features]
y = traps.WnvPresent
In [36]:
## slightly adapted from: http://blog.datadive.net/selecting-good-features-part-iv-stability-selection-rfe-and-everything-side-by-side/

# from sklearn.feature_selection import SelectPercentile
from sklearn.linear_model import RandomizedLogisticRegression
from sklearn.feature_selection import RFECV, f_classif
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
import numpy as np

# create a function that scales the data, takes the data
# and scores the features, storing it in a dictionary, ranks
def rank_to_dict(ranks, names, order=1):
    minmax = StandardScaler()
    ranks = minmax.fit_transform(order*np.array([ranks]).T).T[0]
    ranks = map(lambda x: round(x, 2), ranks)
    return dict(zip(names, ranks ))

ranks = {}

lr = RandomizedLogisticRegression()
lr.fit(X, y)
ranks["Logistic"] = rank_to_dict(np.abs(lr.scores_), features)
    
rf = RandomForestClassifier()
rf.fit(X,y)
ranks["RF"] = rank_to_dict(rf.feature_importances_, features)

f, pval  = f_classif(X, y)
ranks["F_Classif."] = rank_to_dict(f, features)

r = {}
for name in features:
    r[name] = round(np.mean([ranks[method][name] for method in ranks.keys()]), 2)

methods = sorted(ranks.keys())

# average out the scores
ranks["Mean"] = r
methods.append("Mean")


# Now that we have our data, convert to dataframe for better viewing
feats = []
names = []

# feature names
for i in ranks['RF']:
    names.append(i)

feats.append(names)

# feature importances
for i in ranks:
    row = []
    for j in ranks[i]:
        row.append(ranks[i][j])
    feats.append(row)

feats_t = []

for i in range(0,len(features)):
    row = []
    for f in feats:
        row.append(f[i])
    feats_t.append(row)

feature_importances = pd.DataFrame(feats_t,columns = ['Feature','F_Classif','Logistic','RF','Mean'])
In [95]:
feature_importances.sort_values(['Mean'],ascending=False)

# feature_importances.sort_values(['Mean','F_Classif','Logistic','RF'],ascending=False)
# feature_importances.Feature
Out[95]:
Feature F_Classif Logistic RF Mean
18 Longitude 3.59 0.14 1.52 1.75
15 DewPoint -0.23 2.47 1.99 1.41
5 Tmin_20 -0.28 2.44 1.94 1.37
2 month -0.27 0.92 2.12 0.92
1 Species_CULEX PIPIENS -0.07 0.89 1.76 0.86
9 Species_CULEX RESTUANS -0.14 0.75 1.81 0.81
4 Latitude 3.60 -0.79 -0.63 0.73
10 Tmax_20 -0.26 1.25 0.06 0.35
16 Tmin_3 -0.30 0.90 0.24 0.28
20 ResultSpeed -0.27 0.54 0.09 0.12
6 Tmin -0.31 0.72 -0.30 0.04
7 WetBulb -0.25 0.41 -0.58 -0.14
25 AvgSpeed -0.32 0.23 -0.48 -0.19
0 Tavg -0.29 0.16 -0.63 -0.25
26 Tmax_3 -0.31 -0.07 -0.63 -0.34
17 Tmax -0.34 -0.35 -0.63 -0.44
22 Species_CULEX PIPIENS/RESTUANS -0.02 -0.91 -0.63 -0.52
3 year -0.32 -0.62 -0.63 -0.52
8 Species_CULEX TERRITANS -0.35 -0.69 -0.63 -0.56
27 SeaLevel -0.24 -0.86 -0.63 -0.58
13 CodeSum -0.32 -0.87 -0.63 -0.61
12 StnPressure -0.22 -0.97 -0.63 -0.61
19 Species_CULEX SALINARIUS -0.36 -0.87 -0.63 -0.62
11 PrecipTotal -0.32 -0.94 -0.63 -0.63
21 day -0.32 -0.98 -0.63 -0.64
14 ResultDir -0.33 -0.96 -0.63 -0.64
23 Species_CULEX TARSALIS -0.38 -0.97 -0.63 -0.66
24 Species_CULEX ERRATICUS -0.38 -0.98 -0.63 -0.66

Cutting off the features at Tmin_3--use all above it. Also, this has been re-run several times and somehow the Species_CULEX RESTUANS shot up the ranks. This feature, however, wasn't used in any of our models, so it will not be included.

In [98]:
features_new = list(feature_importances.sort_values(['Mean','F_Classif','Logistic','RF'],ascending=False)['Feature'][0:9])
features_new.remove('Species_CULEX RESTUANS')
In [103]:
X = traps[features_new]
y = traps.WnvPresent

Scale the Data

We will use the StandardScaler to scale our data for us in the RandomForest models and SVM models. We'll also just keep the regular data for comparison

In [104]:
from sklearn.preprocessing import MinMaxScaler, StandardScaler, RobustScaler, normalize

X_s = pd.DataFrame(StandardScaler().fit_transform(X[X.columns.drop('Species_CULEX PIPIENS')]),columns=X.columns.drop('Species_CULEX PIPIENS'))
X_s = pd.DataFrame(X['Species_CULEX PIPIENS']).merge(X_s,left_on=X.index.values, right_on=X_s.index.values)
X_s.drop('key_0',inplace=True,axis=1)

Train-Test Split

Creating a stratified train-test split to evaluate our models. Stratified because, as noted above, zeroes dominate WnvPositive at 94%.

In [55]:
from sklearn.model_selection import train_test_split

Split the raw, unscaled data

In [105]:
X_train, X_test, y_train, y_test = train_test_split(X,y,train_size=.33, stratify=y)

Split the scaled data

In [106]:
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(X_s,y,train_size=.33, stratify=y)

Check that both sets have identical shape

In [107]:
print 'Raw splits:'
print X_train.shape, X_test.shape, y_test.shape, y_train.shape
print
print 'Scaled splits:'
print X_train_s.shape, X_test_s.shape, y_test_s.shape, y_train_s.shape
Raw splits:
(2841, 8) (5769, 8) (5769,) (2841,)

Scaled splits:
(2841, 8) (5769, 8) (5769,) (2841,)

Building the Models

Create a function that we will use to evaluate our models. It will return the ROC score that we can use to see how well it will do on Kaggle.

In [48]:
from sklearn.metrics import roc_auc_score

def score_model(model,X_test,y_test):
    preds = model.predict_proba(X_test)
    pred_list =[]
    
    for x in preds:
        pred_list.append(x[1])
        
    roc_score = roc_auc_score(y_test, pred_list)
    return roc_score

(1/6) Random Forest

In [49]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

RF = RandomForestClassifier(n_estimators = 1000, 
                            bootstrap=True, 
                            max_depth=5,
                            max_features='auto',
                            min_samples_leaf= 1, 
                            min_samples_split= 2)

Using raw data

In [108]:
RF_model= RF.fit(X_train, y_train)
score_model(RF_model,X_test,y_test)
# 0.86780509496634572
Out[108]:
0.82547715528947552

Using StandardScaled data

In [109]:
RF_model_s = RF.fit(X_train_s, y_train_s)
score_model(RF_model_s,X_test_s, y_test_s)
# 0.81161294220537694
Out[109]:
0.82209462587890736

(2/6) Support Vector Machine

In [111]:
from sklearn.svm import SVC

SVM with kernel = 'rbf'

In [112]:
svmc= SVC(probability=True)
svm_model = svmc.fit(X_train,y_train)
score_model(svm_model,X_test,y_test)
# 0.30215737665155207
Out[112]:
0.29652750110966347
In [113]:
svm_model_s = svmc.fit(X_train_s,y_train_s)
score_model(svm_model_s,X_test_s,y_test_s)
Out[113]:
0.7412381451451775

Scaled data performed significantly better.

Try with the linear kernel by using bagging and ovr

In [114]:
# credit: first answer on http://stackoverflow.com/questions/31681373/making-svm-run-faster-in-python

import time
import numpy as np
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC


start = time.time()
clf1 = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='balanced'))
clf1.fit(X_train, y_train)
end = time.time()
print "Single SVC", end - start, clf1.score(X_test,y_test), score_model(clf1,X_test,y_test)

n_estimators = 10
start = time.time()
clf2 = OneVsRestClassifier(BaggingClassifier(SVC(kernel='linear', probability=True, class_weight='balanced'), max_samples=1.0 / n_estimators, n_estimators=n_estimators))
clf2.fit(X_train, y_train)
end = time.time()
print "Bagging SVC", end - start, clf2.score(X_test,y_test), score_model(clf2,X_test,y_test)


# Single SVC 4.18354606628 0.671693534408 0.754635462093
# Bagging SVC 0.904535055161 0.787831513261 0.746318070825
Single SVC 3.76783299446 0.655746229849 0.764287440524
Bagging SVC 0.574933052063 0.768070722829 0.752589015349

Try with the scaled data

In [115]:
# credit: first answer on http://stackoverflow.com/questions/31681373/making-svm-run-faster-in-python

import time
import numpy as np
from sklearn.ensemble import BaggingClassifier, RandomForestClassifier
from sklearn import datasets
from sklearn.multiclass import OneVsRestClassifier
from sklearn.svm import SVC


start = time.time()
clf1s = OneVsRestClassifier(SVC(kernel='linear', probability=True, class_weight='balanced'))
clf1s.fit(X_train_s, y_train_s)
end = time.time()
print "Single SVC", end - start, clf1s.score(X_test_s,y_test_s), score_model(clf1s,X_test_s,y_test_s)

n_estimators = 10
start = time.time()
clf2s = OneVsRestClassifier(BaggingClassifier(SVC(kernel='linear', probability=True, class_weight='balanced'), max_samples=1.0 / n_estimators, n_estimators=n_estimators))
clf2s.fit(X_train_s, y_train_s)
end = time.time()
print "Bagging SVC", end - start, clf2s.score(X_test_s,y_test_s), score_model(clf2s,X_test_s,y_test_s)


# Single SVC 1.21880817413 0.680013867221 0.751182644026
# Bagging SVC 0.137820005417 0.718495406483 0.744525261444
Single SVC 1.31022405624 0.648812619171 0.772368542267
Bagging SVC 0.167356967926 0.73565609291 0.762660332911

(3/6) Logistic Regression

In [116]:
#LOGISTIC REGRESSION
from sklearn import metrics
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
def log(X,y,X_test,y_test):
    # flatten y into a 1-D array
    y_log = np.ravel(y)
    model = LogisticRegression()
    model = model.fit(X, y)
    predicted = model.predict(X_test)
    probs = model.predict_proba(X_test)
    scores = cross_val_score(LogisticRegression(), X, y, scoring='accuracy', cv=10)
    print 'Mean CV score: ' + str(scores.mean())
    print 'Accuracy score: ' + str(metrics.accuracy_score(y_test, predicted))
    print 'Roc-auc score: ' + str(metrics.roc_auc_score(y_test, probs[:, 1]))
    print ''
    print 'Confusion matrix: \n' + str(metrics.confusion_matrix(y_test, predicted))
    print 'Classification report: \n' + str(metrics.classification_report(y_test, predicted))
    # check the accuracy on the training set
    print 'Model score: ' + str(model.score(X_test, y_test))
    print 'Kaggle: ' + str(score_model(model,X_test,y_test))
    print ''


log(X_train,y_train,X_test,y_test)
log(X_train_s,y_train_s,X_test_s,y_test_s)


# Mean CV score: 0.946850753645
# Accuracy score: 0.946957878315
# Roc-auc score: 0.734609476227
    
# Mean CV score: 0.946850753645
# Accuracy score: 0.946957878315
# Roc-auc score: 0.761849471011
Mean CV score: 0.946850753645
Accuracy score: 0.946957878315
Roc-auc score: 0.747154356282

Confusion matrix: 
[[5463    0]
 [ 306    0]]
Classification report: 
             precision    recall  f1-score   support

          0       0.95      1.00      0.97      5463
          1       0.00      0.00      0.00       306

avg / total       0.90      0.95      0.92      5769

Model score: 0.946957878315
Kaggle: 0.747154356282

Mean CV score: 0.946850753645
Accuracy score: 0.946957878315
Roc-auc score: 0.768921108012

Confusion matrix: 
[[5463    0]
 [ 306    0]]
Classification report: 
             precision    recall  f1-score   support

          0       0.95      1.00      0.97      5463
          1       0.00      0.00      0.00       306

avg / total       0.90      0.95      0.92      5769

Model score: 0.946957878315
Kaggle: 0.768921108012

The logistic regression model using the scaled data looks okay. Making that model and submitting to kaggle.

In [117]:
# flatten y into a 1-D array
y_log = np.ravel(y_train_s)
log_model = LogisticRegression()
log_model = log_model.fit(X_train_s, y_train_s)

(4/6) ADA Boost

In [118]:
from sklearn.ensemble import AdaBoostClassifier
In [119]:
adaboost= AdaBoostClassifier()
ada_model=adaboost.fit(X,y)
score_model(ada_model,X_test,y_test)
#0.88491183862121936
Out[119]:
0.85638831162460716

(5/6) Gradient Boost

In [120]:
from sklearn.ensemble import GradientBoostingClassifier

def GradBoostClass(X_train,y_train,X_test,y_test):
    clf = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0).fit(X_train, y_train)
    clf.score(X_test, y_test)   
    print score_model(clf,X_test,y_test)
    print ''

GradBoostClass(X_train,y_train,X_test,y_test)
GradBoostClass(X_train_s,y_train_s,X_test_s,y_test_s)
# 0.807420448196

# 0.799875933045
0.820344288792

0.821106397285

In [121]:
gradboost = GradientBoostingClassifier(n_estimators=100, learning_rate=1.0,max_depth=1, random_state=0).fit(X_train, y_train)
gradboost.score(X_test, y_test) 
Out[121]:
0.94661119778124458

(6/6) XG Boost

In [122]:
import xgboost as xgb
from xgboost.sklearn import XGBClassifier  
import scipy.stats as st

model = XGBClassifier()
xgm = model.fit(X_train, y_train, eval_metric=roc_auc_score)
score_model(xgm,X_test,y_test)
# 0.8723832880791802
Out[122]:
0.82574036387390393

6a) XGBoost GridSearch

In [73]:
import random

params = {  
    "n_estimators": [3,4,5,7,8,10],
    "max_depth": st.randint(3, 40),
    "learning_rate": st.uniform(0.05, 0.4),
    'eval_metric' :'auc',
    'objective': ['binary:logistic']
}

xgbclass = XGBClassifier()  
In [56]:
from sklearn.model_selection import RandomizedSearchCV

gs = GridSearchCV(xgbclass,  params, n_jobs=-1, scoring='roc_auc')
gs.fit(X_train, y_train)  

gs.best_estimator_
In [123]:
xg_grid = XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1)

xg_final = xg_grid.fit(X_train, y_train)
score_model(xg_final,X_test,y_test)
# 0.81860621483323937
Out[123]:
0.82574036387390393

Import and Transform Test Data

In [76]:
## Add features
test['Tmax_20'] = weather_add(test, weather_col='Tmax',func=np.mean, days_range=20)
test['DewPoint']= weather_add(test, weather_col ='DewPoint', func=np.mean, days_range = 10)
test['Tmin_3'] = weather_add(test, weather_col='Tmin', func=np.mean, days_range=3)
test['Tmin_20'] = weather_add(test, weather_col='Tmin', func=np.mean, days_range=20)

test['month'] = test['Date'].dt.month
In [78]:
## Export the tranformed test data to a folder in the repo 
## so we don't have to run the above cells every time

# test.to_csv('./assets/Test_transformed/test_transformed_427.csv',sep=',', encoding='utf-8')
In [61]:
# test = pd.read_csv('./assets/Test_transformed/test_transformed_427.csv')
In [81]:
test_X = test[features_new]
In [82]:
test_s = pd.DataFrame(StandardScaler().fit_transform(test_X[test_X.columns.drop(['Species_CULEX PIPIENS'])]),columns=test_X.columns.drop(['Species_CULEX PIPIENS']))
test_s = pd.DataFrame(test_X[['Species_CULEX PIPIENS']]).merge(test_s,left_on=test_X.index.values, right_on=test_s.index.values)
test_s.drop('key_0',inplace=True,axis=1)

Exporting to test

In [83]:
###This function will takes a model and a model name(as a string), generate predictions, 
### and save that as a CSV labeled with the model name and date.
import time 
import math
def model_and_export(model, model_name,test_X=test_X):
    pred_list = []
    predictions =  model.predict_proba(test_X)
    for x in predictions:
        pred_list.append(x[1])
    indexes=np.arange(1, len(predictions)+1, 1)
    preds_df = pd.DataFrame(data=[indexes, pred_list]).T
    preds_df.columns =['Id','WnvPresent']
    preds_df['Id'] = preds_df.Id.astype(int)
    location = './submissions/{}_{}.csv'.format(model_name, time.strftime("%d_%m_%Y"))
    preds_df.to_csv(location, index=False)
    return
    
In [84]:
model_and_export(RF_model, 'RF_JD*')
# Your submission scored 0.72874
In [85]:
model_and_export(log_model, 'LOG_JD*',test_s)
# Your submission scored 0.72305
In [86]:
model_and_export(clf2s, 'SVM_S*', test_s)
# Your submission scored 0.69906.
In [87]:
model_and_export(ada_model, 'ADABoost_JD*')
# Your submission scored 0.74416.
In [88]:
model_and_export(gradboost, 'GRADBOOST_*')
# Your submission scored 0.64589
In [89]:
model_and_export(xgm, 'XG_JD*')
# Your submission scored 0.75110
In [90]:
model_and_export(xg_final, 'XG_FINAL*')
# Your submission scored 0.71567

Ensembles

In [ ]:
def ensemble_and_export(model1,model2,test_X,file_name):
    pred_m1 = []
    predictions_m1 =  model1.predict_proba(test_X)
    pred_m2 = []
    predictions_m2 =  model2.predict_proba(test_X)

    for x in predictions_m1:
        pred_m1.append(x[1])

    for x in predictions_m2:
        pred_m2.append(x[1])

    indexes=np.arange(1, len(predictions_m1)+1, 1)

    preds_m1 = pd.DataFrame(data=[indexes, pred_m1]).T
    preds_m1.columns =['Id','WnvPresent']
    preds_m1['Id'] = preds_m1.Id.astype(int)

    preds_m2 = pd.DataFrame(data=[indexes, pred_m2]).T
    preds_m2.columns =['Id','WnvPresent']
    preds_m2['Id'] = preds_m2.Id.astype(int)
    
    ensemble = preds_m1.merge(preds_m2,left_on='Id', right_on='Id')
    
    ensemble['avg'] = (ensemble['WnvPresent_x']+ ensemble['WnvPresent_y']) / 2
    ensemble.rename(columns={'avg':'WnvPresent'},inplace=True)

    ensemble = ensemble[['Id','WnvPresent']]

    location = './submissions/{}_{}.csv'.format(file_name, time.strftime("%d_%m_%Y"))
    ensemble.to_csv(location, index=False)
In [91]:
ensemble_and_export(xgm,ada_model,test_X,'Ensemble_XG_ADA*')
# Your submission scored 0.70831.
In [92]:
ensemble_and_export(xgm,RF_model_s,test_X,'Ensemble_XG_RF*')
# Your submission scored 0.70979.