The data is for 2001 SAT scores by state for Math and Verbal sections, along with the percentage of high school students who took the SAT.
import pandas as pd
sats = pd.read_csv('../../DC-DSI4/projects/01-project/data/sat_scores.csv')
sats.head()
print 'AVERAGE SCORES:'
print sats[['Verbal','Math']].mean()
print
print 'MEDIAN SCORES:'
print sats[['Verbal','Math']].median()
print
print 'MIN SCORES:'
print sats[['State','Verbal','Math']].min()
print
print 'MAX SCORES:'
print sats[['Verbal','Math']].max()
print 'AVERAGE PARTICIPATION:'
print sats[['Rate']].mean()
print
print 'MEDIAN PARTICIPATION:'
print sats[['Rate']].median()
print
print 'MIN PARTICIPATION:'
print sats[['State','Rate']].min()
print
print 'MAX SCORES:'
print sats[['State','Rate']].max()
print('RATE (min,max) = ' + str((min(sats.Rate),max(sats.Rate))))
print('VERBAL (min,max) = ' + str((min(sats.Verbal),max(sats.Verbal))))
print('MATH (min,max) = ' + str((min(sats.Math),max(sats.Math))))
import matplotlib.pyplot as plt
import matplotlib.figure as fig
import numpy as np
plt.hist(sats.Rate) ## plot a histogram for the rate values
plt.axvline(np.median(sats.Rate), color="b") ## blue line represents the median value
plt.axvline(np.mean(sats.Rate), color="r") ## red line represents the mean value
plt.show()
plt.hist(sats.Math) ## plot a histogram for the rate values
plt.axvline(np.median(sats.Math), color="b") ## blue line represents the median value
plt.axvline(np.mean(sats.Math), color="r") ## red line represents the mean value
plt.show()
plt.hist(sats.Verbal) ## plot a histogram for the rate values
plt.axvline(np.median(sats.Verbal), color="b") ## blue line represents the median value
plt.axvline(np.mean(sats.Verbal), color="r") ## red line represents the mean value
plt.show()
plt.figure(1) ## indicates the first figure to display
## rate vs verbal
scatter_1 = [plt.scatter(row[1],row[2]) for row in sat_scores]
plt.suptitle('Rate, Verbal Relationship', fontsize=20)
plt.xlabel('Rate', fontsize=16)
plt.ylabel('Verbal', fontsize=16)
plt.figure(211) ## indicates that we want to display a second figure below
## rate vs math
[plt.scatter(row[1],row[3]) for row in sat_scores]
plt.suptitle('Rate, Math Relationship', fontsize=20)
plt.xlabel('Rate', fontsize=16)
plt.ylabel('Math', fontsize=16)
plt.figure(212) ## indicates that we want to display a third figure below
## math vs verbal
[plt.scatter(row[2],row[3]) for row in sat_scores]
plt.suptitle('Verbal, Math Relationship', fontsize=20)
plt.xlabel('Verbal', fontsize=16)
plt.ylabel('Math', fontsize=16)
plt.show()
There's negative correlation for both rate, verbal and for rate, math. This means that smaller participation rates are associated with higher scores in both cases, and that higher participation rates are asssociated with lower scores. This greatly affects how we are able to look at the data. I broke down scores by US region (below heat maps) and found that the Midwest scores were higher than the Northeast--which was the opposite of my biased assumption. If you look closer, however, the Midwest has a much lower participation rate, which makes it seem like not enough of a diverse population of students take the exam. This is a very biased sample. Now that I see the great difference in participation rates, my focus is no longer on the scores, but why the rates are down outside of the East Coast.
import pandas as pd
f = '../data/sat_scores.csv'
sat_pandas = pd.read_csv(f, header=0, na_filter=False)
%matplotlib inline
sat_pandas.plot(kind='box')
%%HTML
<div class='tableauPlaceholder' id='viz1489527750572' style='position: relative'><noscript><a href='#'><img alt='2001 Average Math SAT Scores by State ' src='https://public.tableau.com/static/images/Av/AverageMathSATScoresbyState/Sheet2/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='path' value='views/AverageMathSATScoresbyState/Sheet2?:embed=y&:display_count=y' /> <param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/Av/AverageMathSATScoresbyState/Sheet2/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1489527750572'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
%%HTML
<div class='tableauPlaceholder' id='viz1489528066797' style='position: relative'><noscript><a href='#'><img alt='2001 SAT Participation Rate by State ' src='https://public.tableau.com/static/images/20/2001SATParticipationRatebyState_0/Sheet3/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='path' value='views/2001SATParticipationRatebyState_0/Sheet3?:embed=y&:display_count=y' /> <param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/20/2001SATParticipationRatebyState_0/Sheet3/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1489528066797'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
%%HTML
<div class='tableauPlaceholder' id='viz1489528167633' style='position: relative'><noscript><a href='#'><img alt='2001 Average Verbal SAT Scores by State ' src='https://public.tableau.com/static/images/20/2001AverageVerbalSATScoresbyState/Sheet4/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='2001AverageVerbalSATScoresbyState/Sheet4' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/20/2001AverageVerbalSATScoresbyState/Sheet4/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1489528167633'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
## to confirm my suspicions about performance by region,
## I found and utilized a list of all states, their region, and subregion
## source: http://researchertools.blogspot.com/2012/09/excel-file-with-us-states-abbreviations.html
import csv
f = open('../data/US_STATES_REGIONS_SUBREGIONS.csv', 'rU')
reader = csv.reader(f)
states_data = [row for row in reader]
f.close()
states_data_header = states_data[0]
states_regions = states_data[1:]
regions = set([row[2] for row in states_regions])
subregions = set([row[3] for row in states_regions])
regions_dictionary = {'West' : [row[1] for row in states_regions if row[2]=='West'],
'Northeast' : [row[1] for row in states_regions if row[2]=='Northeast'],
'Midwest' : [row[1] for row in states_regions if row[2]=='Midwest'],
'South' : [row[1] for row in states_regions if row[2]=='South']}
rate_by_region = []
for region in regions:
region_rate = [row[1] for row in sat_scores if row[0] in regions_dictionary[region]]
region_verbal = [row[2] for row in sat_scores if row[0] in regions_dictionary[region]]
region_math = [row[3] for row in sat_scores if row[0] in regions_dictionary[region]]
region_dict = {}
region_dict['region']=region
region_dict['region_mean_rate'] = round(np.mean(region_rate),2)
region_dict['region_median_rate'] = round(np.median(region_rate),2)
region_dict['region_mean_verbal'] = round(np.mean(region_verbal),2)
region_dict['region_median_verbal'] = round(np.median(region_verbal),2)
region_dict['region_mean_math'] = round(np.mean(region_math),2)
region_dict['region_median_math'] = round(np.median(region_math),2)
rate_by_region.append(region_dict)
## print the mean and median of each score type for each region
for region in rate_by_region:
print region['region']
print ('RATE: mean = ' + str(region['region_mean_rate']) +
'; median = ' + str(region['region_median_rate']) )
print ('VERBAL: mean = ' + str(region['region_mean_verbal']) +
'; median = ' + str(region['region_median_verbal']) )
print ('MATH: mean = ' + str(region['region_mean_math']) +
'; median = ' + str(region['region_median_math']) )
print ""
## Now I'm curious to see how Midwest ACT participation rates stack up.
## As long as participation rates for the Midwest are significantly
## higher than their SAT participation rates, I think it's fair to say
## not that fewer Midwestern students are taking college entrance exams,
## but fewer are taking the SAT. Let's actually look at some ACT data.
# data source = https://forms.act.org/newsroom/data/2001/states.html
# * Totals for graduating seniors were obtained from Projections of High School Graduates by State and Race/Ethnicity 1996-2012, Copyright © by Western Interstate Commission for Higher Education, February, 1998.
# ** Core Course = at least four years of English and three years each of mathematics (algebra and above), social sciences, and natural sciences
# I'm going to multiply the percentage of graduates who also completed the Core Course requirement as a comparison to the SAT sample to get their percentage of the whole
# Important to note that the SAT data isn't just graduates ...
import csv
f = open('../data/act_scores.csv', 'rU')
reader = csv.reader(f)
act_data = [row for row in reader]
f.close()
act_scores = [[row[0],float(row[1]),float(row[2]),float(row[3])] for row in act_data[1:]]
header = act_scores[0]
act_scores = act_scores[1:]
%%HTML
<div class='tableauPlaceholder' id='viz1490006784570' style='position: relative'><noscript><a href='#'><img alt='2001 ACT Participation Rates for HS Graduates ' src='https://public.tableau.com/static/images/20/2001ACTPartipationRates/Sheet1/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='2001ACTPartipationRates/Sheet1' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/20/2001ACTPartipationRates/Sheet1/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1490006784570'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
%%HTML
<div class='tableauPlaceholder' id='viz1490006820143' style='position: relative'><noscript><a href='#'><img alt='2001 Average ACT Composite Scores for HS Graduates ' src='https://public.tableau.com/static/images/20/2001AverageACTCompositeScores/Sheet2/1_rss.png' style='border: none' /></a></noscript><object class='tableauViz' style='display:none;'><param name='host_url' value='https%3A%2F%2Fpublic.tableau.com%2F' /> <param name='site_root' value='' /><param name='name' value='2001AverageACTCompositeScores/Sheet2' /><param name='tabs' value='no' /><param name='toolbar' value='yes' /><param name='static_image' value='https://public.tableau.com/static/images/20/2001AverageACTCompositeScores/Sheet2/1.png' /> <param name='animate_transition' value='yes' /><param name='display_static_image' value='yes' /><param name='display_spinner' value='yes' /><param name='display_overlay' value='yes' /><param name='display_count' value='yes' /></object></div> <script type='text/javascript'> var divElement = document.getElementById('viz1490006820143'); var vizElement = divElement.getElementsByTagName('object')[0]; vizElement.style.width='100%';vizElement.style.height=(divElement.offsetWidth*0.75)+'px'; var scriptElement = document.createElement('script'); scriptElement.src = 'https://public.tableau.com/javascripts/api/viz_v1.js'; vizElement.parentNode.insertBefore(scriptElement, vizElement); </script>
## similarly to the regions_dictionary for the SAT (using state abbreviations)
## use the full state name to classify by region and match to the ACT dataset
regions_dictionary_full = {'West' : [row[0] for row in states_regions if row[2]=='West'],
'Northeast' : [row[0] for row in states_regions if row[2]=='Northeast'],
'Midwest' : [row[0] for row in states_regions if row[2]=='Midwest'],
'South' : [row[0] for row in states_regions if row[2]=='South']}
## region rate is where I multiply the % of the total that are core course
## in order to get the percentage of core course out of the whole
act_rate_by_region = []
for region in regions:
region_rate = [row[1]*row[2]/100 for row in act_scores if row[0] in regions_dictionary_full[region]]
region_composite = [row[3] for row in act_scores if row[0] in regions_dictionary_full[region]]
region_dict = {}
region_dict['region']=region
region_dict['region_mean_rate'] = round(np.mean(region_rate),2)
region_dict['region_median_rate'] = round(np.median(region_rate),2)
region_dict['region_mean_composite'] = round(np.mean(region_composite),2)
region_dict['region_median_composite'] = round(np.median(region_composite),2)
act_rate_by_region.append(region_dict)
for region in act_rate_by_region:
print region['region']
print ('RATE: mean = ' + str(region['region_mean_rate']) +
'; median = ' + str(region['region_median_rate']) )
print ('COMPOSITE: mean = ' + str(region['region_mean_composite']) +
'; median = ' + str(region['region_median_composite']) )
print ""
While it's hard to make any kind of conclusion given that the target samples of the SAT and ACT datasets are not identical, it is clear that states with low SAT participation rates have much higher ACT participation rates. The College Board needs to increase their presence in these high ACT participation states. I recommend that they collect more data on which schools these students are applying to with their ACT score and see if they can push the importance of SAT performance over ACT performance. Perhaps they could increase these schools' weight on SAT scores and therefore increase the need for students to take the SAT.