In the first week of the General Assembly Data Science Immersive course, we are assigned a project with an aim to analyse data for SAT results. The data describes three pieces of SAT related information for all states in the U.S and the District of Columbia in 2001. The data set consists of the average mathematics score, average verbal score and participation rate respectively. This blog post mainly describes how the descriptive statistics of the dataset is like and visualise the relationships between variables.
The analysis is carried out using Python and many of its modules such as Numpy, Scipy, Matplotlib and Seaborn. After loading in the data, a brief summary was quickly drawn up. There are 4 columns in the data set:
– Participation Rate
– Average Verbal Score
– Average Math Score
once the data types of all but “State” column are numerical, the descriptive statistics can be calculated quickly.
From the figures above we can say that there seems to be no obvious issue. There are no missing values or values that looks very out of place.
Having understood the range, type and shape of the data, it is possible to proceed and find out more about the distribution of individual columns. The typical assumption for most data is that they would follow a normal distribution. This is because according to the Central Limit Theorem, the distribution of sameple means of large samples (more than 30) would closely follow a normal distribution. It is safe to assume that there were more than 30 students who took SAT in each province in 2001. Therefore it is reasonable to believe the distribution for the average scores to be normal. In order to investigate the real distribution, visulisation methods are used. The histogram for both math score and verbal score are plotted with markings indicating the mean, median and mode respectively. The resulting figures are displayed below:
In a perfect normal distribution, the metrics of central tendency should be identical. That is mean = median = mode. From the histgrams above, it is shownn that this is not the case for either of the SAT scores.
It can be said that the distribution for maths score is positively skewed. This is verified by the evidence of peak at the lower end and the long tail at the higher end in the plot. Statistics also suggest that mean > median > mode which is a typical characteristic in positively skewed distributions.
There is a hint suggesting that the verbal scores is positively skewed as well. Since the mean is greater than the median. However there is an interesting observation showing that apart from the global peak at the 490-500 mark, there is also a local peak between the 560 – 580 mark. This causes the distribution to look slightly bimodal. This means most students are either very good at verbal skills or they are not quite up to average. (Although whether the ‘average’ here is meaningful is up for debate.)
After plotting the histograms, we turned to explore the relationship between variables. Therefore scatter plots are used. There are quite a few interesting observations that worth mentioning. The most intuitive figure in the above plots is Fig. C where it displays the average verbal score against the average math score. As one might expect, a higher verbal score very oftenly correlates with a higher math score. A line of best fit is created using numpy’s polyfit operation. It is created by minimizing the squared error. After plotting the line it is obvious that there is an outlier with a verbal score of around 530 and a math score of around 440. We can find out who the outlier is by finding the state with the lowest math score.
It turns out the outlier is OH. According Wikipeida, the abbreviation “OH’ represents Ohio. An investigation was carried out as an attempt to find out the reason for this unexpectedly low score. After a brief search a SAT report written by the The College Board was discovered. Inside page 1 of the report, under the “SAT Program Test Takers” table, along the “Students with SAT I Scores” row, it can be seen that the average verbal score is 534 which matches the data here. However the average maths score is recorded as 539 instead of 439. It means that there is a data entry error in either our data set or the report. However given that our data point is an obvious outlier, it is very likely that our data is the wrong one.
Fig A and Fig B both demonstrate interesting relationships between SAT scores and states participation rate. The gradeints in both figures are negative which indicates that high participation rate results in lower average socres. Intuitively this is expected. It is reasonable to assume students academic ability are distributed normally across populations. Most student are average while small proportions are either very smart/hardworking or have difficulties catching up. SAT is not compulsory but students are advised to take it if they would like to be admitted to a ‘well-known’ or ‘higher ranked’ university. Therefore at placess where participation rate is low, it is very likely that those participants are the ones who are alreay doing well in school and would like to enter a higher ranked university. In other states, taking the SAT is either compulsory or highly encouragedsome average. Which means many average students would take the test and drive up participation rate. It is reasonable to assume that the average students would not score as high as the top students hence lowering the average.
Finally a heat map is produced to see the participation rate for different states in the U.S. It can be seen that the participation rates are relatively high near the coasts (both east and west) while some inland states has really low rates.