Chapter 8 Part 2: More Experimentation

After you start simple, everything suddenly becomes really complicated.

8.1 On the way to Final Model

After some of our previous trials, we realized that not all factors play an essential role in the college ranking algorithm. To check our understanding, we built a quick linear regression model. From its summary and anova table, we recognized the important factors were the following: size of undergraduate student, SAT score, admission rate, location and of course intercept.

## 
## Call:
## lm(formula = as.numeric(Y2018) ~ UGDS_1617 + LOCALE_collapse + 
##     SAT_AVG_1617 + ADM_RATE_1617 + COSTT4_A_1617 + UGDS_WHITE_1617, 
##     data = fullUniversity)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -91.881  -6.717  -1.486   9.020  40.251 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            2.014e+02  4.395e+01   4.584 1.26e-05 ***
## UGDS_1617             -6.681e-04  2.186e-04  -3.057  0.00283 ** 
## LOCALE_collapseSuburb -3.084e+00  4.007e+00  -0.770  0.44318    
## LOCALE_collapseTown    6.109e+00  8.877e+00   0.688  0.49285    
## SAT_AVG_1617          -1.260e-01  3.035e-02  -4.150 6.74e-05 ***
## ADM_RATE_1617          8.617e+01  1.633e+01   5.276 7.04e-07 ***
## COSTT4_A_1617         -1.126e-04  1.540e-04  -0.731  0.46639    
## UGDS_WHITE_1617        2.020e+00  1.107e+01   0.183  0.85552    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16.61 on 106 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.814,  Adjusted R-squared:  0.8017 
## F-statistic: 66.27 on 7 and 106 DF,  p-value: < 2.2e-16
## Analysis of Variance Table
## 
## Response: as.numeric(Y2018)
##                  Df Sum Sq Mean Sq  F value    Pr(>F)    
## UGDS_1617         1   8699    8699  31.5151 1.600e-07 ***
## LOCALE_collapse   2   3730    1865   6.7556  0.001732 ** 
## SAT_AVG_1617      1 104898  104898 380.0200 < 2.2e-16 ***
## ADM_RATE_1617     1  10545   10545  38.2011 1.208e-08 ***
## COSTT4_A_1617     1    161     161   0.5816  0.447366    
## UGDS_WHITE_1617   1      9       9   0.0333  0.855516    
## Residuals       106  29259     276                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

OMG, what happened to the cost variable? Is it because we put the cost variable COSTT4_A_1617 in the second to last position?

In the spirit of scientific experiment (which Kavya and Zuofu happen to have), we switched the order of our linear model and produced another anova test:

## Analysis of Variance Table
## 
## Response: as.numeric(Y2018)
##                  Df Sum Sq Mean Sq  F value    Pr(>F)    
## UGDS_1617         1   8699    8699  31.5151 1.600e-07 ***
## LOCALE_collapse   2   3730    1865   6.7556  0.001732 ** 
## COSTT4_A_1617     1  39446   39446 142.9020 < 2.2e-16 ***
## SAT_AVG_1617      1  66224   66224 239.9151 < 2.2e-16 ***
## ADM_RATE_1617     1   9933    9933  35.9857 2.802e-08 ***
## UGDS_WHITE_1617   1      9       9   0.0333  0.855516    
## Residuals       106  29259     276                       
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Certainly, cost remains one of the most important factors for students in the process of college selection. We also acknowledge the significant role that scholarships and financial aid play. However, the order switch experiment sufficiently shows that other variables explain an equal amount of variability.

8.2 What are we thinking?

“Zuofu was on his way to create the best Bayesian model in the world when his dream got crushed.” –Zuofu

From the above, we made the final decision to incorporate the following variables into our Bayesian model of rankings: size of undergraduate student body, average SAT score, admission rate, and location.

To see how we do with universities and liberal arts colleges, check out our next chapter: Becoming a proud Bayesian.