Chapter 8 Part 2: More Experimentation
After you start simple, everything suddenly becomes really complicated.
8.1 On the way to Final Model
After some of our previous trials, we realized that not all factors play an essential role in the college ranking algorithm. To check our understanding, we built a quick linear regression model. From its summary and anova table, we recognized the important factors were the following: size of undergraduate student, SAT score, admission rate, location and of course intercept.
##
## Call:
## lm(formula = as.numeric(Y2018) ~ UGDS_1617 + LOCALE_collapse +
## SAT_AVG_1617 + ADM_RATE_1617 + COSTT4_A_1617 + UGDS_WHITE_1617,
## data = fullUniversity)
##
## Residuals:
## Min 1Q Median 3Q Max
## -91.881 -6.717 -1.486 9.020 40.251
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.014e+02 4.395e+01 4.584 1.26e-05 ***
## UGDS_1617 -6.681e-04 2.186e-04 -3.057 0.00283 **
## LOCALE_collapseSuburb -3.084e+00 4.007e+00 -0.770 0.44318
## LOCALE_collapseTown 6.109e+00 8.877e+00 0.688 0.49285
## SAT_AVG_1617 -1.260e-01 3.035e-02 -4.150 6.74e-05 ***
## ADM_RATE_1617 8.617e+01 1.633e+01 5.276 7.04e-07 ***
## COSTT4_A_1617 -1.126e-04 1.540e-04 -0.731 0.46639
## UGDS_WHITE_1617 2.020e+00 1.107e+01 0.183 0.85552
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 16.61 on 106 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.814, Adjusted R-squared: 0.8017
## F-statistic: 66.27 on 7 and 106 DF, p-value: < 2.2e-16
## Analysis of Variance Table
##
## Response: as.numeric(Y2018)
## Df Sum Sq Mean Sq F value Pr(>F)
## UGDS_1617 1 8699 8699 31.5151 1.600e-07 ***
## LOCALE_collapse 2 3730 1865 6.7556 0.001732 **
## SAT_AVG_1617 1 104898 104898 380.0200 < 2.2e-16 ***
## ADM_RATE_1617 1 10545 10545 38.2011 1.208e-08 ***
## COSTT4_A_1617 1 161 161 0.5816 0.447366
## UGDS_WHITE_1617 1 9 9 0.0333 0.855516
## Residuals 106 29259 276
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
OMG, what happened to the cost variable? Is it because we put the cost variable COSTT4_A_1617 in the second to last position?
In the spirit of scientific experiment (which Kavya and Zuofu happen to have), we switched the order of our linear model and produced another anova test:
## Analysis of Variance Table
##
## Response: as.numeric(Y2018)
## Df Sum Sq Mean Sq F value Pr(>F)
## UGDS_1617 1 8699 8699 31.5151 1.600e-07 ***
## LOCALE_collapse 2 3730 1865 6.7556 0.001732 **
## COSTT4_A_1617 1 39446 39446 142.9020 < 2.2e-16 ***
## SAT_AVG_1617 1 66224 66224 239.9151 < 2.2e-16 ***
## ADM_RATE_1617 1 9933 9933 35.9857 2.802e-08 ***
## UGDS_WHITE_1617 1 9 9 0.0333 0.855516
## Residuals 106 29259 276
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Certainly, cost remains one of the most important factors for students in the process of college selection. We also acknowledge the significant role that scholarships and financial aid play. However, the order switch experiment sufficiently shows that other variables explain an equal amount of variability.
8.2 What are we thinking?
“Zuofu was on his way to create the best Bayesian model in the world when his dream got crushed.” –Zuofu
From the above, we made the final decision to incorporate the following variables into our Bayesian model of rankings: size of undergraduate student body, average SAT score, admission rate, and location.
To see how we do with universities and liberal arts colleges, check out our next chapter: Becoming a proud Bayesian.