Prompted by some reader questions, we have continued analyzing the errors, earlier identified in the USMA-supplied data in our 2020 post West Point Is Not Selective, in depth. Here are additional findings to supplement the notes in About the Data. The purpose of these findings is to inform the reader about the source material of our posts.
In other words, if you’re looking for something interesting about admissions itself, wait for the next one. But if you want to see some of the issues we encountered with the data files, please continue.
Recall that we obtained the class data files from USMA via FOIA [files posted here at link]. Because they came from USMA we took them as correct and as primary sources in our initial posts. As a result we completed our initial analysis without modifying the basic data.
One of the assumptions in this approach is that any errors are evenly distributed in the data and do not affect the overall or directional outcome of findings.
For example, if we found in the data that more women attended than men, and removing the error led us to find that more men attended than women, this would definitely affect our conclusions. But if we removed an error and found that the margin of women attending was somewhat less, or the ratio was largely unchanged, we could consider the finding still valid.
However we noted that totals for applicants, attended cadets, and graduates (from counting of status G and S codes) did not match the class profiles that West Point publishes. Here are some of the factors affecting this finding.
Some of the issues include:
- incorrect coding. Example: Most candidate records have a race and sex code. However some records have only sex or only race codes. These are concentrated in classes 2000 and 2002, with relatively few in following classes.
- Attendees with no applicant data. No test scores, sports, class size, etc. Example: n_00571. Some of these are prior service. Others have no indication of why no additional records are available, perhaps being foreign exchange cadets
- Field problems: The field “yrcpt_11”, presumably meant to give number of years a candidate was a high school team captain, was always empty. The “applicant_admitted_flg” was always “N”, for “no”, even for those candidates who clearly attended. ay5_cdtco and ay6_cdtco fields were always empty, meaning no way to identify turnbacks or late grads. The National Merit Scholar flag only returns “Y” for candidate files starting in 2010. And there may be more errors or conditions for certain values that we’re not aware of.
- Incorrect numbers. the total number of candidate records, cadets, graduated cadets, etc, can come close to what USMA published for a given class year but not exactly. This may be due to cadets being turned back (increasing or decreasing the class year totals, depending), december grads, and so on. Regardless, this makes precision difficult. This happened even after removing duplicates, which we get to next.
- Duplicate candidate records. We found a number of records that appear to be duplicates in all respects other than the class year and candidate ID. This is the largest identifiable confounding factor in the data.
As mentioned, we found duplicate records. These duplicates had different “cdt_id” field values but the same values for everything else – test scores, class sizes, high school sports (occasionally with an additional sport or year played), West Point GPAs, West Point Academic year companies, and graduated / separated statuses. Further, there could be multiple duplicates, across multiple years, and sometimes the years were non-consecutive.
This made them difficult to detect at the beginning because we assumed candidate records were unique from year to year.
To identify these records, we created unique keys based on other data fields for each record, concatenating race, sex, test scores, high school class size, our “attended” flag, and USMA CQPA. We then sequentially counted these identifiers across the full 20-year data set. Any values with counts higher than 1 were identified as duplicates and values of 1 (first appearance of the record) were identified as unique records. There is a chance that we mix up the class years and identify as duplicate records that should be “the” unique ones, but having to make an assumption, we chose to keep the first value.
In this example we see a profile duplicated in 6 IDs across three different, non-consecutive class years (2010, 2013, 2014). And if you go through the rest of the columns class files, you’ll find the exact same CQPA, CAPS, CMPS (all to three decimal places), academic year cadet companies, major, and resignation reason. While there is an SAT Writing score omission in 2010, it’s extremely unlikely that this is actually six different people, and much more likely that it’s one record duplicated through a query or other error. Because all the test scores, GPA, AY Companies, and separation reason are the same, we cannot identify the “real” year for this cadet.
Magnitude of issue
Total duplicates numbered 2004 with duplicates by class year shown below. The median number of duplicates is 88, average is 95, max is 430 (class of 2017, what happened?), min is 11. We also see the duplicates separated at about a 54% rate, much higher than the general cadet population which is usually around 20%.
These influenced the class totals. Including duplicates (that is, keeping the cdk_count.1 field as “all”), we’d find that class sizes looked like this:
Clearly something’s off. The class “attended” average is 1608. But, according to West Point, the class of 2017 started 1,190 cadets, not 1,616. And 426 cadets is a big gap, not something bridged even by a lot of turnbacks (we hope).
When we removed duplicate files across all class years, we arrive at this set of numbers:
The average class size goes to 1,512. Keeping our eye on 2017, we see it looks much better with 1186 started and 965 graduated. But uh oh! Class of 2019 admitted only 1262 cadets, not 1514. So there is still a discrepancy, but the files do not present a way to filter out any additional profiles to make the files match the West-Point published numbers. (at least so far as we have found, we welcome readers to share any additional findings!).
We chose in previous analysis to deal with the duplicates assuming that they are relatively small numbers and are distributed randomly, that is, assuming they follow the same distribution as the rest of the cadet profiles.
To confirm or deny this assumption we check some basic demographic comparisons below, to see how closely the profiles match and determine whether any large variances exist.
First, comparing race/sex distribution, with the “0” and “1” columns representing those who did not attende West Point and those who did, respectively. “M” and “F” represent Male and Female, with the other letters representing demographic categories. (Reminder: We assume (A)sian, (B)lack, (H)ispanic, (N)ative american, (O)ther, (W)hite, as USMA files did not include a key)
The table on the left is the unique values, and the table on the right is the duplicate values.
To check the duplicate influence, we compare both of the “0” columns (the “not attended” group) and then the “1” columns (records showing attendance at West Point).
The “not-attended” groups have the highest differences. The “attended” groups also show some differences. The percentages may be slightly off but are generally pretty closely distributed in absolute terms.
We also consider test scores / cognitive aptitude distributions, spot-checking the SAT Math score averages of each demographic slice:
We see that the variances here are generally minimal, with average differences being within or around the 10-pt SAT score range, though some groups with smaller absolute numbers (for example NF category) see higher variance. We therefore assume that this pattern holds through the rest of the data.
What about separations?
We find that the duplicates’ separations are about 1/8 or 12-13% of the total number of separations (Duplicates + uniques), and that they are always higher than the unique records’ separation rates.
The unique+duplicates separate rate shows as 24%, and unique records only shows 22% separations. All groups rates move in the same direction (down) from 2-4% once duplicates are removed, but differences remain roughly constant.
We show the adjusted separation rates in the “unique records only” table above. We conclude that they are close enough to not impact directional findings in the analysis or conclusions.
That said, specific numbers related to separations may be impacted. This is especially true because there are still discrepancies in the data which prevent reconciling perfectly to USMA_published figures.
We have identified and investigated a number of discrepancies in the data which was supplied by USMA. While the main error, duplicate records, impacts absolute numbers and a small fraction of rate differences, we conclude it does not change the directional findings of the analyses.
Additionally, we know that there are more errors in the data that we cannot identify, but we assume they are similarly evenly distributed among the records. We will work to account for this and update numbers as we can.
As always, factual corrections, reader insights, and thoughtful criticism are welcome.