160k+ high school students will only graduate if a statistical model allows them to

Jun 17, 2020

tl;dr : Due to curriculum disruptions the International Baccalaureate (IB) is going to use a statistical model to assign grades for > 160,000 high school students. These grades have a very significant impact on students’ lives (for better or for worse). This is an inappropriate use of ‘big data’ and a terrible idea for a plethora of different reasons.

The IB’s model has glaring methodological issues and completely disregards the ethical considerations which should accompany its adoption. The model will inevitably discriminate against groups of students based on gender, race and socioeconomic status. It will unfairly and disproportionately impact vulnerable students. Though this is a mathematical certainty, it is difficult to identify how this will happen a-priori: a model can choose to assign female students systematically lower grades in STEM subjects and/or incorrectly fail Black students at higher rates than Asian students.

I believe that the IB should seriously re-evaluate the manner in which it has chosen to maneuver this very delicate situation. Its plan highlights a troubling trend of blind faith in ‘big data’ driven by a lack of consideration for any potential ramifications. In this article, I explain why the IB’s decision is horrible, using data from schools in New York to illustrate how a model can discriminate even when it is not given gender, race and socioeconomic data.

What is the IB? Should I grab my pitchfork yet?

The International Baccalaureate (IB) is an educational foundation that awards high school diplomas to students from around the world. It had 166,000 candidates across 144 countries in the previous graduating class.

The IB has a set of mandatory ‘leaving examinations’ at the end of high school. The marks from these exams are used to allot each student a final grade. This final grade is very important for students: it enables them to graduate, apply to universities and accept admissions offers from universities. It is also the most important measurement used in college admissions processes in Europe and Asia (~90% of the weight). A student’s final grade can significantly alter their future life outcomes.

The pandemic has led to considerable turmoil in the plans and operations of the IB. Due to curriculum disruptions, the IB has been forced to cancel final exams for its current student cohort. Instead, it is opting to assign final grades in a truly unprecedented manner:

staff are working with an education organisation that specializes in data analysis, standards, assessment and certification. Together we have developed a method that uses data, both historical and from the present session, to arrive at the subject grades for each student.

the IB has … been in dialogue with ministries of education, education regulators and other similar bodies across the world to ensure that they too are confident with our approach and that students will be appropriately recognised where required.

these assessment arrangements represent the fairest approach we can take for all our students

Per the IB, the final grades for each student will be assigned by a statistical model as a function of two or more metrics:

Coursework Grades: Grades for projects and assignments that the students submitted prior to the disruptions.

Predicted Grades (Forecast Grades): The grade that a teacher believes each student was likely to have obtained if the exams were held as planned. This is a teacher’s evaluation of their student’s preparedness.

Miscellaneous other data: The IB says that a model will use miscellaneous other data, wherever it is available.

The three step process, illustrated below, will be used to prescribe final grades to each student. I will refer to this entire process as the ‘model’.

The three step process to prescribe final grades

There are lots of names for extrapolation based on historical patterns: statistical modelling, machine learning, data analytics, big data, AI. All of these terms refer to the same narrow set of processes which use historical data to predict the outcome of a future event. Unfortunately, as three prominent researchers expressed in their textbook, ‘the fact that [this process] is “evidence-based” by no means ensures that it will lead to accurate, reliable, or fair decisions’. Over the course of this article, I will build on the previous statement and convince you that it is a horrible idea to assign final grades using a statistical model.

But definitely read the rest of the article before you grab your pitchfork.

Glaring issues in the IB’s methodology

Let’s start with the obvious issues in the process described by the IB. There are at least seven conspicuous issues (and some smaller yet equally important ones):

I. Double Jeopardy. If a student did badly on their coursework they will be penalized twice over: once by the model & once by the IB grading rubric which calculates the final grade. This is because the model will predict a final mark based on the coursework and predicted grades. Then the final mark will be combined with coursework grades to obtain a final grade.

II. Historical Bias: A study based on data from the National Center for Education Statistics concluded that secondary school teachers tend to express lower predictions for their ‘expectations from students of color and students from disadvantaged backgrounds’. This is problematic because predicted grades play a prominent role in the model.

III. Different schools, Different errors: Small schools (15% - 30% of all IB schools) will have bigger and more frequent errors in their model predictions when compared with large schools. This is an example of representation bias.

Comparative Error (Explained Below) - Based on 10,000 Simulations

Assume that the largest IB school has a class size of 300 (based on IB data from 2019). If you are a student in a school with a class size of 5, your final grade will have a ~25% larger average error than the students in the school with 300 students. The graph above depicts how this comparative error decays with an increasing class size. The principle is simple: the more data you have for a school, the more accurate your predictions are.

IV. Measurement Bias: If the measurement process varies across different schools, it will affect the way a model treats students from different schools. Schools which cater to socioeconomically disadvantaged communities are likely to have less frequent evaluation of students. This will lead to poorer students receiving predicted grades which are less accurate than richer peers in schools with more frequent testing. Additionally, poorer schools are likely to have larger class sizes. A teacher who has to assign predicted grades for 10 students will do a better job than a teacher who has to assign predicted grades to 30 students.

V. Additional data where data is available: The IB has said that they will supplement the coursework grades and predicted grades with additional data ‘where it is available’. This can be problematic as it may induce biases in predictions and will cause predictions for some schools to be more accurate than predictions for others.

VI. Skewed Distributions: Schools with a non-normal distribution of grades will have bad predictions. If a school has a left-skewed distribution (overachievers!) of grades or right-skewed distribution of grades, a model will perform worse for its students.

VII. Distribution Shifts: If the subject teacher in a school changed between last year’s cohort and this year’s cohort, the historical relationship between their predicted grades and final grades will not match the current relationship. This may lead to systematically worse predictions.

These are the low hanging fruit. There are many other nuanced problems which may arise depending on the sort of model that the IB decides to use. I personally feel that this smattering of amuse-bouche arguments should be enough to terminate this experiment. Read on for concepts which are only marginally more complex but much more unsettling.

A student’s future shouldn’t be at the mercy of random noise in a statistical model

There is a ubiquitous saying in the field of statistics that ‘all models are wrong and some models are useful.’ A model is an approximation of reality based on empirical historical patterns: all predictions are rough estimates and no model can forecast the future with complete certainty. Furthermore, every model will have some uncertainty in its predictions due to random noise. If the IB uses a model to assign final grades for students, the model will inevitably make mistakes.

Let’s assume that the IB builds a model which is ‘90% accurate’. This is an almost unrealistically ambitious target and is incredibly difficult to achieve in practice. This would mean that the IB will predict incorrect final grades for at least 1 out of every 10 students. Put another way: this is equal to issuing incorrect grades for all IB students in China, Germany, India, Singapore and the United Kingdom combined.

The IB may be comfortable with this 10% inaccuracy because they have assured students that they will ‘match the grade distribution from last year’. Will this cancel out the inaccuracies in the model predictions? Absolutely Not. This is a cosmetic improvement which has the additional benefit of providing the IB with some plausible deniability. While the IB can make the current distribution look like the historical distribution, they cannot possibly guarantee that students will be in the same neighborhood on the current distribution as they would have been if the exams proceeded per usual.

Consider this: I can build a model which assigns every student some marks at random from an arbitrary normal distribution. Then, I can use these marks and adjust my grade bins to match the final grade distribution from last year. Does this compensate for the fact that my ‘model’ assigns marks in a manner completely untethered from reality? No it does not. This previous example was pathological, in reality, a bad model will fail silently and disproportionately harm certain groups of students.

Yes, models are very useful tools which help us make decisions at scale. They will inevitably play a large part in our future. This does mean that institutions should be able to circumvent the ethical considerations of where a model is appropriate. Is it ethical to deprive a student of their hard-earned spot at the London School of Economics because a black-box decision making mechanism said that they were not worthy of the opportunity? Is it ethical to tell a 17 year old kid that they were unable to graduate with their peers because their inaccurate prediction was ‘the cost of doing business’? I believe there are key ethical considerations which the IB has possibly overlooked while making its operational choices.

Why do models fall short?

Lotta Puns

Before we jump into the last argument, we need to understand how models which predict correctly may still be fully incorrect. Models are very powerful yet incredibly stupid: the exact quality which makes them desirable also renders them completely ineffective. They have an uncanny ability to detect microscopic patterns in humongous quantities of data. The job of the model is to pick up any pattern which will allow it to predict an outcome efficiently. A researcher cannot control what patterns a model will choose to detect. Therefore, models will take any edge to make themselves more predictive - even if it means capitalizing on false relationships to predict outcomes.

Take these this graph below as an example:

Data from tylervigen.com

I trained a model which has an almost perfect fit to the data above. Given the blue line, it can predict the value of the red line very well (and vice-versa). Now, let me re-contextualize this data.

Data from tylervigen.com

The red quantity was actually the money spent on pets in the USA and the blue quantity was the number of lawyers in California. We know that there is no possible way that the two quantities above are related - this is just a coincidence. Yet, any model will learn to predict one as a function of the other (just as mine did). Do you think that a sudden downturn in the money spent on pets in 2010 would have led to a commensurate downturn in the number of lawyers in California? No. But if I trained a model on this data, it would be susceptible to this incorrect assumption. Even when I try to correct this inconsistency by giving my model an appropriate quantity which it should be able to use to predict the number of Lawyers, it chooses to use the incorrect signal (money spent on pets) to predict the outcome instead.

As a researcher, it is not possible to stop the model from learning these incorrect relationships. This is an important point: just because a model is predictive does not mean that it is correct. An accurate model may be a bad model and spurious correlations can be very problematic if not appropriately detected.

A model will discriminate against students based on Gender, Race, Socioeconomic status etc.

tl;dr: Models will learn the gender, race, and socioeconomic status of a student even if this information is withheld from them. It is a statistical inevitability that the IB model will discriminate against certain groups of students, yielding unfair outcomes.

You may think that a model which isn’t aware of gender/race/socioeconomic status cannot possibly discriminate based on these attributes. This line of thinking is called ‘fairness through unawareness’. Let’s see what the experts say about this:

some have hoped that removing or ignoring sensitive attributes … somehow ensure[s] impartiality … unfortunately, this practice is usually somewhere on the spectrum between ineffective and harmful.

Even if you don’t include a sensitive attribute in a model, the model will learn it. Fairness through unawareness reminds me of the Chappelle Show educated guess line skit. Dave first guesses the race of the caller, then uses that to make a guess about the circumstances of the caller. Dave is ‘a model’ which is making malicious assumptions about his clients.

Now, let’s see this in action. I will build a model to predict whether the graduation rate for a New York High School is above or below the national average, as a function of the high school’s county location and its score on 3 different tests. This is incredibly similar to what the IB is doing (they are using analogous metrics).

My model can predict whether the graduation rate is above/below the national average with almost 80% accuracy. Remember, this model does not have any data about the race, socioeconomic status or gender of the student body in any high school.

Let’s take a closer look at how the model learns to predict the graduation rate. I’m going to gloss over some technical details here, but I simply ask the same model as above: are a majority of the students in the High School Black/Hispanic? The big idea is that if our model can detect the majority Black/Hispanic high schools with high accuracy, it is probably learning to identify Black/Hispanic high schools and then using this fact to predict the graduation rate.

Yikes. Our model can predict the majority race of the high school with more accuracy than it can predict the graduation rate. This means that our model is definitely race-aware. We did not expect it to learn anything about the race of the students, yet, the model decided that it needed to know this information in order to predict graduation rates. We did not provide the model with any data on the race of the students, the model just went ahead and learned it.

If we look for the same pattern with socioeconomic status: we see that the model recognizes which schools have a majority of economically disadvantaged students. It can detect these schools with ~75% accuracy. This means that our model is economically-aware as well.

To make sure that this is not a fluke, we check if the model can detect schools which have a majority female population:

The model is less than 50% accurate. This is slightly worse than guessing at random. The model is clearly not taking the majority gender of the school into account while making its decision.

In fact, if we build an alternate model but give it the majority race of the high school (in addition to the same data-points as the original model), we would expect the graduation rate accuracy of the model to go up substantially as we are providing it with additional data. However, we see that the accuracy of the alternate model only increases by ~1%. This is another indicator that our model is somehow race-aware already.

To reiterate: when we built a model to predict high school graduation rates based on test scores and school location, we did not give the model any information about race, socioeconomic status, or gender. Our model simply realized that if it can identify the racial/economic makeup of the school, it can probably identify graduation rates. Therefore, even if the IB does not give the model sensitive data, a model will deduce it.

What happens when a model learns about the gender/socioeconomic status/race of a student and then incorrectly uses these things to predict their final marks? Much like the ‘amount of money spent on pets’ does not actually tell us anything about the ‘number of lawyers in California’, knowing the race/socioeconomic status/gender of a student does not tell us anything about their final marks - even though this data may be useful for prediction. Since we do not have a ‘principled way to tell at which point such a relationship is worrisome and in what cases it is acceptable’, using a model will lead to unfair predictions for groups of students.

A formative result in machine learning states that it is statistically impossible for the IB to ensure that any model will be completely fair. There are three primary criteria which guarantee fair predictions, however it is impossible for any model to satisfy all three of them simultaneously. This means that any model used by the IB will inevitably discriminate against students in two of three ways:

I. It may penalize students based on some sensitive attribute and assign systematically lower grades to certain groups. For example: The model may choose to assign grades based on the gender (or race/socioeconomic status) of the student. It may assign female mathematics students systematically lower grades than male mathematics students.

II. It may have a higher rate of incorrect predictions for certain students based on some sensitive attribute. For example: The model may incorrectly fail students (devastating but unavoidable!) at different rates based on the race (or gender/socioeconomic status) of the student. It may incorrectly fail Black students at a higher rate than Asian students.

III. It may have a lower rate of precision for certain groups based on some sensitive attribute. For Example: A model may have varying precision in identifying students who should fail based on their socioeconomic status (or gender/race). It may be more precise in failing rich students (removed with a scalpel) and less precise in failing poor students (removed with a butter knife).

As mentioned above, two of these three scenarios are unavoidable per sensitive attribute. These disparities will express themselves in any model chosen by the IB. There is an unfortunate trade-off to be made over here: the researchers will have to decide which of the two criteria they will sacrifice.

You must ask yourself: is it ethical to use a model which is systematically pessimistic about the performance of some students but unfairly optimistic about the performance of others? Is it fair to use a model if it is capitalizing on sensitive attributes to assign grades?

Well, what do we do now?

I’ve spent time discussing this problem with people who are a lot smarter than I am. We weren’t able to arrive at a good answer: just a long list of wrong answers. The fact that this is an outsourced black-box model with limited historical data, no oversight into the decision making mechanism and only 3 months for research and production further complicates the situation. Data analysis and machine learning are incredibly powerful tools, but they need to be used in appropriate situations and with a great degree of care. When you can significantly alter life outcomes for vulnerable parts of the population, you need to adopt a higher degree of nuance in your decision making. This situations elicits a process solution rather than a smart ‘modelling’ solution.

Unfortunately, I do not know enough about the domain of education to suggest a good alternative. I am, however, confident enough in my mathematical ability to identify that the current solution is unequivocally incorrect. It is better to have a delayed and suboptimal solution than a discriminatory and error-laden one. I am conscious of the fact that this may cause the article to appear slightly partisan but my goal is to simply raise awareness about a potentially massive issue in a very sensitive situation.

As a student/parent/teacher: the first step is to not succumb to outcome bias. It is easy to say ‘let me see the results of the model and then I will decide if I like it.’ If you feel that you were unfairly treated after the results are legitimized, there is very little you will be able to do about it. Share this article with your administrators and with others in the IB community. Demand a better, fairer and more transparent process (support@ibo.org). Raise your concerns with the University which you wish to enroll in and ask them about fairer alternate admissions processes (info@officeforstudents.org.uk, etc.).

On a closing note, I would like to call attention to the speed with which governments and universities declared that they were ‘on board’ with this bad idea. It is concerning to see how this was treated as a stereotypical business decision with no regard for the potential ramifications and a total lack of consideration from an interdisciplinary perspective. Conversations on fairness in machine learning clearly failed to reach the many many stakeholders involved in this operation. Perhaps there is some further scope for outreach within this corner of academia?

Thank you for reading. I always appreciate any constructive criticism or comments!

Acknowledgements: A big thank you to Ilian, Rohan, Mishti, Vasilis and Mom+Dad for reviewing drafts of my first article ever!

Previous May 23, 2020
« tl;dr mit 6.s085