math in autumn: PISA: snapshots with a fogged up camera

When I was an undergrad I had a friend who lived by the motto “It’s easier to be infamous than it is to be famous.” He was a notorious “sh!+ disturber”. He caused havoc all over the campus, and he enjoyed it.

A certain Edmonton Journal columnist seems to have inherited my friend’s mantle. He enjoys tweaking the noses of the education community. I suspect his “sh!+ disturbing” also gives him much enjoyment. His weapon of mass disturbation is PISA.

Every three years the OECD tests a large sample of 15-year olds in Mathematics, Science, and Reading. This is PISA, the Programme for International Student Assessment. The OECD describes the programme as providing a snapshot of the state of education in the participating countries. Their objective is clearly stated: PISA is intended to guide the development of education policies around the world.

Following each round of PISA exams, OECD releases a welter of results, presenting them as league tables, that is, as a list with the highest scoring countries at the top and the lowest scoring countries at the bottom. Newspaper columnists love the league tables.

Although I like to see how our education systems stack up internationally, I am not a supporter of PISA. There are lots of questions that I find unsettling. Here are two.

How reliable are the PISA results?

Imagine that you are a teacher, and that you have decided to use a two-hour final exam to assess your students. You concoct a list of questions covering all aspects of the course, but you realize that it would take a student up to four hours to complete all of these questions.

Here’s how you get around the problem: You have a lot of additional information about each student (assignments, quizzes, midterms, classroom observations). So you set a final exam by using a subset of the questions and you fill in the missing data by using all that extra information.

Now imagine that you are the PISA examiner. To fully evaluate the students you also need a four hour exam. As before, each exam paper is only permitted to be two hours in length.

But now there is a real problem: you do not have access to all that wonderful additional information. If you set a two-hour exam, there will be an awful lot of missing data, and you will have no way to compensate for it. However, in order to get reliable results, you absolutely need to use all the questions that you have concocted.

Here’s how PISA gets around the problem. The examiners are really not interested in individual results. They only want the aggregate picture. So they distribute the questions among different exam papers. The exams are not the same—the questions on Mary’s paper will be different than the ones on Peter’s paper. In aggregate, your class answers all test questions, but for each student there is a large chunk of missing data. PISA generates the missing data by using a psychometric model called the Rasch model.

The situation may be described like this. Let's suppose we are testing mathematics. Everybody named Mary gets a question about area, but nobody named Peter does. Nevertheless, everyone named Peter still gets a grade for the area question, and the grade is determined by how the Marys answered the question.

In some circumstances, the Rasch model works, but Svend Kreiner, a Danish biomedical statistician, says that PISA uses it incorrectly. The use of the Rasch model has also been criticized by David Spiegelhalter, professor of the public understanding of risk at Cambridge University.

Some people say that Kreiner is wrong and that it is acceptable to use the Rasch model. I am not a statistician, and it is not clear to me who is correct. (However, I find it relevant that Kreiner was one of Rasch’s students and that he has used the Rasch model for 40 years.)

How should we interpret the PISA league tables?

Let’s dismiss any misgivings about the reliability of the results and take the PISA league tables as valid. But keep in mind that some data is generated statistically, and the students taking the exams are only a sample of the 15-year old cohort. Consequently, there is some uncertainty in the published figures.

The league tables show a plausible single score for each country, but the uncertainty means that there is actually an interval range of plausible scores. If the intervals for two countries overlap then you cannot really conclude that the scores are different.

For example, in the 2012 PISA mathematics table, Finland finished 12th with an average score of 519. Canada placed 13th with an average score of 518. The intervals for Finland and Canada overlapped considerably. Do the results really allow us to say with any degree of certainty that Finland finished ahead of Canada?

The uncertainty allows some bending of the league tables to suit your argument.

If you want to argue that Canada is slipping badly in math education, you use the league tables based on the single scores: In 2006 Canada was sixth from the top, in 2009 Canada was tenth from the top, and in 2012 Canada was thirteenth from the top. From sixth to thirteenth is a significant drop.

If you want to argue that we are actually not dropping very much, you use the statistical intervals. Based on these, in 2006 Canada was tied for fifth, and in both 2009 and 2012 Canada was tied for tenth. So, no drop from 2009 to 2012.

This still looks like a drop in performance since 2006. But wait: in 2006 neither Shanghai nor Singapore participated in PISA. So you can argue that if we want to compare the 2009 and 2012 results to 2006, we should exclude those “countries”. When we exclude them, we have Canada finishing fifth in 2006 and eighth in 2009 and 2012. And now the drop somehow doesn’t seem as drastic.

- - -

I don't think I've answered the two questions.

Nevertheless, as OECD intended, PISA provides people with data that they can use to press their governments about their education system. It’s unfortunate that league tables may be flawed and that they are presented in a way that allows the data to be bent to fit one’s agenda.

Thanks for reading.

- - -

Sources

The arguments against the use of the Rasch model:

An academic paper by Svend Kreiner .

A paper by Svend Kreiner and Karl Bang Christensen .

(To be honest, I didn’t read the above papers. They are highly technical and beyond my expertise.)

Two articles by David Speigelhalter here and here (these are readable).

Rebuttals of the above:

Ray Adams (member of PISA) has a longish article defending PISA’s methodology.

Jan-Eric Gustafsson (university of Oslo) discounts Kreiner and Christensen’s critique. (Gustafsson, however, has other reasons to doubt the valdity of the league tables.)

The PISA data:

The OECD-PISA league tables can be found at the OECD PISA site. CMEC also has them here:

PISA 2006, PISA 2009, PISA 2012

The original arguments claiming that PISA is flawed were published some time ago. Here are some more recent ones:

Matthew Smith in Education Week in review (April 2014).

William Stewart, a long time education reporter for TES. (Last Updated: 27 September, 2014).

Benjamin Reilly, Founder of Deans for Impact (Feb 2014).

Catherine Wolff, an education writer in New Zealand (Dec 2013).

math in autumn

Thursday, 18 June 2015

PISA: snapshots with a fogged up camera

No comments:

Post a Comment