Interpreting Conflicting Statistics from the Same Data
Reverse Regression: The Algebra of Discrimination (Greene 1984)
In this first journal club, I will be recapitulating and discussing an old paper which is a prime example of how regressing x on y is not the same as regressing y on x, and how you can find seemingly conflicting statistics from the same dataset.
Summary
This paper is a response to another paper which published two metrics to investigate whether women are paid commensurately with men. This paper tried to made this determination by running two multivariate regressions:
-
For a given job title, education level, years experienced etc. (x1, x2, x3, …), what would be the expected salary for a man and for a women (y)? The regression showed that the expected salaries for a man and a woman were not equivalent, and that men were paid more.
-
For a given salary, what are the expected qualifications for a man and a woman? If women are being discriminated against, one would expect to find that a women would be more qualified for a job at a given salary level. This was not the case.
Important things to know
The second case is known as a reverse regression, and it is not the equivilent of the first case. There are also a few problems with this approach: the measurement errors for each regression are not the same, there is a problem of simultaneous equation bias, and the entire second regression is kind of shady.
One example for why the reverse regression would be off is the uneven distribution of data. Especially in the 80’s, there were fewer qualified women for jobs requiring advanced degrees (don’t forget, schools like Williams didn’t become co-ed until 1970). Thus, those women who do have those jobs would tend to be younger (less experienced) than men in the field. This would skew the regression and imply women are overpaid, when this clearly isn’t the case from the original regression.
Another explanation offered for the reverse regression contradiction would be that the collection of parameters onto which salary is regressed (job title, years experience, education, etc.) are incomplete predictors of salary, and that there are external factors not being taken into account with the regression. Thus, when regressing the opposite way (i.e. using salary to predict “qualifications”), the “qualifications” variable is not entirely representing what it is intended to represent. Thus, the uncertainty, or the error not explained by the regression model \((1-R^2)\), isn’t necessarily going to skew the reverse regression in the same way in order to lead to the same conclusion. Instead, it may even skew towards the opposite conclusion.