Randomised experiments are the preferred method for assessing the effects of treatment for theoretical and practical reasons. It is why they are accorded the highest status in the standards of evidence that underpin Investing in Children. But they are not always feasible or ethical to do, in which case it is likely that non-randomised experiments will be used. So, to what extent do results of non-randomised designs match those of randomised ones?
William Shadish started to study this question during the 1990s. His early analyses used a meta-analytic approach. He gathered a large number of randomised and non-randomised experiments which had looked at the same question, worked out the average effect size, and compared them for similarity.
Previous work in this vein had led researchers to assume that the two methods essentially produce the same results. The studies by Shadish and colleagues, however, showed that “ignoring assignment method is a very bad idea”.
In studies of interventions such as drug use prevention, family therapy, Alcoholics Anonymous and coaching for school aptitude tests, the researchers found that the two methods yielded similar results sometimes but not always. Further, when the results were different they were not consistently different in the same direction: effect sizes were sometimes higher for randomised experiments and sometimes lower.
A problem with these findings is that in randomised studies it may not only be the random assignment that makes them different from non-random assignment experiments. For example, randomised experiments tend to use “passive” control groups in which participants receive little or no attention, whereas non-randomised studies are more likely to use “active” controls in which participants receive attention. Non-randomised studies are also disproportionately likely to use matching of participants in treatment and control.
Non-randomised experiments that allow participants to choose whether they receive the intervention in question yield results that differ from randomised experiments far more than results from non-randomised experiments that do not allow such self-selection. Often the bias from self-selection can be seen at the start of the experiment and it simply carries over to the post-test effect sizes. A complication, again, is that this does not always work in the same direction. For instance, participants who self-select psychotherapy tend to be more distressed, whereas those who self-select AA are more likely to stay sober.
On a positive note, when randomised and non-randomised experiments are conducted identically to each other in all respects except for assignment mechanism, they can yield quite similar results. This said, ultimately meta-analysis cannot provide a firm answer to the question Shadish was studying because it cannot ensure that the experiments were conducted identically except for assignment method.
In an effort to address this problem, Shadish describes three studies concerning mathematics and language training in which university students were randomly assigned to be in either a randomised study or a non-randomised alternative. Participants were otherwise treated identically. This research yielded several important lessons.
The first is that in non-randomised studies the good measurement of selection is crucial. In other words, evaluators should find out what factors predict whether people will choose one condition over the other. Controlling for the relevant factors can eliminate bias.
Another lesson is that the statistical method used to adjust results is unimportant. When different methods are used to adjust results from the non-randomised experiment, bias reduction was about the same for all of them, as long as – critically – the adjustment used good measures of selection.
Next, non-randomised experiments produce more accurate estimates of effect when the control group comes from the same location as the treatment group, and when it shares many of the same characteristics as the treatment group.
Large sample sizes are also needed in non-randomised designs. Controlling for factors that predict whether people will choose one condition over another works best with large samples. In a computer simulation, for instance, the most accurate results came from studies with at least 1,500 participants, with 500 needed to ensure only a small risk of deviating from the right answer.
Shadish also stresses that that the analysis of studies that allocate people to experimental conditions based on whether they fall above or below a cut-off on a given variable is difficult. In particular, care is needed to model correctly the relationship between the assignment variable and the outcome.
Readers may conclude that the complications inherent in non-randomised experiments – as identified here – render them less desirable than many often seem to think, but viewed positively Shadish’s analyses give substantial cause for optimism.
He writes: “Conditions do exist under which non-randomized experiments can yield accurate answers. This is most obvious for the regression discontinuity design, where a number of studies have supported its accuracy when it is properly analyzed.”
Shadish, W. R. (2011) Randomized controlled studies and alternative designs in outcome studies: challenges and opportunities. Research on Social Work Practice 21 (6), 636-643.
Return to Features