Congratulations! You’ve completed an experiment! Now what?
Hint: Implementation of evidence is hard.
When working to promote and mainstream experimentation in a public service setting — Experimentation Works being one such initiative — a fair amount of focus is placed on capacity-building and ensuring that the departments doing the experiments have the right mix of skills, and that that expertise is available to support with the design and implementation of experiments. While this is laudable and a fundamental aspect of the experimentation process, it covers only some of the elements which allow experimental evidence to take hold and mature within an organization.
An experiment is not an end in itself, but rather a technical way of generating evidence, to then be used to inform the decision-making process in a particular way or another. Simply put, generating valid, robust, and relevant evidence is a pre-requisite to evidence-based policy-making. Making sure evidence is properly used to inform decision is a logical next step.
I would argue that this last point is no less difficult than designing and implementing the experiment itself, and it requires significant time dedicated to how the evidence will be embedded into the process from the onset. In short, if you’re currently designing an experiment, you should think about this now. Let me explain what I mean.
1. Experiments are hard, implementation is harder
Implementing an experiment in a public service context presents a number of challenges. First, as is the case in scientific research, experiments outline hypothesized effects (i.e. what we think is going to happen) and test those in the field or a lab setting. Often our theoretical expectations are wrong and experiments yield null results (i.e. the intervention was assumed to bring about an effect but didn’t generate an observable and statistically significant change to that end). This is not a bad thing. Experimenters, policy-makers and for that matter scientists, should be prepared to receive null results and to allow for this evidence to feed a learning process. “Letting go” and following what the evidence says is sometimes easier said than done; the results might be deflating and down-right disappointing (i.e. nothing happened!) but that should not discourage us. True learning should never be limited to positive results.
2. Impartiality doesn’t exist
Secondly, we often like to think that we ourselves are being perfectly neutral in what we expect to happen. However, it is often the case that we are indeed hoping for a specific outcome, whether overtly or not. As human beings, we tend to have established preferences for specific outcomes.
Even if the experimenter doesn’t have a specific interest or preference, this bias can be manifest in the organization supporting the experiment. This is indicated, for instance, through the level of resourcing or the amount of investments tied to the service/program/policy being tested.
This second challenge deals with the fact that experimental results can often be misaligned with existing beliefs and opinions. Experimental results can be counterintuitive (i.e. producing the exact opposite effect as the one hypothesized), hard to explain, or can just be difficult to reconcile with existing program/service/policy options. Bottom line: experimental evidence can be inconvenient, especially if it seems to discredit a preferred option.
3. Evidence is never simple to interpret
Thirdly, experimental results may not be as straightforward to interpret, compounding the previous two challenges. A typical experiment usually contrasts two observed outcomes (i.e. one measured in the treatment group and the other in the control group) using conventional statistics (e.g. using differences in means tests or basic regressions). Of course, experiments can be designed to allow for more complex analyses, but even in a simple case, experimental evidence can prove challenging to interpret, especially when adding the additional step of fleshing out the implications of the results for decision-making purposes.
Experiments — and the same goes for all types of evidence that can potentially inform the decision-making process — will always run the risk of being misinterpreted. The technical, statistical interpretative task is often straightforward. However, the data rarely speaks for itself when it comes to policy choices. For instance, an experiment can demonstrate a statistically significant difference (i.e. the difference between the two groups is not zero) and capture for instance, a 5% difference in outcomes. This difference could be statistically significant, but would it be practically significant for your organization? How about a 10% difference? Or a 12.7% difference? What would be the threshold in the result for you to flag the results as requiring a specific course of action with respect to your program/service/policy? The statistical significance of results is easy enough to compute; the substantive interpretation of the effect’s magnitude, not so simple (although there are guidelines that can help with this task).
In a context where experimental evidence can be contrasted with other experimental evidence on a similar intervention, interpretation can again be tricky. Say an experiment demonstrates that intervention x is effective in the specific context studied, but that someone brings up a similar experiment was conducted in a different jurisdiction, proving the exact opposite of what was believed to be true.
A good example of this is the name-blind recruitment experiment conducted in Canada combined with de-identified applications experiment in Australia; the former demonstrates the lack of effect of name-blind on the screening decisions of applications from members of visible minority groups, whereas the Australian study demonstrates that de-identified applications limited affirmative actions from reviewers, which may indicate that public servants reviewing the job applicants slightly favoured female applicants over male candidates.
How does one reconcile the two experiments above? One can look at differences in experimental designs, but such comparative exercises are challenging. How much weight should be placed on one’s own experiment and how much on the alternative one? These key questions deal with assessing the general trustworthiness of the implemented experimental protocol and the experiment’s probability of error.
Figuring out to what extent the results truly warrant the policy claim it appears to support is very challenging. Meta-science studies (e.g. studies of other studies) suggest that flexibility in designs, heterogeneity of results, risks of bias, are all important challenges affecting the trustworthiness of results. Better experimental evidence is one way to mitigate these, but being aware of these issues and caveats when implementing experiments and analyzing evidence is also necessary.
Spell out what you’d do depending on what the evidence might show
Of course, policy processes allowing for experimentation should be open and prepared to face these situations, but in practice this can be very challenging. One practical solution is to clearly outline, early on in the experimental process, what the results will mean under different scenarios.
In the context of scientific research, a researcher typically details her theory, hypotheses, and has an understanding of what the different tests she performs will mean for the veracity of her theory and the plausibility of her hypothesis. In a scientific experiment, a hypothesis test would have clear implications for the theory at play (e.g. rejecting the null hypothesis means x is effective; an observed effect size of z magnitude means that intervention x1 performed better than x2, etc.).
A similar approach could be developed for experiments taking place in public service settings. This kind of transparency in the policy-making process is important to give proper consideration to evidence and to avoid cherry-picking. When one is designing and implementing an experiment, the following questions should be clarified early on: what will the results (positive, negative or null) mean for your organization? What about for your program? Would you be ready to course-correct based on these possible results? Would your organization be ready to increase funding and scale up your program (or conversely, to de-fund and scale back the program) based on the results? These are all important and challenging questions that should be discussed before the experimental results are in. Without a clear answer to these, the experiment might not be as relevant or as important as it could be.
I would suggest a non-exhaustive list below of key elements to pay particular close attention to when designing and implementing an experiment:
1. Evidence use — Think about what your experimental evidence will be used for. This should be clear, otherwise there will be missed opportunities for learning and action;
2. Scenarios — Get early feedback from your management, not only on the design and test, but on the practical implications of various results scenarios you are likely to face (e.g. negative/positive, significant/non-significant results; small vs. large, intended vs. unintended effects). A pre mortem debrief is always a great best practice;
3. Audience — Identify your target audience(s) and tailor your knowledge dissemination strategy to their needs and specifics. This can mean having different formats, styles and levels of complexity in your presentation, always tailored to the audience;
4. Knowledge translation — Prepare a knowledge translation and dissemination plan ahead of time. You don’t want this part to be an afterthought, otherwise your experiment won’t get the spotlight it deserves and will be quickly forgotten or disregarded;
5. Results interpretation — While data never really speaks for itself, avoid extrapolation and inflated policy claims. Make sure your message is in line with the pattern described by the data. Keep it simple and don’t be afraid to BLUF (Bottom Line Up Front) when presenting your results;
6. Contradictory results — Be prepared to have to explain your results in light of potentially contradictory experiments conducted elsewhere. This can add some noise to the overall interpretation, but divergent results should not be ignored. Don’t hesitate to get expert advice at this point;
7. Failure and learning — Experiments that do not work out can still offer insights and lessons on experimentation within a public service audience. The challenges should be made clear and be shared, so as to address them properly in future rounds of experimentation. Even in the case of successful experiments, a post mortem debrief will be tremendously relevant.
The challenges I’ve highlighted above show the need to engage on these questions early, in order to be prepared to face the challenges when they do come (and they will likely come). Experts are needed to clarify the technicalities and subtleties presented by the data, but these questions should always be answered jointly with program and policy experts, who have the substantive understanding of the program/service/policy area, as well as its vital context. While the promotion of experiments in public services is an important step forward in supporting an effective organizational culture based on continuous (evidentiary) learning, this exercise should be always conducted by carefully weighing some of the important methodological and interpretive challenges specific to experimental evidence use in public service.
Post written by one of Experimentation Works’ experts, Pierre-Olivier Bédard from the National Research Council Canada
Article également disponible en français ici: https://medium.com/@exp_oeuvre