Experimentation Works Failure Report (2018–2019)

Credit: Teddy Rawpixel (Rawpixel.com)

Why a failure report? The semantics of the word failure continue to elicit strong responses among many individuals, including within public services. We understand both sides: on the one hand, there are those who think failure is too strong of a word — why not call it ‘what we learned’, or ‘lessons’, since the focus is positive rather than negative? On the other hand, there are those who think failure is a necessarily strong word, since it points to things that should not be tried again. Seen that way, admitting to failure is about breaking a cycle where one (or many!) continues to do something because it feels intuitively right, or because that’s the way it has always been done, rather than because it was tried and found to work through lived experience.

The idea of failure does catch the eye, for good and bad. And we start by noting that we don’t think EW as a project, or EW1 as a first cohort, were failures. Not at all. However, we do think that every project has things that are done well, and things that are not done well, and so this exercise is more about being introspective and creating space — yes in a very public way — to have that much-needed intentional time before re-embarking on this journey again with a new cohort of EW, one that will no doubt mean spending precious time and resources in this way rather than a different way.

There are many resources on the topic of why admitting to failure in a public policy context is important, and others have much more eloquently written about it; here we give a nod to Ashley Good’s work with Fail Forward, among many other pioneers like Engineers Without Borders (EWB)(check out their 2017 failure report here). So with this preamble in mind, here is a list of the things the team has found wanting, and would think twice about doing again in the same way, if they could help it.

In true nod to the EWB failure report model, we have provided name, role and contact info for each perspective, in case you want to dig deeper or find yourself in a similar situation and would appreciate advice. And finally, we have intended this piece to be read as a complement to our Impact Report.

Cohorts can create an expectation that all teams are starting from the same place, when they’re not (@terhas_, EW project team, TBS). As someone who played the “convener” role for the cohort (e.g. bringing the teams together for calls and meetings, organizing the learning events), it was clear to me that the participating teams had very different operational realities. I understand that this was somewhat intentional in the design. We wanted to have a variety of experiences, contexts and ultimately experiments from which the cohort and observers could learn. It just didn’t always play out that smoothly.

The capacity of the teams (both in terms of expertise and resources) created unevenness across the cohort. Some teams needed more support than others (either from TBS or from the experts) and we couldn’t always find the right balance. In an effort to keep the cohort on the same page and maintain a shared experience, I felt that we were over-burdening some teams while not paying enough attention to others. Bouncing between these extremes was difficult.

It’s true that cohorts learn well together, but the cohort model can be designed with a variation of capacity in mind. This way we can respond to training, expertise and resource needs as appropriate.

Time commitments are difficult to estimate for experts (or for anyone involved for that matter…) (Pierre-Olivier Bédard, EW expert / experimental design support for TBS EW). When I was approached to join the initiative as a supporting expert, I was working full time in a job that didn’t touch upon experimentation. A discussion on my time commitment with EW partners, as well as my own manager, quickly took place to make sure I could realistically commit without over-promising.

Estimating time requirements for something that is partly being built in the process was particularly difficult. Much like estimating construction time for a house for which you don’t have the final plans yet.

All projects were substantially different, and at different development stages, meaning some projects required more hand-holding than others. Providing support to the project teams was my main time commitment, but given this took considerable time it made it particularly difficult to respond to additional requests for tools and products (e.g. training, guidance documents, etc.). In short, the time commitment from experts was underestimated. I really enjoyed brainstorming with teams, providing support on design issues, but sometimes felt a little frustrated that I couldn’t do more of what I enjoy and that would have benefited others.

A possible solution could be to have a clearer work commitment for EW partners, along with clear deliverables. The UK Trial Advice Panel approach might be interesting to follow, where time commitments are included in the Memorandum of Understanding signed between partners. For instance, if you are an expert, you are expected to commit to a 4 days per month, and is expected to participate in approximately x conference calls per month, develop a tool on y, and deliver a 2 hour training session on z during the next 12 months.

Relationships take time and power dynamics are always at play (@danutfm, EW project team, TBS). As one of the individuals who designed the initiative, I felt a lot of ownership about the project. I also felt empowered to make key decisions through the initiative. Some of the pivots we made were good (and so are not necessarily now remembered), but unfortunately some were not so good and are indeed lingering in my mind.

One of these had to do with re-assigning an expert to a project team (from a different project team) once that first team was able to secure a full-time expert in experimental design for their broader work. This individual was able to quickly get on board the project and started supporting it from the inside.

At first glance, we all felt that the expert we had assigned them was no longer needed (the expert themselves felt it a little bit too), and so we thought it was a no-brainer to have that expert re-assigned, given that they were pretty sought-after commodity. The problem was that we didn’t fully communicate that decision very well to the project team — from the outside, and very likely to them, it felt like we were punishing them by taking away their only assigned EW expert, weakening their relationship to the EW project, and really undermining our own EW model.

Realizing that trust had been broken, I quickly found out that we had made a mistake; we should have stuck to our initial model, and only changed if that desire for a re-assigning had come organically, with all parts of the group agreeing to it; it should have definitely not come from the group with quote unquote power over the group, as the organizing team.

My takeaway from that episode is that trust is built with difficulty, and it only takes a small mistake to break it. Even though I would say we recovered that relationship and it didn’t fully cost us the project, we should have been much more careful with our communication and ability to build and maintain relationships (it also didn’t help that we made that decision over the phone, rather than in person — these little things do matter).

Learning events cannot substitute formal training (@terhas_, EW project team, TBS). The entire EW model was built on the concept of “learning-by doing”. There was no formal training embedded into Experimentation Works because the cohort would learn on-the-go, have access to experts along the way, and could attend learning events to supplement their learning.

The learning events were intended as a way to bring the community together to gain insights about a topic related to experimenting in the public sector. Having attended all of the learning events (with the exception of the EW launch), I found that the most popular sessions were the ones that included a practical component like a case study or workshop. There was a strong desire from the cohort and the community to learn practical skills in these sessions. So we tried to respond to that need, but quickly found the learning events could not substitute the desire for more formal training. We just didn’t have the time or the capacity to deliver workshops at every learning event, and sometimes that took away from the organic networking that traditional learning events could provide.

I think that a more structured training or learning plan up-front would have better complemented these learning events. Learning-by doing certainly does not preclude formal training.

Doing something for the first time can be a very bumpy ride (@schancase, EW project team, TBS). EW was an entirely new initiative in the federal government with the aim of demonstrating what was, for most of us policy wonks (myself included!), a relatively unknown practice: rigorous experimentation. As someone who was there from the inception of EW, I felt like I and we were trying to fly a plane while we were still building it and still trying to track down some good pilots and engineers … without any salary dollars to pay anyone! This resulted in a lot of challenges and mishaps, as you can imagine. Given budget and HR limitations, we had to pitch the idea across town to try and secure partnerships (including GOC experts on loan). In other words, we had to imagine and get the initiative rolling before we could benefit from somewhat regular access to expertise in experimentation. This led to some bumps along the way in terms of designing an application process (with flaws, for sure) and (mis)estimating how much time we would really need from experts — two of the many items of the original design where I took on the lead design guess work, consulting where I could with people more familiar with experimentation than me. I think the best thing that I and we did, however, was to be completely honest from the beginning about our limitations (i.e. most everything on the science/data/experimentation side of things) while doing everything we could to track down experts inside and outside of the GOC who could come alongside us to fill those gaps. In the end, we discovered some amazing experts who could help advise on the technical side of things so that we at TBS could be the flight attendants who got everyone on the plane and tried to make sure they were enjoying the ride.

Departmental baseline expertise should be assessed and leveled (Pierre-Olivier Bédard, EW expert / experimental design support for TBS EW). As mentioned above, departments came to the initiative with different levels of ambition, capacity and expertise. This is a basic reality of the federal landscape with regards to experimentation but it would have been helpful to make sure participating departments are starting off with comparable basic of core experimental concepts before launching their projects. For instance, some projects spent considerable amounts of time figuring out what they wanted to test, what intervention channel they would use, what their hypothesis would look like, etc. well into the cohort timeline.

I did spend a considerable amount of time with teams reviewing and refining core elements such as problem statements and research questions. This was great in a way as I really think it really lays the foundation for the design to be developed. At the same time, I would have liked to do that brainstorming phase earlier on, before projects are launched and not when we felt rushed to get something off the ground. Doing so would have meant more time dedicated to the other parts of the design, which shouldn’t be rushed either.

This could have been avoided/mitigated by doing more support work with teams upfront. For instance, holding an initial project development workshop where experimentation knowledge is provided along with expert support in developing the core components of an experiment could be a solution.

Designs could even be pre-registered through an open platform so that all project teams are able to see them, but also to make sure project teams are operating within those bounds during project implementation (i.e. you publish the design, and when reporting on results, this is done against the initial plan, explaining discrepancies, if any). All core elements of the design should be figured out, at least tentatively, before moving in any direction. In short, good planning makes for good projects.

Retention is difficult when incentives are misaligned (@danutfm, EW project team, TBS). In the beginning phases of EW, when we were pitching the model internally within the Government of Canada, there was no developed network of contacts with experimental design skills that we could reach out to, given that we were practically starting (or restarting — depending on your outlook on the cyclical nature of institutions) to build this practice within the Government of Canada. (And again here, I don’t want to suggest we started from scratch — we knew that skillset existed and has a long and rich tradition within the Government of Canada, it was just that it was not a typical one when it came to policy or program circles).

So we initially had to do quite a bit of digging in areas that were not familiar to us — speaking to scientists, regulators, evaluators, performance measurement experts — anyone who would hear us out and who understood what we meant when we said that we value experimental design expertise as an important skillset, and that we think it needs to be embedded into policy and program contexts. It was not easy finding folks that understood us, given that traditionally, these functions had ceased speaking to each other — policy minds stuck with policy people, and scientists stuck with scientists — but that’s another story.

On a number of occasions — after working hard to explain the concept of what we were trying to do, understand if the individuals we spoke to had the expertise to be able to help, and get to a place where we had verbal or written agreement to work together — partnerships went on to dissolve quite easily, when those individuals changed roles, transferred departments, got assigned work that was of perceived greater priority, or when being part of the project was no longer feasible for institutionally, as it happened in one case.

The lesson I learned is that any sort of contract can be broken when you are not the one paying an employee; their ultimate allegiance is (rightly) with their employer, since that is the group paying the bills. I’m not sure there’s a great fix for this learning, since people are mobile and will try to do what’s best navigating many variables, not only about their careers, their employers’ trajectory and priorities, but also with their personal growth journeys.

The most important thing I had to continually remind myself of was to continue to treat everyone with respect and empathy, realizing that even if somebody was leaving the project mid-way despite signing an MOU/charter/contract saying otherwise, there might be circumstances where each one of us on the team might do something very similar at some point in our careers, so it was important to continue to see people as people first, colleagues second.

Growing a community around a discrete initiative is difficult when there are few entry points (@terhas_, EW project team, TBS). Experimentation Works was developed as a way to build capacity in experimentation but also to showcase these experiments (and the learnings) in the open. This meant that public servants from federal departments who were not involved in the initiative could follow along. And they did! I got emails from departments who were interested in learning more about EW, and getting involved.

Practically, this helped us build a community around experimentation. We would invite these interested parties to our learning events and point to our blogs, but departments wanted more. They also wanted access to experts and support on their experimental projects. They wanted to be active members of this community beyond just reading blogs and attending monthly learning events.

In the end, there was a certain level of excitement and demand that we built through this initiative, but we couldn’t follow through on all of it. As a government-wide initiative, there should be more entry points and better ways of engaging observers.

Data analysis and reporting needs validation (Pierre-Olivier Bédard, EW expert / experimental design support for TBS EW). Significant effort went into assessing the projects when teams applied and joined the cohort. Teams went on with their projects and regular check-ins took place. However, there were no formal validating steps when teams analyzed their data and prepared their report. Teams completed their data collection and analyses and then sent management approved reports to TBS.

Given that data analysis is a crucial step of any experimental project, I would have liked to be more closely involved in the data analysis. Not as the sole analyst, nor as a substitute to the project teams, but rather as an observer/participant. I think data analysis is a phase that most people see as merely technical where one applies lines of code turning raw data into evidence. It’s in fact a rather creative phase, where options, decisions and original solutions abound. The downside is that it is also a phase fraught with risks of errors, especially for non-experts. A single misplaced comma in a code line can radically change the outputs and results, maybe without you even noticing (it’s an old case, but Reinhart and Rogoff’s 2010 coding mistakes — beginner mistakes made by world-renowned experts, what is more, caught by a student! — generated considerable policy implications). I would have liked to spend more time with the teams to help them craft their analysis and more directly support that last phase. Given that the project reports received had been approved by management by the time we received the reports, it made it harder to then go back and ask for changes and revisions. Some irregularities were noticed ex post in the reports (e.g. calculations mistakes, discrepancies with the initial design plan), but I was not really in a position to ask for major revisions and changes, or even to ask to sift through the raw data.

A possible solution would be to engage the project teams into a data analysis session at the end of the cycle to go over the analysis and data collectively. Another option could be to convene the initial project assessment panel to review the submitted data and results (again, much like a peer review process). Project teams could should share their raw data (when possible — could be minimally shared within GoC) to encourage cohort validation and independent checks. I very much believe that experimentation should be done in the open, as it was argued and designed early on, but I also value science as a common effort, where everyone benefits from cross-validation and exchange.

In a way, experimentation, much like science, is self-correcting. Just as long as there are enough checkpoints and opportunities to critically appraise each other’s work at all stages of the process.

Experimentation is intimidating, and language / vocabulary matters (@danutfm, EW project team, TBS). One of the things that continually struck me while participating in EW is how difficult it is to work with people that don’t know you, your function, or even your way of thinking based on your education and skillset. This is something we typically call ‘working in silos’ in larger organizations, but it was one of those things that really came alive for me during this project in particular.

I remember being in meetings with some project teams part of EW where we were simply calling each other to repeat our understanding of the very same thing we had already discussed twice if not three times. Not because it wasn’t clear the first time necessarily — it was simply a coping mechanism for a process or vocabulary that was so new that our brains needed time to become familiar with it, and so that saying particular phrases out loud did not seem like we were being frauds.

This resulted in some frustration on my part when it came to others calling me to discuss things that were within my comfort zone. However, after a while I realized my own hypocrisy, as I was doing it too for elements I didn’t know as well. For instance, I happened to be fairly at ease with technology / changing visual elements on a website — I can get by with .html, .css, etc. However, when it came to script randomization and more sophisticated analytics, I was the one calling on the experts and reiterating — for the 3rd and 4th time — what to them were likely very simple concepts.

That meant that for some projects, even something as simple as creating a project timeline, something fairly simple to do when you’ve done that type of project before, became an arduous task that required more versions that I’m willing to admit…This is a broader reflection, one that has come up before, including in our EW Review, namely: how do you find a balance between teaching somebody something new that is almost within their grasp, so that they don’t spin and spin alone when it is clear they won’t magically know it all of a sudden? Do you hire an expert and say this is solely your job, or do you slowly build that expertise across the team, so that they all can start doing it themselves?

The answer is likely somewhere in the middle, and will depend on whether something like experimental design expertise ever becomes a core element of any team. In that case, it likely makes sense to have that function on staff. But if it’s just a one off, having access to a joint resource, and ensuring that staff have a basic understanding of what exactly their gaps are to be able to seek and speak about the help they need, might be a better choice.


If you’ve only skimmed the sentences in bold in the stories above, you would have seen that people are complex, relationships are fragile and ever-changing, projects take unexpected turns, and hindsight is 20–20. If you wanted to go a bit deeper, we hope we’ve provided enough detail to give you a sense of our context, why things happened the way they did, and what we learned from them. In the process, we trust that some of the elements presented above have given you, our readers, an opportunity to reflect on your own contexts; we further hope that we were able to allow you to take a moment and challenge some of your own ways of operating, or maybe reaffirm them as solid.

Finally, If you find yourself in similar situations, with a cohort-model made up of multi-disciplinary teams learning new skills by doing in an open and non-hierarchical model, do let us know what worked and didn’t work for you. We’ll be keen to continue to share impressions and learn together.

Post by some of the TBS team involved in EW, past and present: Pierre-Olivier Bédard, Sarah Chan, Terhas Ghebretecle, Dan Monafu.

Article également disponible en français ici: https://medium.com/@exp_oeuvre



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store