1 ‘You Can Get It If You Really Want’ : Impact Evaluation Experience of the Office
of Evaluation and Oversight of the Inter-American Development Bank
2Inder Jit Ruprah
This paper‟s assessment of the Inter-American Development Bank‟s Office of Evaluation
and Oversight‟s experience with impact evaluation offers lessons for best-practice
methodology, including for studies faced with time and budget constraints. We point out the difficulty in mainstreaming this methodology in a multi- or bi-lateral lender. However, given its didactic nature, this assessment can be instructive to the development community, as we present solutions to rigorously evaluating programs that have not been designed with such an evaluation in mind.
The international development community has been put on notice. The Center of Global Development asserts, „For decades development agencies have disbursed billions of
dollars … Yet the shocking fact is that we have relatively little knowledge about the net impact of most of these programs.‟ (Savedoff and Levine 2006 and CGD 2006) The
criticism is accompanied by a proposed minimum standard of knowledge: „To determine
what works … It is necessary to collect data to estimate what would have happened without the program … [only thus] …It is possible to measure the impact that can be
attributed to the specific program.‟ The criticism also contained a note of despair; and it
called for an independent evaluation entity to ensure rigour in the evaluation of development programs.
This paper re-looks at the veracity of the assertion of the „shocking fact‟ for the Inter-
American Development Bank, a multi-lateral Bank that lends to Latin American and Caribbean countries, and whether the Bank‟s independent evaluation office, Office of
Evaluation and Oversight (OVE), has made any difference. The paper also contributes to
the discussion regarding these criticisms of the international development community‟s
lack of evaluative rigour. The paper mainly documents the experience of the OVE in carrying out impact evaluations, the asserted minimum standard of knowledge.
The story‟s relevance, however, is not limited to other evaluation offices of multi-lateral and bilateral organisations in the development community. The challenge faced by OVE, namely the ex post evaluations of projects that were neither designed for impact evaluation nor that collected outcome data, is probably the most common challenge faced by evaluators. In addition, OVE‟s experience adds to the growing evidence questioning
the validity of the arguments against impact evaluations. The litany of arguments normally consists of: it is too difficult; it is too expensive; too few governments will agree; and there is no institutional mandate. Thus, the challenges faced by and the experience of OVE contribute to understanding the real world approaches to impact evaluations.
II. The Context
The Office of Evaluation and Oversight (OVE) was created in mid-1999 as part of the reform of the Bank‟s evaluation system. At that time OVE became independent of Bank
Management, reporting solely to the Board of Executive Directors. In this redesign, the Board mandated OVE to: conduct Country Program Evaluations (CPE); conduct policy, strategy, thematic, and instrument evaluations; oversee the Bank's internal monitoring and evaluation system; oversee reviews of corporate strategy; provide normative guidance on evaluation issues; and contribute to evaluation capacity building in the region.
OVE did not have a mandate to evaluate individual operations. Only in 2003 did OVE receive a mandate to perform ex-post project evaluations. (IADB 2003) Thus, rather than
being put on notice, the reason OVE took on this exercise was a change in Bank‟s policy.
The new policy mandated ex post project evaluations two to four years after a project
closed. It said little to nothing about selection of what or how to evaluate or the minimum method standard that should be adopted. However, there was an assumption of stand-alone project evaluation and a method of before- completion-after reflexive type. The Bank would do the before-completion part and OVE would be relegated to the completion-after part.
However, the policy was based on false premises. First, the Bank does not routinely
3 collect information for before-after or before-completion naïve reflexive evaluations. Generally, there is no full statement of development outcome intent at project approval. The Bank‟s system does not typically collect outcome information on on-going projects.
The Bank‟s evaluations are almost void of statements on development outcomes upon closure (See Chart1). While it is necessary to collect data to estimate what would have happened without the program in order to determine what works, the Bank‟s evaluation
system is not designed to do so; it does not typically even collect outcome information on
Second, there is an assumption that outcomes can only be discerned years after a project has closed. However, other than lumpy investment loans, many, if not most, of the Bank‟s loans finance programs where development effects can be discerned a few years into the project. Third, the policy‟s focus was on the IADB projects. Often these are embedded in larger country programs. Thus, leaving aside the contribution to the design of a program, unless the benefit and the selection process of beneficiaries differ between the project and program, then the focus should be on the program not the project regarding development effectiveness. Finally, the policy emphasised the „sustainability‟
of the program more in fiscal and institutional terms rather than the sustainability of the development effects.
Given this context, OVE decided to implement the ex post evaluation task within three
principles: First, despite no institutional mandate, it decided to set impact methodology as
a minimum standard (Blundell and Costa 2002). Second, to conduct the impact evaluations using a theory-based approach (Fear 2007). Third, to adopt a purposeful rather than a random selection criterion of the programs to be evaluated, i.e., select similar projects within a thematic or meta-evaluation. OVE accepted that to determine „what works and what does not‟ requires a quantitative approach, and within the
quantitative approach, accepted the emerging consensus of a hierarchy of empirical
The above principles were accompanied by decisions on how to implement the evaluation. The first issue was whether to carry out the evaluations in-house or to outsource them. The decision was to experiment with different modalities that covered all possibilities. The second issue was how to select consultants. The decision was to create a network of evaluators. The third issue was how to involve those evaluated, i.e. Bank staff and governments. The decision was to create a peer review group drawn from the Bank‟s
staff and another peer review group within the country.
In this section, we narrate OVE‟s experience in carrying out impact evaluations. The success is judged with respect to numerous benchmarks: rigorous method standard, full implementation of the theory-based approach, meta-evaluations, the cost of the evaluations, the organisation of the task, and advocacy of impact techniques as a minimum standard.
If the standard of success is the use of counterfactuals to determine the impact of programs, then OVE has been successful. The Office has so far used the following impact techniques. Of the twenty-seven processed evaluations (i.e. publicly available), the techniques used have been, in order of importance: double difference with propensity score method (11), single difference with propensity score method (8), regression-
6instrument variable (5), and discontinuity regression method (1). Sometimes, for
sensitivity or robustness reasons, more than one method in a given evaluation was used.
Often, naïve (i.e. before-after comparison of beneficiaries) or pipeline (i.e. comparison group composed of applicants to a program who have not yet received the program‟s
benefit) techniques are included in OVE‟s impact evaluations.
In fact, the signature feature of OVE‟s ex post program evaluations is that they consist of
routine comparisons between naïve(before-after or pipeline) and impact calculations. The reason for the comparison is essentially to advocate to the Bank that its task is not to fully implement its existing system based on an ex post comparison with a baseline but no
comparison group, but rather to move towards a system that routinely involves impact evaluations. In Chart 2, the naïve and impact evaluations of a Social Investment Fund in Panama are shown using the change in poverty as the outcome. The naïve before-after calculation shows that poverty rose amongst the beneficiaries. The program was a failure. The impact calculation shows that the program‟s impact is a reduction in poverty. The program was successful. The example illustrates the „you do not necessarily get what you
see‟ reason for impact evaluations and that impact calculations are not always less than naïve ones.
A priori, OVE expected to frequently use the regression discontinuity technique (Imbens and Lemieux 2007). High expectations were based on the assumptions that many programs had budget limits relative to the targeted population and the program‟s
beneficiary selection process was based on ranking of applicants. However, de facto OVE
has found it difficult to obtain the rankings and was therefore unable to use this technique. Perhaps the problem of non-availability is due to the continuing confusion between audits and evaluations. The only example is an evaluation of a Chilean Government Research Fund. The outcomes used were number and quality of publications. The impact calculations reveal that the program had no significant effect on outcomes. Chart 3 shows that the method is possible even when the accepted/non-accepted classification of applications to a program do not strictly follow the published ranking criteria of the program. In this case the method is fuzzy discontinuity. However, the argument that even fuzzy data can be used does not reduce the fear that an evaluator is really an auditor.
In contrast, OVE did not expect to be able to estimate an impact effect based on experimental data which, being a priori, is the ideal setting to perform unbiased impact
7 However, in the labour training thematic review, two random evaluations evaluations.
were feasible. One was the result of a well thought out evaluation design (Dominican Republic) and the other was from a natural experiment, in which a valid control group was de facto created due to an administrative cluster (Panama). Chart 4 shows the impact evaluation of the labour training program in the Dominican Republic which used random assignment. It shows that the program was successful for employability, income, and access to health insurance.
The above example also shows that impact evaluations are often limited to answering whether there was a significant impact on the outcomes of interest. This is also the most common approach of OVE. However, policy concern also includes the issues of whether more budgetary outlay per capita increases the benefit, the dosage dimension of a program, and whether a multi-treatment has a greater impact than single-treatment. Chart 5 shows the impact calculations for Chile‟s government regional fund, the National Fund
for Regional Development (FNDR). The transfers are mostly specific-purpose input based conditional, non-matching transfers. Chart 5 shows the different impacts of increased per capita transfers; there is no increase in poverty reduction above twelve times the base expenditure. The impact of transfers increases for diversified transfers (no one type of transfer is greater than 20 per cent of total transfers) vs. concentrated transfers (one type of transfer is 50 per cent or higher, in this case, for education) where the outcome is school attendance.
If the benchmark for success is the systematic testing of all the links – the assumptions –
in the causality chain of a given program, then OVE‟s success has been partial. This
partial success is due to budget restrictions and because it was often impossible to retrofit the required information.
A theory-based approach was adopted because it often gives plausibility to the impact findings. Theory or program based approaches map out the channels through which the activities, inputs, and outputs are expected to result in the expected outcomes. It also allows for the identification of unintended effects. Such mapping helps to identify key assumptions whose empirical validity could be tested for, allows an integration of contextual analysis including process evaluation that could account for the same program design having different performances, and possibly allows for the distinction between implementation failure and design failure. Not all these possible advantages have been fully exploited by OVE.
However, a distinction is often made between process evaluations and outcome evaluations, where impact evaluation is assumed to be only useful for determining outcomes. To the contrary, the impact technique can be used to evaluate process. For example, community participation is often asserted to have high dividends in terms of outcomes relative to non-community participation program delivery systems. Often, satisfaction surveys are taken as sufficient method to determine the success of a program. Chart 6 shows the impacts of community participation on the efficacy of a Social Investment Fund on school attendance and grade repetition as well as community satisfaction. The evaluation shows that if „dividend‟ is taken to mean perceptions, i.e.
community satisfaction, then the assertion is correct. If dividend is taken to mean an increase in outcomes, then it is incorrect- the impacts are statistically zero.
Impact techniques can also be used to check for the validity of key design features of a program. In Latin America many governments‟ social housing programs are based on the
ABC (Spanish acronym for savings-grant-mortgage) design. High delinquency rates of publicly provided mortgages are often interpreted to be an example of intrinsic moral hazard of public provision. An interpretation often based on a probit regression with a dummy for the provider. The moral hazard interpretation leads to a call to change the provider from public to private. However, by using propensity score matching to obtain a valid comparison group (i.e. borrowers with similar relevant characteristics) and estimating the regression, the provider becomes irrelevant. The problem is incapacity to pay, hence redesign calls for the elimination of the mortgage component and a corresponding increase in the grant component. Chart 7 shows the marginal impact of mortgages provided by the public entity versus a private one. The marginal effect of public provision is a statistically significant increase in the probability of delinquency. As the right hand side of Chart 7 shows, the regression is based on very dissimilar households. Using the matched data, for the support group composed of similar households that received either a private or a public mortgage, the marginal effect of the provider becomes statistically zero.
If the standard of success is the systematic evaluation of similar programs across time and space then OVE has been relatively successful. The thematic approach, i.e. simultaneously evaluating similar programs, was adopted under the assumption that using a similar methodology, similar control variables, and a common set of outcomes would lend greater credibility to the evaluative findings of a given type of a program.
The first round of met-evaluations included: Youth Labour Training Programs; Science and Technology; and Rural Roads. The second round, which is in the advanced production stage, includes projects drawn from the following themes: Agricultural Technology Uptake, Social Investment Funds, and Early Childhood Development
programs. A third round, in early production stage, includes Citizen Security, Animal and
8 Plant Health Systems, and Housing Programs.
An example of a thematic evaluation is given in Table 1. A literature review of the impacts of active labour market programs in general and job training programs in particular, find modest results in OECD countries. There are little to no evaluations of these programs in Latin America. OVE analysed the experiences, applied the most robust methodology for each country, and then repeated the analysis with the same estimation technique in all the countries. The analysis concluded that there are significant impacts for particular groups, such as women and in some cases the youngest participants. In general, the impacts are larger for the quality of employment (i.e. formality) than for the gross employment rate.
However, what theory rarely illuminates is the dynamic path of the benefits of a given intervention. The best that can be obtained is an unambiguous statement of steady state effects. Thus the timing of an impact evaluation may matter. Chart 8 shows the impacts on income and consumption of the Rural Road Rehabilitation program in Peru. It not only shows a different impact from motorised compared to non- motorised rural road rehabilitation, but also shows differing changes of those effects over time.
For example, in terms of sustainability of benefits, the evaluation of the job training program in the Dominican Republic illustrates the importance of continuous follow up. Chart 9 shows the impact of labour training on a given cohort over time. The short term
results (ten months after training) suggested limited impacts, however after more passage of time, positive impacts were detected- declining after a certain point, however.
If the benchmark of success is judged by obtaining impact evaluations „on the cheap‟
then OVE has been very successful. The high costs of impact evaluations are often invoked to explain away the lack of impact evaluations. The IBRD reports a cost of US$300,000 to US$500,000 per project adding to the fear of adopting an impact standard for evaluation (White 2006). OVE‟s evaluations cost (staff time, travel costs, and
9 consultants) is much less, averaging about US$43,000.
The lower financial costs follow from: First, selection bias, i.e. selecting themes or projects where there is a high a priori probability of finding existing data. OVE keeps
costs down by not generally incurring primary data collection. Second, costs are reduced due to economies of scale obtained by evaluating a number of similar interventions simultaneously. Third, costs are less due to exploiting local expertise by using local consultants through a specially created network of evaluators, EVALNET. Local consultants have a priori knowledge of context, actors, program etc., which bypasses
upfront learning costs and they usually charge less than similar evaluators from developed countries as there is reduced travel and interview costs. Most importantly, the network can be used to determine where the required data is available.
However, there are quality costs to this approach. The method adopted in the evaluations
10was due to the data available for the evaluation- not the other way round. Using
existing secondary data has all the problems of the „tail wagging the dog.‟ First, it implies
an extremely high drop out rate of about 65 per cent. Second, not all desirable outcomes, intended or unintended, can be measured. Third, it is not always possible to determine the impact of a common set of outcomes using a common set of control variables and the same estimation technique across similar projects, which is the objective of a meta-