ASSESSING THE VALIDITY OF RANDOMIZED FIELD EXPERIMENTS
An Example from Drug Abuse Treatment Research

MICHAEL L. DENNIS
Center for Social Research and Policy Analysis Research Triangle Institute
Evaluation Review Vol 14 no 4 August 1990

AUTHOR'S NOTE: The ideas suggested in this article are based largely on the methodology developed under NIDA contract number 171-88-8230 titled, "Increasing the Capacity of Methadone Maintenance Programs (AIDS)." The assistance of Dr. Robert L. Hubbard and J. Valley Rachal is also acknowledged for encouraging me to apply recent methodological developments to the problems they have seen repeatedly in implementing randomized field experiments. Thanks also go to Dr. John Fairbank for his extensive comments on early drafts, Betty Cavanaugh and Erin Newton for editing later drafts, Donna Albrecht for the typing, and Anne Theisen for the figures. Requests for reprints should be sent to the author at the Research Triangle Institute, PO Box 12194, Research Triangle Park, NC 27709.

Randomized field experiments are often logistical failures. They are particularly difficult to implement when the intervention involves a large number of components or service providers. This article examines six potential major problems and several methodological developments related to them, and how we are attempting to deal with these problems in an experiment at the Research Triangle Institute (RTI). The article also addresses how this approach can be generalized to studies in other areas.

Random ized field experiments have been and are typically thought of as the ideal design for eliminating threats to internal validity in controlled settings and have been increasingly advocated and used to evaluate social policies and or programs (Berk et al. 1985; Boruch, McSweeny, and Soderstrom 1978; Campbell 1969; Campbell and Stanley 1963; Coyle, Boruch, and Turner 1989; Dennis and Boruch 1989; Dennis 1988; Fairweather and Tornatzky 1977; Leighton and McKinlay 1930; Riecken et al. 1974; Simon and Devine 1940). As more randomized experiments have been conducted, however, it has become evident that they can and often do fail. The failures are often in their implementation and the consequential validity of their statistical inferences.

Although surmountable, potential methodological problems with using randomized experiments to evaluate intervention programs under field (i.e., real world) conditions must be anticipated and resolved for the experiment to succeed (Dennis 1988). Here we will examine six of these (somewhat overlapping) potential problems:

1. treatment dilution,
2. treatment contamination or confounding,
3. inaccurate caseflow and power estimates,
4. violations of the random assignment process,
5. changes in the environmental context, and
6. changes in the treatment regimens.

The first potential problem is variation in the type and amount of treatment received. The greater the amount of uncontrolled variation, the lower the statistical power and the harder it will be to interpret the results. The second potential problem is treatment contamination (e.g., counselors in different treatment regimens sharing treatment protocols), compensatory rivalry (e.g., when control group clients get different from other clients who then think the experiment is "unfair"), or a "Hawthorne effect" (e.g., providers do a better job because someone is paying attention and they "believe" the treatment should work better). The third potential problem is being unable to estimate the expected effect size and, consequently, being unable to estimate the number of units necessary to achieve a reasonable level of statistical power because there are no direct data to estimate expected caseflow. The fourth potential problem is being unable to maintain the integrity of random assignment. The fifth potential problem, especially in multiyear experiments, is uncontrollable environmental changes that affect the context in which programs and their clients operate (i.e., staff turnover, changes in local funding, changes in federal regulations). Finally, few programs are static; most are continuously evolving, and researchers, find it difficult to maintain rigid experimental regimens over a long period of time.

It seems sensible to examine how the core design of a randomized experiment can be expanded to address some of these problems. This article examines each problem and how it is related to some of the significant methodological developments from the last ten years (e.g., Berk 1989; Berk et al. 1985; Boruch and Gomez 1979; Cook and Campbell 1979; Cook and Poole 1982; Flick 1988; Fraker and Maynard 1985, 1987; Howard, Krause, and Orlinsky 1986; Lipsey 1988; Reichardt and Gollob 1989; Robins 1989; Scott and Sechrest 1989; Sechrest et al. 1979; Shadish, Cook, and Houts 1986; Skovlund and Walloe 1989; Yeaton and Sechrest 1987). The examination of each problem ends with a discussion of the approach for applying some of these methodological advances to address the problem in a study being implemented at RTI (Rachal et al. 1989).1 A final section looks al how this approach might be generalized to other areas.

DEFINITIONS, QUALIFICATIONS, AND AN ILLUSTRATIVE EXAMPLE

A BRIEF DEFINITION OF RANDOMIZED FIELD EXPERIMENTS

A "randomized field experiment" here means the random allocation of some kind of unit to one of two or more regimens. The purpose of random allocation is to create groups of approximately equal compositions so that the relative effectiveness of the regimens can be fairly assessed. Units are usually allocated evenly among the regimens, but they can also be blocked or unevenly allocated to accommodate logistical constraints or programmatic concerns (e.g., the need to avoid allocating consecutive admissions to one group that might overload intake procedures or to balance geographic representation or counselor caseloads). The units can be people, counselors, clinics, days, or something else, but it is preferable that they be the primary units of the analysis. The regimens can differ by type of treatment, level of treatment (i.e., dosage), or timing of treatment (i.e., delayed treatment entry). Although one group is often designated a "control" group, this does not imply that they receive no service. More often than not, the control group receives the services that were provided before the experiment began.

This article also distinguishes the design from the setting in which the design is implemented. Many laboratory studies involve randomized experiments. The experiments that are the focus of this article, however, are studies with regular people in "real world" settings-studies designed to determine how well a new intervention or program works in the real world (i.e., relative effectiveness) rather than how well it works under ideal circumstances (i.e., efficacy).This distinction is not trivial. Interventions that seem to work when clients are selected and the intervention is implemented by highly dedicated and trained research staff often fail to produce the same effect when they are replicated under field conditions (Boruch and Gomez 1979; Dennis 1988; Dennis and Boruch 1989; Riecken et al. 1974). This article also adopts Robins's (1989) convention of describing the planned intervention as a protocol and the actual intervention as a regimen.

APPROPRIATENESS OF USING A RANDOMIZED FIELD EXPERIMENT

Most evaluations address the questions of whether a program/intervention is needed or whether a program/intervention is being provided to those who need or deserve it. The purpose of randomized field experiments, however, is to assess the relative effectiveness of two or more regimens (Fisher 1960). Even when an evaluation does focus on the question of relative effectiveness, several other threshold conditions must be met before a randomized field experiment should be contemplated.

Expanding on earlier works concerning general policy toward social experiments (Berk et al. 1985; Campbell 1969; Riecken and Boruch 1978; Riecken et al. 1974), Dennis and Boruch (1989) identified five recurring threshold conditions that should be met:

•    The present practice must need improvement.

•    The efficacy of the proposed intervention must be uncertain under field conditions.

•    There should be no simpler alternatives for evaluating the intervention.

•    The results must be potentially important for policy.

•    The design must be able to meet the ethical standards of both the researchers and the service providers.

Underlying these conditions is the notion that randomized field experiments are difficult to implement, expensive, and time-consuming for both the subjects and the researchers. They should therefore be reserved for studies of significance to social policy and not be used solely to expand general knowledge. Campbell has argued that large-scale randomized field experiments are only worthwhile when a clearly defined program or intervention is being considered for widespread dissemination (Watson 1986).

Thus this article focuses on issues that affect only a minority of evaluations-those directed at influencing social policy at critical junctures. Such studies often carry disproportionate weight in policy debates and, therefore, must meet the highest standards for evaluation research.

AN ILLUSTRATIVE EXAMPLE OF AN EXPANDED DESIGN

The rest of this article, illustrates six common problems with randomized field experiments and how those problems are being addressed in an experiment currently being conducted by Rachal and colleagues (1989) at RTI.

The RTI project is one of several being conducted under NIDA's National AIDS Demonstration Research (NADR) initiative and is designed to increase the capability of methadone maintenance treatment programs to reduce intravenous drug use and, consequently, the spread of the human immunodeficiency virus (HIV). The project involves studies of both the quality of support services and the entire treatment process itself. This article will be primarily concerned with a series of randomized trials to evaluate an enhanced counseling regimen being implemented in four community-based methadone maintenance programs in Buffalo, New York; Camden, New Jersey; New Orleans, Louisiana; and Pittsburgh, Pennsylvania.

The enhanced counseling regimen is based on a behavioral model that focuses on (1) developing specific and observable short-, medium-, and long-term goals, and (2) developing problem-solving and relapse-prevention skills. At RTI we developed comprehensive training manuals and a three-day counselor training curriculum to create the local capacity to deliver this enhanced counseling treatment. The counselors will use the treatment manuals, receive regular supervisory follow-up and review visits, and recieve feedback on their treatment technique as assessed from audiotaped sessions.

For approximately two years, half of the 200 new clients in each program will be randomly assigned to the existing treatment regimen, and the other half will be assigned to an enhanced treatment regimen that includes more individualized treatment, more counseling, more urine monitoring, problemsolving training, and assistance in getting other local services and jobs. To control for individual differences, counselors are also randomly assigned to regimens. The data set will include client interviews at intake and after six months, and information abstracted from client records and program service logs. The main outcome variables are program retention and illicit drug use six months after treatment entry.

The core experimental design has been expanded to include a six-month prospective baseline period containing all the same data points and a twoyear retrospective baseline period that includes only the data abstracted from records. Several substudies have also been added to evaluate the extent to which the counselors are actually teaching problem-solving skills and the improvement in the problem-solving skills of the enhanced clients. The final component is several rounds of open-ended interviews with clients, practitioners, and other local stakeholders.

Table 1 summarizes the design using the Cook and Campbell (1979) lexicon in which O is an observation, Xs is the standard treatment, and Xe is the enhanced treatment. A wavy line separates groups in different time cohorts within the same program, and a dashed line separates nonequivalent

comparison groups. The absence of a line between two groups implies random assignment and theoretical equivalence. Note that the data can be analyzed both as a series of small randomized trials or as a single multisite trial. In this sense, replication is built into the design.

THE PROBLEMS AND HOW TO ADDRESS THEM ASSESSING TREATMENT INTEGRITY

Findings on the relative effectiveness of treatments are easily refuted unless each treatment is clearly defined and discretely different from others in ways that are salient in the context in which they occurred. A researcher should be able to describe, at a minimum, specific treatment regimens and, preferably, the treatment actually received by each client. Even crude client level treatment data can be used to further stratify the treatment regimens and, consequently, increase the statistical power (e.g., Cook and Poole 1982).

The main differences between the two treatment regimens in the RTI experiment are the amount, type, intensity, and individualization of counseling, and assistance in obtaining ancillary services. The regimens also involve different levels of urine monitoring, which may indirectly lead to higher average daily dosages of methadone.

Urine monitoring and methadone dosage are fairly objective, easily quantified, and routinely recorded. Counseling, however, is rarely described well, let alone measured at the client level. The RTI treatment process design (Dennis, Fairbank, Rachal, and Bonito 1990) addresses the problem by very specifically defining the enhanced counseling regimen, and measuring the amount and nature of the counseling actually received.

We will assess the integrity of the enhanced and standard interventions in the following ways. First, we will collect data on the type and amount of intervention received by clients from multiple sources. One source is the monthly client encounter checklist (CEC) 2 that the counselor uses to record standard and enhanced interventions after each client contact. Client inter-, views, record abstractions, and regular follow-up visits will provide additional information on the type and amount of the intervention components. Second, we will audiotape standard and enhanced counseling sessions and ask judges who arc blind to the counselor/client condition to categorize the content of the intervention by listening to the tape. This will provide us with both a blind assessment of content and the opportunity to calculate the interrater reliability of the categories. Third, we will interview a sample of clients from each treatment group shortly after admission and again at three and six months after the initiation of treatment to determine whether the enhanced counseling sessions improved clients' ability to engage in problem solving behavior -a key intermediary step that underlies a major hypothesis of the study. During these interviews, clients will be presented with two scenarios depicting significant life problems. One will be a drug-related scenario that would likely lead to the actor's relapse. The other will show a non-drug-related life event that would be likely to create considerable distress and might trigger a relapse episode. The client will be asked to identify the problem in the situation, how the actor is likely to respond, what the consequences of those actions would be, what alternative actions could betaken, and what the consequences of the alternative actions will be.

These audio taped interviews will be rated independently by judges who are blind both to treatment condition and interview order. This will allow us to compare the problem-solving skills of the two groups, controlling for initial differences, and to calculate the interrater reliability of the rating scheme. We will also ask the client's counselor to rate the client's problem solving skills in order to compare the ratings of the counselor and the independent raters.

MEASURING TREATMENT CONTAMINATION, COMPENSATORY RIVALRY, AND HAWTHORNE EFFECTS

One of the most difficult aspects of experimenting with social interventions is controlling the treatment. One problem is intentionally or inadvertently giving some of the experimental treatment innovations to the control group (i.e., treatment contamination). Treatment contamination might occur if an enhanced counselor shared information about how to develop a more effective treatment plan. This would cause the standard group to get some of the experimental treatment and, possibly, reduce differences between the two groups.

Another problem is the occurrence of some other unplanned intervention or activity expressly meant to reduce any disparity (i.e., compensatory rivalry) in treatment or outcome between the treatment groups. Compensating the control group in some way (e.g., treatment scholarships) for their exclusion from the experimental group would confound the treatment effects.

A third problem is the inflation of the estimates of the treatment's impact (i.e., a Hawthorne effect) by overenthusiastic treatment or research staff. Several researchers (Boruch and Gomez 1979; Fairweather and Tomatzky 1977; Riecken et al. 1974) have commented that randomized experiments often overestimate the impact of implementing an intervention program wide. One reason is a lack of well-trained staff. Other reasons include lower quality control in the non experimental programs and/or Hawthorne effects in the experimental group.

The randomized experiment used to evaluate the North Carolina Community Penalties Act illustrates the significance of these problems. Wallace (1987) found that prosecutors felt that the experimental program unfairly helped the public defender and so refused to plea bargain any case that had been assigned to the program. This directly confounded the experimental intervention and produced a presumably competing effect on sentence length, the study's main outcome variable. This problem had not been anticipated and was discovered only because Wallace was concurrently conducting interviews with local officials to monitor the local and organizational environments in which the experiment was taking place.

The traditional method of ruling out threats to internal validity such as these has been to show that there is no statistical difference between the two randomly assigned groups on some third dimension (e.g., type of crime, pattern of drug use, client treatment costs). Reichardt and Gollob (1989) think this convention is potentially deceptive and inefficient. It is potentially deceptive because (1) several small nonsignificant differences may add up to an important difference, (2) statistical significance is not necessarily the same as practical significance, and (3) there is too much qualitative judgment involved in deciding how big an effect must be to be of practical significance. Statistical significance is inefficient because it does not take information about any observed biases into account in estimating effect sizes and does not give guidance on what to do when there is a significant bias. Reichardt and Gollob (1989) recommend using an "estimate and subtract" method even when an observable bias is not "statistically" significant.

In the RTI study, we are addressing the above problems by using pre- and postexperimental client cohorts as quasi-experimental control groups at the client level that will allow us to estimate the bias multivariately. At the regimen level, we can also analyze these data as a repeated-measures design or a time-series design.

For the outcome measures already in the existing client records (e.g., retention), we will compare the experimental baseline period with the retrospective baseline client record abstractions for each program. We can estimate any measurement error (i.e., changes due to increased attention to the point in treatment when many drop out) and/or any unplanned selectivity (i.e., selection that might effect the generalizability of the results) by comparing changes in these two regression lines. Figure 1 uses fictitious data to illustrate a small measurement effect between the retrospective and prospective baseline periods.

We can estimate any treatment contamination or compensatory rivalry between the treatment groups by comparing the baseline period, pilot period, and standard group performance. If there is no contamination in the randomized experiment, we expect the regression line of the control group to follow the regression line of the baseline period. This is illustrated in the first two years of Figure 1. The level of contamination can be estimated by comparing any discontinuity in the regression lines. Figure 2 uses the same fictitious data to illustrate the effects of some contamination and/or compensatory rivalry. Figure 3 illustrates a main effect confounded with a dilution of the standard treatment. If the experimental group were to get the best counselors and the control group were to get inferior counselors, for instance, the control group's performance might be worse than it would be in nonexperimental conditions, exaggerating the "true" treatment effect over the traditional approach.

Finally, we can partition out the impact of any Hawthorne effect by comparing the outcomes during the experimental period with the outcomes during the postexperimental period, during which the intervention is implemented programwide. This is illustrated in Figure 4. Note that we must still measure the process and treatment during the postexperimental period to rule out the other problems.

ESTIMATING CASEFLOW, EFFECT SIZES, AND POWER

Few programs have the numbers of subjects required in a randomized field experiment. Even fewer can accurately estimate the caseflow of appropriate subjects. For researchers, one of the most difficult parts of planning a randomized field experiment, then, is finding programs with sufficient numbers of subjects over a reasonable amount of time to meet the desired confidence level, expected effect size, and desired level of statistical power.

Dennis (1988) analyzed the implementation of thirty randomized field experiments used in criminal or civil justice evaluation studies. Several principal investigators whom he interviewed had contacted as many as 100 programs to find one that was both willing to participate and able to provide the requisite number of clients. Caseflow estimates had to be adjusted in fifteen of the twenty-eight programs (54%) that took in their clients over time (as opposed to usingbatch randomization of an existing client pool). Of those fifteen programs, the estimates had to be adjusted down in thirteen and up in two. The average decrease was 37%, but the adjustments ranged from -5 to -67%. The two that had higher-than-expected caseflows increased by only .5% and 16%.

The caseflows of only two of the eight experiments that based their caseflow estimates on either pilot or caseflow studies had to be adjusted down (by -20% and -40%). Although these two studies had been nominally as extensive as the other six with caseflow studies, they had used retrospective data. Based on this and similar evidence, Dennis (1988) and others (e.g., Berk et al. 1985; Boruch and Wothke 1985; Riecken et al. 1974) have recommended that randomized experiments be preceded by baseline or pilot periods to reassess initial estimates.

Accurate caseflow estimates are important because they allow optimal allocation of study resources and staffing and ensure that the study will have sufficient statistical power. The first point is common sense, so we will focus on the second here.

Lipsey's (1988) analysis of the rate of Type II error in research literature (i.e., failing to establish statistical significance when there was a real difference) found sample size to be a primary determinant. Fifty-five percent of over 1,400 studies for which a meta-analysis found an effect size of .25 or greater failed to detect the difference statistically because of their low statistical power: The average sample size of these studies with Type Il errors was about forty per condition. Stated another way, that sample size would be able to reliably detect (power=.90) only effect sizes of.75 or larger. Note that in his meta-analysis of thirty-nine meta-analyses from social science and medical literature, Lipsey (1988) found a grand mean effect size of only .45.

The proposed baseline and pilot periods in the RTI experiment will give us real data that are appropriate for accurately recalculating the caseflow and statistical power estimates within each project. The scope and time frame of the experiment can then be refined to ensure full implementation and sufficient statistical power on the core comparisons. Accurate caseflow data, control of treatment variation, and multiple measures should improve the statistical power and the capacity to detect program effects.

One control group in the RTI design is a between-client, randomly assigned control group. The other is a within-treatment regime, programlevel control group in which client treatment groups serve as their own controls. Because each client treatment group is its own control for programwide intervention, and because client-level data for this program-level analysis are being combined, there is considerably less variation along the dimensions with repeated measures. A tenfold decrease in both the withinand between-group variance is not uncommon (Winer 1971; e.g., program retention in the previous months is a good predictor of program retention in subsequent months). Although combining client data groups reduces the number of units, the number needed for a statistically powerful analysis is much smaller in a repeated-measures design at the program level.

MAINTAINING THE INTEGRITY OF RANDOM ASSIGNMENT

The process of random assignment is rarely as neat in field experiments as in laboratory experiments. Legitimate reasons unexpectedly arise that make it necessary to override random assignment. Imagine that a client becomes part of an experiment to test the effectiveness of clonidine for opiate. withdrawal symptoms. It soon becomes apparent that the client cannot maintain a constant blood pressure using clonidine, originally a drug for hypertension. This is a legitimate reason to override random assignment (Boruch, Dennis, and Cecil forthcoming; Boruch 1987; Federal Judicial Center 1981). Other situations, however, are less clear and create analytical problems. One of the earliest randomized field experiments, the Lanakshire Milk Experiment (Leighton and McKinlay 1930), was quickly thrown into disrepute (Student 1931; Fisher and Bartlett 1931) because teachers had frequently overridden random assignment and given the milk to the most "needy" school children. In another study, nurses subverted random assignment and the treatment process in an experiment to determine the effect of oxygen-enriched air on the development of retrolental fibroplasia (a form of blindness) and mortality in premature infants (Silverman 1977). The nurses covertly gave the higher concentration of oxygen-enriched air (the existing standard treatment) to the group of babies being treated with a lower concentration of oxygen. The experiment had to be repeated because both groups of babies received approximately the same treatment.

Embedding the experiment in a quasi-experimental design helps to maintain the integrity of random assignment. We can estimate the impact of the experimental intervention through the time series design even if there is some contamination (as in Figure 2) or even full contamination. If randomization fails but there arc still discrete groups with differential treatments, we still have a very strong quasi-experiment (i.e., a time series with an embedded nonequivalent comparison group or workhorse design). The strongest possible time-series design is achieved when the randomized experiment is fully implemented because we can formulate a specific alternative hypothesis that one group should change and the other should not. This design is potentially stronger for detecting confounding variables.

CHANGES IN THE ENVIRONMENTAL CONTEXT

Having more subjects gives most research designs greater statistical power. More subjects usually means more time, however, and the longer an experiment takes, the more likely it is that the environment in which the experiment is occurring will change. Some changes may be easily surmountable (e.g., staff turnover); others may significantly alter the treatment regimens (e.g., changes in federal regulations) or even stop the experiment (e.g., changes in funding, changes in the chief administrator).

Given the amount of money currently being spent on drug abuse treatment, it is likely that programs will want to and probably should continue to grow during the course of the experiment. Programs wanting to implement a new treatment component have basically two desirable alternatives: (1) introduce the new component programwide (i.e., make it orthogonal to the experimental group), or (2) increase the level of treatment contrast. A new component changes the environmental context. Increasing the level of treatment changes the treatment.

Because our knowledge of AIDS and. transmission-related behaviors is changing, the National Research Council (Coyle et al. 1989) has stated that "it is inappropriate to view [AIDS-related] program design, implementation, and evaluation as a short-terrn or one-time event." Because the reduction of high-risk behaviors (e.g., drug use, needle sharing) is a major goal of the RTI experiment, it is likely that the treatment regimens will be revised during the experiment either to address ethical concerns (i.e., raising the standard treatment) or to take advantage of recent advances (i.e., revising the enhanced treatment).

Changes in any part of the treatment protocol are generally considered a problem in traditional randomized experiments because they increase the treatment variation within groups. However, treatment programs can and often should respond to changes in the local environment. The RTI design minimizes the impact of these changes and takes advantage of them for estimating treatment interaction effects. By implementing changes programwide and at a fixed point in time, we can stratify the sample into a programlevel, repeated-measures design to reduce the within-group variation. The effect of the programwide changes can be examined both in the repeatedmeasures design and separately in several time-series designs. If there is a main effect of the new programwidc intervention, labeled B in the figures, there should be a parallel effect on both groups in the time series design. This is illustrated in Year 3 of Figures 1 through 4.

Note that the addition of repeated measures adds new information and, consequently, increases statistical power. The repeated measures do not compromise the underlying experimental design. Although less efficient in terms of the number of observations, a repeated-measures design is more effective in terms of the numbers of clients or client groups (because there are multiple observations on clients and client groups). Because we are getting repeated measures on programs, not clients, we expect little.in the way of order effects. More important, this design gives us a way to incorporate changes in the treatment environment into the analysis. Many of the changes may have been necessary (e.g., changes in legal regulation) and may be beyond the control of the researcher.

In the RTI experiment we are concerned about two problems related to major types of programwide changes. The first problem is the increasing availability of funds to improve the basic services. The second problem is our own effort to increase caseflow and to work with federally funded outreach efforts to intravenous drug users. Although desirable, both increased funding and more clients may change the treatment delivered, the nature of the study sample, and client retention (see Flick 1988 and Howard et al. 1986 for extensive discussions of selectivity, preinclusion attrition, and postinclusion attrition). We can factor out the variance associated with these changes by carefully documenting them and ensuring that they are not confounded with the experimental conditions using a time series, multivariate, or repeated-measures analysis-of-variance model. These analyses will also allow us to examine the potential interaction effects between the enhanced treatment and the standard services.

CHANGES IN THE TREATMENT REGIMENS

In some cases, it may be desirable to more finely distinguish treatment regimens as more information and/or resources become available. In other cases, the ethics or feasibility of continuing a component of a treatment regimen may become questionable during the course of the experiment. In any of these instances, treatment regimens can be changed by refining. the treatment protocol or adding or removing treatment components (for one or both groups). The result will be a planned increase in the variation of the actual treatment received.   

Researchers often find ways of improving and or further delineating the treatment regimens during the course of multiyear experiments. Figure 5 illustrates how this might look if some new component, labeled E, were given only to the experimental clients during the third year of the experiment. Dennis and I3oruch (1989) recommend including an interim analysis to address several questions such as whether to refine the interventions. An interim analysis also allows early feedback to policymakers (cf. Population Council 1986) who may need to make decisions immediately that affect the utility of the final results (e.g., changes in the law or program funding that might otherwise eliminate the feasibility of disseminating a particular intervention).

The pilot stage of the RTI design tries to focus most of the enhanced treatment changes in the early months of the experiment. A planned interim analysis may suggest revisions. If no revisions are made, these treatment regimes can be included in the main data base. If revisions are made in only one treatment regimen during the subsequent years of the experiment, the treatment regimen can be partitioned into a standard treatment group, an early enhanced treatment group, and a late enhanced treatment group. With two degrees of freedom between groups, comparisons can be made (1) between the standard treatment group and the combined enhanced treatment groups, and (2) between the standard treatment group and the late enhanced treatment group. The first contrast will have more clients but is likely to produce a smaller effect. The second contrast will presumably have the largest difference under the alternative hypotheses.

In the RTI experiment, we have already experienced an unexpected event that we are having to examine in this way. For reasons unrelated to the experiment, all of the enhanced counselors in a single site were lost. However, because of individual differences in the counselors, the introduction of a new cohort of counselors in one regimen may not be a major change in that regimen. We have also considerably refined our notion of counseling sessions into a formal treatment model with additional training. We will control for the changes both in counselors and in the definition of a counseling session in the final analysis.

PUTTING IT ALL TOGETHER

REPRISE

The problems discussed here are common and may be detrimental to an experiment's validity. Their existence does not necessarily invalidate the study's findings but does necessitate changes in how the data are interpreted. Developing designs and implementation strategies to deal with these problems is crucial to extending the usefulness of randomized field experiments to evaluate social policies and experiments of the kind proposed by Berk and colleagues (1985), Campbell (1969), Coyle et al. (1989), Dennis and Boruch (1989), and Riecken and colleagues (1974).

To reiterate briefly, the potential problems include:

•    treatment dilution,
•    treatment contamination or confounding,
•   inaccurate casellow or power estimates,
•   violations of the random assignment process,
•   changes in the environmental context, and
•   changes in the treatment regimens.

Several methods for addressing these problems were suggested in the preceding discussions. Measurement and strategies for analyzing the impact of a problem and adjusting for any bias are issues in these problem-solving methodologies. The goal of many of these methods is adequately measuring rather than disproving bias. Although we should continue to try to improve the quality of randomized field experiments, we should also (1) acknowledge that field research is unlikely to ever be ideally implemented and (2) develop techniques to address the problems and improve the quality of our estimates (cf. Reichardt and Gollob 1989). Below is a summary of the three types of methods that have been suggested. A final section addresses the reporting of implementation problems in the literature.

WHAT IS BEING MEASURED

An expanded paradigm of what needs to be measured is required to adequately assess the preceding problems. Traditional randomized field experiments have primarily described what was "supposed" to have been done and the observed differences in the outcome measures. Fairweather and Tornatzky (1977) set forth standards that few experiments have met: (1) rich descriptions of the actual treatment (e.g., narratives, pictures, case studies); (2) validation of the constructs underlying the intervention (e.g., the active ingredients implied by the treatment label); (3) and integration of more qualitative data in. general. Researchers must use four types of measures (process, outcome, context, and construct) to meet these standards and ensure that the problems can be resolved.

Process measures such as frequency, timing, length, and intensity of the contact are necessary to determine whether the experimental interventions were implemented and delivered to the appropriate people or units. Ideally, they allow the calculation of something analogous to the intervention's dosage. Outcome measures arc necessary to assess the impact of the regimens. Often called dependent variables, these measures have traditionally been those that are expected to change under the alternative hypothesis.

Thanks to the works of Sechrest and Redner (1979) and others (Scott and Sechrest 1989; Sechrest et al. 1979), many of the more recent social policy experiments are attempting to determine treatment integrity and dosage (cf. Cook and Poole 1982; Dennis 1988). One of the key findings of these efforts has. been that the control group often gets more treatment than the experimenter expected, necessitating the measurement of the "control" treatment as well as the experimental treatment.'

In the natural sciences, the altitude, humidity, and temperature of the setting in which a study was conducted must be described. Are the locations of studies within human organizations any less relevant? What appear to be grιat differences in treatments on paper may not be salient in a particular context. It is therefore important to describe an experiment's context in such terms as population with which the experiment was conducted, competing programs that are available, and the extent to which other agencies change their practices to compete with or support the experiment.

The demographics of an experiment's client sample are important, but often less obvious screening criteria make the'sample unique. Prevalence estimates of ps ychiatric disorders among drug abuse treatment clients, for instance, vary considerably depending on the type of programs from which the clients are drawn and who is screened out (e.g., Jainchill, DeLeon, and Pinkham 1986 vs. Rounsaville et al. 1983). Although traditional research has focused on internal validity, selection and screening procedures must be explicitly stated because ill-defined and poorly executed procedures can drastically reduce the external validity (i.e., generalizability) of findings. This issue is more important in randomized field experiments than in traditional studies because randomized field experiments are meant to assess relative effectiveness in the real world rather than potential efficacy. Similarly, we need to identify other events that are confounded with the experimental contrast because these events may alter our interpretation of the results.

Construct measures are used to validate the concept underlying the experimental intervention (i.e., disaggregating the key ingredients). These construct measures can be early outcome measures, measures on a dimension related to the intervention, or process/outcome measures that should not change when the planned intervention is properly executed.

In a scientifically ideal situation, constructs are validated in a converging series of experiments. Policymakers often cannot wait for the results of a series of studies (cf. Saxe and Fine 1981), however, and must rely on other alternatives, such as several concurrent experiments, multiple measures that address competing hypotheses, and combining experiments with timeseries or other quasi-experimental designs. Multiple experiments avoid the problem of convincing policymakcrs that a study should be replicated. Multiple measures answer the increasing call for critical multiplism in evaluation (Cook 1985; Cook and Reichardt 1979; Shadish et al. 1986). Quasi-experimental designs often are sufficient for secondary questions despite their potential for bias. Thus the ideal design for a field experiment aimed at changing social policy will probably involve multiple sites and multiple methods (Louis 1982).

Analysis of implementation problems will vary considerably according to the method that was used to measure the problems. At first glance, one might think causal modeling could integrate this new information. However, modeling is best reserved for well-developed theories or simple phenomena. It took Robins (1989) over forty pages, half of which involved Greek formulas, to present a structural model that integrated a patient's decision whether or not to take a single drug, AZT, into a clinical trial to determine its effectiveness. Our current ability to model the impact of several dimensions of treatment dosage is questionable. Most of the analyses of implementation proposed here, therefore, involve little more than a simple t test or traditional analysis of variance. Other analyses rely on the analysis of multivariate and graphical data. Here we will briefly review two of the less common analyses that hold considerable promise-interrupted time series (ITS) and repeatedmeasures analysis of variance.

An ITS design (Box and Jenkins 1976) involves plotting each group and each project in a fashion similar to Figures 1 through 5. The program-level effect would be estimated according to differences in the regression lines for the various groups over the various time periods. Additional checks would be made to rule out other potential threats to internal validity (for a further discussion see Cook and Campbell 1979). Minimally, these would include comparing (1) the retrospective baseline data with data for the periods covered by the research to check for changes in measurement or the impact of selectivity and (2) the prospective baseline data with data for the standard treatment to assess the levels of contamination and compensatory rivalry.

In the RTI experiment, the strongest comparison can be made with the enhanced treatment components that are given only to the randomly assigned, enhanced treatment group. Under the alternative hypothesis in the ITS; there should be a change in the enhanced group but not in the standard group.

A repeated-measures analysis of variance allows us to analyze the data as a single-factor analysis of variance, with multiple levels of the treatment and the control condition. If the components of the enhanced regimen remain the same, but some other intervention is implemented programwide (e.g., opening a day-care or job placement program), there are two levels of the standard regimen. (Recall that enhanced clients get the standard regimen components plus the enhanced regimen components.) If something is added to the enhanced regimen only, then there are two levels of the enhanced regimen. Table 2 illustrates this, assuming there is one uncontrollable historical change that occurs programwide (labeled B) and that the enhanced treatment is expanded in the later stages of the experiment (labeled C).

There will be substantial reductions in the within-group variance of the repeated measures used on clients, client cohorts, or programs. As a result, the within-group term in the denominator must incorporate the constant correlation between multiple observations (p) on the same unit (i.e., the same client, group, or program) and the number of different treatment periods under which measurements were made (q). For the treatment component that is randomly assigned, the error term is a, 2 [(q - 1)p], and the error term for treatment components on which repeated measures have been collected, including any interaction terms, is o, Z(1 -p).

The linear model upon which the analysis would be based dates back to the methods developed by Cornfield and Tukey (1956) and assumes homogeneity in the underlying covariance matrix and the implausibility of order effects (e.g., practice, fatigue, transfer, or training). To the extent that the covariance matrix is not homogeneous, such linear models will be too liberal. Greenhouse and Gcisser (1959) called for a more conservative procedure that basically reduces the within-group degrees of freedom to the number of independent clients. Monte Carlo studies, however, indicate that the Greenhouse and Geisser procedure is too conservative and that the above model tends to give results closer to the nominal significance levels (Collier et al. 1967). We have based our analysis plan on the recommendation of Winer (1971) that the linear model used be provisional and its continuance be based on a statistical test of the homogeneity of the covariance matrix.

Replication is one of the fundamental principals of experimental investigation. Replication is built in by conducting concurrent experiments in multiple sites. Multiple sites also increase the flexibility of site selection. Recall that one of the major constraints on finding a program to host a randomized experiment is casellow. Many of the agencies willing to tollaborate barely have enough cases for the core analysis, let alone any subsequent analyses on subcategories. Finer analyses can be conducted and smaller and more diverse programs can be used by combining the data from multiple experiments.

ADJUSTING ESTIMATES

Failure to find a statistical difference between two or more regimens is not a failure from a research point of view and may lead to valuable conclusions (Yeaton and Sechrest 1987). The findings of no difference, however, should indicate some true state rather than insufficient statistical power (Lipsey 1988) or implementation problem (Dennis 1988). Conversely, biases will often be detected in field research. Current paradigms that emphasize disproving biases should be expanded to deal with their frequent occurrence. More accurate projections can be made from the available information by calculating and adjusting estimates to reflect any observed bias.

There is, however, an important caveat. Reporting adjusted estimates does not negate the need to report unadjusted results. All adjustments should be stated explicitly, and the unadjusted and adjusted results should be reported together. This forces the researcher to clearly justify them and allows the reader to decide whether or not to accept the logic of the adjustments.

REPORTING IMPLEMENTATION PROBLEMS

Interest is frequently expressed in the evaluation literature in narratives of experiences of researchers conducting such experiments (Boruch, Dennis, and Greer 1988; Boruch and Wothke 1985; Conner 1974, 1977; Farrington 1983; Farrington, Ohlin, and Wilson 1986; Rczmovic, Cook, and Dobson 1981; Riecken et al. 1974). Dennis (1988) was surprised to find that the logistical experiences of field researchers are increasingly being reported. The amount of information available in the literature on how studies were implemented and what problems occurred correlates with the year in which the experiment was conducted (r = .38,p <.05). For the sixteen criminal and civil justice experiments that Dennis (1988) analyzed and that were completed between 1973 and 1979, 43% of the study information had to come from interviews. For the twenty-four experiments completed between 1980 and 1986, only 34% of the study information had to come from interviews.

Information about implementing experiments is becoming increasingly common in the literature. Dennis (1988) found that the correlation between

the year of the most recent publication (on a given experiment) and the percentage of published information was even stronger (r = .42, p < .05) than the year of the experiment itself and publication. The percentage of unpublished information was 52% for the five experiments with articles from only 1973-1979 and 35% for experiments with articles after 1980.

Thus there has been progress on publishing more about the implementation of randomized trials. This article has addressed and assessed some of these problems and incorporated findings into the main analysis.

NOTES

1. This technical report is available through the author.

2. This form captures client-level information about (1) the timing of treatment; (2) the duration of contact; (3) the context of the contact (e.g., individual, group, incidental); (4) the content of the contact; (5) use or revision of the treatment plan; (6) referrals; and (7) other follow-up activities.

REFERENCES

Berk, R. A. 1989. What your mother never told you about randomized field experiments. No. 44 in the UCLA Statistic Series. Unpublished report available from the author, Department of Sociology and Program in Social Statistics, University of Southern California, Los Angeles, CA 90024.

Berk, R. A., R. F. Boruch, D. L. Chambers, P. H. Rossi, and A. D. Witte. 1985. Social policy experimentation: 'A position paper. Evaluation Review 9:387-429.

Boruch, R. F. 1987. The ethical propriety of testing programs in AIDS research: The special case of randomized field experiments. Paper presented al the American Psychological Association, New York, August.

Boruch, R. F., M. L. Dennis, and J. S. Cecil. Forthcoming. Fifty years of empirical research on privacy and confidentiality. In Empirical Research on Ethics, edited by J. Sieber. Lincoln: University of Nebraska Press.

Boruch, R. F., M. L. Dennis, and K. C. Greet. 1988. Lessons from the Rockefeller Foundation's Minority Female Single Parent Field Tests. Evaluation Review 12:396-426.

Boruch, R. F., and H. Gomez. 1979, Power theory in impact evaluations. In Improving Eval uations, edited by L. E. Datta and R. Perloff, 139-76. Beverly Hills, CA: Sage.

Boruch, R. F., A. J. McSweeny, and E. J. Soderstrom. 1978. Randomized field experiments for program planning, development, and evaluation. Evaluation Quarterly 2:655-95. Boruch, R. F. and W. Wolhke. (Eds.). 1985. Randomization and field experimentation. New Directions in Program Evaluation Series, no. 28. San Francisco: Jossey-Bass.

Box, G.E.P., and G. M. Jenkins. 1976. Time-series analysis: Forecasting and control. San Francisco: Holden-Day.

Campbell, D. T. 1969. Reforms as experiments. American Psychologist 24:409-29.

Campbell, D. T., and J. S. Stanley. 1963. Experimental and quasi-experimental designs for research. Boston: Houghton Mifflin.

Collier, R. O., Jr., F. B. Baker, G. K. Mandeville, and T. F. Hayes. 1967. Estimates of test size for several test procedures based on variance ratios in the repeated measures design. Psychometrika 32:339-53.

Conner, R. F. 1974.A methodological analysis of twelve true experimental program evaluations. Ph.D. disc., Northwestern University.

. 1977. Selecting a control group: An analysis of the randomization process in twelve social reform programs. Evaluation Quarterly 1:195-244.

Cook, T D. 1985. Postpositivist critical multiplism. In Social Science and Social Policy, edited by R. L. Shortland and M. M. Marks, 21-26. Beverly Hills, CA: Sage.

Cook, T. D., and D. Campbell. 1979. Quasi-experimentation: Design and analysis issues for field settings. Boston: Houghton Mifflin.

Cook, T. D., and C. S. Reichardt. (Eds.). 1979. Qualitative and quantitative methods in evaluation. Beverly Hills, CA: Sage.

Cook, T.1., and W. K. Poole. 1982. Treatment implementation and statistical power: A research note. Evaluation Review 6:425-30.

Comfield, J., and J. W. Tukey. 1956. Average values of mean squares in factorials. The Annals of Mathematical Statistics 27:907-49.

Coyle, S. I.., R. F. Boruch, and C. F. Turner. 1989. Evaluating AIDS prevention programs. Report of the Panel on the Evaluation of AIDS Interventions, Committee on AIDS Interventions, Committee on AIDS Research and the Behavioral, Social and Statistical Sciences, Commission on the Behavioral and Social Sciences and Education, National Research Council. Washington, DC: National Academy Press.

Dennis, M. L. 1988. Implementing randomized field experiments: An analysis of criminal and civil justice research. Ph.D. diss., Northwestern University.

Dennis, M. L., and R. F. Boruch. 1989. Randomized experiments for planning and testing projects in developing countries: Threshold conditions. Evaluation Review 13:292-309. Dennis, M. L., J. A. Fairbank, J. V Rachal, and A. J. Bonito. 1990. Treatment process study. Technical report, NIDA Contract no. 271-88-8230. Research Triangle Park, NC: Research Triangle Institute.

Fairweather, G. W., and I.. G. Tornatzky. 1977. Experimental methods for social policy research. New York: Pcrgamon.

Farrington, D. P. 1983. Randomized experiments on crime and justice. In Crime and Justice: An Annual Review of Research. Vol. 4, edited by M. Tonry and N. Morris, 257-308. Chicago: University of Chicago Press.

Farrington, D. P., l.. E. Ohlin, and J. 0. Wilson. 1986. Understanding and controlling crime: Toward a new research strategy. New York: Springer-Verlag.

Federal Judicial Center. 1981. Experimentation In the law: Report of the Federal Judicial Center Advisory Committee on Experimentation in the Law. Washington, DC: Federal Judicial Center.

Fisher, R. A. 1960. The design of experiments. 7th ed. New York: Hafner.

Fisher, R. A., and S. Bartlett. 1931. Pasteurized and raw milk. Nature, 18 April, 591.

Flick, S. N. 1988. Managing attrition in clinical research. Clinical Psychology Review 8:499-515.

Fraker, T., and R. Maynard. 1985. The use of comparison group designs in evaluations of employment related programs. Princeton, NJ: Mathematica Policy Research.

. 1987. Evaluating comparison group designs with employment-related programs.Journal of Human Resources 22 (2): 195-227.

Greenhouse, S. W., and S. Geisscr. 1959. On methods in the analysis of profile data. Psychometrika 24:95-112.

Howard, K. I., M. S. Krause, and D. E. Orlinsky. 1986. The attrition dilemma: Toward a new strategy for psychotherapy research. Journal of Consulting and Clinical Psychology 54: 106-10.

Jainchill, N., G. DeLeon, and L. Pinkham. 1986. Psychiatric diagnoses among substance abusers in therapeutic community treatment. Journal of Psychoactive Drugs 18 (3): 209-13. Leighton, G., and P. D. McKinlay. 1930. Milk consumption and the growth of school children. Department of Health for Scotland. Edinburgh and London: Her Majesty's Stationery Office. Lipsey, M. W. 1988. Practice and malpractice in evaluation research. Evaluation Practice 9 (4): 5-24.

Louis, K. S. 1982. Mullisite/multimethod studies. American Behavioral Scientist 26:6-22. Population Council. 1986. An experimental study of the efficiency and effectiveness of an IUD insertion and back-up component. English summary of first 6 month report, PC-PE586. Lima, Peru: Population Council.

Rachal, J. V., A. J. Bonito, J. A. Fairbank, M. L. Dennis, and H. S. Zelon. 1989. Increasing the capability of methadone maintenance programs: Randomly controlled trials of enhanced methadone maintenance counseling. Technical report, NIDA Contract no. 271-88-8230. Research Triangle Park, NC: Research Triangle Institute.

Reichardt, C. S., and N. F. Gollob. 1989. Ruling out threats to validity. Evaluation Review 13:3-17.

Rezmovic, E. L., T. J. Cook, and L. D. Dobson. 1981. Beyond random assignment: Factors affecting evaluation integrity. Evaluation Review 5:51-67.

Riecken, H. W., and R. F. Boruch. 1978. Social experiments. Annual Review of Sociology 4:511-32.

Riecken, H. W., R. F. Boruch, D. T. Campbell, N. Caplan, T. K. Glennan, J. W. Pratt, A. Rees, and W. Williams. 1974. Social experimentation: A method for planning and evaluating social programs. New York: Academic Press.

Robins, J. M. 1989. The analysis of randomized and nonrandomized AIDS treatment trials using a new approach to causal inference in longitudinal studies. In Health Services Research Methodology: A FocusonAIDS. DHHS publications no. PHS 89-3439,ediled by H. Freeman and A. Mulley, 113-59. Rockville, MD: National Center for Health Services Research and Health Care Technology Assessment.

Rounsaville, B. J., W. Glazer, C. H. Wilber, M. M. Weissman, and H. D. Kleber.1983. Short-term interpersonal psychotherapy in methadone maintained opiate addicts. Archives of General Psychiatry 40:629-36.

Saxe, L., and M. Fine. 1981. Social experiments: Methods for design and evaluation. Sage Library of Social Research, vol. 131. Beverly Hills, CA: Sage.

Scott, A. G., and L.. Sechresl. 1989. Strength of theory and theory of strength. Evaluation and Program Planning 12:329-36.

Sechresl, L., and R. Redner. 1979. Strength and integrity of treatments in evaluation studies. In How Well Does It Work? Washington, DC: National Institute of Law Enforcement and Criminal Justice.

Sechrest, L., S. G. West, M. A. Phillips, R. Redner, and W. Yeaton. 1979. Introduction. Some neglected problems in evaluation research: Strength and integrity of treatments. Evaluation Studies Review Annual 4:15-38. treatments. Evaluation Studies Review Annual 4:15-38.

Shadish, W. R.J. D. Cook, and A. C. Houts.1986. Quasi-experimentation in a critical mulliplist mode. InAdvances in Quasi-Experimental Design andAnalysis. New Directions in Program Evaluation, no. 31, edited by W.M.K. Trochim, 29-46. San Francisco: Jossey-Bass.

SilveSechrest, L., S. G. West, M. A. Phillips, R. Redner, and W. Yeaton. 1979. Introduction. Some neglected problems in evaluation research: Strength and integrity of rman, W. A. 1977. The lessons of relrolenlal fibroplasia. Scientific American 236 (6): 100-107.

Simon, H. A., and W. R. Devine, (19401 1985. Controlling human factors in an administrative experiment. Reprinted in Program Evaluation: Patterns and Directions, edited by Eleanor Chclimsky, 85-94. Washington, DC: American Society for Public Administration.

Skoviund, E., and I.. Walloc. 1989. Estimation of treatment difference following a sequential clinical trial. Journal of the American Statistical Association 84:823-28.

Student. 1931. The lanakshiro milk experiment. Biometrika 23:398-406.

Wallace, L. W. 1987. The Community Penalties Act of 1983: An evaluation of the law, its implementation, and its impact in North Carolina. Ph.D. diss., University of Nebraska. Watson, K. F. 1986. Programs, experiments, and other evaluation: An interview with Donald Campbell. Canadian Journal of Program Evaluation I (1): 83-86.

Winer, B. J. 1971. Statistical principles in experimental design. 2d ed. New York: McGraw-Hill. Yeaton, W. H.. and L. Sechrest. 1987. No-difference research. In Evaluation Practice in Review. New Directions in Program Evaluation, no. 34, edited by D. S. Cordray, H. S. Bloom, and R.1. Light, 67-82. San Francisco: Jossey-Bass.

Michael l.. Dennis, Ph.D., is a Research Psychologist at the Research Triangle Institute, Research Triangle Park, North Carolina. He is currently involved in implementing several randomized field experiments to evaluate improvements to drug abuse treatment programs.