Robust policy evaluation from large-scale observational studies

Authors: Md Saiful Islam ^aff001; Md Sarowar Morshed ^aff001; Gary J. Young ^aff002; Md. Noor-E-Alam ^aff001
Authors place of work: Mechanical and Industrial Engineering, Northeastern University, Boston, Massachusetts, United States of America ^aff001; Center for Health Policy and Healthcare Research, Northeastern University, Boston, Massachusetts, United States of America ^aff002; D’Amore-McKim School of Business, Northeastern University, Boston, Massachusetts, United States of America ^aff003; Bouvè College of Health Sciences, Northeastern University, Boston, Massachusetts, United States of America ^aff004
Published in the journal: PLoS ONE 14(10)
Category: Research Article
doi: https://doi.org/10.1371/journal.pone.0223360

Summary

Under the current policy decision making paradigm we make or evaluate a policy decision by intervening different socio-economic parameters and analyzing the impact of those interventions. This process involves identifying the causal relation between interventions and outcomes. Matching method is one of the popular techniques to identify such causal relations. However, in one-to-one matching, when a treatment or control unit has multiple pair assignment options with similar match quality, different matching algorithms often assign different pairs. Since all the matching algorithms assign pairs without considering the outcomes, it is possible that with the same data and same hypothesis, different experimenters can reach different conclusions creating an uncertainty in policy decision making. This problem becomes more prominent in the case of large-scale observational studies as there are more pair assignment options. Recently, a robust approach has been proposed to tackle the uncertainty that uses an integer programming model to explore all possible assignments. Though the proposed integer programming model is very efficient in making robust causal inference, it is not scalable to big data observational studies. With the current approach, an observational study with 50,000 samples will generate hundreds of thousands binary variables. Solving such integer programming problem is computationally expensive and becomes even worse with the increase of sample size. In this work, we consider causal inference testing with binary outcomes and propose computationally efficient algorithms that are adaptable for large-scale observational studies. By leveraging the structure of the optimization model, we propose a robustness condition that further reduces the computational burden. We validate the efficiency of the proposed algorithms by testing the causal relation between the Medicare Hospital Readmission Reduction Program (HRRP) and non-index readmissions (i.e., readmission to a hospital that is different from the hospital that discharged the patient) from the State of California Patient Discharge Database from 2010 to 2014. Our result shows that HRRP has a causal relation with the increase in non-index readmissions. The proposed algorithms proved to be highly scalable in testing causal relations from large-scale observational studies.

Keywords:

Algorithms – Health care policy – Optimization – California – Test statistics – Medicare – Observational studies – Experimental economics

Introduction

Effective and evidence-based public policy decisions aim to manipulate one or many socio-economic variables and analyze their impact on the desired outcomes [1]. The impact assessment is not associational but causal [1, 2] which requires an understanding of the counterfactual—the difference in outcomes with or without the presence of the policy [3]. This is also true for any post policy evaluation [1]. A policy maker may design multiple policies and calculate the causal quantities including the effect of the proposed policies on different recipient groups, effects over time, possible trade-offs between competing goals, and, finally, choose the optimal policy [4]. The gold standard approach for calculating those causal quantities is conducting a randomized experiment [5–8]. In a randomized experiment, the experimenter will assign observations to either treatment or control group randomly; this randomness can avoid bias and eliminate confounding effects of covariates and thus can achieve unbiased estimation of treatment effects. In this case, a possible association between treatment and outcome will imply causation. However, many studies in health care, social science, economics, and epidemiology cannot be designed as a randomized experiment due to legal or ethical reasons. Randomization can also be impractical, time consuming, or very expensive. Hence, in most such cases experiments are performed on data that are collected as a natural process. Such experiments are called observational studies (also referred to as natural experiments or quasi-experiments) [9] and can be implemented in a prospective (collecting sample data as natural observation over time) or retrospective (experimenting on already collected data) way.

Making causal inferences from an observational study lacks the experimental elements of randomization on all possible background covariates (the observed and unobserved characteristics of a sample unit) [10, 11] and are prone to bias and systematic confounding on covariates. However, with proper understanding of the underlying process and careful control of non-randomized data, it is possible to make a reasonable estimation of the causal effect [5]. Researchers have been utilizing matching methods for identifying causality since the 1940s [10] and it is one of the most popular methods. It was used or noted in as many as 486,000 academic articles involving causal inference (see S1 File). Matching methods examine the possibility of restoring or replicating properties of randomization based on the observed covariates [10]. In fact, matching attempts to retrieve the latent randomization within the observational data [12]. Being true to its name, matching methods aim to find a control group that is identical to the treatment group in terms of joint distribution of the observed covariates. As discussed by Stuart [10], and Zubizarreta [13], matching the empirical distribution of the covariates has several significant advantages. For example, matching forces the experimenter to closely examine the data, check the common support on the covariates, and assess the quality of inference. Even though the matching process can be complex, the outcome analysis is often done with simple methods [14]. For instance, the Rubin Causal Model (also known as Potential Outcome Framework) estimates the causal effect as the difference of expected outcomes between the control group and the treatment group [15]. Due to its simple architecture and other attractive properties (see [10, 13, 16]), matching has been used to make policy decisions in health care [17–20], education [21, 22], economics [23], law [24], and politics [25].

In this paper, we adopt a robust methodology recently proposed by Morucci et al. [26] and extend it to accommodate causal inference from big data observational studies. We show the efficiency of the proposed methods by evaluating the impact of the implementation of the Medicare Hospital Readmission Reduction Program (HRRP) [27] on non-index readmissions—readmission to a hospital that is different from the hospital that discharged the patient.

Motivation and contribution

Motivation

The objective of the current one-to-one matching paradigm under the potential outcome framework is to find pairs (t, c) between samples t from treatment group T and c from control group C. A pair (t, c) is assigned in such a way that t and c are the same or very similar on a specific, pre-determined set of covariates X: { ( t , c ) : t ≃ c | X ; t ∈ Tandc ∈ C }. Over the years, researchers developed a wide array of algorithms to find such pairs, for example, Propensity Score matching [14], Mahalanobis Distance matching [14], Nearest Neighbour Greedy matching [28], Coarsed Exact Matching [29], and Genetic matching [30] are among the most popular algorithms. All these algorithms (including those not listed here) disregard the outcomes ( Y t 1 , Y c 0 ) of corresponding pairs (t, c) in the assignment process. Though the matching process reduces bias in treatment effect estimation, disregarding the outcomes in the assignment process introduces a new source of uncertainty. If a sample t ∈ T has multiple possible pair assignments { c 1 , c 2 , ⋯ , c n } ∈ C and have similar covariate balance but different outcomes (i.e., Y t 1 -⁠ Y c 1 0 ≠ Y t 1 -⁠ Y c 2 0 ≠ ⋯ ≠ Y t 1 -⁠ Y c n 0), by assigning pairs without considering the outcomes, an experimenter can estimate multiple degrees of causal effect (one for each possible assignment). Similarly, a sample from control group c ∈ C can have multiple possible assignment options { t 1 , t 2 , ⋯ , t n } ∈ T. A possible scenario is presented in Fig 1 where within each circle we have multiple pair assignment options with almost similar match quality but different outcomes (outcomes are presented as the size of the data points). In such cases, different experimenters using different matching algorithms can get different pairs, hence, their causal effect estimates and conclusions on the experiment can be different. It is possible that two researchers having the exact same hypothesis and using the exact same data but with different matching algorithms reach completely opposite results due to this uncertainty. This problem is exacerbated for studies involving big data as we may have more pair assignment options. Therefore, making policy decisions in health care or any other field by using the matching method that disregards uncertainty due to pair assignments can lead to erroneous conclusions.

**Fig. 1. Uncertainty due to multiple pair assignment options.**

In 2012, Congress adopted HRRP as part of the Patient Protection and Affordable Care Act (PPACA) [27] to increase quality of care and reduce hospital readmission rates. HRRP penalizes hospitals when patients with certain clinical conditions (i.e., pneumonia, acute mayocardial infraction (AMI), congestive heart failure (CHF)) who have been discharged are readmitted within 30 days. The index hospital is always penalized even if the patient is readmitted to a different hospital (non-index hospital) [27]. Though readmissions to the index hospital are following a decreasing trend over the post HRRP periods, non-index readmissions are increasing [31, 32]. This increase in the non-index readmission rate—approximately one fifth of all readmissions for Medicare patients [31, 33]—creates suspicion that hospitals are possibly discouraging patients from readmission to avoid penalties introduced by HRRP. Moreover, a recent study identified that non-index readmissions are associated with higher odds of in-hospital mortality and longer length of stay [34]. Therefore, we aim to identify whether HRRP has a causal relation to the increase in non-index readmission. Finding such causal relation involves analyzing a large volume of health care data and matching method would be vulnerable to the uncertainty discussed above. The robust method proposed in [26] to handle such uncertainty requires solving multiple Integer Programming (IP) models (a minimization and a maximization problem) iteratively. Using state-of-the-art integer programming solvers to solve those IP models for big data observational studies will be computationally expensive.

Contribution

In this work, we extend the robust causal inference testing method proposed by Morucci et al. [26] to handle large-scale observational studies with binary outcomes. To handle big data, first, we propose a robustness condition that identifies when a robust solution is possible and combines the maximization and minimization problems into a single problem. Second, we propose an efficient algorithm to calculate the test statistics for the robust condition. In addition, we propose two algorithms—one to solve the minimization problem and one to solve the maximization problem—for any condition that will show the degree of uncertainty for a selected number of matched pair. Finally, we implement the algorithms by testing the causal effect of HRRP to non-index readmissions using the State of California Patient Discharge Data and compare the computational efficiency with canonical IP solvers.

Remark 1. Please note, by “Robust” we imply “Robust to the choice of matching method”: if A represents a set of all possible matching algorithms, a researcher choosing any algorithm A i ∈ A and testing a hypothesis of causal effect will get the same result if she has chosen algorithm A j ≠ i ∈ A. Also, we are considering matching as pre-processing and plan to achieve robust test result from a large-scale observational study for a given set of good matches M identified by any matching algorithm A i ∈ A.

Causal inference with matching method and robust test

In the Rubin Causal Model, a sample unit i from a set of observations { 1 , 2 , ⋯ , n } ∈ S can have two outcomes or responses. The response Y i T is called treatment response when the unit i receives certain treatment (T = 1) and control response when unit i does not receive treatment (T = 0). It is assumed that the treatment assignment of any unit does not interfere with the outcome of other units [35]. This assumption is commonly known as the Stable Unit Treatment Value Assumption (SUTVA). Under this assumption, the treatment effect on a sample unit i ∈ S is calculated as T E i = Y i 1 -⁠ Y i 0. However, it is impossible to observe the counterfactual scenario for the same sample [15]. Under a certain treatment regime T ∈ {0, 1} and identical conditions, we can only observe Y i T = 1 or Y i T = 0 for sample i: Y i = T i Y i 1 + ( 1 -⁠ T i ) Y i 0 [5, 15]. Therefore, we cannot directly measure the treatment effect TE at an individual level. On the other hand, the causal inference literature offers a statistical solution to this fundamental problem by taking expectation over the observation set S, formally called Average Treatment Effect (ATE).

The ATE as defined in Eq 1 provides the opportunity to divide S into the treatment group T when T = 1 and control group C when T = 0 such that ( T ∪ C ) = S and work with their expectations. So, we can construct the ATE as E[Y¹|T = 1] −⁠ E[Y⁰|T = 0] but, this form of ATE implicitly assumes that the potential responses are independent of treatment assignment: Y i 1 , Y i 0 ⊥ T , ∀ i ∈ S. Though this independence assumption holds in randomized experiments, in general, it does not hold for observational studies as the experimenter rarely has control over the treatment assignment process. This problem is solved by making an assumption known as Strong Ignorability [7]. Let X ∈ X and X ∈ R k be the set of pre-treatment background variables (covariates) which characterizes the observations. The strong ignorability assumption states that the potential responses are independent of treatment assignment when conditioned on the covariates: Y i 1 , Y i 0 ⊥ T | X and every unit i ∈ S has a positive probability to receiving treatment: 0 < Pr(T = 1|X = x) < 1. Another commonly used estimate of causal effect is Average Treatment Effect on Treated (ATT) which is defined under slightly relaxed assumption (Y i 0 ⊥ T | X).

Both of these estimates are prone to bias as the treatment assignment process is not random. In the matching method, an unbiased estimate of causal inference can be achieved if treatment unit t ∈ T is exactly matched with a control unit c ∈ C in terms of the covariate set X ∈ X [7]. However, in most of the applications, it is impossible to achieve exact matching [7, 13, 36, 37]. A wide variety of matching methods are employed to make (t, c) pairs as similar as possible [7, 13, 38] or to find a subset of control group samples C ⊆ C that is similar to the treatment group samples T ⊆ T in the joint distribution of the covariate set X [29, 36]. In this work, we consider one-to-one matching that aims to find a pair ( t , c ) ⊆ ( T , C ) that is matched (either exactly or by some user defined balance function) on a set of covariates X ⊂ X.

Before explaining the difference between the classical method of causal inference [5, 14, 15] and the robust causal inference testing approach [26], let us define the set of good match M and the pair assignment variables a_i,j.

Definition 1. (A set of Good Match) A set of good match M includes treatment group samples T ⊆ T and control group samples C ⊆ C that satisfies certain covariate balance criteria defined under matching algorithm A i ∈ A.

Definition 2. (Pair Assignment Operator) The Pair Assignment Operator is a binary assignment variable a_ij ∈ {0, 1} where a_ij = 1 if sample t i ∈ T is paired with a sample c j ∈ C and the pair ( t i , c j ) ∈ M; a_ij = 0 otherwise.

For a given set of possible matches M, we can perform hypothesis test in the following form with the null hypothesis being no causal effect and alternative being the opposite.

Under the classical approach of matching method, we can test these hypotheses first by defining a test statistic Λ, specifying an imbalance measure along with a tolerance limit on the imbalance. Then, we apply a matching algorithm A i ∈ A to find the set of good match M that satisfies the imbalance limit; otherwise we tune the allowable imbalance limit to generate M. Robust approach differs from the classical approach moving forward from here (see Fig 2). The classical approach picks one (out of many) possible combination of pairs from M and conducts the hypothesis test wherein, the robust approach calculate the maximum and minimum value of the test statistic (Λ_max, Λ_min) and corresponding p-values to explore all possible assignment combinations within M which does not increase imbalance under Definition 1. The test will be robust if both Λ_max and Λ_min produce same conclusion on the hypothesis. We formally define the Robust Test in Definition 3.

Definition 3. (Robust Test) Let α be the level of significance set for the hypothesis H 0 and (Λ_max, Λ_min) are the test statistics calculated from M, then, testing H₀ is called α-robust if max(p-value(Λ_max), p-value(Λ_min)) ≤ α or min(p-value(Λ_max), p-value(Λ_min)) > α. Testing H₀ is called absolute-robust when p-value(Λ_min) = p-value(Λ_max).

Calculating the test statistic Λ generates an integer programming model which is computationally expensive for large scale data (see Numerical experiment section). In the following section, we propose a robustness condition following the Robust Test definition which will allow us to calculate a Λ_robust = Λ_min = Λ_max for absolute-robust test and we can avoid solving two integer programming problems. From definition 3, it is clear that an absolute-robust test is always robust. In this work, we are interested in testing the hypothesis stated in Eqs (3 and 4) for binary outcomes: Y ∈ {0, 1} with the McNemar’s test [39] as proposed in [26].

Robust McNemar’s test

McNemar’s test is the ideal candidate for testing hypothesis in Eqs (3 and 4) as it deals with one-to-one matched pairs. It operates on a 2 × 2 contingency table (see Table 1) and the test statistics under the null hypothesis assume that the marginal proportions are homogeneous. Among the four types of matched pairs, we are mainly interested in the discordant pairs B = ∑ i ∈ T ∑ j ∈ C a i j Y j 0 ( 1 -⁠ Y i 1 ) and C = ∑ i ∈ T ∑ j ∈ C a i j Y i 1 ( 1 -⁠ Y j 0 ) where a_i,j is the pair assignment operator defined in Definition 2. Here, B counts the number of pairs where treatment units has outcomes 0: Y¹ = 0 and control units has outcomes 1: Y⁰ = 1 and C counts the discordant pairs where Y¹ = 1 and Y⁰ = 0. Under the assumption of having at least 1 discordant pair: B + C ≥ 1 we will use the test statistic Λ as defined in Eq (5) to test both hypotheses.

**Tab. 1. Contingency table of the outcomes of treatment and control observations.**

Morucci et al. [26] proposed the following integer programming model that explores all possible assignment options and calculate maximum and minimum possible test statistics Λ_max and Λ_min, respectively.

Subject to:

Additional user-defined covariate balance constraints to find M

The total number of discordant pair constraint in Eq (8) provides an opportunity to linearize the robust McNemar’s test model. We can calculate the maximum and minimum test statistic by solving the integer programming model iteratively for different values of m until either a robust solution is obtained under definition 3 or m cannot be increased further. For the latter case, we will not find a robust solution.

As it is shown in Table 1, B is the total number of untied responses when Y i 1 = 0 is matched with Y j 0 = 1. Similarly, C is total number of untied responses when Y i 1 = 1 is matched with Y j 0 = 0. Therefore, both B , C ∈ R +. Under this definition of B and C, we provide the following propositions on the objective function of the robust McNemar’s test and its optimal values.

Proposition 1. The objective function Λ(a), has the following properties:

For any C > 0, Λ(a) is strictly increasing in B for B ∈ R +
For any B ≥ 0, Λ(a) is monotonically decreasing in C for C ≥ 1 and strictly decreasing for C > 1

Proof. Let C > 0, then for any B ∈ R +, we have

which implies Λ(a) is strictly increasing in B for a fixed C. Similarly, let B ≥ 0, then for any C ≥ 1 we have,

this proves the claims of Proposition 1.

Before further discussion on Λ(a) and the optimality conditions, we introduce the following notations and definitions of maximum untied responses, for both B and C.

| Y i 1 = 1 | is the number of treatment units in the matched set with positive outcome
| Y i 1 = 0 | is the number of treatment units in the matched set with negative outcome
| Y j 0 = 1 | is the number of control units in the matched set with positive outcome
| Y j 0 = 0 | is the number of control units in the matched set with negative outcome

Definition 4. (Maximum type one discordant pair) B_max is the maximum number of possible pairs between Y i 1 ∈ T and Y j 0 ∈ C where the treated observation has negative (“No”) outcome but the untreated (control) observation has positive (“Yes”) outcome, i.e.,

Definition 5. (Maximum type two discordant pair) C_max is the maximum number of possible pairs between Y i 1 ∈ T and Y j 0 ∈ C where the treated observation has positive (“Yes”) outcome but the untreated (control) observation has negative (“No”) outcome, i.e.,

For a fix value of m, the McNemar’s test model becomes linear and the objective functions become,

Using the property of Λ(a) explained in Proposition 1, we can find the optimal solution.

Proposition 2. Let C ≥ 1 and denote m as the total number of discordant pairs, then the optimal pair (C*, B*) is given by:

Proof. From Proposition 1, we know that Λ(a) is monotonically decreasing in C when C ≥ 1. Therefore, in the minimization problem, assignment will be made to maximize C until we are about to violate constraint B + C = m. When the total number of discordant pairs is set to m > C_max, C will take the value of C_max and B will take the value of m −⁠ C_max just to satisfy the total number of discordant pair constraints and the solution will be optimal. If m < C_max, the new C = m and the minimum value will be achieved at C = m and B = 0.

Similarly, from Proposition 1, we know that Λ(a) is strictly increasing in B for any B ∈ R +. So, in the maximization problem, pair assignment will be made to maximize B within the feasible region. When the total number of discordant pair is set to m > B_max, at optimal solution, B will take the value of B_max and C will take the value of m −⁠ B_max just to stay in the feasible region. When m < B_max, the B will take the value m and the maximum value will be achieved at B = m and C = 0.

Proposition 3. For the linear model, an absolute-robust estimate will be achieved if and only if the total number of discordant pair m = B_max + C_max.

Proof. According to the proposed approach to the causal inference estimate, an absolute-robust estimate is achieved when Λ(a)_max and Λ(a)_min is equal. For the McNemar’s test model, the model becomes infeasible when m is set to m > B_max + C_max as we can only have B_max + C_max number of total untied responses. So feasible range of m is: 0 < m ≤ (B_max + C_max).

To prove the Proposition 3, we first set m to it’s maximum value B_max + C_max. Using Proposition 2, in this case, the optimal solution for the Λ(a)_max problem is: B = B_max, C = m −⁠ B_max = C_max and the optimal solution for Λ(a)_min problem is: C = C_max, B = m −⁠ C_max = B_max. So, for m = B_max + C_max case, we get Λ(a)_max = Λ(a)_min and the solution is absolute-robust.

Conversely, m can take any integer value in the range 0 < m < (B_max + C_max) which can lead to the following six cases. For each of the cases, we will find the optimal solution using Proposition 2.

0 ≤ B_max ≤ C_max ≤ m < (B_max + C_max): The optimal solution for the minimization problem is C = C_max, B = m −⁠ C_max and the maximization problem is C = m −⁠ B_max, B = B_max.
0 ≤ B_max < m ≤ C_max < (B_max + C_max): The optimal solution for the minimization problem is C = m, B = 0 and the maximization problem is C = m −⁠ B_max, B = B_max.
0 ≤ C_max ≤ B_max ≤ m < (B_max + C_max): The optimal solution for the minimization problem is C = C_max, B = m −⁠ C_max and the maximization problem is C = m −⁠ B_max, B = B_max.
0 ≤ C_max ≤ m ≤ B_max < (B_max + C_max): The optimal solution for the minimization problem is C = C_max, B = m −⁠ C_max and the maximization problem is C = 0, B = m.
0 < m ≤ C_max ≤ B_max < (B_max + C_max): The optimal solution for the minimization problem is C = m, B = 0 and the maximization problem is C = 0, B = m.
0 < m ≤ B_max ≤ C_max < (B_max + C_max): The optimal solution for the minimization problem is C = m, B = 0 and the maximization problem is C = 0, B = m.

For all of the above six cases, Λ(a)_max ≠ Λ(a)_min, hence, the solution is not absolute-robust.

Therefore, the total number of discordant pairs m have to be B_max + C_max to get an absolute-robust estimate.

As we can see from the Proposition 2, the optimization problem has become a counting problem and can be solved efficiently for big data. However, the optimal solution calculated with Proposition 2 disregards the assignment constraints Eqs (9 and 10) and additional user-defined constraints. To find the optimal solution using the result from Proposition 2 that is feasible, we take a two-step approach. At the first step, we handle the user-defined constraints to find a good set of match M as a pre-processing step. We can use any off-the-shelf matching algorithm for that purpose or define a separate pair assignment model with different covariate balance measure to find M. At the second step, we partition the set of good match M into P partitions such that within a partition p ∈ { 1 , 2 , ⋯ , P }, any treatment unit t can be matched with any control unit c. A formal definition of a partition is provided below.

Definition 6. (Partition of M) p ⊂ M is a partition if any treatment unit t ∈ { 1 , 2 , ⋯ , N t p } is a good match to any control unit c ∈ { 1 , 2 , ⋯ , N c p } and ( t , c ) ∈ M. The reverse has to hold as well.

Construction of partitions under Definition 6 ensures that only good matches are considered for assignment. In addition, Definition 4 calculates B_max by pairing negative outcomes of treatment units and positive outcomes of control units which inherently satisfies the pair assignment constraints Eqs (9 and 10). Similarly, we calculate C_max by assigning a pair between samples with positive treatment outcomes and negative control outcomes. Therefore, none of the treatment or control unit is used more than once in the pair assignment process which satisfies the pair assignment constraints Eqs (9 and 10).

Now, using the above mentioned results, we propose Algorithm 1 which identifies the robustness condition and corresponding absolute-robust test statistic Λ(a)_robust.

Algorithm 1: Absolute-robust test statistic Λ(a)_robust at robustness condition

Require: Vector of outcomes ( Y 1 , Y 0 ) 1 , ( Y 1 , Y 0 ) 2 , ⋯ , ( Y 1 , Y 0 ) P

B_max ← 0

C_max ← 0

for p = 1 : P do

B^p ← min(|Y¹ = 0|, |Y⁰ = 1|)^p

C^p ← min(|Y¹ = 1|, |Y⁰ = 0|)^p

B_max ← B_max + B^p

C_max ← C_max + C^p

end for

return

By the sketch of the Algorithm 1, it seems like we are only matching the discordant pairs and ignoring the other possible pair assignments in the data, which is not true. In Proposition 4, we show that we match maximum possible pairs.

Proposition 4. Algorithm 1 ensures that the maximum possible pairs (t, c) are matched in M.

Proof. To prove this Proposition, we only need to show that in any partition p, Algorithm 1 matches maximum possible pairs. Then, we can sum the maximum pair assignments across the partitions to achieve maximum possible pairs (t, c) assignment in M.

Lets consider a partition p where N t p denotes the number of treatment samples and N c p denotes the number of control samples. Hence, the maximum number of pairs we can assign in p is m i n ( N t p , N c p ). We will use N . p + to represent the number of samples with positive outcomes (Y· = 1) and N . p -⁠ to represent the number of samples with negative outcomes (Y· = 0). After assigning the discordant pairs (B m a x p) and (C m a x p) as we did in Algorithm 1, we are left with ( N t p + -⁠ C m a x p ) + ( N t p -⁠ -⁠ B m a x p ) treatment samples and ( N c p + -⁠ B m a x p ) + ( N c p -⁠ -⁠ C m a x p ) control samples. Now, we can assign the remaining treatment and control samples into the other two types of pairs A and D to their limit:

It is trivial to show that the for partition p,

An example of maximum pair assignment is provided in Fig 3 where treatment outcomes (t) are sorted in descending order and control outcomes (c) are sorted in ascending order. In the left panel, we can have m i n ( N t p , N c p ) = 5 pairs at maximum. After assigning B_max = 2 and C_max = 2 according to Algorithm 1, we can assign only one pair to D_max and A_max = 0. Therefore, we achieve the maximum number of pair assignments. We follow the similar procedure in the middle and right panels.

**Fig. 3. Example of maximum pair assignments between treatment and control group.**

In the Algorithm 1, we calculate the absolute-robust test statistics at robustness condition which an experimenter can use to find the corresponding p-value and compare with a pre-defined level of significance α to make conclusion on the hypothesis of no causal relation. Decisions made in this process will be free of uncertainty and robust to the choice of matching algorithms. If different experimenters perform matching on same data using different matching algorithms but follow the above mentioned procedure, all of their conclusions will be exactly the same.

Regarding the computational complexity arises due to big data, our proposed algorithm only involves counting elements in vectors and few algebraic operations. The counting processing can be done with the summation of vectors as we are dealing with only binary outcomes: summation implies the total number of positive outcomes and we can calculate the negative outcomes by subtracting it from the size of the vector. In addition, we only need to solve the problem once—at robustness condition. Therefore, the proposed algorithm will be highly efficient for big data.

While the Algorithm 1 directly calculates test statistics at robustness condition, a researcher might be interested in exploring the degree of uncertainty in the causal inference test. She may want to see how the uncertainty changes towards the robust estimate with respect to the number of discordant pairs matched. For this purpose, we propose the following two algorithms (2, 3) following the result of Proposition 2.

Algorithm 2: Maximizing the test statistics Λ(a)

Require: Vector of outcomes ( Y 1 , Y 0 ) 1 , ⋯ , ( Y 1 , Y 0 ) P and increment in m: IB, C ← 0

while m ≤ B_max + C_max do

for p = 1 : P do

if m < B_max then

C^p, B^p ← 0, m

else

C^p, B^p ← m −⁠ B_max, B_max

end if

B ← B + B^p

C ← C + C^p

if (B + C) ≥ m then

break

end if

end for

m ← m + I

end while

return

Algorithm 3: Minimizing the test statistics Λ(a)

Require: Vector of outcomes ( Y 1 , Y 0 ) 1 , ⋯ , ( Y 1 , Y 0 ) P and increment in m: IB, C ← 0

while m ≤ B_max + C_max do

for p = 1 : P do

if m < B_max then

C^p, B^p ← m, 0

else

C^p, B^p ← C_max, m −⁠ C_max

end if

B ← B + B^p

C ← C + C^p

if (B + C) ≥ m then

break

end if

end for

m ← m + I

end while

return

Now, to show the worst case time complexity for the proposed algorithms, we first define q = max_p∈P (|(Y¹)^p|, |(Y⁰)^p|), where (Y¹)^p and (Y⁰)^p denote the outcome vector for partition p for the treatment and control group, respectively. Then, by definition, the number of operations needed in Algorithm 1 for calculating the term min(|Y¹ = 0|, |Y⁰ = 1|) is 2q. The time complexity becomes T(p) = p(4q + 2). Therefore, Algorithm 1 has a time complexity of O ( p q ). Again, using the same definition of q, we can write B_max + C_max ≤ q + q = 2q. Since the time complexity for the loop (p = 1: P) is just 2 arithmetic operations, the overall time complexity of Algorithm 2 and 3 become T(p) = (B_max + C_max){p(2)} ≤ (2q).(2p) = 4pq. Therefore, Algorithm 2 and 3 have a time complexity of O ( p q ).

Numerical experiment

In this section, we present the efficiency of the proposed algorithms with data from the State of California Patient Discharge Database and address an interesting hypothesis on the effectiveness of the HRRP implemented in October 2012.

A hospital’s readmission rate is considered an important measure of its care quality. As noted, to increase the care quality and hold hospitals accountable, US Congress introduced the HRRP under the PPACA in 2012 [27]. The most important feature of this program is that the index hospital (the hospital that discharged the patient) is penalized if patients with pneumonia, congestive heart failure (CHF), and acute mayocardial infraction (AMI) are readmitted (to the index hospital or any other hospital) within 30 days of discharge. During the post HRRP period, the overall rate of readmission has been decreasing, which the proponents of HRRP are attributing to the success of the policy. However, in this period, readmission to different hospitals (non-index readmission) has been increasing [31, 33]. Non-index readmissions have been found to be associated with longer lengths of stay and higher in-hospital mortality rates [34]. Hospitals are possibly discouraging patients seeking readmission to avoid penalties introduced by the HRRP. To examine the increase in non-index readmission post HRRP, we advance the following hypothesis and test it with the proposed algorithms with the level of significance α = 0.05.

H₀: HRRP has no causal relation with the increase in non-index readmission
H₁: HRRP has a positive causal relation with the increase in non-index readmission

Data description and covariate balance

In this research, we primarily used patient discharge data between 2010 to 2014 from California. We obtained this nonpublic data set from the California Office of Statewide Health Planning and Development (OSHPD), which collects in-patient data from California licensed hospitals. Each patient in this data set has a unique identifier that can be used to determine if a patient is readmitted. In addition, the data set also contains patient level information such as ICD-9 codes for clinical diagnosis, comorbidities, age, gender, discharge destination, patients’ Zip code, and insurance information. When a readmission was identified, we ascertained the destination hospital of that readmission. Then, a binary variable was created with 0 if the patient was readmitted to the same hospital or 1 if different hospital. To test the hypothesis, we used this variable as our outcome: Y = 1 if readmitted to a different hospital or Y = 0 if readmitted to the same hospital.

Moreover, the OSHPD data set was merged with publicly available data from the Centers for Medicare and Medicaid Services, American Association Annual Hospital Survey and the Area Resource file. From these additional data sources, we obtained important hospital-level information including teaching status (membership in the Council of Teaching Hospitals), ownership type (public, non-profit, investor owned), hospital size based on number of beds (small: below 100 beds, medium: 101 to 399 beds, and large: 400 and above beds) and hospital location (rural, metro). We also included a proxy for patient household incomes based on the median income of a patient’s residence Zip code. We divide the data into two sets: before and after October 1, 2012, the implementation date of HRRP. The treatment here is the implementation of HRRP, readmissions between February 1, 2010 and September 30, 2012 is considered as control group C (treatment T = 0) and readmissions from October 1, 2012 to November 30, 2014 is considered the treatment group T (treatment T = 1). To capture any potential readmission within 30 days of an index discharge, admissions before February 1, 2010 and beyond November 30, 2014 were excluded. A descriptive view of readmitted patients’ characteristics is presented in Table 2.

**Tab. 2. Characteristics of readmitted patients in the State of California Patient Discharge Database from 2010 to 2014.**

We matched the patients based on the following covariates: age, gender, primary diagnosis, household income, Charlson Comorbidity Index, hospital location, hospital teaching status, hospital ownership status, and hospital size. We divided the covariates into two groups: 1) discrete and 2) continuous. The discrete covariates (i.e., gender, primary diagnosis, hospital location, hospitals’ teaching status, hospitals’ ownership status, and hospital size) are matched exactly. The continuous covariates (i.e., age, household income, and Charlson Comorbidity Index) were first divided into categories as shown in Table 2, then, the categories were matched exactly. This matching strategy resulted in 1822 partitions of data; within a partition any treatment sample can be matched with any control sample. The number of possible matched pairs in each partition can be calculated by taking the minimum number of treated or control samples in that partition. Among the 1822 partitions, we had 35,584 possible pairs. Though this matching approach seems ad hoc in nature, it is very similar to the well known method called Coarsed Exact matching (CEM) [29] with 1822 bins. Traditionally, CEM is implemented with a much lower number of bins due to the lack of common support between treatment and control groups but a higher number of bins makes for a finer covariate balance [40, 41], which is the objective of any matching method. However, to implement the proposed algorithms, an experimenter is not limited to CEM or the matching method we used. Given a good set of matches created under Definition 1, we can always create the partitions under Definition 6.

Experiment and result

To test the hypothesis H₀, first, we performed the matching operations in R [42] to obtain matched sets. Then, using the matched sets of data, we calculated the test statistic Λ(a)_max and Λ(a)_min using the 1) optimization model with an Integer Programming solver and 2) with the proposed algorithms. The integer programming model was implemented in AMPL [43] and solved with the commercial solver CPLEX [44]. We implemented the Algorithm 1, 2, and 3 in R [42]. All the experiments were performed in a Dell Precision workstation with 64 GB RAM, Intel(R) Xeon(R) CPU E5-2670 v3 processor running at 2.30 GHz.

Table 3 shows the comparison of solutions obtained using an optimization model with CPLEX iterating over different values of discordant pairs (m) and proposed algorithm at robustness condition. The range of p-value achievable corresponding to the test statistics Λ(a)_max and Λ(a)_min is presented in Fig 4. The proposed Algorithm 1 directly identifies the robustness condition which is B_max = 12082 and C_max = 9448 and calculates the absolute-robust test statistics Λ(a)_robust. Including all four types of discordant pairs (A, B, C, D), Algorithm 1 generates 35,584 matched pairs. Regarding the efficiency of the hypothesis test, for a 5% level of significance we would need at least 942 pairs to have 90% power (calculated using result from Connor [45]) wherein we have 35,584 matched pairs. The computation time required by the proposed algorithm is very insignificant compared to the time required by the optimization model. A major implication of the robust McNemar’s test is that if the same experiment is conducted with as many as 19,000 discordant pairs, we can achieve any p-value between 0 to 0.23 (see Fig 4); some experimenter might reject the hypothesis and some might fail to reject the null hypothesis. Both experimenters, in this case, are right but their conclusions differed due to the fact that they choose different pairs. Any policy decision made using the matching method without considering this uncertainty has a possibility to fail.

**Tab. 3. Test statistic Λ(a) calculated using optimization model and algorithm 1.**

The range of p-value achievable for different number of discordant pairs <i>m</i>. — **Fig. 4. The range of p-value achievable for different number of discordant pairs m.**

In regards to the hypothesis we made at the beginning of this section, we can reach a conclusion by using the p-value calculated at the robustness condition or the result from Table 3 and corresponding p-values from Fig 4. We can see that both the maximum and minimum p-value < α when we match more than 20,000 discordant pairs. Therefore, we can reject the null hypothesis of no causal effect and conclude that the HRRP is a cause for increase in the non-index readmissions. This result also suggests that not only the readmission rate but also the non-index readmission rate should be considered as a measure of health care quality.

Conclusion

Any policy decision or evaluation requires identifying the causal relation between policy alternatives and potential outcomes. Matching methods have become very popular in identifying such causal relations. However, in one-to-one matching, when we have multiple pair assignment options, matching method is vulnerable to uncertainty as the pair construction process does not consider outcomes. In this paper, we consider the integer programming model for robust causal inference testing approach with binary outcomes proposed by Morucci et al. [26] and develop scalable algorithms that can be used for large-scale observational studies. We identify a robustness condition that combines the maximization and minimization problem proposed in [26]. Instead of solving two computationally expensive integer programming models iteratively by increasing the number of discordant pairs until a robust estimate is achieved, we convert the problems into counting problems through a series of propositions. In addition, the proposed Algorithm 1 solves one problem instead of two separate problems and it is computationally efficient. Quadratic time complexity and the numerical experiment conducted on the State of California Patient Discharge Database show that the proposed algorithms are highly scalable. The numerical experiment shows an interesting result regarding a highly visible health care policy—HRRP—adopted as part of the PPACA in 2012 to improve health care quality. We identify that the HRRP is a cause of the increase of non-index readmissions, which has been shown to be associated with higher in-hospital mortality rate and a longer length of stay. Though the numerical experiment is performed with around 100,000 samples, the algorithms proposed in this paper can handle observational studies with millions of samples efficiently without further modification. In the future, we plan to develop similar robust causal inference testing algorithms with continuous outcomes for large-scale observational studies.

Supporting information

S1 File [docx]
Popularity of matching method.

Zdroje

1. Nssah BE. Propensity score matching and policy impact analysis: A demonstration in EViews. vol. 3877. World Bank Publications; 2006.

2. Pearl J. Causal inference in statistics: An overview. Statistics Surveys. 2009;3 : 96–146. doi: 10.1214/09-SS057

3. Kleinberg J, Ludwig J, Mullainathan S, Obermeyer Z. Prediction policy problems. American Economic Review. 2015;105(5):491–95. doi: 10.1257/aer.p20151023 27199498

4. Zajonc T. Essays on causal inference for public policy; 2012.

5. Rubin DB. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology. 1974;66(5):688. doi: 10.1037/h0037350

6. Austin PC. An Introduction to Propensity Score Methods for Reducing the Effects of Confounding in Observational Studies. Multivariate Behavioral Research. 2011;46(3):399–424. doi: 10.1080/00273171.2011.568786 21818162

7. Rosenbaum PR, Rubin DB. The Central Role of the Propensity Score in Observational Studies for Causal Effects. Biometrika. 1983;70(1):41–55. doi: 10.1093/biomet/70.1.41

8. Athey S, Imbens GW. The state of applied econometrics: Causality and policy evaluation. Journal of Economic Perspectives. 2017;31(2):3–32. doi: 10.1257/jep.31.2.3

9. Rosenbaum PR. Observational studies. In: Observational Studies. Springer; 2002. p. 1–17.

10. Stuart EA. Matching Methods for Causal Inference: A Review and a Look Forward. Statist Sci. 2010;25(1):1–21. doi: 10.1214/09-STS313

11. Stroup DF, Berlin JA, Morton SC, Olkin I, Williamson GD, Rennie D, et al. Meta-analysis of observational studies in epidemiology: a proposal for reporting. JAMA. 2000;283(15):2008–2012. 10789670

12. Hansen BB. Full Matching in an Observational Study of Coaching for the SAT. Journal of the American Statistical Association. 2004;99(467):609–618. doi: 10.1198/016214504000000647

13. Zubizarreta JR. Using Mixed Integer Programming for Matching in an Observational Study of Kidney Failure After Surgery. Journal of the American Statistical Association. 2012;107(500):1360–1371. doi: 10.1080/01621459.2012.703874

14. Rosenbaum PR, Rubin DB. Constructing a Control Group Using Multivariate Matched Sampling Methods That Incorporate the Propensity Score. The American Statistician. 1985;39(1):33–38. doi: 10.2307/2683903

15. Holland PW. Statistics and Causal Inference. Journal of the American Statistical Association. 1986;81(396):945–960. doi: 10.2307/2289069

16. Morgan SL, Harding DJ. Matching estimators of causal effects: Prospects and pitfalls in theory and practice. Sociological Methods & Research. 2006;35(1):3–60. doi: 10.1177/0049124106289164

17. Christakis NA, Iwashyna TJ. The health impact of health care on families: a matched cohort study of hospice use by decedents and mortality outcomes in surviving, widowed spouses. Social Science & Medicine. 2003;57(3):465–475. doi: 10.1016/S0277-9536(02)00370-2

18. Akematsu Y, Tsuji M. Measuring the effect of telecare on medical expenditures without bias using the propensity score matching method. Telemedicine and e-Health. 2012;18(10):743–747. doi: 10.1089/tmj.2012.0019 23072633

19. Kiil A. Does employment-based private health insurance increase the use of covered health care services? A matching estimator approach. International Journal of Health Care Finance and Economics. 2012;12(1):1–38. doi: 10.1007/s10754-012-9104-3 22367625

20. Sari N, Osman M. The effects of patient education programs on medication use among asthma and COPD patients: a propensity score matching with a difference-in-difference regression approach. BMC Health Services Research. 2015;15(1):332. doi: 10.1186/s12913-015-0998-6 26277920

21. Zubizarreta JR, Keele L. Optimal multilevel matching in clustered observational studies: A case study of the effectiveness of private schools under a large-scale voucher system. Journal of the American Statistical Association. 2017;112(518):547–560. doi: 10.1080/01621459.2016.1240683

22. Hong G, Raudenbush SW. Evaluating kindergarten retention policy: A case study of causal inference for multilevel observational data. Journal of the American Statistical Association. 2006;101(475):901–910. doi: 10.1198/016214506000000447

23. Dehejia RH, Wahba S. Causal effects in nonexperimental studies: Reevaluating the evaluation of training programs. Journal of the American statistical Association. 1999;94(448):1053–1062. doi: 10.1080/01621459.1999.10473858

24. Epstein L, Ho DE, King G, Segal JA. The Supreme Court during crisis: How war affects only non-war cases. NYUL rev. 2005;80 : 1.

25. Herron MC, Wand J. Assessing partisan bias in voting technology: The case of the 2004 New Hampshire recount. Electoral Studies. 2007;26(2):247–261. doi: 10.1016/j.electstud.2006.02.004

26. Morucci M, Noor-E-Alam M, Rudin C. Hypothesis Tests That Are Robust to Choice of Matching Method. arXiv preprint arXiv:181202227. 2018.

27. McIlvennan CK, Eapen ZJ, Allen LA. Hospital readmissions reduction program. Circulation. 2015;131(20):1796–1803. doi: 10.1161/CIRCULATIONAHA.114.010270 25986448

28. Austin PC. Some methods of propensity-score matching had superior performance to others: results of an empirical investigation and Monte Carlo simulations. Biometrical Journal: Journal of Mathematical Methods in Biosciences. 2009;51(1):171–184. doi: 10.1002/bimj.200810488

29. Iacus S, King G, Porro G, et al. CEM: software for coarsened exact matching. Journal of Statistical Software. 2009;30(13):1–27.

30. Diamond A, Sekhon JS. Genetic matching for estimating causal effects: A general multivariate matching method for achieving balance in observational studies. Review of Economics and Statistics. 2013;95(3):932–945. doi: 10.1162/REST_a_00318

31. Chen M. Reducing excess hospital readmissions: Does destination matter? International Journal of Health Economics and Management. 2018;18(1):67–82. doi: 10.1007/s10754-017-9224-x 28948445

32. Hasan MM, Noor-E-Alam M, Wang X, Zepeda ED, Young GJ, et al. Hospital Readmissions to Nonindex Hospitals: Patterns and Determinants Following the Medicare Readmission Reduction Penalty Program. Journal for Healthcare Quality. 2019. doi: 10.1097/JHQ.0000000000000199 31135609

33. Chen M, Grabowski DC. Hospital readmissions reduction program: intended and unintended effects. Medical Care Research and Review. 2017; p. 1077558717744611.

34. Burke RE, Jones CD, Hosokawa P, Glorioso TJ, Coleman EA, Ginde AA. Influence of nonindex hospital readmission on length of stay and mortality. Medical care. 2018;56(1):85–90. doi: 10.1097/MLR.0000000000000829 29087981

35. Rubin DB. Bayesian Inference for Causal Effects: The Role of Randomization. The Annals of Statistics. 1978;6(1):34–58. doi: 10.1214/aos/1176344064

36. Nikolaev AG, Jacobson SH, Cho WKT, Sauppe JJ, Sewell EC. Balance Optimization Subset Selection (BOSS): An Alternative Approach for Causal Inference with Observational Data. Operations Research. 2013;61(2):398–412. doi: 10.1287/opre.1120.1118

37. King G, Nielsen R. Why propensity scores should not be used for matching. Political Analysis. 2019. doi: 10.1017/pan.2019.11

38. Zubizarreta JR. Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association. 2015;110(511):910–922. doi: 10.1080/01621459.2015.1023805

39. McNemar Q. Note on the sampling error of the difference between correlated proportions or percentages. Psychometrika. 1947;12(2):153–157. doi: 10.1007/BF02295996 20254758

40. Iacus SM, King G, Porro G. Causal inference without balance checking: Coarsened exact matching. Political analysis. 2012;20(1):1–24. doi: 10.1093/pan/mpr013

41. Iacus SM, King G, Porro G. Multivariate Matching Methods That Are Monotonic Imbalance Bounding. Journal of the American Statistical Association. 2011;106(493):345–361. doi: 10.1198/jasa.2011.tm09599

42. Ihaka R, Gentleman R. R: a language for data analysis and graphics. Journal of Computational and Graphical Statistics. 1996;5(3):299–314. doi: 10.2307/1390807

43. Fourer R, Gay DM, Kernighan BW. AMPL: A mathematical programming language. AT & T Bell Laboratories Murray Hill, NJ 07974; 1987.

44. CPLEX II. V12. 1: User’s Manual for CPLEX. International Business Machines Corporation. 2009;46(53):157.

45. Connor RJ. Sample size for testing differences in proportions for the paired-sample design. Biometrics. 1987; p. 207–211. doi: 10.2307/2531961 3567305

Robust policy evaluation from large-scale observational studies

Summary

Keywords:

Introduction

Motivation and contribution

Motivation

Contribution

Causal inference with matching method and robust test

Robust McNemar’s test

Numerical experiment

Data description and covariate balance

Experiment and result

Conclusion

Supporting information

Zdroje

PLOS One