Model based estimation of population total in presence of non-ignorable non-response

Authors: Shakeel Ahmed ^aff001; Javid Shabbir ^aff001
Authors place of work: Department of Statistics Quaid-i-Azam University, Islamabad, Pakistan ^aff001
Published in the journal: PLoS ONE 14(10)
Category: Research Article
doi: https://doi.org/10.1371/journal.pone.0222701

Summary

The problem of handling non-ignorable non-response has been typically addressed under the design-based approach using the well-known sub-sampling technique introduced by Hansen and Hurwitz [1946, Journal of the American Statistical Association, Vol 41(236), Page 517 -⁠ 529]. Alternatively, the model-based paradigm emphasizes on utilizing the underlying model relationship between the outcome variable and one or more covariate(s) whose population values are known prior to the survey. This article utilizes the model relationship between the study variable and covariate(s) for handling non-ignorable non-response and obtaining an unbiased estimator for the population total under the sub-sampling technique. The main idea is to combine the estimates obtained from the sample on first call and the sub-sample from second call using separate model relationships. The contribution of this paper helps us in providing unbiased estimates with an improved efficiency under model-based paradigm in presence of non-ignorable non-response. The provided method is more economical than the available estimators under callback methods as we are working sub-sampling and also increase response rate as a stronger mode of interview is employed for data collection. A numerical study using Monte Carlo is presented to illustrate the behavior of the proposed and the efficiency comparison.

Keywords:

Blood – Employment – Surveys – Census – Data processing – Statistical inference – Blood transfusion

1 Introduction

In statistical investigations, once data collection is completed, one has to bear some, perhaps a considerable amount of non–response. Although a significant resource can be employed to improve data collection process to avoid the non-response about 20% non–response rate is commonly accepted. Item non-response occurs when one or more questions in the questionnaire are left unanswered during the survey. While a unit non-response occurs when one or more unit(s) do not response at all or are missing. Non–response in sample surveys leads to non-sampling error in estimation of the population parameters and yields biased estimates which ultimately spoils inference about the population of interest. When non-response occurs completely at random then the best way to deal with is to impute the projected values of the outcome variable corresponding to non-respondents. On contrary, when non-response factor (e.g, age, sex or/and income status etc.) is correlated with the outcome variable then the usual imputation methods fail to cope with the situation. In such situations, the population parameters and the behavior of the population may differ among the responding population (respondents) and the responding populations (non-respondent).

There are several approaches for checking whether there is a difference between the populations of respondents and non-respondents and evaluating potential bias due to non–response: (i) specific follow-up of non-respondents and (ii) analysis of the characteristics of respondents and non-respondents which are known prior to survey. [1] used demographic information (education, age, employment status, state of residence, field of employment etc.) to compare the respondents and the non-respondents. Information regarding non-respondents may come from previous surveys of same population (in the case of longitudinal surveys or with rotation groups) or by using some external data sources (e.g. administrative data etc.). [2] suggested a method for adjustment of non-ignorable non-response in studies involving one or more additional attempts to contact initial non-responders. [3] worked on changing in survey estimates as a function of additional calls under the similar protocol as well as under a different protocol. [4] considered the use of level-of-effort paradata to model the mechanism of non–response in surveys and for adjusting non–response bias, specially bias that is not missing at random (NMAR) or non-ignorable. The approach was based on unconditional maximum likelihood estimation model that adapted and extended the prior work to cope with the complexities encountered in large-scale surveys.

For similar situation, [5] examined whether non-participation in a census-based health study was related with poorer health status, using the Hordaland Health Study conducted in western Norway in 1997-1999. They aimed to determine whether health problems were over–represented in nonparticipants and to explore the consequences of participation bias on relation between outcomes and exposures. Statistical techniques for dealing with non–ignorable non–response based on a propensity–to–respond score has been developed by [6] assuming both item as well as unit non–response. Moreover, [7] proposed an approach of increasing blood supply by collecting blood more frequently from the selected donors for studying the relationship between ageing the population and blood transfusion. The primary aim of their proposed INTERVAL trial was to observe whether donation intervals can be acceptably and safely decreased to optimize blood supply while maintaining the health status of donors. The health status of a cohort of 1991 Gulf War veterans was periodically assessed by [8]. They compared various health outcomes of veterans with those of their peers in military who were not posted to the Gulf. Another example in which one can make utilization of sub-sampling method can be found in [9], where missing data and incomplete randomized interventions were common. These problems complicate the analysis as well as interpretation of controlled randomized trials (CRT), and are rarely handled well in practice. [10] modeled the non-response probabilities as logistic functions of the survey variable and related covariates in the survey with callback. They proposed maximum likelihood semi-parametric estimators of the parameters in the response probabilities. They further proposed, an efficient estimator for the mean of the study variable using the estimated response probabilities. The method was employed to data taken from the Singapore Life Panel Survey, a survey of health spending utilizing a census-based sample of individuals 50-70 years old, assuming that non-response was related to the health status.

In real surveys, as discussed in above cited works, non-response occurrence is not missing at random (NMAR) or, in other word, it is non-ignorable. When the occurrence of non-response in sample survey is related to the outcome of the survey, a valid statistical inference about the target population is quiet difficult. One can make efficient utilization of the sub-sampling method instead of call back. To fill this gap, [11] introduced a well known procedure for sub-sampling (follow-up) the non-respondents. The method includes sub-sampling a portion of non-respondents from the first sample with the assumption that some stronger mode of interview is applied for the purpose of sub-sampling non-respondents, consequently, all persons give full response on second call. On the basis of sub-sampling procedure introduced by [11], many authors including [12, 13], [14] and [15] worked on mean estimation under designed-based approach ignoring model relationship between the study variable and the known covariates. [16] suggested Hansen and Hurwitz [11] type estimator under Bayesian paradigm using squared error loss function (SELF). Later on [17] considered Bayesian approach of estimation under a general model using [11] technique. In survey sampling, usually one assumes the population as a finite collection of distinct and countable units. The measurements on the variable under investigation in the population are considered to be non-stochastic. The focus lies in estimation of population parameters i.e. functions of the population measurements on the study variable in the population (such as mean, total, proportion etc), which are also non-stochastic consequently. A sample is considered just as a smaller collection of population units and inference is carried out typically under the probability distribution formed by the random mechanism employed to draw the sample, which is termed as sampling design (S.D). Desirable properties of the estimators such as unbiasedness and efficiency are established by averaging out the values of the estimators over all possible samples.

While in model-based inference, a population is considered as a collection of realizations of a set of stochastic variables with a specified but unknown mean and a specified variance (usually assumed to be known). While a sample is a collection of identically distributed and independent variables for some fixed S.D. The parameters to be estimated are characteristics of the distribution of the original stochastic variables such as mean, and lower order moments, which are assumed to be constant quantities under the frequentist point of view.

Under model-based statistical inference [18] worked on estimation of a finite population mean. [19] attempted to obtain optimal model-unbiased estimators of the population mean and total using least square (LS) estimation method and the well known Gauss-Markov Theorem (GMT) assuming linear population model. [20] introduced the linear least-square prediction approach for estimation of finite population parameters under two-stage sampling. Other related works on estimation of mean and total under model-based approach can be found in [21], [22], [23], [24], [25], [26], [27] and [28]. [29] adapted mixed model prediction in small areas. Furthermore, [30] compared the model-based approach with model-assisted approach. For an updated comparison of the model-based and the designed based frameworks see [31]. A detailed review of the model-based estimation can also be found in [32]. As we already mentioned that the presence of non-response in sample surveys not only creates problem of small sample size but also spoils the inference when the behavior (underlying model relationship) of the population of respondents and non-respondents are different.

In current article, a model unbiased linear predictor for the population total in presence of non-ignorable non-response is proposed assuming unit non-response. The sub-sampling technique introduced by [11] is used to obtain samples under a fixed sampling design. We provide a revision of model-based approach for estimation of superpopulation total in Section 2. Our proposed estimator and its properties under assumed model are given in Section 3. Some shortcomings of the proposed estimation technique and their possible solutions are discussed in Section 4. A numerical study with real data set and a Monte Carlo simulation are respectively provided in Sections 5 and 6. A discussion with concluding remarks is given in Section 7.

2 Model-based estimation of population total

Consider a finite population of N distinct units U = {1, 2, ‥i‥, N}. Let y = (y_i, i ∈ U) be the vector of the realized values of a stochastic vector Y = (Y_i, i ∈ U) of order N × 1 and x = (x_ij, i ∈ U, j = 0, 1, 2, …, p) be a matrix of (p+1) auxiliary variables whose values are assumed to be known for every unit in U. We start with multiple linear regression model Y = xβ + ϵ, where β = (β₀, β₁, …., β_p)^T and ϵ = (ϵ₁, ϵ₂, …., ϵ_N)^T be the vectors of regression coefficients and the random error terms respectively. Let s = {1, 2, 3, …, n} be a member of S of all possible samples of size n that can be drawn from U using some S.D. Further, the random vector of the study variable Y, the known auxiliary matrix x and the random error vector ϵ are splited into sampled (s) and non-sampled (s¯) as: Y=(Ys;Ys¯)T, x=(xs;xs¯)T and ϵ=(ϵs;ϵs¯)T, where s¯=U−s. The population total T_y (which is assumed to be random under model-based approach) is expressed as Ty=WsTYs+Ws¯TYs¯, where W = (w_i, i ∈ U) is the vector containing 1’s for every units in population. For obtaining population mean W are taken as vector of 1/N for all units. Optimal values of wi′s are found by minimizing the prediction variance which is considered as good practice in model based approach [32]. For further statistical inference about the estimated parameter assumption of normally distributed error term is also necessary specially in case of small sample sizes. After observing y_s as the realized values of Y_s the problem is to predict sub-vector Ys¯ using the information contained in the sample and the auxiliary information through model relationship between the study variable and the auxiliary variable(s). Under linear population model, a predictor for Ys¯ is xs¯b, where the vector b = (b₀, b₂, …., b_p)^T is the solution of the normal equations xsTxsb=xsTys which is obtained by minimizing the sum of squared residuals. The model-based estimator given in [33] is

Note that the total estimator given in (1) works only when error terms are iid with zero mean and constant variance [27]. T^y posses all the properties with respect to the model as the predictor of ys¯ does [18, 19]. When all OLS assumptions fulfill the estimator T^y is model unbiased with the model-variance after averaging over all possible sample of same S.D.

where the subscript D is used to show that the expectation is applied with respect to S.D and I_N−n is the identity matrix of order (N −⁠ n) × (N −⁠ n). Setting p = 0 the linear regression model reduces to homogeneous population model i.e. Y = x₀β₀ + ϵ, where x₀ is vectors of 1’s. Care should be taken while selecting a suitable set of predictors which comes under the domain of variable selection (inclusion and exclusion) [34]. Moreover, when variance of the error term depends on some function of the auxiliary variable(s), weighted least square (WLS) estimator is preferred for estimating β as alternative to OLS. Moreover, if the number of regressors exceeds number of observations in the sample then ridge regression is preferred [27, 35]. We discuss these problems for our proposal later in Section 4.

3 Model-based estimation of population total in presence of non-response

In voluntary surveys, a common threat to the validity of the survey estimates is the problem of non–response. Different surveys possess different response rates, the surveys that ask questions which seem interesting and relevant to the respondents are tend to achieve the highest response rates. In recent years, response rates have been declined even in popular surveys, and, as a consequence, worries about non-response bias have been increased. As we discussed in introduction section that non-response is considered as problematic only if the population of non-respondents is an informative sample of the total sample. Unfortunately, this appears almost in majority of practical applications. In household surveys, for instance, there is a lot of evidence that non-respondents are often younger than respondents, and that women are more likely to persuade to take part than men. Similarly, response rates are also tend to be lower in deprived areas than the areas with abundance of facilities. All of these examples show that the pattern of achieved samples for surveys mostly do not reflect the population that is meant to represent very well. These surveys typically may over-represent women, and the persons elder than certain age. And often under-represent those living in less developed cities and deprived areas. When values of such demographic variable(s) are known for whole target population, we can stratify the population as the respondents and the non-respondents. The problem is then to choose a variable which more accurately stratifies the population as respondents and non-respondents. Suppose that R is a stratification vector defined as R = (R_i, i ∈ U), where R_i = 1(0) according to the ith unit belongs to the population of respondents (non-respondents). In case of missing completely at random (MCAR) non-response factor R and the study variable Y are uncorrelated and one can ignore the non-response or just apply different imputation techniques [36]. When the stratification variable R is related to the study variable Y, the model for the respondents differs from that of the non-respondents such as in above example the population models may differ among men and women, youngers and elders and deprived and settle areas. To capture this difference, we specify the model of respondents and non-respondents in the population separately according to the values of R such that

where β_r and β_nr are the vectors of regression coefficients corresponding to the respondents the non-respondents respectively. Consequently, we get sub-populations U₁ and U₂ such that U = U₁ ∪ U₂, where U₁ and U₂ are the subsets of U denoting populations of respondents and non-respondents with sizes N₁ and N₂ respectively. It is assumed that the error terms are independently and identically distributed (IID) with means E_m(ϵ₁) = E_m(ϵ₂) = 0 with model variances Vm(ϵ1)=σ12IN1, and Vm(ϵ2)=σ22IN2, where IN1 and IN2 are the identity matrix of order N₁ and N₂ respectively. Separation of model is straight forward when we have exact knowledge about the occurrence of non-response and a related stratification variable which is almost impossible in real world problem. As it is not possible to have such information that separates the underlying model exactly into the respondents and the non-respondents. One way to overcome this problem may be to use two phase sampling for obtaining information on stratification variable. In which we select a larger sample on first phase and observe the stratification variable (i.e. respondents are marked as respondents according to their behavior to respond the first phase survey are observe such factor which cause non-response) and estimate the proportions of units fall in sub-populations i.e. λ₁ = N₁/N and λ₂ = N₂/N. These information then can be used at second phase for estimating population parameters of the study variable. Before going toward our proposal, we discuss the estimation of population total without sub-sampling non-respondents which help us in knowing how the non-response creates biasedness in estimation of total.

3.1 Estimation of total without sub-sampling

For a sample s of size n assume that only n₁ units respond while remaining n₂ units don’t respond. The prediction problem given in Section 2 becomes Ty=Ws1TYs1+Ws¯1TYs¯1+W2TY2, where Ws1T, Ws¯1T, and W2T, are vectors of weights associated with n₁ respondents, N₁ −⁠ n₁ non-sampled units from responding population, and N₂ units from non-responding population respectively. Further, Ws¯1TYs¯1+W2TY2 is unknown and can be predicted using sample at hand and the auxiliary information for the non-responded and non-sampled values. A predictive estimator for population total based on respondents only, can be found as follow:

where b₁ is the vector of OLS estimates of β₁ based on n₁ respondents. The model bias of T^y1 is

See Appendix A1 for proof. T^y1 is unbiased estimate of T_y if the vectors of coefficients for the responding and non-responding sub-populations are same i.e. β₁ = β₂, this is equivalent to regression imputation. This situation occurs when Behavior of the responding and the non-responding populations are same allowing us to ignore the non-response just as reduced sample size. We obtain model mean squared error (M-MSE) of the total estimator T^y1 as

The subscript m shows that expectation is applied over model. The model-mean squared error (M-MSE) given in (7) depends on random sample under designed-based point of view. Consequently, it varies with sampling fluctuations. To obtain a fix value, we apply expectation with respect to S.D.

3.2 Estimation of total with sub-sampling

As we already discussed, there are several approaches for handling the problem of non-response in sample literature. A suitable approach may be chosen according to the type of non-response (full or partial), the accessibility of the auxiliary variable(s) and the validity of the underlying response model for handling the problem. In general, re weighting is used to deal with full (non-availability of units) non-response. Imputation is preferably applied for dealing with partial non-response although it can be applied for full non-response if appropriate auxiliary information is available. Re-weighting eliminates or at least reduces total non-response bias [36, 37]. While the sub-sampling method introduced by [11] provides a good adjustment for non-response bias and yield unbiased estimator for the population mean when the non-response variable R is significantly correlated with the survey outcome.

In this study, we develop a model-based estimator for population total by adjusting non-response using sub-sampling procedure. As the models described in (3) and (4) have different parameters it is inevitable to obtain information about both sub-populations. The sample information obtained from respondents alone leads to biased estimate for the population total of the whole population. For estimating the relationship between the study and the auxiliary variables for the population of non-respondents and estimating total, we need some information from non-respondents as well. The sampling mechanism in Section 2 is based on the respondents from first sample which don’t provide any information about the population model of non-respondents. The sub-sampling introduced by [11] is the best alternative to handle such situation of non-response which assumes the mode of data collection on first round was inexpensive and then a more stronger mode of interview is employed for sub-sampling non-respondents. The rationale behind taking a sub-sample instead of following all non-respondent is the fact that taking information from all non-respondents by using stronger mode of interview increases survey cost. Sometime randomized response techniques (more expensive and complex) are applied to gather information on second call [38]. The method assumes sub-sampling n´2=n2k (k > 1) units from n₂ units selected and not respond on first round, using some stronger mode of interview (face to face survey, telephonic survey etc). The estimation process covers two prediction problems (i) predicting N₁ −⁠ n₁ non-sampled units from the sample taken from the first round using model given in (3) and (ii) predicting N₂ −⁠ n₂ (non-sampled)+ n₂ −⁠ n´2 (non-responded) units on the basis of sample obtained on second round using the model relationship given in (4). Let s´2 be the sub-sample of size n´2 selected from s₂ and s¯´2=U2−s´2 be the set representing non-sampled values from the population of non-respondents. Now the outcome vector for respondents is further partitioned as Y1=(Ys1:Ys¯1)T and for non-respondents Y2=(Ys´2:Ys¯´2)T. The matrix x, the vector W and the random error vector ϵ are also partitioned into sampled and non-sampled parts in same way. The population total of the study character is now expressed as Ty=Ws1TYs1+Ws¯1TYs¯1+Ws´2TYs´2+Ws¯´2TYs¯´2 after replacing known values of the response units, we have Ty=Ws1Tys1+Ws¯1TYs¯1+Ws´2Tys´2+Ws¯´2TYs¯´2. The problem is to predict Ws¯1TYs¯1+Ws¯´2TYs¯´2. The first part is predicted on the basis of sample obtained on first round along with model given in (3) and the second part is predicted on the basis of sample obtained on second round and the model given in (4). Under the sub-sampling technique a linear unbiased predictor for T_y is

where Ws1T, Ws¯1T, Ws´2T and Ws¯´2T are the vectors of known weights for the values corresponding to the groups mentioned in subscripts. The estimates of model parameters β₁ and β₂ are obtained by solving the normal equations (Hs1)b1=xs1Tys1 and (Hs´2)b2=xs´2Tys´2=Hs´2 respectively, where Hs1=xs1Txs1 and Hs´2=xs´2Txs´2 are the hessian matrix for the first round sample and sub-sample respectively. The well-known GMT provides the evidence that the OLS estimators are the best linear unbiased estimators (BLUE) of the parameters β₁ and β₂ when the observations obtained on first round sample s₁ and the second round sample s¯´2 follows two different population models with independently and identically distributed error terms. In designed based point of view, the selection of sub-sample s´2 depends on the selection of s₁, hence the assumption of independence is no more valid. To proceed we need the assumption of model independency only. The separation of population as the respondents and the non-respondent is based on the values of R which is already discussed in previous section. The role of the variable R is same as the role of stratification variable in stratified sampling which is merely used to separate populations into respondents and non-respondents. Hence more correlation between the non-response factor (R) and the study variable is a requirement for using the sub-sampling approach. The case of low correlation between the study variable and the non-response variable can be handled through weighting adjustment and imputation techniques discussed in literature review. However the literature of sub-sampling technique reveals that the efficiency of the sub-sampling estimator is not affected by this correlation. But in case of presence of significant correlation proceeding with just respondents on first call may produce invalid and inconsistent statistical inference.

Note that respondents on first sample always represent the responding population U₁. While the non-respondents on first sample may or may not represent the population of the non-respondent U₂ as it depends on the degree of relationship between R and Y and the nature of occurrence of non-response (whether it is ignorable or not). The model bias of T^y* is derived in Appendix A 2, and given by

T^y* is model unbiased if all of the OLS assumptions are satisfied for the populations of the respondents and non-respondents. Assuming unbiasedness model variance of the total estimator under non-response is obtained as

Taking expectation with respect to S.D we get

where ED1 and ED2 are expectations with respect to S.D used for selecting first sample and sub-sample respectively. The first component of the expected model-variance depends on the error variances while the second component depends on the inverse of the matrix H = x^Tx for the first sample and the sub-sample. Hence, for smaller variance the population units with larger sampled values of all included covariates should be prefer. [39] provided a detail discussion on optimum selection of units under different population models.

4 Estimation of total with sub-sampling under super-collinearity and heteroscedasticity

While applying linear regression model for predicting the non-sampled values from the population of non-respondents the number of input variables (regressors) may greatly exceeds the number of observations i.e. n´2<(p+1) as we are sub-sampling a relatively small portion of non-respondents. In such situations, fitting the full model to the non-respondents without penalization will result in wider prediction intervals, and the normal equations may not have trivial solution as the matrix Hs´2 does not possess the full rank property. It is not possible to estimate the parameters of the model when Hs´2 is singular i.e. not of full rank. This situation is called super-collinearity or ill-conditioning. The problem of super-collinearity can be solved using ridge regression. To get an estimate for β₂, when there is super-collinearity in x₂, we use ad-hoc fix method proposed by [40] for resolving singularity of Hs´2. We simply replace H=Hs´2 by H(v)=Hs´2+vIp+1 with v ∈ [0, ∞]. The scalar v is called tuning parameter or penalty parameter. A clearly defined estimator for β₂ obtained even for high-dimensional data matrix (n´2≤p) for a strictly positive v is b2(v)=H(v)−1xs´2Tys´2. Using b₂(v) in (8), we obtain a partially ridge regression (PRR) estimator (as the concept of ridge regression is used for non-responding part only) for population total which is given by

The expressions for model-bias and expected model-MSE of the PRR estimator of the total in presence of non-response are obtained by replacing H(v) by H in (10) and (11). Following [41] a range for v in which the model-MSE of xs¯´2b2(v) is smaller than the model-variance of xs¯´2b2 is

where ψ₂ is the minimum eigen-value of the matrix (Hs´2)−1−β2β2Tσ22. PRR is also applicable for predicting non-sampled respondents when n₁ < (p + 1) leading to super-collinearity in the respondents.

Another major problem that arises in estimation of so called superpopulation parameters is the violation of assumption of homoscedasticity is violated. In presence of heteroscedasticity one has

where V₁ = diag(V_1ii,i ∈ U₁) and V₂ = diag(V_2ii, i ∈ U₂) units specific variances for respondents and non-respondents respectively. Here V_1ii = V_m(Y_1i|x_1i) = υ(x_1i) and V_2ii = V_m(Y_2i|x_2i) = υ(x_2i), where x_1i and x_2i are the vectors of the auxiliary variables corresponding to the ith unit in respondents and non-respondents respectively. In such situations, OLS estimators for the regression coefficients may have higher variances. If we have information about the variance structure for the populations of respondents and non-respondents (assuming zero correlation between the units), we can adopt weighted least square (WLS) method of estimation. The WLS estimators of β₁ and β₂ are b1wls=(xs1TVs1−1xs1)−1xs1TVs1−1ys1 and b2wls=(xs´2TVs´2−1xs´2)xs´2TVs´2−1ys´2 respectively, where V1=(Vs1,Vs¯1)

The sub-matrices are also diagonal assuming zero correlation between the error terms corresponding to the respobndents and the non-respondents. A WLS estimator for T_y in presence of non-response is obtained by replacing b_1wls and b_2wls by b₁ and b₂ respectively in (8). It is assumed that the variance structures of the responding and non-responding population are known and depend on covariates whose values are known for each population unit. In practice, for many types of data set, the structure of weights (inverse of variance) is usually unknown, so one has to perform an ordinary least squares (OLS) regression first to estimate the variance structure and obtain estimates for the population regression coefficients after performing an iterative process which is commonly known as generalized least square (GLS).

5 Application

A real data set taken from [42] is applied to investigate the behavior of our proposed model-based estimator. The data set is given as supporting information S1 Data 7 to this paper. The data consist of 748 blood donors on following variables:

y = Monetary total blood donated in c.c., x₁ = Time (months since first donation), x₂ = Recency (months since last donation) and x₃ = Frequency (total number of donation). Considering the above 748 blood donors as our population of interest, we select a sample of size 100 using simple random sampling without replacement. The scatter plot matrix between the variables in the sample selected on first call and the sub-sample collected on the second call represents the relationship between the variables in the population of respondents and non-respondents for response rates λ₂ = 0.4 (Figs 1 and 2) and λ₂ = 0.4 (Figs 3 and 4). Fig 1 shows the relation between the study variable y and the predictors x₁, x₂, and x₃ for the sampled respondents which shows that the study variable y, is highly related to x₃ and moderately related to x₁ but weakly related to x₂. Fig 2 portrays the relationship between variables for the sub-sampled non-respondents which is different from the relationship in Fig 1 which shows the relevancy of the data to our proposed sampling mechanism. One can observe the similar relationship between the variables for λ = 0.2 from upper triangle of Figs 3 and 4. Hence our proposal works here as the relationship between the total monetary blood donated and its three determinants have different relationship for the population of the respondents and the sub-population of the non-respondents which is the main assumption of our data collection mechanism. We select half (k = 2) of the non-respondent selected on first call for sub-sampling on second call.

Behavior of non-respondents with λ<sub>2</sub> = 0.4. — **Fig. 1. Behavior of non-respondents with λ₂ = 0.4.**

Behavior of respondents with λ<sub>2</sub> = 0.4. — **Fig. 2. Behavior of respondents with λ₂ = 0.4.**

Behavior of non-respondents with λ<sub>2</sub> = 0.2. — **Fig. 3. Behavior of non-respondents with λ₂ = 0.2.**

Behavior of respondents with λ<sub>2</sub> = 0.2. — **Fig. 4. Behavior of respondents with λ₂ = 0.2.**

Further to see the magnitude of the prediction error, we provide a bootstrap sampling procedure taking different non-response rate (say λ₂) in the population. We generate a new variable R associated with each 748 cases which posses value 1 if the ith unit has an outcome greater than the λ₂ th percentile of all the y values in the data set otherwise zero.

A sample of size n (for n = 100, 200) is taken from the data using simple random sampling without replacement and divide it them into the respondents and the non-respondents according to the value of R and observe n₁ and n₂.
Select a sub-sample of size n´2=n2k (taking k = 2,4) from n₂ non-respondents again using simple random sampling without replacement and compute the estimator using information obtain from first and second samples. We take p = 2 to avoid the problem of super-collinearity in our situation.
Repeat Step 2, 2000 times to get expected value from the sub-sampling. The sub-sampling does not alter results of T^1 as it is based on sample from respondents only.
Repeat Steps 1-3, 5000 times to obtain a stable value of prediction variance and bias for both estimators.

The prediction bias and variances are computed as follows:

The RB and RMSE for the [11]-type estimator for the population total are obtained by replacing T^y1 by T^y* in Eqs (16) and (17) respectively. Table 1 provides relative bias (RB) and relative mean squared error (RMSE) of the total estimator based on the sample on first call for different combinations of n, λ₂ and k.

The results in Table 1 are reported assuming non-response rate λ₂ at 50%, 25%, and 10%. RB of both estimators go to zero as non-response rate falls toward zero which assures that for full response it vanishes while the sub-sampling method produce ignorable bias as compared to direct method which is the attractive feature of this method. Further from Table 1, one can observe that RMSE is smaller in case of sub-sampling non-respondents, i.e. taking interview of additional non-respondents through some stronger mode of interview, for every choices of λ₂. RB and RMSE of T^y* tend to increase with decrease in non-response rate in the population which shows that our proposed technique works well for higher non-response rates as compared to lower smaller ones. RB and RMSE of the model based total estimator go down while increasing sub-sample size n´2 (decreasing k) as expected. Further, this error decreases when population has smaller non-response rate λ₂. In upcoming section, we provide a simulation study to provide a detailed picture of the performance of estimators in terms of design bias and mean squared error.

6 Simulation study

To see the long run behavior of the proposed estimators in terms of bias and efficiency, a simulation study, generating a hypothetical population, is conducted. Following [43], a matrix z = (z_ij, i = 1, 2, 3, …, N, j = 1, 2, …, p) with p variate each generated from N(100,1), has been constructed with N = 10,000 observations. The ijth element of the auxiliary matrix x is computed as x_ij = (1 −⁠ ρ)^0.5 × z_ij + ρ × z_ij, where ρ is the degree of linear relationship between x and z to be fixed in advance. The vector of the study variable (y) is then obtained by using the relationship y = xγ + ϵ, where γ is the vector of coefficients which are computed as the averaged eigen vectors corresponding to the eigen values of H = x^Tx that are greater than unity and ϵ ∼ N(0, σ²I_N) is randomly generated error term. It is assumed that the variance is of homoscedastic nature with constant diagonal σ². We fix σ² at 0.01, 0.1 and 1. The data consist of (y, 1_N, x, R_i), where 1_N is the vector of 1’s. R_i takes value 1 if the ith value of variable y falls in a threshold lower than (1 −⁠ λ₂)th quantile in the population, where λ₂ is non-response rate in the population. In real life, we suggest to choose R in form of some observable covariates or latent variables. The simulation study is conducted in following three steps.

Take a random sample of size n from the population generated through the mechanism described above and split it into n₁ respondents and n₂ non-respondents according to the values of R_i.
Select a sub-sample of size n´2 from n₂ non-respondents for fix k.
Estimate the population total (T_y) using estimated models from samples obtained on Steps 1 and 2.
Simulate Steps 2–3 500 times and average the values of estimates.
Repeat Steps 1–4 2000 times to obtain prediction errors to obtain 2000 estimated values.

The bias (B) and mean squared error (MSE) of the proposed total estimators are computed using the formula given in Eqs (16) and (17) respectively after removing the denominators as the generated values are already standardized. The subscript v is used for the results where prediction is performed using PRR.

Tables 2–4 provide the bias of the PPR estimator and mean squared error of both estimators for different combinations of σ², λ₂, ρ, n and k in nested order. We obtain results for p = 5 and p = 8 but the result for p = 5 is not reported here for the sake of space. Tables 2, 3 and 4 provide the prediction error measures (B and MSE) for σ² = 0.01, σ² = 0.1 and σ² = 1 respectively. From Tables 2–4, one can see that the bias of the PRR total estimator tends to increase with increase in k. This implies selecting a smaller sub-sample increases the bias in estimation due to sampling error although this bias depends on the magnitude of the tuning parameter v. MSE of the total estimator under multiple regression and PRR both increase with increase in k which shows that MSE of the estimators grows with smaller sub-samples from non-respondents. The PRR total estimator is more sensitive to the change in k, in terms of MSE, as the optimum value of the tuning parameter v is estimated from sub–sample. In practice v might be computed using data available from previous surveys of the same population or through expert judgment. The estimation methods of v by minimization of prediction error are available in [43]. Moreover, whatever model we use for prediction, the MSE values of the total estimators depend on the sample size of respondents and sub-sample of non-respondents. The simulated results are provided for sample size 100, 150 and 250 with sub-sample size inversely proportional to k = 1.5, k = 2 and k = 3. It can be noticed that MSE values are increasing with increase in k. Comparing two portions of Tables 2–4, we observe that the MSE of proposed estimators fall when non-response rate increases which conflicts the efficiency property of the [11] estimator. The reason is the use of separate models and increasing λ₂ from 0.2 to 0.4 implies(i.e. we are using Model (2) for 40% of the data) which is the main contribution of our proposal in terms of increased precision. Apart from the design parameters, the data generating process also effects the efficiency of the total estimator which can be seen from three different column-panels (for three different choices of the parameter ρ) assuming that the correlation between the variables X and Z are same for all choices of j of Tables 2–4.

Bias and MSEs with <i>σ</i><sup>2</sup> = 0.01 and <i>p</i> = 8. — **Tab. 2. Bias and MSEs with σ² = 0.01 and p = 8.**

Bias and MSEs with <i>σ</i><sup>2</sup> = 0.1 and <i>p</i> = 8. — **Tab. 3. Bias and MSEs with σ² = 0.1 and p = 8.**

Bias and MSEs with <i>σ</i><sup>2</sup> = 1 and <i>p</i> = 8. — **Tab. 4. Bias and MSEs with σ² = 1 and p = 8.**

7 Conclusion

This article is concerned with utilization of model relationship between the outcome variable and one or more covariate(s) for efficient estimation of population total of the outcome variable in surveys with non-ignorable non-response. A model based version of [11] sub-sampling technique is suggested which assumes that the responding and non-responding population have different models. This assumption may hold for majority of real world situations where the occurrence of non-response is observable like a stratification variable. In public health surveys the non-response occurrence is based on the gender, ethical affiliation, age and other demographic factors of the respondents. In such situations, respondents and non-respondents may have different models. The method assumes that a stratification variable is available to divide the population into respondents and non-respondents which is difficult to obtain in most of real surveys although a two phase sampling method can provide a better stratification variable to divide the population into respondents and non-respondents. It is shown that under linear population model (linear in parameter as well as in variables) the total estimator with sub-sampling is model-unbiased and has smaller model-variance as compared to predictive estimator based on sampled respondents only. The linearity assumption emphasizes on linear in parameters but not restricted to the linearity in variable. Polynomial regression models are also useful for handling non-response in demographic surveys using age as the predictor. The problem of non-response can be well handled using polynomial regression models which is an open area to work in future. While sub-sampling non-respondents the number of observations may become smaller than the number of regressors included in the model leading to problem of super-collinearity. To cope with super-collinearity problem, we suggest a version of ridge regression named, called PRR, for predicting the non-sampled non-respondents. WLS and GLS are suggested for obtaining estimates of the regression coefficients for respondents and non-respondents when error terms for at least one model is of heteroscedasticity nature. To confirm mathematical expressions a numerical study with blood transfusion data has been carried out. The suggested method is applicable to telephonic or web household surveys where households are first contacted with email or telephone call and then non-respondents are followed via face to face surveys where it seems logical to select a sub-sample of non-respondents through more expensive mode (face to face).

APPENDIX A1. Derivation of bias and MSE T^y1 without sub-sampling

where A=Ws¯1Txs¯1+W2Tx2. The model variance of T^y1 is derived as

Under OLS assumptions, we have Vm(b1)=σ12(Hs1)−1. Inserting this result, we get

The MSE of T^y1, is given by

A2. Derivation of bias and MSE of T^y1 with sub-sampling

The variance of the estimator, is given by

Rearranging terms, we get

Supporting information

S1 Data [csv]
Blood transfusion data set.

Zdroje

1. Barton J, Bain C, Hennekens CH, Rosner B, Belanger C, Roth A, Speizer FE. Characteristics of respondents and non-respondents to a mailed questionnaire. American Journal of Public Health. 1980 Aug;70(8):823–5. doi: 10.2105/ajph.70.8.823 7416342

2. Wood AM, White IR, Hotopf M. Using number of failed contact attempts to adjust for non–ignorable non–response. Journal of the Royal Statistical Society–Series A Statistics in Society. 2006 Jul 1;169(3):525–542. doi: 10.1111/j.1467-985X.2006.00405.x

3. Peytchev A, Baxter RK, Carley-Baxter LR. Not all survey effort is equal: Reduction of non–response bias and non–response error. Public Opinion Quarterly. 2009 Jan 1;73(4):785–806. doi: 10.1093/poq/nfp037

4. Biemer PP, Chen P, Wang K. Using level–of–effort paradata in non–response adjustments with application to field surveys. Journal of the Royal Statistical Society: Series A (Statistics in Society). 2013 Jan 1;176(1):147–68. doi: 10.1111/j.1467-985X.2012.01058.x

5. Knudsen AK, Hotopf M, Skogen JC, Overland S, Mykletun A. The health status of nonparticipants in a population-based health study: the Hordaland Health Study. American journal of epidemiology. 2010 Sep 15;172(11):1306–14. doi: 10.1093/aje/kwq257 20843863

6. Copas AJ, Farewell VT. Dealing with non–ignorable non–response by using an enthusiasm-to-respond variable. Journal of the Royal Statistical Society: Series A (Statistics in Society). 1998;161(3):385–96. doi: 10.1111/1467-985X.00115

7. Moore RE, editor. Reliability in computing: the role of interval methods in scientific computing. Elsevier; 2014 May 10.

8. Huang GB. Tai-kang (2009). Taking the Opportunities of Health Care Reform to Develop Clinical Pharmacy. Asian Journal of Social Pharmacy 4.;2 : 65–9.

9. White IR, Kalaitzaki E, Thompson SG. Allowing for missing outcome data and incomplete uptake of randomised interventions, with application to an Internet–based alcohol trial. Statistics in medicine. 2011 Nov 30;30(27):3192–207. doi: 10.1002/sim.4360 21948462

10. Guan Z, Leung DH, Qin J. Semi–parametric maximum likelihood inference for non–ignorable non–response with callbacks. Scandinavian Journal of Statistics. 2018 Dec;45(4):962–84. doi: 10.1111/sjos.12330

11. Hansen MH, Hurwitz WN. The problem of non-response in sample surveys. Journal of the American Statistical Association. 1946 Dec 1;41(236):517–29. doi: 10.1080/01621459.1946.10501894 20279350

12. Khare BB, Srivastava S. Estimation of population mean using auxiliary character in presence of non-response. National Academy Science Letters. 1993 Mar 1;16 : 111-.

13. Khare BB, Srivastava S. Study of conventional and alternative two phase sampling ratio, product and regression estimators in presence of non–response. Proceedings-National Academy Of Sciences India Section A. 1995;65 : 195–204.

14. Khare BB, Sinha RR. On class of estimators for population mean using multi-auxiliary characters in the presence of non-response. Statistics in Transition. 2009;10(1):3–14.

15. Singh HP, Kumar S. A REGRESSION APPROACH TO THE ESTIMATION OF THE FINITE POPULATION MEAN IN THE PRESENCE OF NON–RESPONSE. Australian & New Zealand Journal of Statistics. 2008 Dec;50(4):395–408. doi: 10.1111/j.1467-842X.2008.00525.x

16. Ericson WA. Optimal sample design with non–response. Journal of the American Statistical Association, 1967, 62(317):63–78. doi: 10.1080/01621459.1967.10482888

17. Smouse EP. Bayesian estimation of a finite population total using auxiliary information in the presence of non–response. Journal of the American Statistical Association. 1982 Mar 1;77(377):97–102. doi: 10.1080/01621459.1982.10477771

18. Fuller WA. Simple estimators for the mean of skewed populations. Statistica Sinica. 1991 Jan 1 : 137–58.

19. Royall RM, Cumberland WG. The finite-population linear regression estimator and estimators of its variance—An empirical study. Journal of the American Statistical Association. 1981 Dec 1;76(376):924–930. doi: 10.1080/01621459.1981.10477742

20. Royall RM. The linear least-squares prediction approach to two-stage sampling. Journal of the American Statistical Association. 1976 Sep 1;71(355):657–664. doi: 10.1080/01621459.1976.10481542

21. Godambe VP. A unified theory of sampling from finite populations. Journal of the Royal Statistical Society: Series B (Methodological). 1955 Jul;17(2):269–78.

22. Godambe VP, Joshi VM. Admissibility and Bayes estimation in sampling finite populations. I. The Annals of Mathematical Statistics. 1965 Dec 1;36(6):1707–22. doi: 10.1214/aoms/1177699799

23. Basu D. An essay on the logical foundations of survey sampling, Part I. Foundations of Statistical Inferences, VP Godambe and DA Sprott. 1971, 203–233.

24. Sarndal CE, Thomsen I, Hoem JM, Lindley DV, Barndorff-Nielsen O, Dalenius T. Design-based and model-based inference in survey sampling [with discussion and reply]. Scandinavian Journal of Statistics. 1978 Jan 1 : 27–52.

25. Godambe VP. Estimation in survey sampling: robustness and optimality. Journal of the American Statistical Association. 1982 Jun 1;77(378):393–403.

26. Little RJ. Survey non–response adjustments for estimates of means. International Statistical Review/Revue Internationale de Statistique. 1986 Aug 1 : 139–157.

27. Bellhouse DR. Model-based estimation in finite population sampling. The American Statistician. 1987 Nov 1;41(4):260–2. doi: 10.1080/00031305.1987.10475496

28. Valliant R. Finite population sampling and inference: a prediction approach. 2000.

29. Jiang J, Lahiri P. Mixed model prediction and small area estimation. Test. 2006 Jun 1;15(1):1. doi: 10.1007/BF02595419

30. Sarndal CE, Swensson B, Wretman J. Model assisted survey sampling. Springer Science & Business Media; 2003 Oct 31.

31. Brewer K. and Gregoire TG. Introduction to survey sampling. In Handbook of Statistics. 2009, 29 : 9–37. Elsevier

32. Valliant R. Model-based prediction of finite population totals. Sample Surveys: Inference and Analysis, 2009, 29B:23–31.

33. Mukhopadhyay P. Estimation of a finite population total under regression models: a review. Sankhyā: The Indian Journal of Statistics, Series B. 1993 Aug 1 : 141–155.

34. Rawlings JO, Pantula SG, Dickey DA. Applied regression analysis: a research tool. Springer Science & Business Media; 2001 Apr 6.

35. Draper NR and Smith H. Applied regression analysis, 2014, 326. John Wiley & Sons

36. Holt D. and Elliot D. (1991). Methods of weighting for unit non-response. The Statistician, pages 333–342. doi: 10.2307/2348286

37. Särndal CE. The calibration approach in survey theory and practice. Survey Methodology. 2007 Dec 1;33(2):99–119.

38. Ahmed S. & Shabbir J. Extreme-cum-median ranked set sampling, Brazilian Journal of Probability and Statistics, 2019, 33 (1): 24–38. doi: 10.1214/17-BJPS373

39. Chambers R, Clark R. An introduction to model-based survey sampling with applications. OUP Oxford; 2012 Jan 12.

40. Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970 Feb 1;12(1):55–67. doi: 10.1080/00401706.1970.10488634

41. Vinod HD, Ullah A. Recent advances in regression methods. Marcel Dekker Incorporated; 1981, 41.

42. Yeh IC, Yang KJ, Ting TM. Knowledge discovery on RFM model using Bernoulli sequence. Expert Systems with Applications. 2009 Apr 1;36(3):5866–5871. doi: 10.1016/j.eswa.2008.07.018

43. Najarian S, Arashi M, Kibria BG. A simulation study on some restricted ridge regression estimators. Communications in Statistics-Simulation and Computation. 2013 Apr 1;42(4):871–890. doi: 10.1080/03610918.2012.659953