This paper focuses on presenting a generalization of the scrambled response models of Hussain and Shabbir [Hussain, Z. and Shabbir, J. “On estimation of mean of a sensitive quantitative variable”, InterStat , (#006), (2007)] and Gjestvang and Singh [Gjestvang, C.R. and Singh, S. “An improved randomized response model: estimation of mean”, Journal of Applied Statistics , 36(12), pp. 1361–1367 (2009)]. The suggested generalization is helpful in procuring honest data on socially undesirable characteristics. The suggested estimator is found to be unconditionally more efficient in terms of variablity. From a privacy point of view, comparison of the proposed class of models is made using the privacy protection measure by Zaizai et al. [Zaizai, Y., Jingu, W. and Junfeng, L. “An efficiency and protection based comparison among the quantitative randomized response strategies”, Communications in Statistics-Theory and Methods , 38, pp. 400–408 (2009)]. Unlike many scrambled response models, the proposed class of models is free from the need of known parameters of scrambling variables. The relative numerical efficiency of the proposed model is simulated for some fixed values of the parameters. The practical application of the proposed model is also studied through a small scale survey.
Randomized response technique ; Sensitive character, Estimation of mean, anonymity ; Social desirability bias ; Scrambled response
One of the leading paraphernalia for obtaining data pertaining to human populations is the social survey. To measure opinions, attitudes, and behaviors that cover a wide band of interests, the social survey has been established as being tremendously practical. The surveys are conducted due to many reasons, non-availability of certain facts/information in the archives being the most understandable and apparent. For instance, if one is interested in knowing crime rates, information about unseen crimes or unreported victimization experience is not available in formal records on crime. Sometimes the facts about the individuals (in a population) are inaccessible to the investigators for legal reasons. For example, in many countries, certain information about criminals is kept confidential, due to security and privacy concerns. In most studies, the study population may be so geographically dispersed that studying a whole population is simply infeasible.
Questionnaires, in particular social surveys, generally consist of many items. Some of the items may be about sensitive/high risk behavior, due to the social stigma carried by them. One problem with research on high-risk behavior is that respondents may consciously or unconsciously provide incorrect information. In psychological surveys, a social desirability bias has been observed as a major cause of distortion in standardized personality measures. Survey researchers have similar concerns about the truth of survey results/findings about such topics as drunk driving, use of marijuana, tax evasion, illicit drug use, induced abortion, shop lifting, cheating in exams, and sexual behavior.
The most serious problem in studying certain social problems that are sensitive in nature (e.g. induced abortion, drug usage, tax evasion, etc.) is lack of a reliable measure of their incidence or prevalence. Social stigma and fear of reprisal usually result in lying by the respondents when approached with the conventional or direct-response survey method. An obvious consequence of false reporting is unavoidable estimation bias. Warner  showed this evasive answer bias to prevail in the estimate obtained by direct questioning, and proposed a Randomized Response Model (RRM) to estimate the proportion of prevalence of sensitive characteristics in a population. Greenberg et al.  extended the RRM to the estimation of mean of a sensitive quantitative variable. The recent articles on the estimation of mean of a sensitive variable include: Eichhorn and Hayre  , Singh et al.  , Gupta et al.  , Bar-Lev et al.  , Ryu et al.  , Hussain et al.  , Hussain and Shabbir  and  , Huang  and  , Gjestvang and Singh  and  , Gupta et al.  and the references cited therein. For a detailed understanding of RRM, interested readers may be referred to Chaudhuri and Mukerjee  .
In the literature on estimation of scrambled randomized response models, we can find two types of scrambling, namely, the additive and multiplicative. The additive scrambled response model is due to Himmelfarb and Edgell  , and has been advocated by many authors due to its simplicity of application (cf.  ,  ,  and  ). Keeping in mind this advocacy, it is obvious that we need to search for an improvement in additive randomized response models. In this study, we present an unbiased estimator of the mean, assuming Simple Random Sampling with Replacement (SRSWR) and Stratified Random Sampling (STRS) protocols. The paper is organized as follows. In Section 2 , we briefly present Hussain and Shabbir  and Gjestvang and Singh  models. The proposed generalization is showcased in Section 3 under SRSWR and STRS schemes. Section 4 of the paper consists of efficiency comparisons and Section 5 is about a practical study, followed by conclusions of this study in Section 6 .
Following Gjestvang and Singh  , Gjestvang and Singh  proposed an additive Randomized Response Model. For the th individual, let and be the values of the sensitive and scrambling variables, respectively. The distribution of is completely known, with mean and variance, . Assuming and to be positive real constants, the Gjestvang and Singh  model provides two options for the respondents; (i) “Report the value ”, and (ii) “Report the value ”, with pre-assigned probabilities and , respectively, where is a scrambling variable with mean, , and variance, . Let be the reported response of the th respondent, then, it can be written as:
where is a Bernoulli random variable having value ‘1’, if statement (i) is randomly chosen by the respondent and ‘0’, otherwise. They proposed an unbiased estimator of the population mean, , of the sensitive variable, , as:
It is to be noted that borrowing the idea from Gjestvang and Singh  , Hussain and Shabbir  proposed an improved version of Gjestvang and Singh  RRM. The RRM of Hussain and Shabbir  is actually a two stage RRM. In their proposed RRM, a sample of size is drawn from the population with a SRSWR sampling scheme. Each individual in the sample is requested to use a randomization device, , which consists of the two statements:
where , if statement (i) is randomly chosen in , and ‘0’, otherwise. Similarly, , if statement (i) is randomly chosen in and ‘0’, otherwise. An unbiased estimator of is given by:
The variance of the is given by:
Hussain and Shabbir  used the idea of distributing the probability of reporting on the true value of into stages using the multiplicative randomized response models and reported the following advantages: (i) the inability of a clever respondent to correctly guess the total probability on reporting ‘ ’. (ii) Provision of more protection against the privacy of the respondents, and, therefore, making the interviewer unable to know at which stage respondents actually reported his response, and (iii) the increased degrees of freedom to set the values for design probabilities, in order to keep the total probability of reporting on at some desired level. As the use of additive randomized response models has been advocated by many authors, like Gjestvang and Singh  , Gupta et al.  , and Huang  and  , we plan to study the additive RRM of Gjestvang and Singh  in increased numbers of randomization stages. In the next section, we present the proposed RRM.
In the proposed RRM, a sample of size is drawn from the population with the SRSWR sampling scheme. Each individual in the sample is provided randomization devices, , and requested to use these randomization devices in the following order:
Use the randomization device, , which consists of the two statements:
The randomization device, , consists of the two statements:
The randomization device, , consists of the two statements:
where , if the statement (i) is chosen randomly by the respondents using the randomization device, .
Let be the expectation operator over all possible samples and be the expectation operator over the randomization device, then:
We propose an unbiased estimator of population mean, , as:
The variance of the proposed estimator is given by:
It is to be noted that for , the proposed model is essentially the Gjestvang and Singh  RRM, and, for , it reduces to Hussain and Shabbir  RRM. For all , the responses of the respondents can be expressed as:
The -stage randomization device can be viewed as a two stage randomization procedure with and . Thus, the Hussain and Shabbir  procedure is a special case of the proposed procedure. In addition, the proposed procedure has the advantage of distributing the total probability of reporting on into an increased number of stages. In the lines to follow, we illustrate the working of the proposed RRM for . Following these lines, we can easily derive the generalized results given by Eqs. (3) , (6) and (10) .
For the purpose of illustrating the idea, suppose we have three different urns containing black and white cards, with and being the proportions of white cards, respectively. A selected respondent is asked to pick a card randomly from the urn ‘ ’. If a white card is picked, he/she is asked to report the true value of ‘ ’, otherwise, he/she is directed to go to the second urn, ‘ ’. At this stage, again, he/she is requested to randomly draw a card from the urn, ‘ ’, and report the value of ‘ ’, if the white card is drawn, otherwise, directed to go to third urn, ‘ ’, and randomly draw a card from the third urn. Then, report ‘ ’ if a white card is drawn, otherwise, report ‘ ’.
The th respondent selected in the sample of size , drawn by using simple random sampling with replacement (SRSWR), is requested to report the value:
where is defined as earlier. The expected value of the observed response is:
The unbiased estimator of is then given, as in Eq. (9) , with variance:
As pointed out by one of the referees, one of the two key issues in the scrambling model is the degree of privacy protection provided and competing models should also be compared at equal levels of privacy protection. For this purpose, we take the privacy measure proposed by Zaizai et al.  . The privacy measure proposed by Zaizai et al.  is defined as , where is the scrambled response obtained through a given scrambling model. The model with the larger value of is taken as a more protective model. The privacy measure by Zaizai et al.  is not a normalized privacy measure. We normalize it as , where is defined as earlier. The normalized privacy measures for the proposed, Hussain and Shabbir  and Gjestvang and Singh  models are given, respectively, as below:
From Eqs. , and , it is observed that the model of Gjestvang and Singh  is more protective than the other two models and the model of Hussain and Shabbir  is better only to the proposed model. It is also observed that privacy protection (efficiency) is the decreasing (increasing) function of . Thus, it is a tradeoff between privacy and efficiency. The value of should be fixed, depending upon the required privacy protection and efficiency.
Acceptance of the unrelated variable, , by the respondents, as pointed out by one of the referee, is another key issue of concern. Explaining the working of the whole procedure to respondents may be needed in some situations, but not always. It depends upon the nature of the study variable and the sampled population. If the study variable is sensitive enough, the procedure should be explained to the respondents, assuring them that their individual answers cannot be traced back to their true values on the study variable and that only the population mean is estimable. The explanation of the procedure would help decrease suspicion among the respondents. Though any unrelated variable with a known population mean and variance may be fairly used, we recommend using generating random numbers from a known distribution through the computer, writing them on cards and putting them into a box. Otherwise, the number of siblings, family size, last digit of the social security number, etc. may be used as an unrelated variable.
Suppose the population is partitioned into strata, and a sample is selected by simple random sampling with replacement from each stratum. Using the results in Section 3.1 , we can show that for the th stratum, the estimator of is given by:
Its variance is given by:
The mean estimators for individual strata can be added together to obtain a mean estimator for the whole population. The mean estimator of is:
where is the number of units in the whole population, is total number of units in stratum , and for , so that .
It is obvious that the proposed mean estimator, , is an unbiased estimate for the population mean, . Since the selections in different strata are made independently, each unbiased mean estimator, , has its own variance. The variance of is given by:
The optimal allocation of to derive the minimum variance of , subject to , is approximately given by:
The minimal variance of the estimator, , is given by:
Application of the Gjestvang and Singh  model in the stratified sampling with fixed total sample size and optimum allocation of sample sizes in different strata yields the following mean estimator:
with minimal variance:
The proposed estimator, based on SRSWR, will be more efficient than that of Gjestvang and Singh  , if:
which is always true.
Similarly the efficiency condition, with respect to Hussain and Shabbir  , is given by:
which is always true.
Our proposed stratified mean estimator is more efficient than the Gjestvang and Singh  stratified mean estimator, iff:
If, for each stratum , then, above inequality is always true. Using Eq. (26) for each stratum, we can see that . Thus, Eq. (27) is always true. Hence, the proposed stratified mean estimator is more efficient than that of Gjestvang and Singh  . To know the extent of Relative Efficiency (RE ) we have done a simulation study, assuming and different values of and . The REs of the proposed model relative to Hussain and Shabbir  and Gjestvang and Singh  models are defined, respectively, as and . The RE results are shown in Table A.2 in the Appendix .
|Model||Estimated mean||Estimated variance||Large sample 95% confidence interval|
|Hussain and Shabbir ||3.048||1.940||(2.344, 3.105)|
|Gjestvang and Singh ||2.711||2.023||(2.314, 3.107)|
|Direct questioning||3.048||0.269||(2.936, 3.159)|
We considered application of the proposed model with , in estimating the average GPA of the students at Quaid-i-Azam University, Islamabad. To estimate the average GPA of the students, we took a sample of 100 s semester students. Each student was requested to report the responses using the proposed Hussain and Shabbir  and Gjestvang and Singh  RR models. The responses obtained from the students are reported in the Table A.3 , Table A.4 and Table A.5 (see Appendix ). We generated 100 random numbers from a normal distribution, with mean 5 and standard deviation 0.5, wrote on the cards (white and black) and placed them in a transparent box. Thus, . We decided to choose , and , that is, .
In the first deck of 100 cards, on 20 we wrote the statement: Please “Report GPA” and on the remaining 80, we wrote the statement “go to second box”. In the second deck of 100 cards, on 12 cards, we wrote the statement: Please “Report GPA” and on the remaining 88 cards, we wrote the statement “go to second box”. Similarly, in the third deck of cards, on 75, we wrote the statement: Please “Report: (Random number)”, and on the remaining 25 cards, we wrote the statement “Report: (Random number)”. Assuming that , the data were obtained from the same respondents, which were, essentially, the data obtained through Hussain and Shabbir  RRM. Similarly, assuming the data were obtained, again, from the same respondents. Obviously, those were the data obtained by Gjestvang and Singh  RRM. At the end, we requested them to write their true GPA on a paper chit and drop it into a box without disclosing their identity. The true data are given in Table A.6 (see Appendix ).
The summary of the survey results is given in Table A.1 (see Appendix ). From Table A.1 , it is observed that the estimates based on the responses through the proposed model are closer to the estimates based on true responses than those of the other two methods.
Using the idea of distributing the probability of reporting the true value of sensitive variables into an increased number of stages, we proposed a general class of the scrambling model. The models by Hussain and Shabbir  and Gjestvang and Singh  have been shown as special cases of the proposed class of models. The efficiency and privacy protection of the models in the proposed class are functions of the number of randomization stages. If increases, the efficiency (privacy protection) of the proposed class of models increases (decreases). Thus, a suitable value of is the value which satisfies the objectives (greater efficiency and privacy protection) of the study. It is also established that the proposed class of models is actually a class of two stage models, having the additional advantage of distributing the probability of reporting on sensitive variables into an increased number of stages. Although the estimator given in Eq. (9) is unbiased and has smaller variance, its application in field surveys may be problematic because the individuals in the samples may get annoyed/irritated at reporting again and again. Thus must be chosen, at most, 3 or 4, in order to have the proposed model practically feasible.
A small scale practical application of the proposed class of model for is also given. In this application, we compared two types of estimate, one based on direct responses and the others based on scrambled responses. These estimates may not represent the true average GPA of the whole campus, as we have taken only the second semester students. This study could have been extended to a large scale by including all the students in the university and getting their actual average GPA from the controller of the examination office. Then, comparing the true average GPA with the estimates would shed more light on the performance of the proposed estimators. Nevertheless, it is established that the proposed estimator performs well compared to estimators considered in this paper. In conclusion, we must say that the proposed method of obtaining scrambled responses can be used safely and securely in field surveys on sensitive variables.
The authors are most grateful to the two learned referees for their guidance in improving the earlier draft of this article. The first author greatly appreciates the research facilities provided by King Abdulaziz University.