Estimation of the mean of a socially undesirable characteristic

Abstract

This paper focuses on presenting a generalization of the scrambled response models of Hussain and Shabbir [Hussain, Z. and Shabbir, J. “On estimation of mean of a sensitive quantitative variable”, InterStat , (#006), (2007)] and Gjestvang and Singh [Gjestvang, C.R. and Singh, S. “An improved randomized response model: estimation of mean”, Journal of Applied Statistics , 36(12), pp. 1361–1367 (2009)]. The suggested generalization is helpful in procuring honest data on socially undesirable characteristics. The suggested estimator is found to be unconditionally more efficient in terms of variablity. From a privacy point of view, comparison of the proposed class of models is made using the privacy protection measure by Zaizai et al. [Zaizai, Y., Jingu, W. and Junfeng, L. “An efficiency and protection based comparison among the quantitative randomized response strategies”, Communications in Statistics-Theory and Methods , 38, pp. 400–408 (2009)]. Unlike many scrambled response models, the proposed class of models is free from the need of known parameters of scrambling variables. The relative numerical efficiency of the proposed model is simulated for some fixed values of the parameters. The practical application of the proposed model is also studied through a small scale survey.

Keywords

Randomized response technique ; Sensitive character, Estimation of mean, anonymity ; Social desirability bias ; Scrambled response

1. Introduction

One of the leading paraphernalia for obtaining data pertaining to human populations is the social survey. To measure opinions, attitudes, and behaviors that cover a wide band of interests, the social survey has been established as being tremendously practical. The surveys are conducted due to many reasons, non-availability of certain facts/information in the archives being the most understandable and apparent. For instance, if one is interested in knowing crime rates, information about unseen crimes or unreported victimization experience is not available in formal records on crime. Sometimes the facts about the individuals (in a population) are inaccessible to the investigators for legal reasons. For example, in many countries, certain information about criminals is kept confidential, due to security and privacy concerns. In most studies, the study population may be so geographically dispersed that studying a whole population is simply infeasible.

Questionnaires, in particular social surveys, generally consist of many items. Some of the items may be about sensitive/high risk behavior, due to the social stigma carried by them. One problem with research on high-risk behavior is that respondents may consciously or unconsciously provide incorrect information. In psychological surveys, a social desirability bias has been observed as a major cause of distortion in standardized personality measures. Survey researchers have similar concerns about the truth of survey results/findings about such topics as drunk driving, use of marijuana, tax evasion, illicit drug use, induced abortion, shop lifting, cheating in exams, and sexual behavior.

The most serious problem in studying certain social problems that are sensitive in nature (e.g. induced abortion, drug usage, tax evasion, etc.) is lack of a reliable measure of their incidence or prevalence. Social stigma and fear of reprisal usually result in lying by the respondents when approached with the conventional or direct-response survey method. An obvious consequence of false reporting is unavoidable estimation bias. Warner [1] showed this evasive answer bias to prevail in the estimate obtained by direct questioning, and proposed a Randomized Response Model (RRM) to estimate the proportion of prevalence of sensitive characteristics in a population. Greenberg et al. [2] extended the RRM to the estimation of mean of a sensitive quantitative variable. The recent articles on the estimation of mean of a sensitive variable include: Eichhorn and Hayre [3] , Singh et al. [4] , Gupta et al. [5] , Bar-Lev et al. [6] , Ryu et al. [7] , Hussain et al. [8] , Hussain and Shabbir [9] and [10] , Huang [11] and [12] , Gjestvang and Singh [13] and [14] , Gupta et al. [15] and the references cited therein. For a detailed understanding of RRM, interested readers may be referred to Chaudhuri and Mukerjee [16] .

In the literature on estimation of scrambled randomized response models, we can find two types of scrambling, namely, the additive and multiplicative. The additive scrambled response model is due to Himmelfarb and Edgell [17] , and has been advocated by many authors due to its simplicity of application (cf. [11] , [12] , [13] and [15] ). Keeping in mind this advocacy, it is obvious that we need to search for an improvement in additive randomized response models. In this study, we present an unbiased estimator of the mean, assuming Simple Random Sampling with Replacement (SRSWR) and Stratified Random Sampling (STRS) protocols. The paper is organized as follows. In Section 2 , we briefly present Hussain and Shabbir [10] and Gjestvang and Singh [14] models. The proposed generalization is showcased in Section 3 under SRSWR and STRS schemes. Section 4 of the paper consists of efficiency comparisons and Section 5 is about a practical study, followed by conclusions of this study in Section 6 .

2. Gjestvang and Singh [14] , Hussain and Shabbir [10] RRMs and motivation

Following Gjestvang and Singh [13] , Gjestvang and Singh [14] proposed an additive Randomized Response Model. For the ${\textstyle i}$ th individual, let ${\textstyle X_{i}}$ and ${\textstyle S_{i}}$ be the values of the sensitive and scrambling variables, respectively. The distribution of ${\textstyle S}$ is completely known, with mean ${\textstyle \mu _{S}(-\infty <\mu _{S}<\infty )}$ and variance, ${\textstyle \sigma _{S}^{2}}$ . Assuming ${\textstyle a}$ and ${\textstyle b}$ to be positive real constants, the Gjestvang and Singh [14] model provides two options for the respondents; (i) “Report the value ${\textstyle X_{i}+aS_{i}}$ ”, and (ii) “Report the value ${\textstyle X_{i}-bS_{i}}$ ”, with pre-assigned probabilities ${\textstyle P_{1}={\frac {b}{a+b}}}$ and ${\textstyle (1-P_{1})={\frac {a}{a+b}}}$ , respectively, where ${\textstyle S}$ is a scrambling variable with mean, ${\textstyle \mu _{S}}$ , and variance, ${\textstyle \sigma _{S}^{2}}$ . Let ${\textstyle Y_{i}}$ be the reported response of the ${\textstyle i}$ th respondent, then, it can be written as:

( 1)

where ${\textstyle \beta _{i}}$ is a Bernoulli random variable having value ‘1’, if statement (i) is randomly chosen by the respondent and ‘0’, otherwise. They proposed an unbiased estimator of the population mean, ${\textstyle \mu _{X}}$ , of the sensitive variable, ${\textstyle X}$ , as:

( 2)

with variance:

( 3)

It is to be noted that borrowing the idea from Gjestvang and Singh [13] , Hussain and Shabbir [10] proposed an improved version of Gjestvang and Singh [14] RRM. The RRM of Hussain and Shabbir [10] is actually a two stage RRM. In their proposed RRM, a sample of size ${\textstyle n}$ is drawn from the population with a SRSWR sampling scheme. Each individual in the sample is requested to use a randomization device, ${\textstyle R_{1}}$ , which consists of the two statements:

“report your true response, ${\textstyle X_{i}}$ , of the sensitive question” and
“go to the randomization device, ${\textstyle R_{2}}$ ”, represented with the probabilities ${\textstyle P_{1}}$ and ${\textstyle (1-P_{1})}$ , respectively. The randomization device, ${\textstyle R_{2}}$ , consists of the two statements:
“report the scrambled response, ${\textstyle X_{i}+aS_{i}}$ ”, and
“report your scrambled response, ${\textstyle X_{i}-bS_{i}}$ ”, represented with probabilities ${\textstyle P_{2}={\frac {b}{a+b}}}$ and ${\textstyle 1-P_{2}={\frac {a}{a+b}}}$ , respectively. Let ${\textstyle Z_{i}}$ be the response of the ${\textstyle i}$ th respondent, then, it can be written as:

( 4)

where ${\textstyle \alpha _{i}=1}$ , if statement (i) is randomly chosen in ${\textstyle R_{1}}$ , and ‘0’, otherwise. Similarly, ${\textstyle \beta _{i}=1}$ , if statement (i) is randomly chosen in ${\textstyle R_{2}}$ and ‘0’, otherwise. An unbiased estimator of ${\textstyle \mu _{X}}$ is given by:

( 5)

The variance of the ${\textstyle {\overset {\mbox{ˆ}}{\mu }}_{A(HS)}}$ is given by:

( 6)

Hussain and Shabbir [9] used the idea of distributing the probability of reporting on the true value of ${\textstyle X}$ into ${\textstyle k(>2)}$ stages using the multiplicative randomized response models and reported the following advantages: (i) the inability of a clever respondent to correctly guess the total probability on reporting ‘ ${\textstyle X}$ ’. (ii) Provision of more protection against the privacy of the respondents, and, therefore, making the interviewer unable to know at which stage respondents actually reported his response, and (iii) the increased degrees of freedom to set the values for design probabilities, in order to keep the total probability of reporting on ${\textstyle X}$ at some desired level. As the use of additive randomized response models has been advocated by many authors, like Gjestvang and Singh [14] , Gupta et al. [15] , and Huang [11] and [12] , we plan to study the additive RRM of Gjestvang and Singh [14] in increased numbers of randomization stages. In the next section, we present the proposed RRM.

3. Proposed class of RRMs

We present the proposed model under two sampling schemes, namely, SRSWR and STRS, in following Sections 3.1 and 3.2 , respectively.

3.1. Case of SRSWR

In the proposed RRM, a sample of size ${\textstyle n}$ is drawn from the population with the SRSWR sampling scheme. Each individual in the sample is provided ${\textstyle k(>2)}$ randomization devices, ${\textstyle R_{1},R_{2},\ldots ,R_{k}}$ , and requested to use these randomization devices in the following order:

Use the randomization device, ${\textstyle R_{1}}$ , which consists of the two statements:

“report your true response, ${\textstyle X_{i}}$ , of the sensitive question” and
“go to the randomization device, ${\textstyle R_{2}}$ ”, represented with the probabilities, ${\textstyle P_{1}}$ and ${\textstyle (1-P_{1})}$ , respectively.

The randomization device, ${\textstyle R_{2}}$ , consists of the two statements:

“report your true response, ${\textstyle X_{i}}$ , of the sensitive question” and
“go to the randomization device, ${\textstyle R_{3}}$ ”, represented with the probabilities, ${\textstyle P_{2}}$ and ${\textstyle (1-P_{2})}$ , respectively. Continuing in this way, ${\textstyle (k-1)}$ th randomization device, ${\textstyle R_{k-1}}$ , consists of the two statements:
“report your true response, ${\textstyle X_{i}}$ , of the sensitive question” and
“go to the randomization device, ${\textstyle R_{k}}$ ”, represented with probabilities, ${\textstyle P_{k-1}}$ and ${\textstyle (1-P_{k-1})}$ , respectively.

The randomization device, ${\textstyle R_{k}}$ , consists of the two statements:

“report the scrambled response, ${\textstyle X_{i}+aS_{i}}$ ”, and
“report your scrambled response, ${\textstyle X_{i}-bS_{i}}$ ”, represented with probabilities, ${\textstyle P_{k}={\frac {b}{a+b}}}$ and ${\textstyle (1-P_{k})={\frac {a}{a+b}}}$ , respectively. Let ${\textstyle V_{i}}$ be the response of the ${\textstyle i}$ threspondent, then, it can be written as:

{\begin{array}{l}\displaystyle V_{i}=\alpha _{1i}X_{i}+(1-\alpha _{1i})\lbrace \alpha _{2i}X_{i}+(1-\alpha _{2i})\lbrace \ldots \lbrace \alpha _{ki}(X_{i}+aS_{i})+\\\displaystyle +(1-\alpha _{ki})(X_{i}-bS_{i})\rbrace \rbrace \rbrace {\mbox{,}}\end{array}}

( 7)

where ${\textstyle \alpha _{ji}=1}$ , if the statement (i) is chosen randomly by the ${\textstyle i^{th}(i=1,2,\ldots ,n)}$ respondents using the randomization device, ${\textstyle R_{j}(j=1,2,\ldots ,k)}$ .

Let ${\textstyle E_{1}}$ be the expectation operator over all possible samples and ${\textstyle E_{2}}$ be the expectation operator over the randomization device, then:

where:

{\begin{array}{l}\displaystyle E_{2}(V_{i})=P_{1}X_{i}+(1-P_{1})\lbrace P_{2}X_{i}+(1-P_{2})\lbrace \ldots \lbrace P_{k}(X_{i}+a\mu _{S})+\\\displaystyle +(1-P_{k})(X_{i}-b\mu _{S})\rbrace \rbrace \rbrace {\mbox{.}}\end{array}}

Thus:

{\begin{array}{l}\displaystyle E(V_{i})=E_{1}[P_{1}X_{i}+(1-P_{1})\lbrace P_{2}X_{i}+(1-P_{2})\lbrace \ldots \lbrace P_{k}(X_{i}+\\\displaystyle +a\mu _{S})+(1-P_{k})(X_{i}-b\mu _{S})\rbrace \rbrace \rbrace ]\end{array}}

( 8)

We propose an unbiased estimator of population mean, ${\textstyle \mu _{X}}$ , as:

( 9)

The variance of the proposed estimator is given by:

{\begin{array}{l}\displaystyle Var({\overset {\mbox{ˆ}}{\mu }}_{X(p)})={\frac {1}{n}}[\sigma _{X}^{2}+ab(1-P_{1})(1-P_{2})\ldots (1-P_{k-1})(\mu _{S}^{2}+\\\displaystyle +\sigma _{S}^{2})]{\mbox{.}}\end{array}}

( 10)

It is to be noted that for ${\textstyle k=1}$ , the proposed model is essentially the Gjestvang and Singh [14] RRM, and, for ${\textstyle k=2}$ , it reduces to Hussain and Shabbir [10] RRM. For all ${\textstyle k\geq 3}$ , the responses of the respondents can be expressed as:

V_{i}=\lbrace {\begin{array}{c}X_{i},{\mbox{with probability  }}1-{\underset {h=1}{\overset {k-1}{\prod }}}(1-P_{h})\\X_{i}+aS_{i},{\mbox{with probability  }}P_{k}{\underset {h=1}{\overset {k-1}{\prod }}}(1-P_{h})\\X_{i}-bS_{i},{\mbox{with probability  }}{\underset {h=1}{\overset {k}{\prod }}}(1-P_{h}){\mbox{.}}\end{array}}{\mbox{.}}

The ${\textstyle k}$ -stage randomization device can be viewed as a two stage randomization procedure with ${\textstyle \lbrace 1-\prod _{h=1}^{k-1}(1-P_{h})\rbrace =P_{1}}$ and ${\textstyle P_{k}=P_{2}}$ . Thus, the Hussain and Shabbir [10] procedure is a special case of the proposed procedure. In addition, the proposed procedure has the advantage of distributing the total probability of reporting on ${\textstyle X_{i}}$ into an increased number of stages. In the lines to follow, we illustrate the working of the proposed RRM for ${\textstyle k=3}$ . Following these lines, we can easily derive the generalized results given by Eqs. (3) , (6) and (10) .

For the purpose of illustrating the idea, suppose we have three different urns ${\textstyle (U_{1},U_{2},U_{3})}$ containing black and white cards, with ${\textstyle P_{1},P_{2}}$ and ${\textstyle P_{3}}$ being the proportions of white cards, respectively. A selected respondent is asked to pick a card randomly from the urn ‘ ${\textstyle U_{1}}$ ’. If a white card is picked, he/she is asked to report the true value of ‘ ${\textstyle X}$ ’, otherwise, he/she is directed to go to the second urn, ‘ ${\textstyle U_{2}}$ ’. At this stage, again, he/she is requested to randomly draw a card from the urn, ‘ ${\textstyle U_{2}}$ ’, and report the value of ‘ ${\textstyle X}$ ’, if the white card is drawn, otherwise, directed to go to third urn, ‘ ${\textstyle U_{3}}$ ’, and randomly draw a card from the third urn. Then, report ‘ ${\textstyle X+aS}$ ’ if a white card is drawn, otherwise, report ‘ ${\textstyle X-bS}$ ’.

The ${\textstyle i}$ th respondent selected in the sample of size ${\textstyle n}$ , drawn by using simple random sampling with replacement (SRSWR), is requested to report the value:

{\begin{array}{l}\displaystyle V_{i}=\alpha _{1i}X_{i}+(1-\alpha _{1i})[\alpha _{2i}X_{i}+(1-\alpha _{2i})\lbrace \alpha _{3i}(X_{i}+aS_{i})+\\\displaystyle +(1-\alpha _{3i})(X_{i}-bS_{i})\rbrace ]{\mbox{,}}\end{array}}

( 11)

where ${\textstyle \alpha _{ji}(j=1,2,3,i=1,2,\ldots ,n)}$ is defined as earlier. The expected value of the observed response is:

{\begin{array}{l}\displaystyle E(V_{i})=P_{1}\mu _{X}+(1-P_{1})[P_{2}\mu _{X}+(1-P_{2})\lbrace P_{3}(\mu _{X}+a\mu _{S})+(1-\\\displaystyle -P_{3})(\mu _{X}-b\mu _{S})\rbrace ]\end{array}}

( 12)

The unbiased estimator of ${\textstyle \mu _{X}}$ is then given, as in Eq. (9) , with variance:

( 14)

As pointed out by one of the referees, one of the two key issues in the scrambling model is the degree of privacy protection provided and competing models should also be compared at equal levels of privacy protection. For this purpose, we take the privacy measure proposed by Zaizai et al. [18] . The privacy measure proposed by Zaizai et al. [18] is defined as ${\textstyle E(X_{i}-T_{i})^{2}}$ , where ${\textstyle T_{i}}$ is the scrambled response obtained through a given scrambling model. The model with the larger value of ${\textstyle E(X_{i}-T_{i})^{2}}$ is taken as a more protective model. The privacy measure by Zaizai et al. [18] is not a normalized privacy measure. We normalize it as ${\textstyle E({\frac {X_{i}-T_{i}}{\mu _{S}^{2}}})^{2}}$ , where ${\textstyle \mu _{S}}$ is defined as earlier. The normalized privacy measures for the proposed, Hussain and Shabbir [10] and Gjestvang and Singh [14] models are given, respectively, as below:

( 15)

From Eqs. , and , it is observed that the model of Gjestvang and Singh [14] is more protective than the other two models and the model of Hussain and Shabbir [10] is better only to the proposed model. It is also observed that privacy protection (efficiency) is the decreasing (increasing) function of ${\textstyle k}$ . Thus, it is a tradeoff between privacy and efficiency. The value of ${\textstyle k}$ should be fixed, depending upon the required privacy protection and efficiency.

Acceptance of the unrelated variable, ${\textstyle S}$ , by the respondents, as pointed out by one of the referee, is another key issue of concern. Explaining the working of the whole procedure to respondents may be needed in some situations, but not always. It depends upon the nature of the study variable and the sampled population. If the study variable is sensitive enough, the procedure should be explained to the respondents, assuring them that their individual answers cannot be traced back to their true values on the study variable and that only the population mean is estimable. The explanation of the procedure would help decrease suspicion among the respondents. Though any unrelated variable with a known population mean and variance may be fairly used, we recommend using generating random numbers from a known distribution through the computer, writing them on cards and putting them into a box. Otherwise, the number of siblings, family size, last digit of the social security number, etc. may be used as an unrelated variable.

3.2. Case of STRS

Suppose the population is partitioned into ${\textstyle H}$ strata, and a sample is selected by simple random sampling with replacement from each stratum. Using the results in Section 3.1 , we can show that for the ${\textstyle h}$ th stratum, the estimator of ${\textstyle \mu _{Xh}}$ is given by:

( 18)

Its variance is given by:

{\begin{array}{l}\displaystyle Var({\overset {\mbox{ˆ}}{\mu }}_{X_{h}(p)})={\frac {1}{n_{h}}}[\sigma _{X_{h}}^{2}+(\mu _{S_{h}}^{2}+\sigma _{S_{h}}^{2})(1-P_{1h})(1-\\\displaystyle -P_{2h})\cdots (1-P_{(k-1)h})a_{h}b_{h}]{\mbox{.}}\end{array}}

( 19)

The mean estimators for individual strata can be added together to obtain a mean estimator for the whole population. The mean estimator of ${\textstyle \mu _{X}}$ is:

{\begin{array}{l}\displaystyle {\tilde {\mu }}_{X(p)}=\sum _{h=1}^{H}W_{h}{\overset {\mbox{ˆ}}{\mu }}_{X_{h}(p)}=\sum _{h=1}^{H}W_{h}[{\frac {1}{n_{h}}}\sum _{i=1}^{n_{h}}V_{hi}]=\\\displaystyle =\sum _{h=1}^{H}{\frac {W_{h}}{n_{h}}}[\sum _{i=1}^{n_{h}}V_{hi}]{\mbox{,}}\end{array}}

( 20)

where ${\textstyle N}$ is the number of units in the whole population, ${\textstyle N_{h}}$ is total number of units in stratum ${\textstyle h}$ , and ${\textstyle W_{h}={\frac {N_{h}}{N}}}$ for ${\textstyle h=1,2,\ldots ,k}$ , so that ${\textstyle \sum _{h=1}^{k}W_{h}=1}$ .

It is obvious that the proposed mean estimator, ${\textstyle {\tilde {\mu }}_{X(p)}}$ , is an unbiased estimate for the population mean, ${\textstyle \mu _{X}}$ . Since the selections in different strata are made independently, each unbiased mean estimator, ${\textstyle {\overset {\mbox{ˆ}}{\mu }}_{X_{h}(p)}}$ , has its own variance. The variance of ${\textstyle {\tilde {\mu }}_{X(p)}}$ is given by:

{\begin{array}{l}\displaystyle Var({\tilde {\mu }}_{X(p)})=Var(\sum _{h=1}^{H}W_{h}{\overset {\mbox{ˆ}}{\mu }}_{X_{h}(p)})=\\\displaystyle =\sum _{h=1}^{H}W_{h}^{2}Var({\overset {\mbox{ˆ}}{\mu }}_{X_{h}(p)})\end{array}}

( 21)

3.2.1. Optimum sample sizes

The optimal allocation of ${\textstyle n{\mbox{ to }}n_{1},n_{2},\ldots ,n_{k-1}{\mbox{ and }}n_{k}}$ to derive the minimum variance of ${\textstyle {\tilde {\mu }}_{X(p)}}$ , subject to ${\textstyle n=\sum _{h=1}^{H}n_{h}}$ , is approximately given by:

( 22)

The minimal variance of the estimator, ${\textstyle {\tilde {\mu }}_{X(p)}}$ , is given by:

{\begin{array}{l}\displaystyle Var({\tilde {\mu }}_{X(p)})={\frac {1}{n}}[\sum _{h=1}^{H}W_{h}\lbrace \sigma _{X_{h}}^{2}+(\mu _{S_{h}}^{2}+\sigma _{S_{h}}^{2})(1-\\\displaystyle -P_{1h})(1-P_{2h})\cdots (1-P_{(k-1)h})a_{h}b_{h}\rbrace ^{1/2}]^{2}{\mbox{.}}\end{array}}

( 23)

Application of the Gjestvang and Singh [14] model in the stratified sampling with fixed total sample size and optimum allocation of sample sizes in different strata yields the following mean estimator:

( 24)

with minimal variance:

{\begin{array}{l}\displaystyle Var({\tilde {\mu }}_{X(GS)})={\frac {1}{n}}[\sum _{h=1}^{H}W_{h}\lbrace \sigma _{X_{h}}^{2}+(\mu _{S_{h}}^{2}+\sigma _{S_{h}}^{2})(1-\\\displaystyle -P_{1h})a_{h}b_{h}\rbrace ^{1/2}]^{2}{\mbox{.}}\end{array}}

( 25)

4. Efficiency comparisons

The proposed estimator, based on SRSWR, will be more efficient than that of Gjestvang and Singh [14] , if:

or if:

{\begin{array}{l}\displaystyle {\frac {1}{n}}[\sigma _{X}^{2}+ab(\mu _{S}^{2}+\sigma _{S}^{2})]-{\frac {1}{n}}[\sigma _{X}^{2}+ab(1-P_{1})(1-P_{2})\cdots (1-\\\displaystyle -P_{(k-1)})(\mu _{S}^{2}+\sigma _{S}^{2})]\geq 0{\mbox{,}}\end{array}}

or if:

( 26)

which is always true.

Similarly the efficiency condition, with respect to Hussain and Shabbir [10] , is given by:

which is always true.

Our proposed stratified mean estimator is more efficient than the Gjestvang and Singh [14] stratified mean estimator, iff:

That is:

{\begin{array}{l}\displaystyle {\frac {1}{n}}[\sum _{h=1}^{H}W_{h}\lbrace \sigma _{X_{h}}^{2}+(\mu _{S_{h}}^{2}+\sigma _{S_{h}}^{2})(1-P_{1h})ab\rbrace ^{1/2}]^{2}-\\\displaystyle -{\frac {1}{n}}[\sum _{h=1}^{H}W_{h}\lbrace \sigma _{X_{h}}^{2}+(\mu _{S_{h}}^{2}+\sigma _{S_{h}}^{2})(1-P_{1h})(1-P_{2h})\cdots (1-\\\displaystyle -P_{(k-1)h})ab\rbrace ^{1/2}]^{2}\geq 0{\mbox{.}}\end{array}}

Or:

{\begin{array}{l}\displaystyle [\sum _{h=1}^{H}W_{h}\lbrace \sigma _{X_{h}}^{2}+(\mu _{X_{h}}^{2}+\sigma _{X_{h}}^{2})(1-P_{1h})ab\rbrace ^{1/2}]^{2}-\\\displaystyle -[\sum _{h=1}^{H}W_{h}\lbrace \sigma _{X_{h}}^{2}+(\mu _{X_{h}}^{2}+\sigma _{X_{h}}^{2})(1-P_{1h})(1-P_{2h})\cdots (1-\\\displaystyle -P_{(k-1)h})ab\rbrace ^{1/2}]^{2}\geq 0{\mbox{.}}\end{array}}

( 27)

If, for each stratum ${\textstyle Var({\overset {\mbox{ˆ}}{\mu }}_{X_{h}(p)})\leq Var({\overset {\mbox{ˆ}}{\mu }}_{X_{h}(GS)})}$ , then, above inequality is always true. Using Eq. (26) for each stratum, we can see that ${\textstyle Var({\overset {\mbox{ˆ}}{\mu }}_{X_{h}(p)})\leq Var({\overset {\mbox{ˆ}}{\mu }}_{X_{h}(GS)})}$ . Thus, Eq. (27) is always true. Hence, the proposed stratified mean estimator is more efficient than that of Gjestvang and Singh [14] . To know the extent of Relative Efficiency (RE ) we have done a simulation study, assuming ${\textstyle P_{1}=0.5,P_{2}=0.8}$ and different values of ${\textstyle a}$ and ${\textstyle b}$ . The REs of the proposed model relative to Hussain and Shabbir [10] and Gjestvang and Singh [14] models are defined, respectively, as ${\textstyle RE_{1}={\frac {Var({\overset {\mbox{ˆ}}{\mu }}_{X(HS)})}{Var({\overset {\mbox{ˆ}}{\mu }}_{X(p)})}}}$ and ${\textstyle RE_{2}={\frac {Var({\overset {\mbox{ˆ}}{\mu }}_{X(GS)})}{Var({\overset {\mbox{ˆ}}{\mu }}_{X(p)})}}}$ . The RE results are shown in Table A.2 in the Appendix .

Table A.1. Summary of the survey results.
Model	Estimated mean	Estimated variance	Large sample 95% confidence interval
Proposed	2.718	1.830	(2.359, 3.077)
Hussain and Shabbir [10]	3.048	1.940	(2.344, 3.105)
Gjestvang and Singh [14]	2.711	2.023	(2.314, 3.107)
Direct questioning	3.048	0.269	(2.936, 3.159)

Table A.2. RE of the proposed estimator ${\textstyle {\overset {\mbox{ˆ}}{\mu }}_{X(p)}}$ relative to ${\textstyle {\overset {\mbox{ˆ}}{\mu }}_{X(HS)}}$ and ${\textstyle {\overset {\mbox{ˆ}}{\mu }}_{X(GS)}}$ for ${\textstyle P_{1}=0.5,P_{2}=0.8}$ .
${\textstyle \mu _{S}}$	${\textstyle a}$		${\textstyle b}$
			0.1	0.5	1.0	1.5	2.0	3.0	4.0
5	0.01	RE1	1.006	1.033	1.062	1.111	1.122	1.226	1.230
	0.01	RE2	1.063	1.325	1.598	1.924	2.193	2.644	3.180
	0.05	RE1	1.034	1.157	1.281	1.366	1.453	1.479	1.482
	0.05	RE2	1.332	2.421	3.452	4.151	4.872	5.664	6.296
	0.1	RE1	1.071	1.273	1.459	1.517	1.612	1.720	1.730
	0.1	RE2	1.628	3.497	4.957	5.762	6.102	7.168	7.673
	0.15	RE1	1.101	1.341	1.537	1.658	1.727	1.809	1.824
	0.15	RE2	1.918	4.237	5.866	6.765	7.296	8.322	7.925

5. Practical application

We considered application of the proposed model with ${\textstyle k=3}$ , in estimating the average GPA of the students at Quaid-i-Azam University, Islamabad. To estimate the average GPA of the students, we took a sample of 100 s semester students. Each student was requested to report the responses using the proposed Hussain and Shabbir [10] and Gjestvang and Singh [14] RR models. The responses obtained from the students are reported in the Table A.3 , Table A.4 and Table A.5 (see Appendix ). We generated 100 random numbers from a normal distribution, with mean 5 and standard deviation 0.5, wrote on the cards (white and black) and placed them in a transparent box. Thus, ${\textstyle S\sim N(5,0.5)}$ . We decided to choose ${\textstyle P_{1}=0.2}$ , ${\textstyle P_{2}=0.12,a=0.2}$ and ${\textstyle b=0.6}$ , that is, ${\textstyle P_{3}={\frac {0.6}{0.2+0.6}}=0.75}$ .

Table A.3. The data obtained through proposed model with ${\textstyle a=0.2,b=0.6}$ , ${\textstyle P_{1}=0.2}$ and ${\textstyle P_{2}=0.12}$ .
2.23316076	0.11682901	3.69261291	2.79361833	3.31754616	4.48523422
0.07177086	2.54567059	2.00454378	−0.18521662	4.02623078	4.78814901
4.34181631	−1.20497925	3.03632159	0.92956271	3.18827191	−0.39164659
5.05778842	−0.04244069	4.82164560	4.70977953	3.16458601	3.44016293
3.33889340	3.98265516	4.47154482	5.05657143	−1.03031096	3.75428966
4.71087638	3.04748875	−0.12171074	3.93705588	3.19369396	2.93455250
3.31228968	4.81156619	3.99121545	3.47980705	3.41929935	4.30847776
0.61209063	4.45097525	−0.59982820	−0.38546359	4.87526643	3.34191693
2.16227070	−0.61183118	4.46274876	3.83178736	2.93766344	4.65216277
4.44753724	3.75060349	3.68410234	4.66868069	0.07913056	−0.12381204
0.25315506	2.44209882	4.31549906	3.31681455	2.96894543	3.35360459
0.07256166	1.02781537	−0.45570842	−0.87029593	4.11836979	2.64111698
−0.27625023	3.16656813	4.63397434	3.29360005	4.48240265	4.23193867
0.01389507	3.57583500	4.81195764	2.38847305	3.26900358	3.35566998
−0.23058781	3.52512426	3.02007434	−0.87307810	4.56112139	3.93763373
0.80805500	3.26219485	3.74346405	3.16233729	3.34006163	3.15538427
3.47715985	3.88315901	0.62564945	4.55753575

Table A.4. Data obtained through Hussain and Shabbir (2007b) [10] model with ${\textstyle a=0.2,b=0.6}$ and ${\textstyle P_{1}=0.12}$ .
3.11109635	0.11682901	4.78965708	2.79361833	3.31754616	4.48523422
0.07177086	3.77153499	2.00454378	−0.18521662	4.02623078	4.78814901
4.34181631	−1.20497925	3.03632159	0.92956271	3.18827191	−0.39164659
5.05778842	−0.04244069	4.82164560	4.70977953	3.16458601	4.34426764
3.33889340	3.98265516	4.47154482	5.05657143	−1.03031096	4.74066931
4.71087638	3.04748875	−0.12171074	3.93705588	0.48483138	2.93455250
3.31228968	4.81156619	3.99121545	0.62013548	3.41929935	4.30847776
0.61209063	4.45097525	−0.59982820	−0.38546359	4.87526643	3.34191693
3.11559258	−0.61183118	4.46274876	3.83178736	4.05210038	4.65216277
4.44753724	3.75060349	3.68410234	4.66868069	0.07913056	−0.12381204
0.25315506	2.44209882	4.31549906	4.44771291	3.97064382	3.35360459
0.07256166	1.02781537	−0.45570842	−0.87029593	4.11836979	−0.17281558
−0.27625023	3.16656813	4.63397434	4.25109065	4.48240265	4.23193867
0.01389507	3.57583500	4.81195764	2.38847305	3.26900358	3.35566998
−0.23058781	3.52512426	3.02007434	−0.87307810	4.56112139	3.93763373
0.80805500	3.26219485	4.66407697	3.16233729	4.28249550	3.15538427
3.47715985	0.84556881	0.62564945	4.55753575

Table A.5. Data obtained through Gjestvang and Singh (2009) [14] with ${\textstyle a=0.2,b=0.6}$ .
3.11109635	0.11682901	4.78965708	0.07148237	3.31754616	4.48523422
0.07177086	3.77153499	2.98748964	−0.18521662	4.02623078	4.78814901
4.34181631	−1.20497925	0.20104257	0.92956271	4.16780198	−0.39164659
5.05778842	−0.04244069	4.82164560	4.70977953	4.31361666	4.34426764
3.33889340	3.98265516	4.47154482	5.05657143	−1.03031096	4.74066931
4.71087638	3.04748875	−0.12171074	3.93705588	0.48483138	2.93455250
3.31228968	4.81156619	3.99121545	0.62013548	3.41929935	4.30847776
0.61209063	4.45097525	−0.59982820	−0.38546359	4.87526643	4.41765019
3.11559258	−0.61183118	4.46274876	3.83178736	4.05210038	4.65216277
4.44753724	3.75060349	3.68410234	4.66868069	0.07913056	−0.12381204
0.25315506	−0.42428630	4.31549906	4.44771291	3.97064382	3.35360459
0.07256166	1.02781537	−0.45570842	−0.87029593	4.11836979	−0.17281558
−0.27625023	3.16656813	4.63397434	4.25109065	4.48240265	4.23193867
0.01389507	3.57583500	4.81195764	3.26489306	3.26900358	3.35566998
−0.23058781	3.52512426	4.02475847	−0.87307810	4.56112139	3.93763373
0.80805500	3.26219485	4.66407697	4.10066525	4.28249550	3.15538427
3.47715985	0.84556881	0.62564945	4.55753575

In the first deck of 100 cards, on 20 we wrote the statement: Please “Report GPA” and on the remaining 80, we wrote the statement “go to second box”. In the second deck of 100 cards, on 12 cards, we wrote the statement: Please “Report GPA” and on the remaining 88 cards, we wrote the statement “go to second box”. Similarly, in the third deck of cards, on 75, we wrote the statement: Please “Report: ${\textstyle GPA+0.2}$ (Random number)”, and on the remaining 25 cards, we wrote the statement “Report: ${\textstyle GPA-0.6}$ (Random number)”. Assuming that ${\textstyle P_{1}=0}$ , the data were obtained from the same respondents, which were, essentially, the data obtained through Hussain and Shabbir [10] RRM. Similarly, assuming ${\textstyle P_{1}=P_{2}=0}$ the data were obtained, again, from the same respondents. Obviously, those were the data obtained by Gjestvang and Singh [14] RRM. At the end, we requested them to write their true GPA on a paper chit and drop it into a box without disclosing their identity. The true data are given in Table A.6 (see Appendix ).

Table A.6. True data.
2.233161	3.136780	3.692613	2.793618	2.205516	3.431046	2.707550	2.545671
2.004544	3.460874	3.110159	3.619516	3.217301	2.058742	3.036322	3.914756
3.188272	3.005791	3.982075	3.275306	3.930744	3.717378	3.164586	3.440163
2.099116	2.988803	3.171050	3.958999	2.080994	3.754290	3.707987	2.036558
2.655664	2.910881	3.193694	2.046447	2.261989	3.656463	2.845359	3.479807
2.492049	3.277419	3.719893	3.494468	2.591614	2.792680	3.810854	3.341917
2.162271	2.996835	3.462074	2.922333	2.937663	3.800106	3.252035	2.894699
2.741574	3.590808	2.611402	3.730712	3.321683	2.442099	3.223477	3.316815
2.968945	2.312281	3.336032	3.967372	2.366828	2.500851	3.303393	2.641117
3.461387	2.230421	3.716620	3.293600	3.488766	3.251924	2.901757	2.566012
3.843604	2.388473	2.085841	2.310979	2.968765	2.624486	3.020074	2.626089
3.553566	2.784691	3.598225	2.165920	3.743464	3.162337	3.340062	2.142230
2.391306	3.883159	3.739664	3.490303

The summary of the survey results is given in Table A.1 (see Appendix ). From Table A.1 , it is observed that the estimates based on the responses through the proposed model are closer to the estimates based on true responses than those of the other two methods.

6. Conclusion

Using the idea of distributing the probability of reporting the true value of sensitive variables into an increased number of stages, we proposed a general class of the scrambling model. The models by Hussain and Shabbir [10] and Gjestvang and Singh [14] have been shown as special cases of the proposed class of models. The efficiency and privacy protection of the models in the proposed class are functions of the number ${\textstyle (k)}$ of randomization stages. If ${\textstyle k}$ increases, the efficiency (privacy protection) of the proposed class of models increases (decreases). Thus, a suitable value of ${\textstyle k}$ is the value which satisfies the objectives (greater efficiency and privacy protection) of the study. It is also established that the proposed class of models is actually a class of two stage models, having the additional advantage of distributing the probability of reporting on sensitive variables into an increased number of stages. Although the estimator given in Eq. (9) is unbiased and has smaller variance, its application in field surveys may be problematic because the individuals in the samples may get annoyed/irritated at reporting again and again. Thus ${\textstyle k}$ must be chosen, at most, 3 or 4, in order to have the proposed model practically feasible.

A small scale practical application of the proposed class of model for ${\textstyle k=3}$ is also given. In this application, we compared two types of estimate, one based on direct responses and the others based on scrambled responses. These estimates may not represent the true average GPA of the whole campus, as we have taken only the second semester students. This study could have been extended to a large scale by including all the students in the university and getting their actual average GPA from the controller of the examination office. Then, comparing the true average GPA with the estimates would shed more light on the performance of the proposed estimators. Nevertheless, it is established that the proposed estimator performs well compared to estimators considered in this paper. In conclusion, we must say that the proposed method of obtaining scrambled responses can be used safely and securely in field surveys on sensitive variables.

Acknowledgments

The authors are most grateful to the two learned referees for their guidance in improving the earlier draft of this article. The first author greatly appreciates the research facilities provided by King Abdulaziz University.

Appendix.

See Table A.1 , Table A.2 , Table A.3 , Table A.4 , Table A.5 and Table A.6 .

References

[1] S.L. Warner; Randomized response: a survey technique for eliminating evasive answer bias; Journal of the American Statistical Associations, 60 (1965), pp. 63–69
[2] B.G. Greenberg, R.R. Kuebler Jr., J.R. Abernathy, D.G. Hovertiz; Application of the randomized response techniques in obtaining quantitative data; Journal of the American Statistical Associations, 66 (1971), pp. 243–250
[3] B.H. Eichhorn, L.S. Hayre; Scrambled randomized response methods for obtaining sensitive quantitative data; Journal of Statistical Planning and Inference, 7 (1983), pp. 307–316
[4] S. Singh, M. Mahmood, D.S. Tracy; Estimation of mean and variance of stigmatized quantitative variable using distinct units in randomized response sampling; Statistical Papers, 42 (2001), pp. 403–411
[5] S. Gupta, B. Gupta, S. Singh; Estimation of sensitivity level of personal interview survey questions; Journal of Statistical Planning and Inference, 100 (2002), pp. 239–247
[6] S.K. Bar-Lev, E. Bobovitch, B. Boukai; A note on randomized response models; Metrika, 6 (2004), pp. 255–260
[7] J.-B. Ryu, J.-M. Kim, T.-Y. Heo, C.G. Park; On stratified randomized response sampling; Model Assisted Statistics and Applications, 1 (1) (2005–2006), pp. 31–36
[8] Z. Hussain, J. Shabbir, S. Gupta; An alternative to Ryu et al. randomized response model; Journal of Statistics and Management Sciences, 10 (4) (2007), pp. 511–517
[9] Z. Hussain, J. Shabbir; Generalized quantitative randomized response model; InterStat (#004) (2007)
[10] Z. Hussain, J. Shabbir; On estimation of mean of a sensitive quantitative variable; InterStat (#006) (2007)
[11] K.C. Huang; Estimation of sensitive characteristic using optional randomized response technique; Quality and Quantity, 42 (2008), pp. 679–686
[12] K.C. Huang; Unbiased estimators of mean, variance and sensitivity level for quantitative characteristic in finite population sampling; Metrika, 71 (2010), pp. 341–352
[13] C.R. Gjestvang, S. Singh; A new randomized response model; Journal of Royal Statistical Society, Series B, 68 (2006), pp. 523–530
[14] C.R. Gjestvang, S. Singh; An Improved randomized response model: estimation of mean; Journal of Applied Statistics, 36 (12) (2009), pp. 1361–1367
[15] S. Gupta, J. Shabbir, S. Sehra; Mean and sensitivity estimation in optional randomized response models; Journal of Statistical Planning and Inference, 100 (2010), pp. 239–247
[16] A. Chaudhuri, R. Mukerjee; Randomized Response: Theory and Techniques; Marcel Dekker, New York (1988)
[17] S. Himmelfarb, S.E. Edgell; Additive constant model: a randomized response technique for eliminating evasiveness to quantitative response questions; Psychological Bulletin, 87 (1980), pp. 525–530
[18] Y. Zaizai, W. Jingu, L. Junfeng; An efficiency and protection based comparison among the quantitative randomized response strategies; Communication in Statistics-Theory and Methods, 38 (2009), pp. 400–408

Abstract

Keywords

1. Introduction

2. Gjestvang and Singh [14] , Hussain and Shabbir [10] RRMs and motivation

3. Proposed class of RRMs

3.1. Case of SRSWR

3.2. Case of STRS

3.2.1. Optimum sample sizes

4. Efficiency comparisons

5. Practical application

6. Conclusion

Acknowledgments

Appendix.

References

Document information

Document Score

Share this document

Keywords

claim authorship