The aim of this vignette is to describe the various methods for estimating the Gini index, for both infinite and finite populations, as well as the methods for estimating its variance, as implemented in the giniVarCI package. Different confidence intervals for the Gini index are also explained.
To exemplify the use of the different functions, we assume that inequality is measured for a nonnegative and continuous random variable Y. A popular formulation of the Gini index (G) is defined by (see David, 1968; Kendall and Stuart, 1977; Qin et al., 2010): G=12μY∫+∞0∫+∞0|x−y|dFY(x)dFY(y), where μY=E[Y]=∫+∞0yf(y)dy=∫+∞0ydFY(y), is the mean of Y, and FY(y)=P(Y≤y) and f(y) are the cumulative distribution function and the probability density function of Y, respectively.
In practice, the value of G is estimated by means of a sample S with size n, which can be selected from either infinite or finite populations (Berger and Gedik Balay, 2020; Muñoz et al., 2023).
For infinite populations, {Yi:i∈S} denotes a sequence, with size n, of nonnegative random variables with
the same distribution as the variable of interest Y. The Gini index (G) is estimated using the observation of
individuals selected in the sample, which are denoted as {yi:i∈S}. A popular estimator of
the Gini index is (see Langel and Tille, 2013; Giorgi and Gigliarano,
2017; Muñoz et al., 2023): ˆG=2¯yn2∑i∈Siy(i)−n+1n, where ¯y=n−1∑ni=1yi, and
y(i) are the ordered values (in
non-decreasing order) of the sample observations yi. This is the expression computed by
the functions iginindex() (method = 5
) and
igini() when bias.correction = FALSE
.
The estimator ˆG can be
biased for small sample sizes (Deltas, 2003). The bias corrected
(bc) version of ˆG is: ˆGbc=2¯yn(n−1)∑i∈Siy(i)−n+1n−1, which corresponds to the Gini index bias
correction version computed by iginindex()
(method = 5
) and igini() when
bias.correction = TRUE
.
In the first example, a sample with size n=100
is
generated using the gsample() function from the standard
logNormal distribution (distribution = "lognormal"
) with
true Gini index is G=0.5
(gini = 0.5
) and the Gini index is estimated using bias
correction.
library(giniVarCI)
set.seed(123)
y <- gsample(n = 100, gini = 0.5, distribution = "lognormal")
igini(y)
#> [1] 0.4671929
iginindex() can be used to estimate the Gini index using
various estimation methods and both R and
C++ codes. See help(iginindex)
for a
detailed description of the various estimation methods. Efficiency
comparisons between both implementations and with other functions
available in other packages, such as laeken,
DescTools, ineq or
REAT, can be made using, for instance, the function
microbenchmark():
#Comparing the computation time for the various estimation methods using R
microbenchmark::microbenchmark(
iginindex(y, method = 1, useRcpp = FALSE),
iginindex(y, method = 2, useRcpp = FALSE),
iginindex(y, method = 3, useRcpp = FALSE),
iginindex(y, method = 4, useRcpp = FALSE),
iginindex(y, method = 5, useRcpp = FALSE),
iginindex(y, method = 6, useRcpp = FALSE),
iginindex(y, method = 7, useRcpp = FALSE),
iginindex(y, method = 8, useRcpp = FALSE),
iginindex(y, method = 9, useRcpp = FALSE),
iginindex(y, method = 10, useRcpp = FALSE)
)
#> Unit: microseconds
#> expr min lq mean median
#> iginindex(y, method = 1, useRcpp = FALSE) 144.1 164.10 256.703 185.15
#> iginindex(y, method = 2, useRcpp = FALSE) 13.9 20.30 33.951 25.50
#> iginindex(y, method = 3, useRcpp = FALSE) 11.5 17.10 29.406 21.55
#> iginindex(y, method = 4, useRcpp = FALSE) 15.5 22.85 36.819 28.70
#> iginindex(y, method = 5, useRcpp = FALSE) 16.2 21.80 41.324 27.45
#> iginindex(y, method = 6, useRcpp = FALSE) 31.3 63.60 95.618 80.15
#> iginindex(y, method = 7, useRcpp = FALSE) 948.3 1024.75 1356.977 1149.25
#> iginindex(y, method = 8, useRcpp = FALSE) 779.4 897.05 1148.337 975.20
#> iginindex(y, method = 9, useRcpp = FALSE) 751.7 832.75 1073.753 916.95
#> iginindex(y, method = 10, useRcpp = FALSE) 9899.3 12344.85 14473.612 14318.45
#> uq max neval
#> 239.50 3925.8 100
#> 39.05 154.9 100
#> 31.75 215.0 100
#> 43.95 170.7 100
#> 41.70 544.4 100
#> 102.00 602.7 100
#> 1499.85 4320.1 100
#> 1254.35 4180.3 100
#> 1275.00 3225.7 100
#> 15689.05 24814.1 100
# Comparing the computation time for the various estimation methods using Rcpp
microbenchmark::microbenchmark(
iginindex(y, method = 1),
iginindex(y, method = 2),
iginindex(y, method = 3),
iginindex(y, method = 4),
iginindex(y, method = 5),
iginindex(y, method = 6),
iginindex(y, method = 7),
iginindex(y, method = 8),
iginindex(y, method = 9),
iginindex(y, method = 10) )
#> Unit: microseconds
#> expr min lq mean median uq max
#> iginindex(y, method = 1) 21.2 27.00 58.562 36.20 49.85 1068.6
#> iginindex(y, method = 2) 10.0 15.50 53.593 20.90 26.50 2465.1
#> iginindex(y, method = 3) 9.9 17.35 33.202 23.95 32.75 287.1
#> iginindex(y, method = 4) 10.2 18.15 41.034 24.20 30.70 1275.7
#> iginindex(y, method = 5) 9.0 12.95 22.263 20.50 25.35 146.6
#> iginindex(y, method = 6) 8.4 13.85 23.954 19.65 24.90 126.1
#> iginindex(y, method = 7) 31.0 46.85 70.911 58.30 67.65 442.9
#> iginindex(y, method = 8) 20.5 27.75 47.885 36.80 44.65 480.6
#> iginindex(y, method = 9) 19.0 29.75 66.249 40.65 55.00 940.7
#> iginindex(y, method = 10) 10681.2 16687.45 25121.255 22552.65 30578.30 87119.2
#> neval
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
#> 100
# Comparing the computation time for estimates of the Gini index in various R packages.
microbenchmark::microbenchmark(
igini(y),
laeken::gini(y),
DescTools::Gini(y),
ineq::Gini(y),
REAT::gini(y))
#> Registered S3 methods overwritten by 'DescTools':
#> method from
#> lines.Lc ineq
#> plot.Lc ineq
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> igini(y) 9.8 12.00 34.814 14.35 18.05 1864.2 100
#> laeken::gini(y) 35.3 38.00 392.260 41.20 50.95 32288.5 100
#> DescTools::Gini(y) 56.6 61.80 11304.144 67.20 78.40 1122424.1 100
#> ineq::Gini(y) 45.7 49.45 133.872 52.60 65.90 6572.8 100
#> REAT::gini(y) 100.7 108.30 234.544 113.10 137.15 8623.5 100
Variance estimators and confidence intervals are described using different methods for the estimator of the non-bias corrected version of Gini index ˆG, since as ˆGbc=nn−1ˆG, the variance estimators and confidence intervals based on ˆGbc can be straightforwardly derived. In particular, ˆV(ˆGbc)=n2(n−1)2ˆV(ˆG). Let [L,U] the lower and upper limits of a confidence interval for G based on ˆG. The confidence interval based on ˆGbc can be computed as: [nn−1L,nn−1U].
The argument interval = pbootstrap
in the function
igini() returns the confidence interval for the Gini index
using the percentile bootstrap method. Let {y∗1(b),…,y∗n(b)} be
the bth bootstrap sample taken from
the original sample {y1,…,yn} by simple random sampling with replacement, and ˆG∗(b) denotes the estimator
ˆG computed from the bth bootstrap sample, with b={1,…,B}, being B the total number of bootstrap samples.
For a confidence level 1−α,
the percentile bootstrap confidence interval is defined as (see Qin et
al., 2010): [ˆG∗(α/2),ˆG∗(1−α/2)], where ˆG∗(a) is the ath quantile of the bootstrapped
coefficients ˆG∗(b). A
variance estimator of the Gini index based on bootstrap is defined as
ˆVB(ˆG)=1B−1B∑b=1(ˆG∗(b)−¯G∗)2, where ¯G∗=1BB∑b=1ˆG∗(b).
# Gini index estimation and confidence interval using 'pbootstrap',
igini(y, interval = "pbootstrap")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4004204 0.5135315
#>
#> $Variance
#> [1] 0.0008333577
interval = "BCa"
computes the bias-corrected and
accelerated bootstrap interval (Davison and Hinkley, 1997). The idea of
this confidence interval is to correct for bias due to the skewness in
the distribution of bootstrap estimates. The "BCa"
confidence interval is defined as: [ˆG∗(α1),ˆG∗(α2)], where α1=ϕ(ˆZ0+ˆZ0+Zα1−ˆa(ˆZ0+ˆZα)),
α2=ϕ(ˆZ0+ˆZ0+Z1−α1−ˆa(ˆZ0+ˆZ1−α)), ϕ(⋅) is the cumulative distribution function of the standard Normal distribution, and Za is the ath quantile of the standard Normal distribution. The bias-correction factor is defined as ˆZ0=ϕ−1(#ˆG∗(b)−ˆGB), and the acceleration factor is given by ˆa=∑i∈S(¯G−ˆG−i)36{∑i∈S(¯G−ˆG−i)2}3/2, where ˆG−i are the jackknife estimates defined in the following section, and ¯G=1n∑i∈SˆG−i.
The "zjackknife"
and "tjackknife"
methods
compute the variance of the Gini index using the Ogwang Jackknife
procedure (Ogwang, 2000; Langel and Tille, 2013). This variance si given
by ˆVJ(ˆG)=n−1n∑i∈S(ˆG−i−¯G)2, where ˆG−i=ˆG+2∑j∈Syj−y(i)[y(i)∑j∈Sjy(j)n∑j∈Syj+∑j∈Sjy(j)n(n−1)−∑j∈Syj−∑ij=1y(j)+iy(i)n−1]−1n(n−1), with i={1,…,n} being the jackknife
estimates, i.e., ˆG−i
is the estimation of the Gini index when the unit i is removed from the sample. For a
confidence level 1−α, the
"zjackknife"
confidence interval is defined as [ˆG−Z1−α/2√ˆVJ(ˆG),ˆG+Z1−α/2√ˆVJ(ˆG)],
where Z1−α/2 is the (1−α/2)th quantile of the standard
Normal distribution.
# Gini index estimation and confidence interval using 'zjackknife'.
igini(y, interval = "zjackknife")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4103563 0.5240296
#>
#> $Variance
#> [1] 0.0008409313
"tjackknife"
sustitutes the critical value Z1−α/2 by critical values
computed from the studentized bootstrap. This confidence interval is
given by
[ˆG−t∗J;1−α/2√ˆVJ(ˆG),ˆG−t∗J;α/2√ˆVJ(ˆG)],
where t∗J;a is the ath quantile of the values t∗J(b)=ˆG∗(b)−ˆG√ˆVJ[ˆG∗(b)]
computed using the bootstrap technique, where ˆVJ[ˆG∗(b)]
is the estimated Ogwang Jackknife variance of ˆG∗(b) for the bth bootstrap sample.
The linearization technique for variance estimation (Deville, 1999)
has been applied to the following estimators of the Gini index: ˆGa=12¯yn2∑i∈S∑j∈S|yi−yj| and ˆGb=2¯yn∑i∈SyiˆFn(yi)−1, where ˆFn(yi)=1n∑j∈Sδ(yj≤yi) and δ(⋅) is the indicator variable
that takes the value 1 when its argument is true and 0 otherwise. For a
given estimator ˆG and a
linearizated variable z, the
confidence interval, with confidence level 1−α, is defined as:
[ˆG−Z1−α/2√ˆVL(ˆG),ˆG+Z1−α/2√ˆVL(ˆG)],
where the variance estimator of the Gini index is given by: ˆVL(ˆG)=1n(n−1)∑i∈S(zi−¯z)2, and ¯z=1n∑i∈Szi.
On the one hand, interval = "zalinearization"
linearizates the estimator ˆGa, and the corresponding
pseudo-values are (see Langel anf Tillé 2013):
za(i)=1¯y[2in(y(i)−ˆ¯Y(i))+¯y−y(i)−ˆGa(¯y+y(i))], where ˆ¯Y(i)=1ii∑j=1y(j).
On the other hand, interval = "zblinearization"
linearizates the estimator ˆGb, and the corresponding
pseudo values are (see Berger, 2008):
zbi=1¯y[2yiˆFn(yi)−(ˆGb+1)(yi+¯y)+2∑j∈Syjδ(yj≥yi)n].
# Gini index estimation and confidence interval using 'zalinearization'.
igini(y, interval = "zalinearization")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4125876 0.5217982
#>
#> $Variance
#> [1] 0.0007762
# Gini index estimation and confidence interval using 'zblinearization'.
igini(y, interval = "zblinearization")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4107537 0.5236321
#>
#> $Variance
#> [1] 0.0008292117
Intervals "talinearization"
and
"tblinearization"
substitute the critical value Z1−α/2 by critical values
computed from the Studentized bootstrap. This confidence interval is
given by
[ˆG−t∗L;1−α/2√ˆVL(ˆG),ˆG−t∗L;α/2√ˆVL(ˆG)],
where t∗L;a is the ath quantile of the values t∗L(b)=ˆG∗(b)−ˆG√ˆVL[ˆG∗(b)].
ˆVL(⋅) is computed
using the pseudo-values za(i)
when interval = "zalinearization"
, and using the
pseudo-values zbi when
interval = "zblinearization"
.
# Gini index estimation and confidence interval using 'talinearization'.
igini(y, interval = "talinearization")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4195142 0.5253662
#>
#> $Variance
#> [1] 0.0007762
# Gini index estimation and confidence interval using 'tblinearization'.
igini(y, interval = "tblinearization")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4224278 0.5329734
#>
#> $Variance
#> [1] 0.0008292117
Intervals "ELchisq"
and "ELboot"
compute
the empirical likelihood (EL)
method, a nonparametric technique that provides desirable inferences
under skewed distributions. The shape of the EL confidence intervals are determined by
the data-driven likelihood ratio function (Owen, 2001).
interval = "ELchisq"
obtains the EL confidence interval, with confidence
level 1−α, for the Gini index
as defined by Qin et al. (2010): {θ|−2R(θ)≤χ21;1−αk}
where R(θ)=−∑i∈Slog{1+λZ(yi,θ)} is the log-EL ratio statistic for θ=G, Z(yi,θ)={2ˆFn(yi)−1}yi−θyi, λ is the solution to 1n∑i∈SZ(yi,θ)1+Z(yi,θ)=0, k=ˆσ22/ˆσ21 is the scaling factor, ˆσ2j=1n−1∑i∈S(uji−¯uj)2, with j={1,2}, ¯uj=1n∑i∈Suji, and χ21;1−α is the (1−α)th quantile of Chi-Squared distribution with one degree of freedom.
# Gini index estimation and confidence interval using 'ELchisq'.
igini(y, interval = "ELchisq")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4216374 0.5319404
#>
#> $Variance
#> [1] 0.0008292117
interval = "ELboot"
substitutes the critical value based
on the Chi-Squared distribution by an empirical critical value based on
bootstrap. "ELboot"
computes the EL confidence interval (Qin et al.,
2010): {θ|−2R(θ)≤C1−α}, where C1−α is the (1−α)th quantile of the values {−R∗1(ˆG),…,−R∗B(ˆG)}, and where R∗b(ˆG) denotes the value
of R(θ) computed from the
bth bootstrap sample.
# Gini index estimation and confidence interval using 'ELboot'.
igini(y, interval = "ELboot")
#> $Gini
#> [1] 0.4671929
#>
#> $Interval
#> lower upper
#> [1,] 0.4118394 0.5413343
#>
#> $Variance
#> [1] 0.0008343201
The function icompareCI() compares the various confidence
intervals for the scenario of a sample derived from an infinite
population. The argument plotCI = TRUE
plots the results
derived from the various available methods for constructing confidence
intervals.
# Comparisons of variance estimators and confidence intervals.
icompareCI(y, plotCI = FALSE)
#> interval bc gini lowerlimit upperlimit var.gini
#> 1 zjackknife FALSE 0.46 0.41 0.52 8e-04
#> 2 zjackknife TRUE 0.47 0.41 0.52 8e-04
#> 3 tjackknife FALSE 0.46 0.41 0.53 8e-04
#> 4 tjackknife TRUE 0.47 0.42 0.54 8e-04
#> 5 zalinearization FALSE 0.46 0.41 0.52 8e-04
#> 6 zalinearization TRUE 0.47 0.41 0.52 8e-04
#> 7 talinearization FALSE 0.46 0.41 0.52 8e-04
#> 8 talinearization TRUE 0.47 0.42 0.52 8e-04
#> 9 zblinearization FALSE 0.46 0.41 0.52 8e-04
#> 10 zblinearization TRUE 0.47 0.41 0.52 8e-04
#> 11 tblinearization FALSE 0.46 0.42 0.53 8e-04
#> 12 tblinearization TRUE 0.47 0.42 0.53 8e-04
#> 13 pbootstrap FALSE 0.46 0.40 0.51 8e-04
#> 14 pbootstrap TRUE 0.47 0.40 0.51 8e-04
#> 15 BCa FALSE 0.46 0.42 0.53 7e-04
#> 16 BCa TRUE 0.47 0.42 0.53 8e-04
#> 17 ELchisq FALSE 0.46 0.42 0.53 8e-04
#> 18 ELchisq TRUE 0.47 0.42 0.53 8e-04
#> 19 ELboot FALSE 0.46 0.41 0.53 7e-04
#> 20 ELboot TRUE 0.47 0.42 0.54 7e-04
For a finite population U, {Yi:i∈U} denotes a sequence, with size N, of nonnegative random variables with the same distribution as the variable of interest Y, and {yi:i∈U} are the population values of the variable of interest. A sample S is selected from U by using a sampling design with survey weights wi, with i∈S. For example, the survey weights can be wi=π−1i, where πi=P(i∈S) are the inclusion probabilities (Muñoz et al., 2023). The Gini index (G) is estimated using the observations of individuals selected in the sample {yi:i∈S}, and the corresponding survey weights {wi:i∈S}. The different methods for estimating the Gini index are (see also Muñoz et al., 2023):
method = 1
(Langel and Tillé, 2013).ˆGw1=12ˆN2¯yw∑i∈S∑j∈Swiwj|yi−yj|, where ˆN=∑i∈Swi and ¯yw=1ˆN∑i∈Swiyi.
method = 2
(Alfons and Templ, 2012; Langel and Tillé,
2013).ˆGw2=2∑i∈Sw+(i)ˆN(i)y(i)−∑i∈Sw2iyiˆN2¯yw−1, where y(i) are the values yi sorted in increasing order, w+(i) are the values wi sorted according to the increasing order of the values yi, and ˆN(i)=∑ij=1w+(j). Note that Langel and Tillé (2013) show that ˆGw1=ˆGw2.
method = 3
(Berger, 2008).ˆGw3=2ˆN¯yw∑i∈SwiyiˆF∗w(yi)−1, where ˆF∗w(t)=1ˆN∑i∈Swi[δ(yi<t)+0.5δ(yi=t)] is the smooth (mid-point) distribution function.
method = 4
(Berger and Gedik-Balay, 2020).ˆGw4=1−¯vw¯yw, where ¯vw=ˆN−1∑i∈Swivi and vi=1ˆN−wi∑j∈Sj≠imin
method = 5
(Lerman and Yitzhaki, 1989).\widehat{G}_{w5} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}} \sum_{i \in S} w_{(i)}^{+}[y_{(i)} - \overline{y}_{w}]\left[ \widehat{F}_{w}^{LY}(y_{(i)}) - \overline{F}_{w}^{LY} \right], where \widehat{F}_{w}^{LY}(y_{(i)}) = \displaystyle \frac{1}{\widehat{N}}\left(\widehat{N}_{(i-1)} + \frac{w_{(i)}^{+}}{2} \right) and \overline{F}_{w}^{LY}=\frac{1}{\widehat{N}}\sum_{i \in S}w_{(i)}^{+}\widehat{F}_{w}^{LY}(y_{(i)}).
In the finite population example, income and weights from the 2006
Austrian EU-SILC data set (laeken package) are used to
estimate the Gini index in the Austrian region of Burgenland. The Gini
index is estimated using fgini() and method = 2
(the default method).
data(eusilc, package="laeken")
y <- eusilc$eqIncome[eusilc$db040 == "Burgenland"]
w <- eusilc$rb050[eusilc$db040 == "Burgenland"]
fgini(y, w)
#> [1] 0.3205489
fginindex() can be used to estimate the Gini index using various estimation methods and both R and C++ codes. Efficiency comparisons between both implementations and with other functions available in other packages, such as laeken, DescTools, ineq or REAT, can be made using, for example, the function microbenchmark():
#Comparing the computation time for the various estimation methods and using R
microbenchmark::microbenchmark(
fginindex(y, w, method = 1, useRcpp = FALSE),
fginindex(y, w, method = 2, useRcpp = FALSE),
fginindex(y, w, method = 3, useRcpp = FALSE),
fginindex(y, w, method = 4, useRcpp = FALSE),
fginindex(y, w, method = 5, useRcpp = FALSE)
)
#> Unit: microseconds
#> expr min lq mean median
#> fginindex(y, w, method = 1, useRcpp = FALSE) 1315.7 1422.95 2091.698 1564.65
#> fginindex(y, w, method = 2, useRcpp = FALSE) 47.7 68.25 112.532 105.25
#> fginindex(y, w, method = 3, useRcpp = FALSE) 3665.1 4164.85 6859.617 5122.20
#> fginindex(y, w, method = 4, useRcpp = FALSE) 7737.1 9392.00 12269.216 12195.65
#> fginindex(y, w, method = 5, useRcpp = FALSE) 60.9 109.25 190.398 143.90
#> uq max neval
#> 2003.60 7574.2 100
#> 133.25 282.4 100
#> 7488.70 31897.5 100
#> 14806.60 21488.4 100
#> 164.65 4365.5 100
# Comparing the computation time for the various estimation methods and using Rcpp
microbenchmark::microbenchmark(
fginindex(y, w, method = 1),
fginindex(y, w, method = 2),
fginindex(y, w, method = 3),
fginindex(y, w, method = 4),
fginindex(y, w, method = 5)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fginindex(y, w, method = 1) 367.6 369.35 399.065 379.60 395.1 758.8 100
#> fginindex(y, w, method = 2) 42.2 53.60 77.953 58.60 83.9 278.8 100
#> fginindex(y, w, method = 3) 413.5 415.45 488.974 428.60 481.0 959.5 100
#> fginindex(y, w, method = 4) 349.7 352.15 377.131 362.05 368.6 549.9 100
#> fginindex(y, w, method = 5) 44.6 56.60 81.559 62.80 91.7 258.4 100
# Comparing the computation time for estimates of the Gini index in various R packages.
# Comparing 'method = 2', used also by the laeken package.
microbenchmark::microbenchmark(
fgini(y,w),
laeken::gini(y,w)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fgini(y, w) 28.6 32.25 40.654 35.45 41.55 132.4 100
#> laeken::gini(y, w) 46.4 54.70 64.262 59.00 65.75 241.4 100
# Comparing 'method = 5', used also by the DescTools and REAT packages.
microbenchmark::microbenchmark(
fgini(y,w, method = 5),
DescTools::Gini(y,w),
REAT::gini(y, weighting = w)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> fgini(y, w, method = 5) 31.3 37.95 53.580 43.45 57.15 204.1 100
#> DescTools::Gini(y, w) 79.4 96.00 120.604 103.85 121.75 669.8 100
#> REAT::gini(y, weighting = w) 184.5 242.55 296.965 261.90 308.05 817.2 100
Jackknife and linearization tecniques compute pseudo-values
(named as z_{i}, with i \in S) that require the use of an
expression for the variance estimation. The function fgini()
can compute the following type variance estimators using the argument
varformula
:
"HT"
) type variance estimator
(Hortvitz and Thompson, 1952).\widehat{V}_{HT}(\widehat{G}_{w}) =
\displaystyle \sum_{i\in S}\sum_{j\in
S}\breve{\Delta}_{ij}w_{i}w_{j}z_{i}z_{j}, which is computed
when varformula = "HT"
, where \breve{\Delta}_{ij}=\displaystyle
\frac{\pi_{ij}-\pi_{i}\pi_{j}}{\pi_{ij}}.
"SYG"
) type variance estimator
(Sen, 1953; Yates and Grundy, 1953).\widehat{V}_{SYG}(\widehat{G}_{w}) = -
\displaystyle \frac{1}{2}\sum_{i\in S}\sum_{j\in
S}\breve{\Delta}_{ij}(w_{i}z_i-w_{j}z_{j})^{2}, which is
computed when varformula = "SYG"
.
"HR"
) type variance estimator (Hartley
and Rao, 1962).\widehat{V}_{HR}(\widehat{G}_{w}) =
\displaystyle \frac{1}{n-1}\sum_{i\in S}\sum_{\substack{j \in S\\ j <
i}}\left(1-\pi_i-\pi_j + \frac{1}{n}\sum_{k\in U}\pi_{k}^{2}
\right)(w_{i}z_i-w_{j}z_{j})^{2}, which is computed when
varformula = "HR"
.
Note that the "HT"
variance estimator may give negative
values, and the "SYG"
variance estimator is suitable for
fixed-size sampling designs. This implies that "SYG"
should
not be used under Poisson sampling. Fortunately, "HT"
always give positive values under this sampling design. We observe that
both Horvitz-Thompson and Sen-Yates-Grundy variance estimators depend on
second (joint) inclusion probabilities (argument Pij
). The
Hàjek (1964) approximation \pi_{ij}\cong
\pi_{i}\pi_{j}\left[1- \displaystyle
\frac{(1-\pi_{i})(1-\pi_{j})}{\sum_{i \in S}(1-\pi_{i})}
\right] is used when the second (joint) inclusion probabilities
are not available (Pij = NULL
). Note that the Hàjek
approximation is suggested for large-entropy sampling designs, large
samples, and large populations (see Tille 2006; Berger and Tille, 2009;
Haziza et al., 2008; Berger, 2011). For instance, this approximation is
not recomended for highly-stratified samples (Berger, 2005). The
Hartley-Rao variance estimator requires the first inclusion
probabilities at the population level (argument PiU
).
For complex sampling designs, the rescaled bootstrap (Rao el al.,
1992; Rust and Rao, 1996) can be used for variance estimation and
construction of confidence intervals.
interval = "pbootstrap"
returns the confidence interval for
the Gini index using the rescaled bootstrap with confidence limits
obtained by the percentile method. For a given estimator \widehat{G}_{w} and a confidence level
1-\alpha, this confidence interval
is given by \left[
\widehat{G}^{*}_{w;\alpha/2},
\widehat{G}^{*}_{w;1-\alpha/2} \right],
where \widehat{G}^{*}_{w;a} is the ath quantile of the bootstrapped coefficients \widehat{G}^{*}_{w}(b), with b=\{1,\ldots,B\}, and which are obtained by using the expression \widehat{G}_{w} after substituting the original survey weights w_{i} by the bootstrap weights w_{i}^{*}(b)=w_{i}\frac{r_{i}n}{n-1}, where r_{i} is the number of times that iyh unit is selected by the bootstrap procedure. A variance estimator of the Gini index based on the rescaled bootstrap is defined as: \widehat{V}_{B}(\widehat{G}_{w})= \displaystyle \frac{1}{B-1}\sum_{b=1}^{B}\left(\widehat{G}^{*}_{w}(b) - \overline{G}^{*}_{w} \right)^2, where \overline{G}^{*}_{w}=\frac{1}{B}\sum_{b=1}^{B}\widehat{G}^{*}_{w}(b).
The "zjackknife"
method computes the variance of the
Gini index using the jackknife technique. For a given estimator \widehat{G}_{w}, the pseudo-values for
variance estimation are defined as (see Berger, 2008): z_{i}=\displaystyle
\frac{1}{w_{i}}\left(1-\frac{w_{i}}{\widehat{N}}\right)\left(\widehat{G}_{w}
- \widehat{G}_{w;-i}\right),
where \widehat{G}_{w;-i} denotes
the estimator \widehat{G}_{w}
computed from S\setminus\{i\},
i.e., from the sample S after
removing the ith unit. For a
confidence level 1-\alpha, the
"zjackknife"
confidence interval is defined as \left[\widehat{G}_{w} -
Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_{w})}, \widehat{G}_{w} +
Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_{w})} \right],
where the variance \widehat{V}(\widehat{G}_{w}) is computed
using the pseudo-values z_i and any
of the aforementioned type variance estimators (Horvitz-Thompson;
Sen-Yates-Grundy; or Harley-Rao).
The linearization technique for variance estimation (Deville, 1999) has been applied to the following estimators of the Gini index: \widehat{G}_{w}^{a}= \displaystyle \frac{1}{2\widehat{N}^{2}\overline{y}_{w}}\sum_{i \in S}\sum_{j \in S}w_{i}w_{j}|y_{i}-y_{j}|,
and \widehat{G}_{w}^{b} = \displaystyle \frac{2}{\widehat{N}\overline{y}_{w}}\sum_{i \in S}w_{i}y_{i}\widehat{F}_{w}(y_{i})-1,
where \widehat{F}_{w}(t)=\frac{1}{\widehat{N}}\sum_{i
\in S}w_i\delta(y_i \leq t) For a given estimator \widehat{G}_w and a linearizated variable
z, the confidence interval, with
confidence level 1-\alpha, is
defined as
\left[\widehat{G}_w -
Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_w)}, \widehat{G}_w +
Z_{1-\alpha/2}\sqrt{\widehat{V}(\widehat{G}_w)} \right],
where the variance \widehat{V}(\widehat{G}_w) is computed using the corresponding pseudo-values and any of the aforementioned type variance estimators (Horvitz-Thompson; Sen-Yates-Grundy, or Harley-Rao).
On the one hand, interval = "zalinearization"
linearizates the estimator \widehat{G}_{w}^{a}, and the
corresponding pseudo-values are defined as (Langel anf Tillé 2013):
z_{(i)}^{a}=\frac{1}{\widehat{N}^{2}\overline{y}_w}\left[ 2\widehat{N}_{(i)}\left( y_{(i)} - \widehat{\overline{Y}}_{(i)}\right) + \widehat{N}\left\{ \overline{y}_{w} - y_{(i)} - \widehat{G}_{w}^{a}\left(\overline{y}_{w} + y_{(i)} \right) \right\} \right], where \widehat{\overline{Y}}_{(i)} = \displaystyle \frac{1}{\widehat{N}_{(i)}}\sum_{j=1}^{i}w_{(j)}^{+}y_{(j)}.
On the other hand, interval = "zblinearization"
linearizates the estimator \widehat{G}_{w}^{b}, and the
corresponding pseudo values are (see Berger, 2008):
z_i^{b}=\frac{1}{\hat{N}\overline{y}_{w}}\left[ 2y_i\widehat{F}_{w}(y_i) - (\widehat{G}_{w}^{b}+1)(y_i+\overline{y}_{w})+\frac{2}{\hat{N}}\sum_{j \in S}w_jy_j\delta(y_j \geq y_i) \right], where \widehat{F}_{w}(t) = \displaystyle \frac{1}{\widehat{N}}\sum_{i \in S}w_{i}\delta(y_i \leq t).
# Gini index estimation and confidence interval using:
## a: The method 2 for point estimation.
## b: The method 'zalinearization' for variance estimation.
## c: The Sen-Yates-Grundy type variance estimator.
## d: The Hàjek approximation for the joint inclusion probabilities.
fgini(y, w, interval = "zalinearization")
#> $Gini
#> [1] 0.3205489
#>
#> $Interval
#> lower upper
#> [1,] 0.2946057 0.346492
#>
#> $Variance
#> [1] 0.0001752056
# Gini index estimation and confidence interval using:
## a: The method 3 for point estimation.
## b: The method 'zblinearization' for variance estimation.
## c: The Sen-Yates-Grundy type variance estimator.
## d: The Hàjek approximation for the joint inclusion probabilities.
fgini(y, w, method = 3, interval = "zblinearization")
#> $Gini
#> [1] 0.3205489
#>
#> $Interval
#> lower upper
#> [1,] 0.2944802 0.3466175
#>
#> $Variance
#> [1] 0.0001769051
Alfons, A., and Templ, M. (2012). Estimation of social exclusion indicators from complex surveys: The R package laeken. KU Leuven, Faculty of Business and Economics Working Paper.
Berger, Y. G. (2005). Variance estimation with highly stratified sampling designs with unequal probabilities. Australian & New Zealand Journal of Statistics, 47, 365–373.
Berger, Y. G. (2008). A note on the asymptotic equivalence of jackknife and linearization variance estimation for the Gini Coefficient. Journal of Official Statistics, 24(4), 541-555.
Berger, Y. G. (2011). Asymptotic consistency under large entropy sampling designs with unequal probabilities. Pakistan Journal of Statistics, 27, 407–426.
Berger, Y. G. and Tille, Y. (2009). Sampling with unequal probabilities. In Sample Surveys: Design, Methods and Applications (eds. D. Pfeffermann and C. R. Rao), 39–54. Elsevier, Amsterdam
Berger, Y., and Gedik Balay, İ. (2020). Confidence intervals of Gini coefficient under unequal probability sampling. Journal of Official Statistics, 36(2), 237-249.
David, H.A. (1968). Gini’s mean difference rediscovered. Biometrika, 55, 573–575.
Davison, A. C., and Hinkley, D. V. (1997). Bootstrap Methods and Their Application (Cambridge Series in Statistical and Probabilistic Mathematics, No 1)–Cambridge University Press.
Deville, J.C. (1999). Variance Estimation for Complex Statistics and Estimators: Linearization and Residual Techniques. Survey Methodology, 25, 193–203.
Deltas, G. (2003). The small-sample bias of the Gini coefficient: results and implications for empirical research. Review of Economics and Statistics, 85(1), 226-234.
Giorgi, G. M., and Gigliarano, C. (2017). The Gini concentration index: a review of the inference literature. Journal of Economic Surveys, 31(4), 1130-1148.
Hàjek, J. (1964). Asymptotic theory of rejective sampling with varying probabilities from a finite population. The Annals of Mathematical Statistics, 35, 4, 1491–1523.
Hartley, H. O., and Rao, J. N. K. (1962). Sampling with unequal probabilities and without replacement. The Annals of Mathematical Statistics, 350-374.
Haziza, D., Mecatti, F. and Rao, J. N. K. (2008). Evaluation of some approximate variance estimators under the Rao-Sampford unequal probability sampling design. Metron, LXVI, 91–108.
Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47, 663–685.
Kendall, M., and Stuart, A. (1977). The advanced theory of statistics. Vol. 1: Distribution Theory. London: Griffin.
Langel, M., and Tillé, Y. (2013). Variance estimation of the Gini index: revisiting a result several times published. Journal of the Royal Statistical Society: Series A (Statistics in Society), 176(2), 521-540.
Lerman, R. I., and Yitzhaki, S. (1989). Improving the accuracy of estimates of Gini coefficients. Journal of econometrics, 42(1), 43-47.
Muñoz, J. F., Moya-Fernández, P. J., and Álvarez-Verdejo, E. (2023). Exploring and Correcting the Bias in the Estimation of the Gini Measure of Inequality. Sociological Methods & Research. https://doi.org/10.1177/00491241231176847
Ogwang, T. (2000). A convenient method of computing the Gini index and its standard error. Oxford Bulletin of Economics and Statistics, 62(1), 123-123.
Owen, A. B. (2001). Empirical likelihood. CRC press.
Qin, Y., Rao, J. N. K., and Wu, C. (2010). Empirical likelihood confidence intervals for the Gini measure of income inequality. Economic Modelling, 27(6), 1429-1435.
Rao, J. N. K., Wu, C. F. J., and Yue, K. (1992). Some recent work on resampling methods for complex surveys. Survey methodology, 18(2), 209-217.
Rust, K. F., and Rao, J. N. K. (1996). Variance estimation for complex surveys using replication techniques. Statistical methods in medical research, 5(3), 283-310.
Sen, A. R. (1953). On the estimate of the variance in sampling with varying probabilities. Journal of the Indian Society of Agricultural Statistics, 5, 119–127.
Tillé, Y. (2006). Sampling Algorithms. Springer, New York.
Yates, F., and Grundy, P. M. (1953). Selection without replacement from within strata with probability proportional to size. Journal of the Royal Statistical Society B, 15, 253–261.