This paper describes a range of ‘minimum distance’ methods used to compute new weights for large cross-sectional surveys used in microsimulation modelling. Extraneous information about a range of population variables is used for calibration purposes. An iterative solution procedure is described and numerical examples are given, involving comparisons among alternative distance functions. An application to the New Zealand Household Economic Survey (HES) is reported.
This paper arose from Treasury modelling of hypothetical reforms to the New Zealand tax and transfer system. I should like to thank Ivan Tuckwell for helpful discussions and for providing HES data in the required form. I have also benefited from discussions with Guyonne Kalb and Nathan McClellan, and comments by Mike Doherty and Dean Hyslop on an earlier version.
The views expressed in this Working Paper are those of the author(s) and do not necessarily reflect the views of the New Zealand Treasury. The paper is presented not as policy, but with a view to inform and stimulate wider debate.
Tax microsimulation models are based on large-scale cross-sectional survey data. Each individual or household has a sample weight provided by the statistical agency responsible for collecting the data. The typical starting point is to use weights that are inversely related to the probability of selecting the individual in a random sample, with some adjustment for non-response. It has become common for agencies, using ‘minimal’ adjustments, to produce revised weights to ensure that, for example, the estimated population age/gender distributions match population totals obtained from other sources, in particular census data. Such calibration methods appear to be well known among survey statisticians, a highly influential paper being that by Deville and Särndal (1992).
Users of official data usually take the weights as given, when ‘grossing up’ from the sample in order to obtain estimates of population values. This applies not only to simple aggregates, such as income taxation, or the number of recipients of a particular social transfer, or the number of people in a particular age group, but the weights are also used in the estimation of measures of population inequality or poverty. However, there is no guarantee that weights calibrated on demographic variables produce appropriate revenue, expenditure and income distribution results.
One aim of this paper is therefore to describe the basic calibration approach to economic modellers who are not familiar with the survey literature but need to reweight their samples. This may arise, for example, if population aggregates, not used for official calibrations, are not sufficiently close to population values obtained from other data sources, such as tax and benefit administration data. A further important reason for wanting to reweight the data arises when a survey from one year is used to examine the likely implications of, say, a tax and transfer policy in a later year. This need can arise if cross-sectional surveys are not carried out every year or if there are long delays in releasing data. Nevertheless, other administrative data may be available at more frequent intervals. It is also useful to be able to allow for changes in, say, the age distribution of the population or in aggregate unemployment rates over time.
The basic problem of obtaining ‘minimum distance’ weights is described more formally in section 2. The chi-squared distance function has an explicit solution and this is derived in section 3. A more general class of distance measures is discussed in section 4, where iterative solutions are needed. These sections provide a simplified exposition, with derivations, of some of the results stated by Deville and Särndal (1992), whose more sophisticated and comprehensive treatment concentrated on statistical inference issues. The use of Newton’s method for the solution of the nonlinear equations is explored. Numerical examples are used to compare alternative distance functions, based on a small hypothetical sample. Finally, in section 5 the methods are applied to New Zealand Household Economic Survey (HES) data. Brief conclusions are in section 6.
- A detailed description of calibration and Generalised Regression (GREG) methods used in Belgium is given in Vanderhoeft (2001), which also describes the SPSS based program g-CALIB-S. Bell (2000) describes methods used in the Australian Bureau of Statistics household surveys, involving the SAS software GREGWT. Statistics Sweden uses the SAS software CLAN, described by Andersson and Nordberg (1998) and also used by the Finnish Labour Force Survey. All results in the present paper were obtained using Fortran programs written by the author.
- The link between this method and Generalised Regression estimators of population totals is discussed briefly at the end of the section. See especially Särndal et al. (1992).
- Deville and Särndal (1992) used fewer than two pages to state the results discussed here.
2 The problem
For each of
individuals in a sample survey, information is available about
variables; these are placed in the vector:
For present purposes these vectors contain only the variables of interest for the calibration exercise (rather than all measured variables). Many of the elements of
are likely to be
variables. For example
th individual is in a particular age group (or receives a particular type of social transfer), and zero otherwise. The sum
therefore gives the number of individuals in the sample who are in the age group (or who receive the transfer payment).
Let the sample design weights (provided by the statistical agency responsible for data collection) be denoted
These weights can be used to produce estimated population totals,
based on the sample, given by the
The problem examined in this paper can be stated as follows. Suppose that other data sources, for example census or social security administrative data, provide information about ‘true’ population totals,
. The problem is to compute new weights,
which are as close as possible to the design weights,
while satisfying the set of
It is thus necessary to specify a criterion by which to judge the closeness of the two sets of weights.
In general, denote the distance between
. The aggregate distance between the design and calibrated weights is thus:
The problem is therefore to minimise (4) subject to (3). The Lagrangean for this problem is:
are the Lagrange multipliers. The following two sections consider methods of obtaining values of
that minimise (5).
- Some authors, such as Folson and Singh(2000) write the distance to be minimised as , but the present paper follows Deville and Särndal (1992).
3 An explicit solution
The constrained minimisation problem stated above has an explicit solution for a distance function based on the chi-squared measure. This is discussed in subsection 1. A numerical example is examined in subsection 2.
3.1 The Chi-squared distance measure
Consider the chi-squared type of distance measure, where the aggregate distance is given by:
The Lagrangean in (5) can be written as:
, are the Lagrange multipliers, and
th element of the vector of known population aggregates,
Differentiation of (7) gives the set of
, along with the
conditions in (3). Rewriting
, where the prime indicates transposition, and multiplication of each equation in (8) by
gives, after rearrangement:
To solve for the Lagrange multipliers, pre-multiply (9) by
and rearrange, so that:
Summing (10) over all
and making use of the calibration equations, gives:
where the term in brackets on the right hand side of (11) is a
square matrix. Hence, if this matrix can be inverted, the vector of Lagrange multipliers is given by:
The resulting values of
are substituted into (9) to obtain the new weights.
- Write (9) as and (12) as with as the symmetric matrix . Given sample observations on the variable an estimate of the population total, , can be obtained as . Substituting for gives the result in Deville and Särndal (1992, p.377) that , where . This provides the link between reweighting and the Generalised Regression (GREG) estimator. The production of asymptotic standard errors is often based on this estimator, in view of the result that other distance functions are asymptotically equivalent; see Deville and Särndal (1992, p.378). The present discussion concentrates only on reweighting.
3.2 A small example
The above procedure may be illustrated using a simple example. Suppose there are four variables,
), of concern, for which population values
are available. The hypothetical data, for a sample of 20 individuals, are shown in Table 1. Suppose variable
refers to age, so that
for those who are ‘young’ and is zero otherwise, while
for those who are unemployed, and zero otherwise. Variable
measures earnings from employment, while variable
is another categorical variable referring to location (
if the individual lives in a city, and is zero otherwise). Given the sample design weights shown in the penultimate column of Table 1, the estimated population totals are equal to
The symmetric matrix
and its inverse are given in Table 2. The zero elements reflect the property of the basic data, that only individuals who work (for whom
) are assumed to receive positive earnings,
. Suppose that the known population totals are
reflecting a younger population than in the sample weights and a lower unemployment rate. The resulting calibrated weights are shown in the final column of Table 1.
The required adjustments to the weights can clearly be seen to be consistent with expectations, given the calibration requirements and the characteristics of the individuals. For example, the weights for individuals 2, 9 and 20 fall by a relatively large amount (from 3 to 2.109), since these individuals are all unemployed, old and living in rural locations, for all of which the aggregates are required to fall. The weights for individuals 1 and 6 do not drop so far because, although these are unemployed and in a rural location, they are young. The weight for person 12 falls by a small amount because, although unemployed, this person is young and in a city. The weights for individuals 13, 17 and 19 increase by relatively large amounts as they are young, employed and living in a city.
- The number of variables needed is of course one less than the number of categories of each type, otherwise singularity problems arise.
4 Alternative Distance Functions
The chi-squared distance function is convenient because it enables an explicit solution for the calibrated weights to be obtained, requiring only matrix inversion. However, a modified form of the same approach can be applied to a range of alternative distance functions, as shown in this section. These functions belong to a class of functions having two features: the first derivative with respect to
can be expressed as a function of
and its inverse can be obtained explicitly. An interactive solution procedure is required for the calculation of the Lagrange multipliers. The general case of this class is presented in subsection 1. An iterative approach based on Newton’s method is described in subsection 2. Several weighting functions are described in subsection 3 and illustrated in subsection 4.
4.1 The general case
The Lagrangean for the general case, stated in section 2, was written as:
has the property, shared with the chi-square distance function, that the differential with respect to
can be expressed as a function of the ratio
, so that:
first-order conditions for minimisation can therefore be written as:
Write the inverse function of
so that if
In the case of the chi-square distance function used above,
, and the inverse takes a simple linear form. In general, from (15) the
are expressed as:
If the inverse function,
can be obtained explicitly, equation (16) can be used to compute the calibrated weights, given a solution for the vector,
As before, the Lagrange multipliers can be obtained by post-multiplying (16) by the vector
, summing over all
and using the calibration equations, so that:
from both sides of (17) gives:
is of course a scalar, and the left hand side is a known vector. In general, (18) is nonlinear in the vector
and so must be solved using an iterative procedure, as described in the following subsection.
4.2 An iterative procedure
the equations in (18) can be written as:
. The roots can be obtained using Newton’s method, described in the Appendix. This involves the following iterative sequence, where
denotes the value of
The Hessian matrix
and the vector
on the right hand side of (20) are evaluated using
are given by:
which can be written as:
Starting from arbitrary initial values, the matrix equation in (20) is used repeatedly to adjust the values until convergence is reached, where possible.
As mentioned earlier, the application of the approach requires that it is limited to distance functions for which the form of the inverse function,
can be obtained explicitly, given the specification for
. Hence, the Hessian can easily be evaluated at each step using an explicit expression for
. As these expressions avoid the need for the numerical evaluation of
for each individual at each step, the calculation of the new weights can be expected to be relatively quick, even for large samples. However, it must be borne in mind that a solution does not necessarily exist, depending on the distance function used and the adjustment required to the vector
- The approach described here differs somewhat from other routines described in the literature, for example in Singh and Mohl (1996) and Vanderhoeft (2001). However, it provides extremely rapid convergence.
- Using numerical methods to solve for each and for , for every individual in each iteration, would increase the computational burden substantially.
4.3 Some distance functions
One reason why the chi-squared distance function produces a solution is that no constraints are placed on the size of the adjustment to each of the survey weights. It is therefore also possible for the calibrated weights to become negative. However, Deville and Särndal (1992) suggested the following simple modification to the chi-squared function, although the explicit solution for the chi-squared case is no longer available and the iterative method must be used.
Suppose it is required to constrain the proportionate changes to certain limits, different for increases compared with decreases in the weights. Define
The objective is to ensure that, for increases, the proportionate change,
is less than
. For decreases, the aim is to ensure that
(or the negative of the proportional change) is less than
For the chi-squared distance function, it has been seen that
. Hence if
is outside the specified range, it is necessary to set it to the relevant limit, either
rather than allow it to take the value generated. Since
it is clear that the limits are exceeded if
. In each case where the value of
has to be set to the relevant limit, the corresponding value of
is zero. This approach ensures that weights are kept within the range,
. Hence, negative values of
are avoided simply by setting
to be positive.
It has been seen above that the solution procedure requires only an explicit form for the inverse function
from which its derivative can be obtained. It is not necessary to start from a specification of
. Deville and Särndal (1992) suggest the simple form:
The gradient function,
is given by solving (23) for
and the form of the distance function can be obtained by integrating (24). This is referred to as Case A, and its properties are given in the first row of Table 3. The second row of the table provides details of Case B, where
, and the final row gives the corresponding properties of the basic chi-squared function. A feature of these functions is that they do not require any parameters to be set.
Deville and Särndal (1992) also suggest the use of an inverse function
of the form:
are as defined above and:
, so that the limits of
This function therefore has the property that adjustments to the weights are kept within the range,
, although, unlike the chi-squared modification, no checks have to be made during computation.
The derivative required in the computation of the Hessian is therefore:
(25) can be rearranged, by collecting terms in
, to give:
so that the gradient of the distance function is:
The special nature of this gradient function is illustrated by the line D-S in Figure 1, which shows the profile of (29) for the wide range where
The first characteristic of the S-D function that is evident is the restriction of
to the range specified. Figure 1 also shows the function
for the other cases discussed above. In all cases, the slope is zero (corresponding to a turning point of the distance function) when
. Given the quadratic U-shaped nature of the chi-squared distance function, the gradient increases at a constant rate, being negative in the range
. Cases A and B also imply U-shaped distance functions, but with the gradient increasing more sharply for
and more slowly than the chi-square function in the range
The distance function is given by integrating (29) with respect to
. It is most convenient to apply the variate transformation
, so that
, and it is required to obtain:
Using the result that:
substitution and rearrangement gives
plus a term
- This is much more convenient than imposing inequality constraints and applying the more complex Kuhn-Tucker conditions. Also, it is desirable to restrict the extent of proportional changes even where they produce positive weights.
- Hence it is required to obtain which can be written as , and dropping the last term, which is a constant, this is equal to .
- Deville and Särndal (1992) discuss the use of a normalisation whereby is set to some specified value, but this is not necessary for the approach.
- Singh and Mohl (1996), in reviewing alternative calibration estimators, refer to this ‘inverse logit-type transformation’ as a Generalised Modified Discrimination Information method.
- Equation (33) is the result stated without proof by Deville and Särndal (1992, p. 378).
- Folsom and Singh (2000) propose a variation on this, which they call a ‘generalised exponential model’, in which the limits are allowed to be unit-specific. In practice they suggest the use of three sets of bounds for low, medium and high initial weights.
4.4 Further Numerical Examples
The application of the distance functions presented in the previous subsection to the hypothetical sample used earlier gives the results shown in Table 4, where the simple (unrestricted) chi-squared results are added for comparison. In cases where limits are imposed on the degree of adjustment of the weights, it cannot be expected that a solution will always be available. For this reason, care is needed in the choice of
, as discussed below.
The values of
were initially selected as being well outside the range of ratios obtained using the other distance functions. When the range was reduced to the potentially restrictive values of
, none of the ratios obtained was actually at the limits specified. Nevertheless, the change to the weighting function produces a different set of weights, as shown by comparisons in Table 4: some actually move further away from their initial, or survey, weights.
The choice of
shown in the penultimate column of the table, actually places some adjustments to the weights at the lower limit of the range: for individuals 2, 9 and 20, the value of
is equal to
. However, no adjustments are at the upper range specified. If
is raised to
unchanged), unreported results show that individual 16 is placed at the lower limit, along with 2, 9 and 20 as before; in addition individuals 5, 13, 17, 19 are pushed to the upper limit of
. The attempt to raise
means that no solution is possible. However, if
is set to the higher value of
is found to be the highest value (where the range of variation is limited to the second decimal point) of
for which a solution is possible. The two highest ratios needed in this case are for persons 13 and 17, who have
is kept at this value of
the lowest value of
for which a solution exists is
. In this case individuals 1, 2, 6, 9, 16 and 20 are placed at the lower limit and individuals 5, 13, 17 and 19 are placed at the upper limit. Clearly, some care needs to be exercised in the choice of upper and lower limits.
While these examples help to explore the characteristics of the different approaches, it is necessary to examine the practical implementation of the method. This is carried out in the following section.
5 The NZ Household Economic Survey
This section applies the above approaches to the New Zealand Household Economic Survey 2000/01, which is the latest survey available. The aim is to illustrate the application of the approach in a practical context, and to compare the performance of the alternative distance functions. At this point it may be useful to stress that reweighting may cause non-calibrated variables to change in undesirable ways, so that various other checks need to be made.
The variation in the survey weights provided by Statistics NZ for the period 2000/01 is illustrated in Figure 3, where the weights are arranged in ascending order for a sample of 2808 households. It can be seen that the majority of these weights are within a fairly narrow range, although some are substantially higher, suggesting a considerable degree of under-representation of these household types in the sample.
For present purposes, new weights were obtained using calibration values for 2003/4, therefore allowing for population changes. A total of 36 calibration equations were used, covering the total numbers in the following categories for: 11 family composition types; 16 age/sex types; 2 unemployment benefits; 2 Domestic Purpose Benefits; 2 invalidity benefits; 2 sickness benefits; and 1 widow’s benefits.
- Figure 3 – Survey weights
- Figure 4 – Calibrated weights: Chi-Square
- Figure 5 – Ratio of calibrated to survey Weights: Chi-Square
The calibrated weights obtained using the basic chi-square distance function are shown in Figure 4, where households are arranged in the same order as in Figure 3 (although the vertical axis has been truncated at 2,500). The corresponding ratios of calibrated to survey weights,
, are displayed in increasing order in Figure 5. This clearly shows considerable variability in the weights, with some negative weights resulting. Using the modified chi-square distance function, allowing the ratio of weights,
, to be restricted with the limits
resulted in values displayed in Figure 6. The effects of the adjustment can clearly be seen in the extent to which the new weights are restricted to a range of variation around the initial profile. The top and bottom of the profile in Figure 6 are substantially ‘smoothed’; clearly a significant number are placed at the limits, particularly at the lower limit.
The calibrated weights obtained using the distance function in (33), allowing for upper and lower limits to
are shown in Figure 7, and the ratios are shown in ascending order in Figure 8. In both cases where limits were imposed, the range shown is the narrowest for which a solution was obtained (that is, for which the iterative method used to obtain the Lagrange multipliers converged). The main difference between the modified chi-square case and the distance function in (33) is that the former appears to push more values to the lower edge of the profile.
- Figure 6 – Calibrated weights: Modified Chi-Squared function
- Figure 7 – Calibrated weights: Deville-Särndal function
- Figure 8 – Ratio of calibrated to survey weights: Deville-Sarndal function
The use of the other two distance functions failed to produce solutions. Comparing the results for the distance functions producing solutions, it seems that the only serious contenders are the two cases imposing constraints on the proportionate changes in weights. The numerical values of the limits which can be imposed, while still obtaining a solution, appear to be the same for the adjusted chi-squared function and the function in equation (33). Where solutions are available, there seems little to choose between those two cases. However, in further experiments using a larger number of calibration equations, it was found that no solution was available using the distance function in (33), however wide the range of variation allowed. Nevertheless a solution could be obtained using the modified chi-squared distance function. The standard chi-squared function also gave a solution, as expected, but this produced a number of negative weights.
- The precision of some survey estimates may also be lowered, particularly where many calibration constraints are used. Examples are given in Skinner (1999); see also Kalton and Flores-Cervantes (2003).
- These are integrated weights, not the original weights. For a discussion of the use of integrated weighting, as described by Lemaître and Dufour (1987), by Statistics New Zealand, see StatsNZ (2001).
- For each of these types, there is of course one additional category not used. The motivation for selecting these variables involved the use of the data for projecting taxes and benefit expenditures. For a general discussion of variable selection, see Nascimento Silva and Skinner (1997).
- Where a solution was not available, the procedure ‘exploded’ relatively quickly, after just a few iterations. Otherwise convergence was achieved rapidly.
This paper has examined a range of minimum distance methods used to compute new weights for large cross-sectional surveys used in microsimulation modelling. The methods involve the use of extraneous information about a range of population variables, for calibration purposes. The distance functions were restricted to those for which the first derivative can be expressed as a function of the ratio of the new weights to the survey weights, and for which that function can be inverted explicitly. In general, an iterative solution procedure is required. An approach based on Newton’s method was described and numerical examples were given for several distance functions. Finally, the performance of the method was examined using the New Zealand Household Economic Survey. Rapid convergence of the iterations was obtained, although care needs to be taken when imposing limits on the proportional adjustments to sample weights. Since the same basic approach (and computer program) can easily examine a range of distance functions, and Newton’s method converges extremely quickly, it is relatively costless to consider the full range of distance measures. However, in practice, convergence cannot be expected using all measures.
Finally it is worth remembering that reweighting may cause the distribution of important variables, in particular alternative sources of income, to change. Checks on changes in a range of distributions are therefore recommended.
- This point is also made by Klevmarken (1998).
Appendix: Newton’s method
Consider finding the root of the single equation in one variable,
takes the form shown in Figure 9. Newton’s method involves taking an arbitrary starting point,
and drawing the tangent, with slope
. By approximating the function by the tangent, the new value is given by the point of intersection of this tangent with the
as the next starting point leads quickly to the required root.
- Figure 9 – Newton’s method
From the triangle in Figure 9:
Hence, starting from
, the sequence of iterations is:
Convergence is reached when
depends on the accuracy required. Newton’s method is easily adapted to deal with a set of equations,
is a vector. The method involves repeatedly solving the following matrix equation, where
now denotes the vector in the
th iteration and
is a vector containing the
 Andersson, C. and Nordberg, L. (1998) A User’s Guide to CLAN 97. Statistics Sweden.
 Bell, P. (2000) Weighting and standard error estimation for ABS household surveys. Paper prepared for ABS Methodology Advisory Committee: Australian Bureau of Statistics.
 Deville, J.-F. and Särndal, C.-E. (1992) Calibration estimators in survey sampling. Journal of the American Statistical Association, 87, pp. 376-382.
 Statistics New Zealand (2001) Information Paper: The Introduction of Integrated Weighting to the 2000/2001 Household Economic Survey. Statistics New Zealand.
 Folson, R.E. Jnr. and Singh, A.C. (2000) The generalized exponential model for sampling weight calibration for extreme values, non-response and post-stratification. Proceedings of the Survey Research Methods Section: American Statistical Association. http://www.amstat.org/sections/srms/proceedings/papers/2000_099.pdf.
 Kalton, G. and Flores-Cervantes, I. (2003) Weighting methods. Journal of Official Statistics, 19, pp. 81-98.
 Klevmarken, N.A. (1998) Statistical inference in micro-simulation models: incorporating external information. Uppsala University Department of Economics Working Paper. http://www.nek.uu.se/Pdf/1998wp20.pdf.
 Lemaître, G. and Dufour, J. (1987) An integrated method for weighting persons and families. Survey Methodology, 13, pp. 199-207.
 Nascimento Silva, P.L.D. and Skinner, C. (1997) Variable selection for regression estimation in finite populations. Survey Methodololgy, 23, pp. 23-32.
 Särndal, C.-E., Swensson, B. and Wretman, J. (1992) Model Assisted Survey Sampling. New York: Springer-Verlag.
 Singh, A.C. and Mohl, C.A. (1996) Understanding calibration estimators in survey sampling. Survey Methodology, 22, pp. 107-115.
 Skinner, C. (1999) Calibration weighting and non-sampling errors. Research in Official Statistics, 2, pp. 33-43.
 Vanderhoeft, C. (2001) Generalised calibration at Statistics Belgium. Statistics Belgium Working Paper, no. 3.