Official Statistics Research funded 6 projects in the 2007/08 year.
Confidentialising microdata | Generating synthetic data | Unbiased estimation of linked data analysis | Sampling of subpopulations in household surveys with application to Māori and Pacific sampling | Specifications for a geospacial land use classification for New Zealand | Who are we: the conceptualisation and expression of ethnicity
OS Research will be inviting the public to scheduled Officials Statistics System Seminars throughout 2008 and 2009, as well as coordinating expert workshops. In the workshops, researchers will come together with government departments and help them apply the research findings to their data. It is anticipated that all of the research and reports that come from the projects will be available after successful review in the Official Statistics Research Series.
For any seminar or expert workshop inquiries, please contact OS Research at: osresearch@stats.govt.nz.
1. Confidentialising microdata
Project summary
Releasing multiply-imputed synthetic data has been proposed as a way for official statistics agencies to make microdata available to users, while minimising disclosure risks. Several federal agencies in the United States of America have synthetic data projects underway. Synthetic datasets are generated from models of observed data and can be viewed as predictions (or imputations) of responses for a new sample drawn from the same population as the original sample.
In previous research on categorical data, we demonstrated that hierarchical Bayesian models provide an attractive framework for developing imputation models for synthetic data because they are robust to model misspecification. This new project will develop and test methods for generating synthetic versions of datasets comprised of a mix of categorical and numerical variables, again using hierarchical Bayesian imputation models.
For numerical data, results from our previous research suggest that hierarchical normal models are insufficiently flexible to act as a general imputation modeling framework for synthetic data. In this new project, we will draw on non-parametric Bayesian methods to extend the data distributions which can be accommodated by hierarchical models. The synthetic data methods developed will be tested on the 2004 Income Survey dataset and compared with the confidentialised unit record file (CURF) for this survey, and with a recently proposed perturbation method.
The comparison of confidentialising methods will consider four dimensions:
- The accuracy of inferences, treating inferences from the original data as the gold standard.
- Disclosure risk.
- The ease with which users can obtain valid inferences.
- Ease of implementation by official statistics agencies. In addition, we will scope the extensions necessary to adapt our synthetic data methodology to deal with longitudinal data such as the Linked Employer-Employee Data(LEED).
Aims
- To develop and evaluate synthetic data methodology for microdata comprised of a mix of categorical and continuous variables.
- To extend hierarchical Bayesian imputation models to imputation of data comprised of categorical and numerical variables.
- To compare the synthetic data method with other methods for confidentialising microdata, particularly current Statistics NZ CURF
methodology and a new perturbation method.
Research questions
- Can synthetic microdata, constructed using hierarchical Bayes imputation models, closely reproduce inferences obtained from original data for a range of analysis models?
- Are inferences obtained from synthetic microdata closer to inferences obtained from original data than inferences obtained from other microdata confidentialising methods?
- Is it possible to quantify disclosure risks for synthetic data, and if so, how do these risks compare with disclosure risks for other methods of confidentialising microdata?
- How can hierarchical Bayes imputation models be successfully applied to the imputation of longitudinal data?
Project Sponsor: Statistics NZ
Project team: Independent researcher – Patrick Graham and Statistics NZ
Back to top
2. Generating synthetic data
Project summary
Synthetic datasets are useful for project development, software development, training, and in some cases substantive social research. This project will investigate methods for creating large synthetic datasets of both published tables and confidentialised microdata. It will build on the lead researcher's 2006 OS Research project by allowing the creation of synthetic datasets containing more variables than is possible when using marginal tables alone.
This will be done by using the confidentialised microdata to impute the values of further variables. The resulting datasets will be created to carry no confidentiality implications and will be made freely available to researchers for the purposes of software development and training. We also propose to investigate the extent to which such datasets can be used for substantive social research.
Research aims
- To produce statistically robust methods for constructing synthetic datasets using both published marginal tables and confidentialised microdata that will accurately mimic the statistical characteristics of the relevant population.
- To investigate to what extent such synthetic datasets can be used for substantive social research, and under what circumstances conclusions reached from the analysis of such synthetic datasets are statistically valid.
- To create statistical software that will implement these methods, and allow the routine generation of synthetic data.
Research questions
How can synthetic datasets be generated from published marginal tables and confidentialised microdata?
Can such datasets be used for social research?
Project sponsor: Statistics NZ
Project team: University of Auckland and Statistics NZ.
Back to top
3. Unbiased estimation of linked data analysis
With the increasing use of administrative data and greater inter-agency collaboration; linked, or integrated, data is increasingly important to official statistics agencies. However, when these data are linked via ‘probabilistic matching’, uncertainty of the matching introduces bias and additional variance into standard methods of estimation. This project will develop techniques for correcting this bias and correctly estimating the variance for standard statistical outputs, such as means and linear regression models.
Statistics NZ’s Student Loans database will be used to test the performance of the techniques on policy relevant questions supplied by the Ministry of Education (MoE). The principal outcome of this research will be improved methods for analysing linked data with potential applications across the Official Statistics System. An important additional outcome will be a substantive analysis of the policy questions raised by the MoE.
The project will result in a technical report focussed on the statistical technique; a more substantively-oriented paper focussed on the policy questions; presentation(s) for Statistics NZ, the MoE and the OSS; and conference presentations and external publication(s). The project will also produce software programs (likely to be in R or SAS) to implement the methodology.
Research aims
- Develop techniques for unbiased estimation of means and regression models for probabilistically matched data.
- Apply those techniques to policy-relevant questions identified by the MoE and using Statistics NZ’s Student Loans database. Provide software programs (most likely to be written in R or SAS) for these techniques.
Project Sponsor: Ministry of Education
Project team: University of Wollongong, Statistics NZ and Ministry of Education.
Back to top
4. Sampling for subpopulations in household surveys with application to Māori and Pacific sampling
Project summary
Many New Zealand national household surveys have a requirement to produce statistics with adequate precision both for the whole of NZ, and for important subpopulations, particularly the Māori and Pacific populations. Two strategies for achieving this are: geographically-based unequal probability sampling (usually based on census data); and screening (where part of the sample are initially screened, and only members of the subpopulation of interest are eligible for the full survey).
Methods and theory are available for determining how to combine screening and unequal probability sampling. However, these methods do not allow for multi-stage sampling, which is used in most household surveys, or for the inaccuracies resulting from using census data to apply to periods in between censuses.
This project will develop new theory and methods to address these shortfalls, and apply them to the NZ context. The use of the Māori electoral roll will also be evaluated, including a clerical coverage assessment. The outcomes of the project will include: improved cost efficiency for NZ surveys where subpopulation estimates are a priority; more precise Māori and Pacific statistics; and a better understanding of sample design for subpopulations in NZ and in the international surveys and statistical community.
Aims
- Extend existing theory on sampling of rare populations to multistage sampling using the NZ statistical system and the Māori and Pacific populations as examples.
- Analyse meshblock data from 2001 and 2006 census data to model the reliability of census data for designing surveys in between censuses.
- Use these models to develop improved designs. A range of designs will be produced, corresponding to differing priorities on Māori, Pacific and national estimates. Designs using meshblocks, and SNZ Primary Sampling Units (PSUs) will be developed.
- Clerically match addresses from the Māori electoral roll to addresses from the NZ Health Survey sample, to estimate the coverage of the roll, and the extent to which the coverage is geographically clustered.
A final aim will be pursued if possible in 06/07; otherwise, this may be submitted for funding in future.
- Develop sample designs which use the electoral roll to improve the efficiency of Māori sampling. A range of designs will be developed, corresponding to different priorities for different survey.
Research questions
Many NZ national surveys have a requirement to produce statistics with adequate precision both for the whole of NZ and for important subpopulations, particularly the Māori and Pacific populations. Two main strategies exist:
- (a) Unequal probability sampling, where households in areas containing higher concentrations of the subpopulation are given a higher chance of selection.
- (b) Screening, where a portion of the sample is a screening sample. In the screening sample, only members of subpopulations of interest are eligible for selection.
Both of these methods increase the proportion of the sample belonging to the subpopulations of interest and both have limitations. Method (a) can only increase the effective sample size of the Māori and Pacific populations by around 20 percent, relative to equal probability sampling. This is generally well short of the increase needed. Method (b) can generally provide sufficient Māori or Pacific sample size, but is expensive. Another difficulty is that census data must be used for method (a); however, this data may be up to six years out of date, depending on when the survey is to be run. An option to further improve on (a) may be to use the Māori electoral roll to target specific addresses, rather than geographic areas. However, it is not clear whether the coverage of the electoral roll is sufficient to give useful gains.
Project sponsor: Ministry of Health
Project team: University of Wollongong, Ministry of Health and Statistics NZ.
Back to top
5. Specifications for a geospatial land use classification for New Zealand
Project summary
The allocation and use of land affects all aspects of New Zealand’s overall well-being (cultural, economic, environmental, and social) and quality of life. Currently New Zealand lacks any nationally consistent and comprehensive land use information covering the full range of natural, production, and urban landscapes. The provision of such information would help meet a critical gap in land use information and foster better planning, policy and management at national (e.g., carbon monitoring, biodiversity protection), regional (e.g, Resource Management Act), and district/city (e.g., land use planning) scales. It would also provide key information leading to more effective national (e.g. State of Environment, Economic and Social Statistics) and international (e.g. OECD, System of Environmental Accounts) reporting on New Zealand’s progress towards more sustainable development.
This project aims to fill that crucial information gap by developing a multi-scale, hierarchical geospatial land use classification that meets the range of information needs. Such a classification would be used in land use mapping and land use change analysis to underpin policy, planning, and resource management and contribute to reporting on progress towards sustainable development at a number of scales within New Zealand. A land use classification would complement and not replace the existing land cover classification and corresponding land cover database. In fact robust, detailed, and accurate information on both land use and land cover are needed to inform a range of existing and emerging environmental, economic, and social issues within New Zealand.
Within government use of different land use classifications results in an uncoordinated approach and the collection of incompatible data. A standard and consistent approach to land use classification at the national level will improve the quality of data collected and promote a framework for a harmonised approach leading to the development of a nationally complete and consistent land use information base.
Aims
The project will develop specifications for multi-scale, hierarchical geospatial land use classification for New Zealand. Such a classification would provide land use and land use change statistics to underpin policy, planning, and resource management and contribute to reporting on progress towards sustainable development at a number of geospatial scales within New Zealand.
Research questions
Land use can be defined as the activity(ities) or socio-economic function(s) for which land is used, and the same land can support multiple uses. It differs from land cover, which describes the physical state of the land. Land use statistics provide information on the function and purpose for which land is currently used and, if tracked over time, how land use changes. Appropriate land use information requires consideration of three interrelated aspects or dimensions: information (i.e. classification), space, and time. Different data sources provide different types of land use information at different spatial and temporal scales including, but not limited to, satellite imagery, aerial photography, ground surveys including direct sampling, and surveys of land use managers. This research will focus on developing a geospatial land use classification for New Zealand that satisfies a range of policy, regulatory, and reporting needs across local, regional, national, and international levels. It will address several key questions related to those needs including:
Land use information:
- What agencies require land use information and for what purpose?
- What types of land use classes and level of classification detail are needed to meet those various purposes?
- What is the finest level of classification detail attainable without sacrificing privacy and confidentiality?
- What are the appropriate and obtainable accuracies at the various levels of classification detail?
Geospatial Scale:
- What geospatial information sources are available and at what spatial scales?
- How can geospatial information sources be combined to deliver land use information?
Temporal Scale:
- How frequently can land use information be collected and updated?
- What is a significant time period for land use to be collected and reported on?
- How accurately can we interpolate periods of change between sampling periods?
Project sponsor: Statistics New Zealand and the Ministry for the Environment
Project team: Landcare Research NZ, Statistics NZ.
Back to top
6. Who are we: the conceptualisation and expression of ethnicity
Project summary
The availability of ethnic-specific data underpins many research and policy development processes and some areas of service funding and delivery. Ethnicity is used to label individuals and groups.
Research and informal discussions suggest that ethnicity is a concept that is not well understood by some respondents filling in official surveys or when responding to administrative questions. Equally it seems that ethnicity is not well understood by some of those analysing the data or formulating policy based on these data. One reason is that it seems that ethnicity is often conflated with concepts such as race, national identity and cultural identity.
Both nationally and internationally, there has been an explosion of research on ethnicity. This has been driven by a wide range of influences including high levels of international migration and, with this, anxieties about integrating migrants into host countries. There are also concerns about outcomes for minority populations including those considered to be indigenous to the country of interest. In New Zealand, a range of small scale projects have been completed on how ethnicity is constructed, conceptualised and expressed. Some of these have been funded by grants from the OSS system. However, there has been no formal investigation to bring together in a coherent way knowledge of the national and international concepts and expressions and what these really mean in a New Zealand context. This project will involve a literature review, analysis of secondary data, and discussion with key informants on ethnicity. From this we will gain a deeper knowledge of the context around ethnicity and within this context, how individuals construct their own and others identity.
Based on a mixture of methods such as theoretical consideration, empirical investigation and consultation this project will provide a picture of the how the concept of ethnicity is perceived and expressed in New Zealand.
The project will exclude:
- A review of the ethnicity measure
- Recommendations with reference to the 2011 Census ethnicity question
Research questions
Based on a mixture of theoretical consideration, empirical investigation and consultation this project will provide a picture of the how ethnicity is conceptualised and expressed in a New Zealand environment.
Research questions will be centered around a range of things including:
- contextualisation of ethnicity
- ethnic mobility
- ethnogenesis
- individual versus group identity
- indigenity
Project sponsor: Statistics NZ
Project team: Victoria University of Wellington, Statistics NZ and Stanford University.