statistics:
missing data in a social science study
Overview
I wrote a practical guide for the analysis of non-response and attrition (participation drop-out) in longitudinal research studies. In brief, selective non-participation and attrition pose a ubiquitous threat to the validity of inferences drawn from observational longitudinal studies. In this work, we investigate various potential predictors for non-response and attrition of parents as well as young persons at different stages of a multi-informant study. Various phases of renewed consent from parents and young persons allowed for a unique comparison of factors that drive participation. The target sample consisted of 1675 children entering primary school at age seven in 2004. Seven waves of interviews, over the course of 10 years, measured levels of problem behavior as rated by children, parents, and teachers. The study showed that attrition was higher for some immigrant background groups.
Statistical Procedure overview
For descriptive purposes, simple logistic regressions with listwise deletion were used to evaluate the relations between predictors and non-participation or drop-out without considering the effects of other predictors. Participation/retention was coded = 0 and non-participation/drop-out was coded = 1 such that odds ratios (ORs) > 1 reflect increased likelihood of non-participation or drop-out. Specifically, the ORs reflect the ratio of the odds of non-participation/drop-out at levels of the predictor separated by one unit. For example, an OR = 2 would indicate that the odds of dropping out double for each unit increase in the predictor. Associated (unadjusted) p values are reported for descriptive purposes. These analyses were conducted in R statistical software, using a logit link function in the glm function (R Core Team, 2016).
We then conducted a series of multiple regressions to evaluate the unique relations between each predictor and drop-out/attrition controlling for other predictors. These analyses were implemented in Lavaan, again in R Statistical Software (Rosseel, 2012), this time using probit regression. Probit and logistic regression can both be used to model the prediction of dichotomous outcomes and generally result in the same conclusions. Whereas logistic regression uses a logit function to model the probability that the outcome variable is equal to 1, probit regression uses an inverse standard normal cumulative distribution function. Probit regression coefficients can thus be interpreted as the difference in the cumulative normal probability of the outcome variable for a unit increase in the predictor. Here, probit regression was used for practical reasons. To correct for multiple comparisons, we used the generalized Holm (1979) k-familywise error rate (FWER; Lehmano & Romano, 2012).