Abstract
The posterior predictive distribution (the distribution of data simulated from a model) has been used to flag model-data discrepancies in the Bayesian literature, and several approaches have been developed. The approach taken here differs from the others both conceptually and as realized. It works by comparing the "distance" between the data and model (as represented by pseudo-data simulated from a model) with "distance" within the model. The distance within the model is calculated by generating pseudo-data from it, using each set of these pseudo-data to reestimate the model, and then generating pseudo-data from them, matching the way the original data are used to generate pseudo-data. "Distances" are calculated as the log of sums-of-squares, following ranking, and the test from comparing a mean distance to a distribution of mean distances. The power of this method compares favorably with those of standard methods, e.g. t-tests, but it is more general since it can be used for most models in the GLMM framework, whether estimated using traditional or Bayesian methods. A new kind of plot, where the distribution of the ranked pseudo-data is compared to the original data at each ranked datum, is useful for determining the region of the data where the model fails.
Creative Commons License
This work is licensed under a Creative Commons Attribution-Noncommercial-No Derivative Works 4.0 License.
Recommended Citation
Kramer, Matthew
(2014).
"USE OF THE POSTERIOR PREDICTIVE DISTRIBUTION AS A DIAGNOSTIC TOOL FOR MIXED MODELS,"
Conference on Applied Statistics in Agriculture.
https://doi.org/10.4148/2475-7772.1005
USE OF THE POSTERIOR PREDICTIVE DISTRIBUTION AS A DIAGNOSTIC TOOL FOR MIXED MODELS
The posterior predictive distribution (the distribution of data simulated from a model) has been used to flag model-data discrepancies in the Bayesian literature, and several approaches have been developed. The approach taken here differs from the others both conceptually and as realized. It works by comparing the "distance" between the data and model (as represented by pseudo-data simulated from a model) with "distance" within the model. The distance within the model is calculated by generating pseudo-data from it, using each set of these pseudo-data to reestimate the model, and then generating pseudo-data from them, matching the way the original data are used to generate pseudo-data. "Distances" are calculated as the log of sums-of-squares, following ranking, and the test from comparing a mean distance to a distribution of mean distances. The power of this method compares favorably with those of standard methods, e.g. t-tests, but it is more general since it can be used for most models in the GLMM framework, whether estimated using traditional or Bayesian methods. A new kind of plot, where the distribution of the ranked pseudo-data is compared to the original data at each ranked datum, is useful for determining the region of the data where the model fails.