Deidentification is a common data anonymization technique for protecting sensitive data e.g., personally identifying information (PII). However, studies have shown that deidentification is often insufficient to prevent re-identification of individuals from their data; in some cases, zip code, gender, and age are sufficient characteristics to uniquely identify individuals. An alternative approach that guarantees anonymization is synthetic data generation (SDG), which algorithmically generates artificial data to mimic a desired real dataset. SDG irreversibly anonymizes the output data since synthetic data has no direct link to individual samples in the source, private data. Industries such as finance, healthcare, and national security often employ SDG when conducting analysis and simulation. For example, SDG could be used to anonymize service member training data containing PII in such a way that protects the PII whilst allowing evaluation and analysis of the training outcomes.
A traditional approach to SDG is to create a model by fitting a probability distribution to the data and then sampling that distribution. A plethora of distributions are available for smooth, continuous data (e.g., normal, Student’s t, Cauchy, logistic, or beta) however, in many cases, it may be unclear which distribution would best capture the reality of the data (shape, skewness, tail behavior, etc.). In addition, the distribution fitting process for these approaches is often complex, subjective, and non-convergent. However, the metalog distributions (Keelin, 2016), address these issues by proposing a distribution that is more flexible and easier to use for fitting datasets. We demonstrate a workflow for using the metalog distribution for SDG and analysis of simulated health data and compare the model fidelity against other approaches.
Keywords
DATA;MEDICAL MODELING AND SIMULATION;MODELING;PROBABILITY;SYNTHETIC
Additional Keywords