GAN Data Imputation | Brian Tandy

GAN for Likert Scale Data Imputation

Originally as part of a course, I entered a Kaggle competition focused on predicting a child's likelihood of experiencing internet addiction based on indicators of physical activity and internet usage. The primary challenge was addressing a substantial amount of missing data in the provided private dataset. While many participants simply dropped rows with missing values, I took a different approach.

Drawing on my psychology background and research into the questionnaire's structure, I aimed to develop a Generative Adversarial Network (GAN) to impute missing likert-scale values in a way that respected the between-question dependencies typically seen in psychological assessments. Although I submitted a working model without a GAN imputer by the competition deadline, I continued developing the GAN afterward to explore its viability.

One of the main challenges was producing discrete, ordinal outputs from the generator while retaining differentiability for back propagation. This was resolved by switching from TensorFlow to PyTorch and using a Gumbel-Softmax layer shaped [num_questions, num_likert_classes]. I then ensured that imputed values replaced only the missing entries and that binary cross-entropy loss (BCE) for the generator was computed solely on those generated values. At the same time, the discriminator's BCE loss was computed on both real and imputed entries to prevent it from defaulting to predicting “fake” for all inputs.

Another challenge was the small dataset size, only 1289 rows with missing data out of 3960 total, which is suboptimal for GAN training. I adjusted the models' architectures accordingly and implemented pretraining to improve stability. The generator was pretrained using fully complete rows with randomly masked columns, comparing generated values to ground truth. The discriminator was pretrained on randomly imputed rows to learn distinguishing patterns.

Despite the dataset's size constraints, the GAN training remained stable post-pretraining. The final model, combining my original competition predictor with the GAN imputer, achieved a Quadratic Weighted Kappa (QWK) that consistently outperformed the competition's winning entry by 7-12%. For context, the margin between first and second place in the competition was less than 1%.