This project explored the potential for using machine learning (ML) models to analyze and score videos of simulation-based surgical training. The experiment broke the problem into three phases: (1) identifying objects in the video, (2) classifying the dynamic activity being performed, and (3) assigning a score to the quality of performance demonstrated. Our 2019 I/ITSEC paper reported the results of Phase 1. This paper addresses Phases 2 and 3.
1,735 videos containing five unique activities to be classified were processed with Goggle Cloud Platform tools - assigning 1,235 videos to the training set, 250 to test, and 250 to validation. Google AutoML was trained to classify the activity in the videos. The same steps were applied to predicting performance scores. Models can be created with this small data set only because AutoML leverages transfer learning and network architecture search techniques.
An AutoML generated model was able to classify the activities in the videos with an accuracy of 88%. The same process created a model that was only 65% accurate when predicting the performance scores. The activity classifications approach levels for satisfactory automated classifiers. The performance scoring model accuracy was too low to replace human evaluators. Low accuracy on performance scoring was expected since the data set was divided by both activity and quality level, resulting in fewer instances in each group. Using the same data set, ML techniques that can classify activity would necessarily be less accurate at recognizing good vs. poor performance of that activity. Achieving higher accuracy in scoring will require significantly larger data sets. Additionally, the subtlety of Phase 3 may call for entirely different ML techniques than those successful for Phases 1 and 2. The ML literature indicates that effective techniques for automated performance scoring is a topic of interest to multiple research teams.