Qualitative analysis of our results

We have selected 6 takes from the validation set in the ‘seen singer’ condition which illustrate some of the features of the classification performance. Commentaries complement the prediction accuracy statistics with qualitative observations which may indicate ways in which the systems are classifying the ragas in different modalities. Predictions are overlaid on the video, green when correct and red when incorrect, with probabilities output by the models in parentheses.

The selected videos are listed below:

  1. AG_4b_Nand
  2. AG_5b_MM
  3. CC_3b_MM
  4. CC_5b_Shree
  5. SCh_5b_Shree
  6. SCh_9b_Bahar

AG_4b_Nand

Both modalities score very well in the first half of this clip, where the distinctively fast moving, complex melodic movement is matched by a gesture style in which faster, winding movements of the right hand dominate. From 1'30" the singer switches to a more even bimanual gesture style similar to that used for MM and Bahar, and the video prediction shifts to these ragas and Bageshri, before returning to Nand at 2'40" when she reverts to a more right hand-dominant style.

AG_5b_MM

The audio prediction is mostly accurate for the first portion of this alap, while the video prediction is poor. After about 1’ the video prediction improves: this is the point a t which the singer begins to use bimanual gestures with a strong periodic component - although separated, the hands appear to be moving in concert. Apart from some confusion with Bilaskhani Todi in the video, the prediction remains stable until about 2’. As the melody moves higher and the gestures become more varied and faster the prediction becomes less stable; Bahar appears more frequently in the predictions. 2'30"-2'40" is an example of a section in which the A+V prediction is accurate when both audio a nd video are predicting Bahar.

CC_3b_MM

This is an example of a clip for which video prediction works particularly well, scoring more highly than audio. Some of the audio predictions are hard to explain in terms of similarity of pitch materials, but the melodic movement is slow, meaning the information is sparse: it is possible that some of the pitch slopes match those the system has learned other ragas, although the pitches themselves may differ. around 2'30" we see an example of low pitch that may be causing problems with our pitch tracking parameters, which are set with a floor of the low Pa (5th scale degree). The passage around 2'15"-2'30" where the video prediction is poor marks a transition from an unpulsed alap to a section with a clear pulse: it may be that here the singer’s movements are reflective of the rhythm more than the melodic movement.

CC_5b_Shree

This extract begins with good accuracy in the audio modality but not the video: it seems that the slow,deliberate movements are not associated uniquely with Shree.From 0'38"-0'53" the audio prediction breaks down: this is the portion when the vocal range dips below the boundary set by our pitch tracking; at the same time the video prediction improves. The direct upward movement of the right hand, associated with an ascent from Re to Pa (2>5) and identified as distinctive of Shree, is not correctly predicted (contrast this with SCh). In general the audio prediction is very good, the video patchy but much better than chance (c. 40%), especially on the complex descending melodies.

SCh_5b_Shree

This is one the extracts with the highest prediction rates, ranging from 94-98% with the highest score for the video (classification by gesture). The singer’s gestures are clearly highly consistent between takes, and distinguishable from her gestures in other ragas.

SCh_9b_Bahar

This take contrasts sharply with SCh_5b_shree in that the predictions are much less accurate in all modalities. Comparing the three results shows that the multimodal classification has an advantage: audio scores 21%, video 0% but audio plus video achieves 37% accuracy. Some of the misclassification of the video as Shree and the audio as MM, in particular, is corrected in the multimodal condition.