View as pdf

Experiments & Results

Hyperparameter search ranges

The tables below list the ranges over which the individual hyperparameters were tuned corresponding to the model types listed in Table 5 of our paper.

Modality (Model Type)Convolution LayersInception blockPooling layer
# filtersKernel sizeCommon kernel size (k)# filtersPool sizePooling typeDropout ratePooling type
audio (B){4, 8, 16, 32, 64, 128, 256}{3, 5, 7, 9, 10}{3, 5, 7, 9, 10}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}
video (A){16, 32, 64, 128}{3, 5, 7}{3, 5, 7}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}
source fusion (C){16, 32, 64, 128}{3, 5, 7}{3, 5, 7}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}

Table S4a: Hyperparameter search ranges for the unimodal audio, video and source fusion methods, viz. model types A, B and C in Table 5.

Modality (Model Type)Convolution LayersInception blockPooling layer
# filtersKernel sizeCommon kernel size (k)# filtersPool sizePooling typeDropout ratepooling type
audio (B){4, 8, 16, 32, 64, 128, 256}{3, 5, 7, 9, 10}{3, 5, 7, 9, 10}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}
video (A){16, 32, 64, 128}{3, 5, 7}{3, 5, 7}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}
source fusion (C){16, 32, 64, 128}{3, 5, 7}{3, 5, 7}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}

Table S4b: Hyperparameter search ranges for the latent fusion method, viz. model type D in Table 5.

ModalityModelHyperparameters
Parameter nameParameter values
Late fusionLogistic regressionPenalty’l2’, ’l1’, ’elasticnet'
Regularization constant0.001-100 in GP of 10
Random Forest (RF)Num estimators10, 25, 50, 75, 100
Max depth3, 5, 7
Max features‘auto’, ‘sqrt’, ’log2’
Support Vector Machine (SVM)Regularization constant0.001-100 in GP of 10
Kernel‘rbf’, ’linear’, ‘poly’
Gamma0.001-100 in GP of 10
Polynomial degree2, 3
XGBoostLearning rate0.01, 0.05
max_depth2, 4, 6
Min_child_weight9, 11
subsample0.7, 0.8
colsample_bytree0.7, 0.5, 0.6
n_estimators10-100 in steps of 10

Table S4c: Hyperparameter search ranges for late fusion method, viz. Model type E2 in Table 5. Due to space constraints only Random Forest results are mentioned in Table 5 of the paper. All models except XGBoost were trained using sklearn and hyperparameter tuning done by grid search cv. XGboost was trained with xgboost package.

Tuned hyperparameters

Legend:

The below listed variable names are used in the following tables in this section

  1. Conv block
    a. n – number of filters
    b. k – kernel size
  2. Inception block
    Refer to Figure 3 in the paper for the meaning of each variable name
  3. Pooling layer
    a. P – type of pooling

Tuned hyperparameters for video classification task on seen singer split

Information of the model and hyperparameters for the third column (Seen singer, Video) in Table 4 and model type A in Table 5 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)
AG21869213230.43
CC1033399930.29
SCh19010184900.39

Table S5A1a: Size of models used for unimodal video classification on the seen singer split

Conv block - 1Inception blockPooling layer
splitnkn11n21n31n32n41n42n43kpPdrP
AG12871232211425223173Max0.3Max
CC12871029282418283173Avg0.6Avg
SCh1287925101723273073Max0.3Max

Table S5A1b: Hyperparameters of models used for unimodal video classification on the seen singer split

Tuned hyperparameters for video classification task on unseen singer split

Information of the model and hyperparameters for the fifth column (Unseen singer, Audio) in Table 4 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)
AG947991690.28
CC16331158710.36
SCh595757210.24

Table S5A2a: Size of models used for unimodal video classification on the unseen singer split

Conv block - 1Inception blockPooling layer
splitnkn11n21n31n32n41n42n43kpPdrP
AG167219322016303255Max0.7Avg
CC128522119283117633Max0.6Avg
SCh167185261522191553Avg0.5Avg

Table S5A2b: Hyperparameters of models used for unimodal video classification on the unseen singer split

Tuned hyperparameters for audio classification task on seen singer split

Information of the model and hyperparameters for the second column (Seen singer, Audio) in Table 4 and model type B in Table 5 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)
AG69903690671
CC43323427610.7
SCh1891361883002.5

Table S5B1a: Size of models used for unimodal audio classification on the seen singer split

Conv block - 1Conv block - 2Inception blockPooling layer
splitnknkn11n21n31n32n41n42n43kpPdrP
AG165256924282722173220105Avg0.5Avg
CC32712871914102824162935Avg0.4Avg
SCh1289128102428272628242933Max0.5Avg

Table S5B1b: Hyperparameters of models used for unimodal audio classification on the seen singer split

Tuned hyperparameters for audio classification task on unseen singer split

Information of the model and hyperparameters for the fourth column (Unseen singer, Audio) in Table 4 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)
AG18698182440.41
CC977394310.3
SCh20298197900.43

Table S5B2a: Size of models used for unimodal audio classification on the unseen singer split

Conv block - 1Conv block - 2Inception blockPooling layer
splitnknkn11n21n31n32n41n42n43kpPdrP
AG8364727323132321711105Avg0.6Avg
CC323891112321520322055Avg0.6Avg
SCh891285247303253014105Max0.3Avg

Table S5B2b: Size of models used for unimodal audio classification on the unseen singer split

Tuned hyperparameters for early fusion classification

Information of the model and hyperparameters for model type C in Table 5 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)
AG22341218150.43
CC21570210440.42
SCh23397228790.45

Table S5Ca: Size of models used for source fusion multimodal classification on the seen singer split

Conv block - 1Inception blockPooling layer
splitnkn11n21n31n32n41n42n43kpPdrP
AG12871422102732301433Avg0.5Avg
CC1283211425628323073Max0.5Avg
SCh12872421273017241275Max0.5Avg

Table S5Cb: Hyperparameters of models used for source fusion multimodal classification on the seen singer split

Tuned hyperparameters for latent fusion classification

Information of the model and hyperparameters for model type D in Table 5 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)
AG108876657381.2
CC88934542300.99
SCh220897487611.5

Table S5Da: Size of models used for latent fusion multimodal classification on the seen singer split

Pooling + 1D convolution*Inception blockPooling layer
splitAudio Pool typeVideo Pool typenn11n21n31n32n41n42n43kpPdrP
AGMaxMax1282622123121222935Avg0.4Avg
CCMaxMax1282418311329322953Max0.6Avg
SChAvgAvg128238261923291175Avg0.4Avg

Table S5Db: Hyperparameters of models used for latent fusion multimodal classification on the seen singer split

* Pooling is done on the audio and video conv layer outputs separately to make them the same length to allow for their depth-wise concatenation. 1D convolution is then used on the output of the audio and video conv blocks to reduce the size of the data going into the inception block.

Tuned hyperparameters for Late fusion

SplitModelHyperparameter name
Num_EstimatorsMax_depthMax_features
AGRandom forest507‘sqrt’
CCRandom forest755‘sqrt’
SChRandom forest507‘sqrt’

Table S5E: Hyperparameters of models used for late fusion multimodal classification on the seen singer split

Confusion matrices of the results

Figure S3 (a): Confusion matrices of predictions made from audio, video and audio-video modalities. Numbers are represented as percentages of the total number of samples shown in the matrix. The rows indicate the train-val split from which the validation data was used to generate the confusion matrix i.e. AG, CC, SCh data splits in the ‘seen split‘ case. The columns indicate unimodal audio, unimodal video and multimodal (corresponding to latent fusion - model D in Table 5 of the paper) models. Fig 5 in the paper (same as Figure S3 (b) below) is derived from this figure by combining the predictions across all 3 data splits.

Figure S3 (b): Confusion matrices of predictions made from audio, video and audio-video modalities. Numbers are represented in percentages of the total number of test examples across the three singers combined. Same as Figure 5 in the paper.