View as pdf

Experiments & Results

Hyperparameter search ranges

The tables below list the ranges over which the individual hyperparameters were tuned corresponding to the model types listed in Table 5 of our paper.

Modality (Model Type)Convolution LayersInception blockPooling layer
# filtersKernel sizeCommon kernel size (k)# filtersPool sizePooling typeDropout ratePooling type
audio (B){4, 8, 16, 32, 64, 128, 256}{3, 5, 7, 9, 10}{3, 5, 7, 9, 10}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}
video (A){16, 32, 64, 128}{3, 5, 7}{3, 5, 7}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}
source fusion (C){16, 32, 64, 128}{3, 5, 7}{3, 5, 7}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}

Table S4a: Hyperparameter search ranges for the unimodal audio, video and source fusion methods, viz. model types A, B and C in Table 5.

Modality (Model Type)Convolution LayersInception blockPooling layer
# filtersKernel sizeCommon kernel size (k)# filtersPool sizePooling typeDropout ratepooling type
audio (B){4, 8, 16, 32, 64, 128, 256}{3, 5, 7, 9, 10}{3, 5, 7, 9, 10}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}
video (A){16, 32, 64, 128}{3, 5, 7}{3, 5, 7}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}
source fusion (C){16, 32, 64, 128}{3, 5, 7}{3, 5, 7}[4, 32]{3, 5}{Max, Average}{0.3, 0.4, 0.5, 0.6, 0.7}{Max, Average}

Table S4b: Hyperparameter search ranges for the latent fusion method, viz. model type D in Table 5.

Parameter nameParameter values
Late fusionLogistic regressionPenalty’l2’, ’l1’, ’elasticnet'
Regularization constant0.001-100 in GP of 10
Random Forest (RF)Num estimators10, 25, 50, 75, 100
Max depth3, 5, 7
Max features‘auto’, ‘sqrt’, ’log2’
Support Vector Machine (SVM)Regularization constant0.001-100 in GP of 10
Kernel‘rbf’, ’linear’, ‘poly’
Gamma0.001-100 in GP of 10
Polynomial degree2, 3
XGBoostLearning rate0.01, 0.05
max_depth2, 4, 6
Min_child_weight9, 11
subsample0.7, 0.8
colsample_bytree0.7, 0.5, 0.6
n_estimators10-100 in steps of 10

Table S4c: Hyperparameter search ranges for late fusion method, viz. Model type E2 in Table 5. Due to space constraints only Random Forest results are mentioned in Table 5 of the paper. All models except XGBoost were trained using sklearn and hyperparameter tuning done by grid search cv. XGboost was trained with xgboost package.

Tuned hyperparameters


The below listed variable names are used in the following tables in this section

  1. Conv block
    a. n – number of filters
    b. k – kernel size
  2. Inception block
    Refer to Figure 3 in the paper for the meaning of each variable name
  3. Pooling layer
    a. P – type of pooling

Tuned hyperparameters for video classification task on seen singer split

Information of the model and hyperparameters for the third column (Seen singer, Video) in Table 4 and model type A in Table 5 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)

Table S5A1a: Size of models used for unimodal video classification on the seen singer split

Conv block - 1Inception blockPooling layer

Table S5A1b: Hyperparameters of models used for unimodal video classification on the seen singer split

Tuned hyperparameters for video classification task on unseen singer split

Information of the model and hyperparameters for the fifth column (Unseen singer, Audio) in Table 4 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)

Table S5A2a: Size of models used for unimodal video classification on the unseen singer split

Conv block - 1Inception blockPooling layer

Table S5A2b: Hyperparameters of models used for unimodal video classification on the unseen singer split

Tuned hyperparameters for audio classification task on seen singer split

Information of the model and hyperparameters for the second column (Seen singer, Audio) in Table 4 and model type B in Table 5 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)

Table S5B1a: Size of models used for unimodal audio classification on the seen singer split

Conv block - 1Conv block - 2Inception blockPooling layer

Table S5B1b: Hyperparameters of models used for unimodal audio classification on the seen singer split

Tuned hyperparameters for audio classification task on unseen singer split

Information of the model and hyperparameters for the fourth column (Unseen singer, Audio) in Table 4 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)

Table S5B2a: Size of models used for unimodal audio classification on the unseen singer split

Conv block - 1Conv block - 2Inception blockPooling layer

Table S5B2b: Size of models used for unimodal audio classification on the unseen singer split

Tuned hyperparameters for early fusion classification

Information of the model and hyperparameters for model type C in Table 5 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)

Table S5Ca: Size of models used for source fusion multimodal classification on the seen singer split

Conv block - 1Inception blockPooling layer

Table S5Cb: Hyperparameters of models used for source fusion multimodal classification on the seen singer split

Tuned hyperparameters for latent fusion classification

Information of the model and hyperparameters for model type D in Table 5 of our paper.

Model size
split# total hparams# trainable hparamsMean model size (MB)

Table S5Da: Size of models used for latent fusion multimodal classification on the seen singer split

Pooling + 1D convolution*Inception blockPooling layer
splitAudio Pool typeVideo Pool typenn11n21n31n32n41n42n43kpPdrP

Table S5Db: Hyperparameters of models used for latent fusion multimodal classification on the seen singer split

* Pooling is done on the audio and video conv layer outputs separately to make them the same length to allow for their depth-wise concatenation. 1D convolution is then used on the output of the audio and video conv blocks to reduce the size of the data going into the inception block.

Tuned hyperparameters for Late fusion

SplitModelHyperparameter name
AGRandom forest507‘sqrt’
CCRandom forest755‘sqrt’
SChRandom forest507‘sqrt’

Table S5E: Hyperparameters of models used for late fusion multimodal classification on the seen singer split

Confusion matrices of the results

Figure S3 (a): Confusion matrices of predictions made from audio, video and audio-video modalities. Numbers are represented as percentages of the total number of samples shown in the matrix. The rows indicate the train-val split from which the validation data was used to generate the confusion matrix i.e. AG, CC, SCh data splits in the ‘seen split‘ case. The columns indicate unimodal audio, unimodal video and multimodal (corresponding to latent fusion - model D in Table 5 of the paper) models. Fig 5 in the paper (same as Figure S3 (b) below) is derived from this figure by combining the predictions across all 3 data splits.

Figure S3 (b): Confusion matrices of predictions made from audio, video and audio-video modalities. Numbers are represented in percentages of the total number of test examples across the three singers combined. Same as Figure 5 in the paper.