Audio Classification using CNN and Vision Transformer
The research on music genre classification using the GTZAN dataset, transformed into mel spectrograms for alignment with human auditory perception, compared traditional machine learning models (Random Forest, K-Nearest Neighbors, Naive Bayes) with advanced deep learning approaches. The Random Forest model proved most accurate among traditional methods due to its ensemble learning technique. However, a custom Convolutional Neural Network (CNN) tailored for mel spectrogram analysis surpassed established models like VGG16 and ResNet152 in transfer learning scenarios. Notably, a Vision Transformer adapted for audio data significantly outperformed both the custom CNN and traditional models in terms of accuracy. This highlights the importance of model selection in music genre classification and demonstrates the effectiveness of mel spectrograms in advanced audio data analysis.