Audio Classification using CNN and Vision Transformer
The research demonstrates that a Vision Transformer adapted for audio data outperforms both a custom CNN and traditional machine learning models in classifying music genres using mel spectrograms from the GTZAN dataset.