FoundationaLLM can power solutions that can classify media, like audio files, into one of a set of user provided categories without any domain specific training or fine tuning of the LLM. To do so the FoundationaLLM agent realizes its being asked an audio classification question, forwards the candidate labels and audio file to external, pre-trained, machine learning models (a text encoder API and sound encoder API respectively) to produce the vector embeddings. Then the similarity between the audio embedding and the text embeddings is calculated (typically using the normalized vector dot product). The largest dot product represents the pair that is closest, so the text description is returned to the agent who incorporates it into its answer.