Use Cases > Zero shot audio retrieval

Zero shot audio retrieval

FoundationaLLM enables solutions that search for media, like audio files, from text descriptions. In this case, the FoundationaLLM agent realizes it’s being asked to search for an audio file. It forwards the desired text description to a pre-trained machine learning model (a text encoder) that produces the vector embeddings. A catalog of available media files is provided as a pre-indexed vector store containing the vector embeddings for all available audio files. Then the similarity between the audio embeddings and the text embedding is calculated using the normalized vector dot product. The largest dot product represents the pair that is closest, so that vector embedding is selected. The retriever looks up the audio file represented by that vector embedding and returns the audio file.