8/27/2024

Using Ollama for Speech Synthesis

In today’s tech-obsessed world, the ability to communicate with machines via natural language is becoming increasingly important. With solutions like Ollama, developers can build powerful voice assistants right on their devices. This blog post explores the usage of Ollama for speech synthesis, combining innovative tech stacks to create a seamless offline voice experience. So buckle up & let’s dive into the fascinating world of voice synthesis using Ollama!

What is Ollama?

Ollama is a widely recognized tool designed to run and serve large language models (LLMs) offline. It allows developers to build engaging AI models without needing constant internet connectivity. This makes it a GO-TO for those willing to push the envelope of AI technology — I mean, who doesn't love a hardy assistant that doesn't need Wi-Fi, right?

In combination with tools like Whisper for speech recognition and Bark for text-to-speech conversion, the magic of speech synthesis can unfold right before your very ears!

Setting Up Your Environment

Before we embark on our journey exploring Ollama, let’s set up an environment to craft our voice assistant. You’ll need to establish a virtual Python environment using tools like virtualenv, pyenv, or Poetry, which is my personal favorite. The goal is to have a clean slate when you're diving into beautiful code.

Required Libraries

Here’s a handy list of libraries you’ll need to install:

rich: This library helps in creating visually appealing console output.
openai-whisper: This robust tool performs speech-to-text conversion.
suno-bark: A cutting-edge library for text-to-speech synthesis, ensuring high-quality audio outputs.
langchain: A straightforward library for interacting with LLMs.
sounddevice, pyaudio, speechrecognition: Essential libraries for audio recording & playback.

Make sure to check the detailed list of dependencies in the respective GitHub repositories or at this link.

The Architecture

At the heart of using Ollama for speech synthesis lies three critical components:

Speech Recognition: Utilizing the aforementioned OpenAI's Whisper, spoken language is converted to text.
Conversational Chain: Here, we implement conversational capabilities using the Langchain interface with the Llama-2 model served through Ollama. This setup promises a seamless & engaging flow.
Speech Synthesizer: Finally, the transformation of text into speech is achieved by using Bark, which is famous for its lifelike speech production.

The Workflow

The workflow is beautifully straightforward:

Record Speech: Use the microphone to capture audio.
Transcribe to Text: Convert the recorded speech into text using Whisper.
Generate a Response: Use the LLM via Langchain to produce a response.
Synthesize Speech: Vocalize the generated text using Bark.

Isn’t that just poetic? Get it? 😄

Implementing Text-To-Speech Service

It all begins with coding a

TextToSpeechService

class based on Bark. This class will humanize the machine with an array of functions performing various tasks related to speech synthesis.

Code Snippet Wow-ed

Here’s a simplified view of this superhero service: ```python import nltk import torch import warnings import numpy as np from transformers import AutoProcessor, BarkModel

warnings.filterwarnings( "ignore", message="torch.nn.utils.weight_norm deprecated favor torch.nn.utils.parametrizations.weight_norm." )

class TextToSpeechService: def init(self, device: str = "cuda" if torch.cuda.is_available() else "cpu"): self.device = device self.processor = AutoProcessor.from_pretrained("suno/bark-small") self.model = BarkModel.from_pretrained("suno/bark-small") self.model.to(self.device)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
def synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"):
    inputs = self.processor(text, voice_preset=voice_preset, return_tensors="pt")
    inputs = {k: v.to(self.device) for k, v in inputs.items()}
    with torch.no_grad():
        audio_array = self.model.generate(**inputs, pad_token_id=10000)
        audio_array = audio_array.cpu().numpy().squeeze()
    sample_rate = self.model.generation_config.sample_rate
    return sample_rate, audio_array

def long_form_synthesize(self, text: str, voice_preset: str = "v2/en_speaker_1"):
    pieces = []
    sentences = nltk.sent_tokenize(text)
    silence = np.zeros(int(0.25 * self.model.generation_config.sample_rate))
    for sent in sentences:
        sample_rate, audio_array = self.synthesize(sent, voice_preset)
        pieces += [audio_array, silence.copy()]
    return self.model.generation_config.sample_rate, np.concatenate(pieces)

```

Standing ovation for that brilliance there! 💯

Preparing the Ollama Server

After creating our service, it’s critical to prepare the Ollama server for LLM serving. Just follow these tasks:

Pull Latest Llama-2 Model: Execute
1ollama pull llama2
to grab the latest & greatest model.
Start Ollama Server: Fire it up with
1ollama serve
. Once this step is complete, your application will leverage the Llama-2 model to generate responses based on user input.

Crafting the Main Application Logic

Next on our checklist is to define the necessary components for our application:

Rich Console for Interaction: We use the Rich library for an engaging terminal interface.
Whisper for Transcription: Load the Whisper speech recognition model to decode speech into text.
Bark for Synthesis: Initialize the Bark synthesizer instance we built earlier.
Conversational Chain: Use the built-in
1ConversationalChain
from Langchain to manage conversational flow.

Main Loop Logic

The main application loop will ensure a seamless interaction with users:

Prompt User for Input: Ask the user to press Enter to start recording.
Start Recording: Once the input is given, use the
1record_audio
function to capture audio from the user's microphone.
Stop Recording: On another Enter key press, stop recording, & transcribe the audio.
Generate Response: Pass the transcribed text for a response generation through the LLM.
Playback the Response: Lastly, vocalize the generated response using the Bark synthesizer.

The Result

Once everything is neatly sewn together, running the application is like being in a movie moment! Though it may run a bit slowly on devices like a MacBook compared to faster, CUDA-enabled computers due to the model size, the experience is rewarding.

Here are some KEYS from our application:

Voice-Based Interaction: Users engaged through recorded voice input, with the assistant responding back via vocal playback.
Conversational Context: Maintained throughout the interaction, enabling coherent, relevant responses thanks to the incredible Llama-2 language model.

Why Choose Ollama for Your Voice Synthesis Needs?

Performance: With tailored models designed for various hardware setups, Ollama ensures you get the performance you require.
Flexibility: Customize models effectively for different needs — whether you need help with FAQs, event details, or even fan engagement.
User-Friendly: Ollama makes it easy to build an assistant without deep technical experience. Jump in & start creating!
Comprehensive Analytics: As you interact with your audience, you gain insights into their interests, allowing your strategies to evolve.

Join the Future of Conversational AI with Arsturn

If you’re fascinated by all the possibilities of integrating AI into your workflow, Arsturn is here to help! With Arsturn's platform, you can create custom ChatGPT chatbots to boost engagement & conversions. No credit card is needed to get started — just jump right in & explore how easy it is to unlock the full potential of conversational AI.

In conclusion, Ollama alongside Whisper & Bark is establishing a whole new world where machines & humans can synergize in communication, creativity, and problem-solving. So why not start YOUR journey today with Arsturn to revolutionize your digital presence?