If you’ve been playing around with maintenance screen and the speech integration that we completed last week you may have noticed that there can be a noticeable lag between the time you press a play button and when you hear the speech.

The lag is caused by the necessary round trip to Azure Cognitive Services (henceforth ACS) to do the conversion of text to speech. In my testing (using service located relatively close to me, in Australia), rendering could take as much as 3.7 seconds.

This isn’t fast enough for interactive use.

It’s worth pointing out that I’m not being critical of ACS here. On top of the actual time taken by ACS to create the speech fragment, we’re also dealing with the round trip time between my laptop and the service itself. Given that New Zealand internet use seems to set a new record every few days, due to everyone working and playing from home, the performance I’m seeing is pretty good.

The cliche in software development is that most every problem can be solved by introducing another layer of indirection, unless your problem is too many layers of indirection.

Let’s introduce some caching - not only will this give us faster access to any particular phrase the second time we need it, we’ll reduce our calls to ACS by not rendering the same phrase multiple times.

Our first step is to move the call to ACS into a private method that simply returns a stream of binary data containing the required speech:

private async Task<Stream> RenderSpeech(string content)
{
    var stopwatch = Stopwatch.StartNew();
    try
    {
        var audioStream = AudioOutputStream.CreatePullStream();
        var audioConfig = AudioConfig.FromStreamOutput(audioStream);

        using var _synthesizer = 
            new SpeechSynthesizer(_configuration, audioConfig);
        using var result = await _synthesizer.SpeakTextAsync(content);

        if (result.Reason == ResultReason.SynthesizingAudioCompleted)
        {
            var stream = new MemoryStream();
            stream.Write(result.AudioData, 0, result.AudioData.Length);
            return stream;
        }

        _logger.Info($"Failed to say '{content}'.");
        return null;
    }
    finally
    {
        _logger.Info($"Elapsed {stopwatch.Elapsed}");
    }
}

This differs from the approach taken previously in a few significant ways.

We use a different AudioConfiguration to ensure we don’t use the speaker but instead return the audio data for later reuse. Oddly, we don’t need to actually use audioStream as the data we want is returned directly to us in result; we capture the audio data for the speech and write it into a memory stream for caching.

Around this new method, we add a simple asynchronous cache:

private async Task<Stream> GetSpeechStream(string content)
{
    if (_cache.TryGetValue(content, out var stream))
    {
        return stream;
    }

    stream = await RenderSpeech(content);
    if (stream is Stream)
    {
        // Successfully rendered, so store the result
        // (Don't want to cache failures)
        _cache[content] = stream;
    }

    return stream;
}

private readonly Dictionary<string, Stream> _cache 
    = new Dictionary<string, Stream>();

This is pretty straightforward caching code - if we have it in the cache, we return it immediately. If not, we render the speech and add it to the cache if that was successful.

Now we can rewrite the core SayAsync() method to use the cache:

public async Task SayAsync(string content)
{
    var speech = await GetSpeechStream(content);

    _player.Stop();
    if (speech is Stream)
    {
        speech.Seek(0, SeekOrigin.Begin);
        _player.Stream = speech;
        _player.Play();
    }
}

To make this work, we needed to upgrade the project to use .NET Core 3.1, as it was only in that release that the SoundPlayer class was introduced, allowing us to play audio.

Prior post in this series:
Maintenance & Speech
Next post in this series:
Improved Caching

Comments

blog comments powered by Disqus