Voice & appearance¶

Custom TTS¶

Replace the default text-to-speech engine with a custom implementation. The TTS engine should stream audio in real-time.

Extend uni.TtsEngine and register via @uni.tts:

Custom TTS engine

@uni.tts
class MyCustomTts(uni.TtsEngine):

    def __init__(self):
        self._sample_rate = 24000
        self._bytes_per_sample = 2  # 16-bit PCM

    @property
    def sample_rate(self) -> int:
        """Sample rate in Hz."""
        return self._sample_rate

    @property
    def bytes_per_sample(self) -> int:
        """Bytes per sample (2 for 16-bit PCM)."""
        return self._bytes_per_sample

    def load(self) -> None:
        """Load resources needed by the TTS engine."""
        logger.info("Custom TTS loaded")

    def synthesize(self, text: str) -> Iterator[bytes]:
        """Generate and stream audio for the given text."""
        for chunk in generate_audio_chunks(text):
            yield chunk

    def shutdown(self) -> None:
        """Release resources held by the TTS engine."""
        pass

Custom avatars¶

Avatars control UNI's visual representation. They can react to TTS audio levels (pulse/lip sync) and support expressions ("happy", "listening", "surprised").

Since avatars render client-side, they're built in JavaScript. The JS module exposes a mount(context) function. On the Python side, use @uni.avatar to register it.

PNG-based avatar example¶

plugins/my_png_avatar/__init__.py

from pathlib import Path

import uni_plugin_sdk as uni

images_dir = Path(__file__).parent / "static" / "images"

@uni.avatar(module="~/avatar.js")
def create_avatar() -> dict[str, Any]:
    return {
        "expressions": [f.stem for f in images_dir.glob("*.png")]
    }

The object returned from create_avatar is passed to your JavaScript module.

Automatic expressions

If expressions is present in the returned object, UNI triggers them automatically during interactions based on sentiment analysis. Use descriptive names.

plugins/my_png_avatar/static/avatar.js

/**
 * @typedef {Object} AvatarContext
 * @property {HTMLElement} container - The container element.
 * @property {Object} config - Config object from the server.
 * @property {string} path - Static directory path of the plugin.
 * @property {function(src:string):Promise<void>} loadScript - Load a static script.
 * @property {function(eventName: string, callback: Function): void} on - Register event listener.
 */

/**
 * @param {AvatarContext} context
 * @returns {Promise<Function|undefined>}
 */
export const mount = async (context) => {
  const getImagePath = (fileName) => {
    return `${context.path}/expressions/${fileName}.png`;
  };

  const img = document.createElement("img");
  img.src = getImagePath("default");
  img.style.transition = "transform 0.2s ease-out";
  context.container.appendChild(img);

  context.on("expression", (name) => (img.src = getImagePath(name)));
  context.on("audio", (lv) => (img.style.transform = `scale(${1 + lv * 0.1})`));

  await Promise.all(
    context.config.expressions.map((name) => {
      return new Promise((resolve) => {
        const img = new Image();
        img.onload = img.onerror = resolve;
        img.src = getImagePath(name);
      });
    })
  );

  return () => img.remove();
};

Advanced example

Check out the included uni_avatar_live2d plugin for an example with user-provided assets and external libraries.

Expressions¶

Avatars can optionally support expressions (emotes). They're triggered automatically:

Expression	Trigger
`"default"`	Active by default (otherwise first one is used)
`"sleeping"`	Active during the sleep cycle
`"listening"`	Active while the user is speaking
Other names	Auto-fire while UNI speaks (sentiment-based)

Naming matters

Use descriptive expression names for sentiment analysis to work correctly.

Audio effects¶

Post-process TTS output in real-time with effects like pitch shift or reverb. Multiple effects can be active at once.

Extend uni.AudioEffect and register via @uni.audio_effect:

Echo effect

import numpy as np

@uni.audio_effect
class EchoEffect(uni.AudioEffect):

    def __init__(self):
        self.delay_samples = 0
        self.decay = 0.5
        self.buffer: np.ndarray = None

    def configure(self, sample_rate: int, bytes_per_sample: int) -> None:
        """Configure according to the TTS audio format."""
        self.delay_samples = int(0.2 * sample_rate)
        self.buffer = np.zeros(self.delay_samples, dtype=np.int16)

    def apply(self, chunk: bytes) -> bytes:
        """Apply effect to a chunk of PCM audio."""
        samples = np.frombuffer(chunk, dtype=np.int16)

        output = samples.copy()
        for i in range(len(samples)):
            echo = self.buffer[i % self.delay_samples]
            output[i] = np.clip(output[i] + echo * self.decay, -32768, 32767)
            self.buffer[i % self.delay_samples] = samples[i]

        return output.astype(np.int16).tobytes()

    def reset(self) -> None:
        """Reset internal state between audio streams."""
        self.buffer.fill(0)

Wake word detection¶

Provide custom wake word detection by extending uni.WakeWordEngine and registering via @uni.wake_word:

Custom wake word engine

@uni.wake_word
class MyWakeWordEngine(uni.WakeWordEngine):

    def load(self) -> None:
        config = uni.get_config()
        words = config.plugin.get("words", str, "").split(",")
        sensitivity = config.plugin.get("sensitivity", float, 0.5)
        self._detector = load_model(words, sensitivity)

    def detect(self, audio_chunk: bytes, sample_rate: int) -> bool:
        return self._detector.detected(audio_chunk, sample_rate)

    def shutdown(self) -> None:
        self._detector.close()