API Reference

SciTeX Audio - Text-to-Speech with Multiple Backends

Backends (fallback order):

elevenlabs: ElevenLabs (paid, high quality, speed=1.2)
luxtts: LuxTTS (open-source, offline, voice-cloning, speed=2.0)
gtts: Google TTS (free, requires internet, speed=1.5)
pyttsx3: System TTS (offline, free, uses espeak/SAPI5)

Usage:

from scitex_audio import speak speak(“Hello, world!”)

from scitex_audio import get_tts, LuxTTS tts = get_tts(“luxtts”) tts.speak(“Hello!”)

class scitex_audio.ElevenLabsTTS(api_key: str | None = None, voice: str = 'adam', model_id: str = 'eleven_multilingual_v2', stability: float = 0.5, similarity_boost: float = 0.75, speed: float = 1.0, client=None, **kwargs)[source]

Bases: BaseTTS

ElevenLabs TTS backend.

High-quality voices but requires API key and has usage costs.

Environment:: ELEVENLABS_API_KEY: Your ElevenLabs API key

MAX_SPEED = 1.2

MIN_SPEED = 0.7

VOICES = {'adam': 'pNInz6obpgDQGcFmaJgB', 'alice': 'Xb7hH8MSUJpSbSDYk0k2', 'antoni': 'ErXwobaYiN019PkySvjV', 'bella': 'hpp4J3VqNfWAUOO0d1Us', 'brian': 'nPczCjzI2devNBz1zQrb', 'callum': 'N2lVS1w4EtoT3dr4eOWO', 'charlie': 'IKne3meq5aSn9XLyUdCD', 'chris': 'iP95p4xoKVk53GoZ742B', 'daniel': 'onwK4e9ZLuTAKqWW03F9', 'domi': 'AZnzlk1XvdvUeBnXmlld', 'elli': 'MF3mGyEYCl7XYWbV9V6O', 'eric': 'cjVigY5qzO86Huf0OWal', 'george': 'JBFqnCBsd6RMkjVDRZzb', 'harry': 'SOYHLrjzK2X1ezoPC6cr', 'jessica': 'cgSgspJ2msm6clMCkdW9', 'josh': 'TxGEqnHWrfWFTfGW9XjX', 'laura': 'FGY2WhTYpPnrIDTdsKH5', 'liam': 'TX3LPaxmHKxFdv7VOQHJ', 'lily': 'pFZP5JQG7iQjIQuC4Bku', 'matilda': 'XrExE9yKIg1WjnnlVkGX', 'rachel': '21m00Tcm4TlvDq8ikWAM', 'river': 'SAz9YHcvj6GT2YYXdXww', 'roger': 'CwhRBWXzGAHq8TQ4Fs17', 'sam': 'yoZ06aMxZJJ28mfd3POQ', 'sarah': 'EXAVITQu4vr4xnSDxMaL', 'will': 'bIHbv24MWmeRgasZH58o'}

property client: Lazy-load ElevenLabs client.

get_voices() → List[dict][source]: Get available voices.

property name: str: Return the backend name.

property requires_api_key: bool: Whether this backend requires an API key.

property requires_internet: bool: Whether this backend requires internet connection.

synthesize(text: str, output_path: str) → Path[source]: Synthesize text using ElevenLabs API.

class scitex_audio.GoogleTTS(lang: str = 'en', slow: bool = False, speed: float = 1.5, gtts_factory=None, **kwargs)[source]

Bases: BaseTTS

Google Text-to-Speech backend using gTTS.

Free to use, requires internet connection. Good quality voices with multi-language support. Supports speed control via pydub (requires ffmpeg).

Install: pip install gTTS pydub

LANGUAGES = {'ar': 'Arabic', 'de': 'German', 'en': 'English', 'es': 'Spanish', 'fr': 'French', 'hi': 'Hindi', 'it': 'Italian', 'ja': 'Japanese', 'ko': 'Korean', 'nl': 'Dutch', 'pl': 'Polish', 'pt': 'Portuguese', 'ru': 'Russian', 'sv': 'Swedish', 'tr': 'Turkish', 'vi': 'Vietnamese', 'zh-CN': 'Chinese (Simplified)', 'zh-TW': 'Chinese (Traditional)'}

get_voices() → List[dict][source]: Get available languages as ‘voices’.

property name: str: Return the backend name.

property requires_internet: bool: Whether this backend requires internet connection.

synthesize(text: str, output_path: str) → Path[source]: Synthesize text using Google TTS with optional speed control.

class scitex_audio.LuxTTS(device: str | None = None, model_id: str = 'YatharthS/LuxTTS', reference_audio: str | None = None, num_steps: int = 4, speed: float = 2.0, rms: float = 0.01, t_shift: float = 0.9, return_smooth: bool = False, ref_duration: float = 5.0, trim_start: float | None = None, **kwargs)[source]

Bases: BaseTTS

LuxTTS backend - open-source voice-cloning TTS.

High-quality 48kHz output. Near-realtime on CPU, 150x+ on GPU. Requires a reference audio file for voice cloning.

Install: pip install git+https://github.com/ysharma3501/LuxTTS.git

get_voices() → List[dict][source]: Get available voices (reference audio files).

property name: str: Return the backend name.

property requires_internet: bool: Whether this backend requires internet connection.

speak(text: str, output_path: str | None = None, play: bool = True, voice: str | None = None) → dict[source]: Synthesize and optionally play. Uses .wav temp files (not .mp3).

synthesize(text: str, output_path: str) → Path[source]: Synthesize text using LuxTTS.

class scitex_audio.SystemTTS(rate: int = 150, volume: float = 1.0, voice: str | None = None, engine=None, **kwargs)[source]

Bases: BaseTTS

System TTS backend using pyttsx3.

Works offline using system’s built-in TTS engine. Quality varies by platform and available voices.

Platforms:

Linux: espeak/espeak-ng
Windows: SAPI5
macOS: NSSpeechSynthesizer

property engine: Lazy-load pyttsx3 engine.

get_voices() → List[dict][source]: Get available system voices.

property name: str: Return the backend name.

speak_direct(text: str)[source]: Speak directly without saving to file (faster).

synthesize(text: str, output_path: str) → Path[source]: Synthesize text using system TTS.

class scitex_audio.TTS(api_key: str | None = None, voice_name: str | None = None, voice_id: str | None = None, client=None, client_factory=None, **kwargs)[source]

Bases: object

Text-to-Speech using ElevenLabs API.

Examples

# Basic usage tts = TTS() tts.speak(“Hello, world!”)

# With custom voice tts = TTS(voice_name=”Adam”) tts.speak(“Processing complete”)

# Save to file without playing tts.speak(“Test”, output_path=”/tmp/test.mp3”, play=False)

VOICES = {'adam': 'pNInz6obpgDQGcFmaJgB', 'alice': 'Xb7hH8MSUJpSbSDYk0k2', 'antoni': 'ErXwobaYiN019PkySvjV', 'bella': 'hpp4J3VqNfWAUOO0d1Us', 'brian': 'nPczCjzI2devNBz1zQrb', 'callum': 'N2lVS1w4EtoT3dr4eOWO', 'charlie': 'IKne3meq5aSn9XLyUdCD', 'chris': 'iP95p4xoKVk53GoZ742B', 'daniel': 'onwK4e9ZLuTAKqWW03F9', 'domi': 'AZnzlk1XvdvUeBnXmlld', 'elli': 'MF3mGyEYCl7XYWbV9V6O', 'eric': 'cjVigY5qzO86Huf0OWal', 'george': 'JBFqnCBsd6RMkjVDRZzb', 'harry': 'SOYHLrjzK2X1ezoPC6cr', 'jessica': 'cgSgspJ2msm6clMCkdW9', 'josh': 'TxGEqnHWrfWFTfGW9XjX', 'laura': 'FGY2WhTYpPnrIDTdsKH5', 'liam': 'TX3LPaxmHKxFdv7VOQHJ', 'lily': 'pFZP5JQG7iQjIQuC4Bku', 'matilda': 'XrExE9yKIg1WjnnlVkGX', 'rachel': '21m00Tcm4TlvDq8ikWAM', 'river': 'SAz9YHcvj6GT2YYXdXww', 'roger': 'CwhRBWXzGAHq8TQ4Fs17', 'sam': 'yoZ06aMxZJJ28mfd3POQ', 'sarah': 'EXAVITQu4vr4xnSDxMaL', 'will': 'bIHbv24MWmeRgasZH58o'}

__init__(api_key: str | None = None, voice_name: str | None = None, voice_id: str | None = None, client=None, client_factory=None, **kwargs)[source]

Initialize TTS.

Parameters:

api_key – ElevenLabs API key. Defaults to ELEVENLABS_API_KEY env var.
voice_name – Voice name (e.g., “Adam”, “Sarah”, “George” — free-tier).
voice_id – Direct voice ID (overrides voice_name).
client – Optional pre-built client (testing). When given, the lazy-load is skipped.
client_factory – Optional callable (api_key) -> client used by the lazy client property instead of the real ElevenLabs SDK (testing). Lets a test exercise the import-error path without uninstalling the dependency.
**kwargs – Additional config options (stability, speed, etc.)

property client: Lazy-load ElevenLabs client.

list_voices() → list[source]: List available voices from ElevenLabs.

speak(text: str, output_path: str | None = None, play: bool = True, voice_name: str | None = None, voice_id: str | None = None) → Path | None[source]

Convert text to speech and optionally play it.

Parameters:

text – Text to convert to speech.
output_path – Path to save audio file. Auto-generated if None.
play – Whether to play the audio after generation.
voice_name – Override voice name for this call.
voice_id – Override voice ID for this call.

Return type:

Path to the generated audio file, or None if only played.

scitex_audio.announce_context(include_full_path: bool = False, speak_aloud: bool = True, branch_resolver=None, speak_fn=None) → dict[source]

Announce the current working directory and git branch.

Builds an orientation sentence (e.g. "Working in scitex-audio, on branch develop") and, by default, speaks it aloud. Useful when starting work in a new session.

Parameters:

include_full_path (bool) – Include the absolute path instead of just the directory name.
speak_aloud (bool) – Speak the announcement (default True). When False, only the context dict is returned.
branch_resolver (callable, optional) – Injectable callable (cwd: str) -> str | None returning the git branch name (testing seam). Defaults to a real git rev-parse subprocess.
speak_fn (callable, optional) – Injectable speak function (testing seam). Defaults to speak().

Returns:

{"directory": str, "directory_name": str, "git_branch": str | None, "announced_text": str, "spoke": bool}.

Return type:

dict

scitex_audio.available_backends() → list[str][source]: Return list of available TTS backends.

scitex_audio.available_models() → list[str][source]

List available whisper models.

Return type:: List of model names (e.g., [“tiny”, “base”, “medium”]).

scitex_audio.check_local_audio_available() → dict[source]

Check if local audio playback is available.

Checks PulseAudio sink state to determine if audio can actually be heard. On NAS or headless servers, the sink is typically SUSPENDED.

In WSL environments, also checks for Windows playback fallback via PowerShell.

Returns:

dict with keys
- available (bool - True if local audio output is likely to work)
- state (str - ‘RUNNING’, ‘IDLE’, ‘SUSPENDED’, ‘NO_SINK’, etc.)
- reason (str - Human-readable explanation)
- fallback (str (optional) - Fallback method if primary unavailable)

scitex_audio.check_wsl_audio() → dict[source]: Check WSL audio status and connectivity.

scitex_audio.find_whisper_cli() → str | None[source]

Find whisper-cli binary.

Return type:: Path to whisper-cli, or None if not found.

scitex_audio.find_whisper_model(model: str = 'tiny') → str | None[source]

Find a whisper model file.

Parameters:: model – Model name (tiny, base, small, medium, large-v3-turbo, etc.)
Return type:: Path to model file, or None if not found.

scitex_audio.generate_bytes(text: str, backend: str | None = None, voice: str | None = None, **kwargs) → bytes[source]: Generate TTS audio as raw bytes without playing.

scitex_audio.generate_env_template(include_sensitive: bool = True, include_defaults: bool = True) → str

Generate a template .src file with all environment variables.

Parameters:

include_sensitive (bool) – Include sensitive variables (API keys) as commented placeholders.
include_defaults (bool) – Include default values for variables that have them.

Returns:

Bash-compatible .src file content.

Return type:

str

scitex_audio.get_tts(backend: str | None = None, **kwargs) → BaseTTS[source]: Get a TTS instance for the specified backend.

Convert text to speech with smart local/remote switching.

Modes:

local: Always use local TTS backends (fails if audio unavailable)
remote: Always forward to relay server
auto: Smart routing - prefers relay if local audio unavailable

Smart Routing (auto mode):

Checks if local audio sink is available (not SUSPENDED)
If local unavailable and relay configured, uses relay
If both unavailable, returns error with clear message

Fallback order (local, only when backend is None): elevenlabs -> luxtts -> gtts -> pyttsx3

Parameters:

text – Text to speak.
backend – TTS backend (‘elevenlabs’, ‘luxtts’, ‘gtts’, ‘pyttsx3’). Auto-selects with fallback if None.
voice – Voice name, ID, or language code.
play – Whether to play the audio.
output_path – Path to save audio file.
fallback – If None (default), True when backend is None, False when backend is explicitly specified — i.e. an explicit backend request fails loud rather than silently falling back. Pass True/False to override.
rate – Speech rate in words per minute (pyttsx3 only, default 150).
speed – Speed multiplier for gtts (1.0=normal, >1.0=faster, <1.0=slower).
mode – Override mode (‘local’, ‘remote’, ‘auto’). Uses env if None.
**kwargs – Additional backend options.

Returns:

success, played, play_requested, backend, path (if saved), mode.

Return type:

Dict with

Environment Variables:: SCITEX_AUDIO_MODE: Default mode (‘local’, ‘remote’, ‘auto’) SCITEX_AUDIO_RELAY_URL: Relay server URL for remote mode

scitex_audio.stop_speech() → None[source]: Stop any currently playing speech by killing espeak processes.

scitex_audio.transcribe(audio_path: str, language: str | None = 'ja', model: str = 'tiny', whisper_cli: str | None = None, model_path: str | None = None) → dict[source]

Transcribe audio file to text using whisper.cpp.

Parameters:

audio_path – Path to audio file (any format ffmpeg supports).
language – Language code (e.g., “ja”, “en”). None for auto-detect.
model – Whisper model name (tiny, base, small, medium, large-v3-turbo).
whisper_cli – Override path to whisper-cli binary.
model_path – Override path to model file.

Returns:

Dict with keys

Return type:

success, text, segments, language, model, audio_path.