AI News Hub Logo

AI News Hub

Building Dynamic Audio with Emotion & Pace: Gemini 3.1 Flash TTS, Angular & Firebase Cloud Functions [GDE]

DEV Community
Connie Leung

Google released the Gemini 3.1 Flash TTS Preview model for AI audio generation in the Gemini API, Gemini in Vertex AI, and Gemini AI Studio. This model introduces a new Audio tags feature to exhibit expressive human emotion, pace, and style. This application explores Firebase AI Logic to analyze an uploaded image to generate recommendations, description, alternative tags, and an obscure fact. The obscure fact is sent to a Firebase Cloud Function to generate an audio using a Gemini TTS model. The Cloud Function returns the stream to an Angular application that converts it to a Blob URL object. An audio player sets the URL to the source that users can click the Play button to play the stream. In this blog post, I migrate my application to use the Gemini 3.1 Flash TTS Preview model and create a signal form in Angular to input a scene, emotion, and pace. Then, the Angular application provides the form values and the obscure fact to the Firebase Cloud Function to generate an expressive voice using the GenAI TypeScript SDK. The technical stack of the project: Angular 21: The latest version as of May 2026. Node.js LTS: The LTS version as of May 2026. Firebase Remote Config: To manage dynamic parameters. Firebase Cloud Functions: To generate an expressive human voice when called by the frontend. Firebase Local Emulator Suite: To test the functions locally at http://localhost:5001. Gemini in Vertex AI: To generate videos and store them in Firebase Cloud Storage. The public Google AI Studio API is restricted in my region (Hong Kong). However, Vertex AI (Google Cloud) offers enterprise access that works reliably here, so I chose Vertex AI for this demo. npm i -g firebase-tools Install firebase-tools globally using npm. firebase logout firebase login Log out of Firebase and log in again to perform proper Firebase authentication. firebase init Execute firebase init and follow the prompts to set up Firebase Cloud Functions, the Firebase Local Emulator Suite, Firebase Cloud Storage, and Firebase Remote Config. If you have an existing project or multiple projects, you can specify the project ID on the command line. firebase init --project In both cases, the Firebase CLI automatically installs the firebase-admin and firebase-functions dependencies. After completing the setup steps, the Firebase tools generate the functions emulator, functions, a storage rules file, remote config templates, and configuration files such as .firebaserc and firebase.json. Angular dependency npm i firebase The Angular application requires the firebase dependency to initialize a Firebase app, load remote config, and invoke the Firebase Cloud Functions to generate videos. Firebase dependencies npm i @cfworker/json-schema @google/genai @modelcontextprotocol/sdk Install the above dependencies to access Gemini in Vertex AI. @google/genai depends on @cfworker/json-schema and @modelcontextprotocol/sdk. Without these, the Cloud Functions cannot start. With our project configured, let's look at how the frontend and backend communicate. A user uploads an image in an Angular application and prompts the Gemini 3.1 Flash Lite Preview model to generate a few recommendations for improving the image, a description, and alternative tags. The user also uses the same model and the Google Search tool to find an obscure fact related to the image. A user inputs a scene, an emotion, and a pace in an experimental signal form. When a user clicks the generate audio button, the Angular application sends the form values and the obscure fact to the Firebase Cloud Function to generate an expressive voice using the GenAI TypeScript SDK and Gemini 3.1 Flash TTS Preview model. The model can only accept text inputs and generate audio outputs. The context window is 32K tokens TTS does not support streaming. The supported languages can be found in https://ai.google.dev/gemini-api/docs/speech-generation#languages. My mother tongue, Cantonese, is currently unsupported. Defining the environment variables in the Firebase project ensures the functions know the region of the Google Cloud project, the Firebase Cloud Function location, and the required TTS model. .env.example GOOGLE_CLOUD_LOCATION="global" GOOGLE_FUNCTION_LOCATION="asia-east2" GEMINI_TTS_MODEL_NAME="gemini-3.1-flash-tts-preview" WHITELIST="http://localhost:4200" REFERER="http://localhost:4200/" Variable Description GOOGLE_CLOUD_LOCATION The region of the Google Cloud project. I chose global so that the Firebase project has access to the newest Gemini 3.1 Flash TTS preview model. GOOGLE_FUNCTION_LOCATION The region of the Firebase Cloud Functions. I chose asia-east2 because this is the region where I live. WHITELIST Requests must come from http://localhost:4200. REFERER Requests originate from http://localhost:4200/. http://localhost:4200 is the host and port number of my local Angular application. Before the Cloud Function proceeds with any AI calls, it is critical to ensure that all necessary environment variables are present. I implemented an AUDIO_CONFIG IIFE (Immediately Invoked Function Expression) to validate environment variables like the TTS model name, Google Cloud Project ID, and location. import logger from "firebase-functions/logger"; export function validate(value: string | undefined, fieldName: string, missingKeys: string[]) { const err = `${fieldName} is missing.`; if (!value) { logger.error(err); missingKeys.push(fieldName); return ""; } return value; } export const AUDIO_CONFIG = (() => { logger.info("AUDIO_CONFIG initialization: Loading environment variables and validating configuration..."); const env = process.env; const missingKeys: string[] = []; const location = validate(env.GOOGLE_CLOUD_LOCATION, "Vertex Location", missingKeys); const model = validate(env.GEMINI_TTS_MODEL_NAME, "Gemini TTS Model Name", missingKeys); const project = validate(env.GCLOUD_PROJECT, "Google Cloud Project", missingKeys); if (missingKeys.length > 0) { throw new HttpsError("failed-precondition", `Missing environment variables: ${missingKeys.join(", ")}`); } return { genAIOptions: { project, location, vertexai: true, }, model, }; })(); I am using Node 24 as of May 2026. Since Node 20, we can use the built-in process.loadEnvFile function that loads environment variables from the .env file. In env.ts, the try-catch block attempts to load the environment variables from the .env file. try { process.loadEnvFile(); } catch { // Ignore error if .env file is not found (e.g., in production where env vars are set by the platform) } In src/index.ts, the first statement imports the env.ts before importing other files and libraries. import "./env"; ... other import statements ... If you are using a Node version that does not support process.loadEnvfile, the alternative is to install dotenv to load the environment variables. npm i dotenv import dotenv from "dotenv"; dotenv.config(); Firebase provides the GCLOUD_PROJECT variable, so it is not defined in the .env file. When the missingKeys array is not empty, AUDIO_CONFIG throws an error that lists all the missing variable names. If the validation is successful, the genAIOptions and model are returned. The genAIOptions is used to initialize the GoogleGenAI and model is the selected TTS model name. The Cloud Function sanitizes the scene and transcript before composing the audio prompt. The sanitizeScene function accepts the scene by escaping the newline character ('\n') with the '\\n'. The newline character creates a blank line and often signals the end of a block. The sanitization effectively flattens the scene into one continuous line of data and the LLM's Markdown parser recognizes it as a single, safe paragraph. The sanitization also removes all Markdown headers that are injected into the scene. function sanitizeScene(text: string): string { return (text || "").trim().replace(/\r?\n/g, "\\n").replace(/^[#\s]+/gm, ""); } The sanitizeTranscript function accepts the transcript by removing all Markdown headers and triple quotes that are injected into it. function sanitizeTranscript(text: string): string { return (text || "").trim().replace(/^#+/gm, "").replace(/"""/g, '"'); } The AudioPrompt interface encapsulates the scene, emotion, pace, transcript, and voice option to set the location, audio tags, text, and persona of the audio. export type AudioPrompt = { scene: string; emotion: string; pace: string; transcript: string; voiceOption: string; } The SCENE_DICTIONARY is an array of scenes. When the user does not provide a scene, a scene is randomly selected from the array. export const SCENE_DICTIONARY = [ "A dimly lit, dusty library filled with ancient leather-bound books.\n" + "The air is thick with history. A scholarly archivist is leaning closely into a warm, vintage ribbon microphone.\n" + "They speak with an infectious, hushed intensity, eager to share a forgotten secret they just uncovered in a decaying manuscript.", "It is 10:00 PM in a glass-walled studio overlooking the moonlit London skyline, but inside, it is blindingly bright.\n" + "The red 'ON AIR' tally light is blazing. The speaker is standing up, bouncing on the balls of their heels to the rhythm of a thumping backing track.\n" + "It is a chaotic, caffeine-fueled cockpit designed to wake up an entire nation.", "A meticulously sound-treated bedroom in a suburban home.\n" + "The space is deadened by plush velvet curtains and a heavy rug, creating an intimate, close-up acoustic environment.\n" + "The speaker delivers the information like a trusted friend sharing an inside joke.", "A high-tech, minimalist laboratory humming with servers.\n" + "Crisp, clean acoustics reflect off glass and steel.\n" + "A brilliant but eccentric scientist is pacing back and forth, speaking rapidly and enthusiastically into a headset microphone, excited to explain a complex phenomenon.", ]; I define a buildAudioPrompt function to construct the advanced audio prompt. []. When a pace is defined, the tag is []. The combined audio tag is [] [] to create a proper token boundary. The insertAudioTagsToTranscript uses a regular expression to split the transcript into lines, inserts the combined audio tag before each line, and then joins them with an empty string. The buildAudioPrompt concatenates the scene and the expressive transcript into a string before returning it. import { SCENE_DICTIONARY } from './constants/scenes.const'; import { AudioPrompt } from './types/audio-prompt.type'; function makeTag(value: string) { const trimmedValue = value.trim(); return trimmedValue ? `[${trimmedValue}] ` : ""; } function insertAudioTagsToTranscript({ transcript, pace, emotion }: AudioPrompt): string { const audioTags = `${makeTag(emotion)}${makeTag(pace)}`; const cleanedTranscript = sanitizeTranscript(transcript); const parts = cleanedTranscript.split(/(? { if (i % 2 !== 0) { return ""; // Skip delimiters, they are appended to the text blocks } const delimiter = arr[i + 1] || ""; return text.trim() ? `${audioTags}${text.trim()}${delimiter}` : delimiter; }) .join(""); } export function buildAudioPrompt(data: AudioPrompt): string { const randomIndex = Math.floor(Math.random() * SCENE_DICTIONARY.length); const selectedScene = SCENE_DICTIONARY[randomIndex]; const trimmedScene = (data.scene || "").trim() || selectedScene; const escapedScene = sanitizeScene(trimmedScene); const transcript = insertAudioTagsToTranscript(data); return `## Scene: ${escapedScene} ## Transcript: """ ${transcript} """ `; } The output of the prompt looks like: ## Scene: ## Transcript: [] [] [] [] ...[] [] The createVoiceConfig function constructs an instance of GenerateContentConfig that outputs a speech narrated by the given voice name. import { GenerateContentConfig } from "@google/genai"; export function createVoiceConfig(voiceName = "Kore"): GenerateContentConfig { return { responseModalities: ["audio"], speechConfig: { voiceConfig: { prebuiltVoiceConfig: { voiceName, }, }, }, }; } const splitList = (whitelist?: string) => (whitelist || "").split(",").map((origin) => origin.trim()); export const whitelist = splitList(process.env.WHITELIST); export const cors = whitelist.length > 0 ? whitelist : true; export const refererList = splitList(process.env.REFERER); All Cloud Functions enforce App Check, CORS, and a timeout period of 600 seconds. If WHITELIST is unspecified, CORS defaults to true. While acceptable in a demo environment, configure CORS to a specific domain or false in production to prevent unauthorized access. The readFact cloud function delegates to readFactStreamFunction when isStreaming is true. Otherwise, it is delegated to readFactFunction. The readFactFunction function returns a Promise that is the encoded base64 string. The readFactStreamFunction functions returns a Promise that represents a buffer of WAV header bytes. import { onCall } from "firebase-functions/v2/https"; import { cors } from "../auth"; import { buildAudioPrompt } from './audio-prompt'; import { readFactFunction, readFactFunctionStream } from "./read-fact"; import { createVoiceConfig } from './voice-config'; const options = { cors, enforceAppCheck: true, timeoutSeconds: 600, }; export const readFact = onCall(options, (request, response) => { const { data, acceptsStreaming } = request; const isStreaming = acceptsStreaming && !!response; const prompt = buildAudioPrompt(data); const voiceOption = createVoiceConfig(data.voiceOption); return isStreaming ? readFactStreamFunction(prompt, voiceOption, response) : readFactFunction(prompt, voiceOption); }); The withAIAudio function is a high-order function that calls the callback to generate an audio stream. async function withAIAudio(callback: (ai: GoogleGenAI, model: string) => Promise) { try { const variables = AUDIO_CONFIG; if (!variables) { return ""; } const { genAIOptions, model } = variables; const ai = new GoogleGenAI(genAIOptions); return await callback(ai, model); } catch (e) { if (e instanceof HttpsError) { throw e; } throw new HttpsError("internal", "An internal error occurred while setting up the AI client.", { originalError: (e as Error).message, }); } } generateAudio is a callback function that uses the Gemini 3.1 Flash TTS Preview model to generate a response. getBase64DataUrl invokes extractInlineAudioData to extract the raw data and the mime type from the response. The encodeBase64String function first converts the raw data to WAV format, then encodes it to base64 format, and finally returns the base64 string. The createAudioParams function constructs a parameter with the Gemini TTS model, the audio prompt, and the speech configuration. async function generateAudio(aiTTS: AIAudio, prompt: string, voiceOption: GenerateContentConfig) { try { const { ai, model } = aiTTS; const response = await ai.models.generateContent(createAudioParams(model, prompt, voiceOption)); return getBase64DataUrl(response); } catch (error) { console.error(error); throw error; } } function createAudioParams(model: string, prompt: string, config?: GenerateContentConfig) { return { model, contents: [ { role: "user", parts: [ { text: prompt, }, ], }, ], config, }; } function extractInlineAudioData(response: GenerateContentResponse): { rawData: string | undefined; mimeType: string | undefined; } { const { data: rawData, mimeType } = response.candidates?.[0]?.content?.parts?.[0]?.inlineData ?? {}; return { rawData, mimeType }; } function getBase64DataUrl(response: GenerateContentResponse) { const { rawData, mimeType } = extractInlineAudioData(response); if (!rawData || !mimeType) { throw new Error("Audio generation failed: No audio data received."); } return encodeBase64String({ rawData, mimeType }); } export function encodeBase64String({ rawData, mimeType }: RawAudioData) { const wavBuffer = convertToWav(rawData, mimeType); const base64Data = wavBuffer.toString("base64"); return `data:audio/wav;base64,${base64Data}`; } generateAudioStream is a callback function that uses the Gemini 3.1 Flash TTS Preview model to stream a list of audio chunks. The chunks are iterated so that each chunk is passed to the extractInlineAudioData function to extract the raw data and the mime type. The function converts the chunk's raw data into a buffer and sends it to the client; the byte length accumulates to determine the total size of all chunks. After all the chunks are sent to the client, the createWavHeader function uses the total byte length and the audio options to construct a WAV header and returns it. async function generateAudioStream( aiTTS: AIAudio, prompt: string, voiceOption: GenerateContentConfig, response: CallableResponse, ): Promise { try { const { ai, model } = aiTTS; const chunks = await ai.models.generateContentStream(createAudioParams(model, prompt, voiceOption)); let byteLength = 0; let options: WavConversionOptions | undefined = undefined; for await (const chunk of chunks) { const { rawData, mimeType } = extractInlineAudioData(chunk); if (!options && mimeType) { options = parseMimeType(mimeType); response.sendChunk({ type: "metadata", payload: { sampleRate: options.sampleRate, }, }); } if (rawData && mimeType) { const buffer = Buffer.from(rawData, "base64"); byteLength = byteLength + buffer.length; response.sendChunk({ type: "data", payload: { buffer, }, }); } } if (options && byteLength > 0) { const header = createWavHeader(byteLength, options); return [...header]; } return undefined; } catch (error) { console.error(error); throw error; } } The readFactFunction invokes the withAIAudio high-order function to generate a base64-encoded string. The readFactStreamFunction function calls the withAIAudio high-order function to write chunks to the response body and send them to the client. Then, the generateAudioStream function returns the bytes of the WAV header. export async function readFactFunction(prompt: string, voiceOption: GenerateContentConfig) { return withAIAudio((ai, model) => generateAudio({ ai, model }, prompt, voiceOption)); } export async function readFactStreamFunction(prompt: string, voiceOption: GenerateContentConfig, response: CallableResponse) { return withAIAudio((ai, model) => generateAudioStream({ ai, model }, prompt, voiceOption, response)); } I implemented a FIREBASE_APP_CONFIG IIFE (Immediately Invoked Function Expression) to run once to validate the environment variables of the Firebase app. export const FIREBASE_APP_CONFIG = (() => { const env = process.env; const missingKeys: string[] = []; const apiKey = validate(env.APP_API_KEY, "API Key", missingKeys); const appId = validate(env.APP_ID, "App Id", missingKeys); const messagingSenderId = validate(env.APP_MESSAGING_SENDER_ID, "Messaging Sender ID", missingKeys); const recaptchaSiteKey = validate(env.RECAPTCHA_ENTERPRISE_SITE_KEY, "Recaptcha site key", missingKeys); const projectId = validate(env.GCLOUD_PROJECT, "Project ID", missingKeys); if (missingKeys.length > 0) { throw new Error(`Missing environment variables: ${missingKeys.join(", ")}`); } return { app: { apiKey, appId, projectId, messagingSenderId, authDomain: `${projectId}.firebaseapp.com`, storageBucket: `${projectId}.firebasestorage.app`, }, recaptchaSiteKey, }; })(); The getFirebaseConfig function caches the FIREBASE_APP_CONFIG for an hour before returning it to the Angular application. The Angular application receives the Firebase app configuration and reCAPTCHA site key from the Cloud Function to initialize Firebase AI Logic and protect resources from unauthorized access and abuse. export const getFirebaseConfig = onRequest({ cors }, (request, response) => { if (!validateRequest(request, response)) { return; } try { response.set("Cache-Control", "public, max-age=3600, s-maxage=3600"); response.json(FIREBASE_APP_CONFIG); } catch (err) { console.error(err); response.status(500).send("Internal Server Error"); } }); For local development, I used the Firebase Local Emulator Suite to save cost and time. In the bootstrapFirebase process, the application calls connectFunctionsEmulator to link to the Cloud Functions running at http://localhost:5001. The port number defaulted to 5001 when firebase init was executed. function connectEmulators(functions: Functions, remoteConfig: RemoteConfig) { if (location.hostname === 'localhost') { const host = getValue(remoteConfig, 'functionEmulatorHost').asString(); const port = getValue(remoteConfig, 'functionEmulatorPort').asNumber(); connectFunctionsEmulator(functions, host, port); } } loadFirebaseConfig is a helper function that makes request to the Cloud function to obtain the Firebase App configuration and the reCAPTCHA site key. { "getFirebaseConfigUrl": "http://127.0.0.1:5001/vertexai-firebase-6a64f/us-central1/getFirebaseConfig" } export type FirebaseConfigResponse = { app: FirebaseOptions; recaptchaSiteKey: string } import { HttpClient } from '@angular/common/http'; import { inject } from '@angular/core'; import { catchError, lastValueFrom, throwError } from 'rxjs'; import config from '../../public/config.json'; import { FirebaseConfigResponse } from './ai/types/firebase-config.type'; async function loadFirebaseConfig() { const httpService = inject(HttpClient); const firebaseConfig$ = httpService.get(config.getFirebaseConfigUrl) .pipe(catchError((e) => throwError(() => e))); return lastValueFrom(firebaseConfig$); } The bootstrapFirebase function initializes the FirebaseApp and App Check, loads the Firebase remote configuration and cloud functions, and stores them in the config service for later use. export async function bootstrapFirebase() { try { const configService = inject(ConfigService); const firebaseConfig = await loadFirebaseConfig(); const { app, recaptchaSiteKey } = firebaseConfig; const firebaseApp = initializeApp(app); const remoteConfig = await fetchRemoteConfig(firebaseApp); initializeAppCheck(firebaseApp, { provider: new ReCaptchaEnterpriseProvider(recaptchaSiteKey), isTokenAutoRefreshEnabled: true, }); const functionRegion = getValue(remoteConfig, 'functionRegion').asString(); const functions = getFunctions(firebaseApp, functionRegion); connectEmulators(functions, remoteConfig); configService.loadConfig(firebaseApp, remoteConfig, functions); } catch (err) { console.error(err); } } The AppConfig remains unchanged. import { ApplicationConfig, provideAppInitializer } from '@angular/core'; import { bootstrapFirebase } from './app.bootstrap'; export const appConfig: ApplicationConfig = { providers: [ provideAppInitializer(async () => bootstrapFirebase()), ] }; I create an AudioTagsComponent and a new signal form to input the scene, emotion, pace, and voice name in the Angular frontend. 🎙️ Customize Audio Generation Scene Description Vocal Emotion Speaking Pace AI Voice Model Select a voice... @for (option of sortedVoiceOptions(); track option.name) { {{ option.label }} } import { ChangeDetectionStrategy, Component, computed, signal } from '@angular/core'; import { form, FormField } from '@angular/forms/signals'; import { VOICE_OPTIONS } from './constants/voice-options.const'; import { AudioPromptData } from './types/audio-prompt-data.type'; @Component({ selector: 'app-audio-tags', imports: [FormField], templateUrl: './audio-tags.component.html', changeDetection: ChangeDetectionStrategy.OnPush, }) export class AudioTagsComponent { #audioPromptModel = signal({ scene: 'A news anchor reading the news in a busy newsroom', emotion: 'professional, slightly serious', pace: 'moderate, clear enunciation', voiceOption: 'Kore' }); audioPromptForm = form(this.#audioPromptModel); sortedVoiceOptions = computed(() => { const sortedList = VOICE_OPTIONS.sort((a, b) => a.name.localeCompare(b.name)); return sortedList.map(option => ({ name: option.name, label: `${option.name} - ${option.description}` })); }); audioPromptModel = this.#audioPromptModel.asReadonly(); } The AudioTagsComponent is imported into ObscureFactComponent such that users can input values into the experimental signal form. In the HTML template of ObscureFactComponent, the has a template variable audioTags, and audioTags.audioPromptModel() resolves to an instance of AudioPromptData. The data is assigned to the audioTags property of the generateSpeech method. A surprising or obscure fact about the tags @if (interestingFact()) { {{ interestingFact() }} } @else { The tag(s) does not have any interesting or obscure fact. } import { AudioPromptData } from './audio-prompt-data.type'; import { GenerateSpeechMode } from '../../generate-audio.util'; export type ModeWithAudioTags = { mode: GenerateSpeechMode; audioTags: AudioPromptData; }; export type AudioPrompt = { scene: string; emotion: string; pace: string; transcript: string; voiceOption: string; }; The generateSpeech method uses the fact and audioTags to contruct an instance of AudioPrompt. When the mode is stream, the SpeechService calls generateAudioBlobURL to use the audioPrompt to construct a blob URL. When the mode is sync, the SpeechService calls generateAudio to use the audioPrompt to generate an encoded base64 string. When the mode is web_audio_api, the AudioPlayerService calls playStream to stream the audio. import { SpeechService } from '@/ai/services/speech.service'; import { AudioPrompt } from '@/ai/types/audio-prompt.type'; import { ChangeDetectionStrategy, Component, inject, input, OnDestroy, signal } from '@angular/core'; import { revokeBlobURL } from '../blob.util'; import { AudioTagsComponent } from './audio-tags/audio-tags.component'; import { ModeWithAudioTags } from './audio-tags/types/mode-audio-tags.type'; import { generateSpeechHelper, streamSpeechWithWebAudio, ttsError } from './generate-audio.util'; import { AudioPlayerService } from './services/audio-player.service'; @Component({ selector: 'app-obscure-fact', templateUrl: './obscure-fact.component.html', imports: [ TextToSpeechComponent, ], changeDetection: ChangeDetectionStrategy.OnPush, }) export class ObscureFactComponent implements OnDestroy { interestingFact = input(undefined); speechService = inject(SpeechService); audioPlayerService = inject(AudioPlayerService); isLoadingSync = signal(false); isLoadingStream = signal(false); isLoadingWebAudio = signal(false); audioUrl = signal(undefined); ttsError = ttsError; async generateSpeech({ mode, audioTags }: ModeWithAudioTags) { const fact = this.interestingFact(); if (fact) { revokeBlobURL(this.audioUrl); this.audioUrl.set(undefined); const audioPrompt = { ...audioTags, transcript: fact, }; if (mode === 'sync' || mode === 'stream') { const loadingSignal = mode === 'stream' ? this.isLoadingStream : this.isLoadingSync; const speechFn = (audioPrompt: AudioPrompt) => mode === 'stream' ? this.speechService.generateAudioBlobURL(audioPrompt) : this.speechService.generateAudio(audioPrompt); await generateSpeechHelper(audioPrompt, loadingSignal, this.audioUrl, speechFn); } else if (mode === 'web_audio_api') { await streamSpeechWithWebAudio( audioPrompt, this.isLoadingWebAudio, (audioPrompt: AudioPrompt) => this.audioPlayerService.playStream(audioPrompt)); } } } ngOnDestroy(): void { revokeBlobURL(this.audioUrl); } } The SpeechService has a generateAudio method that calls the readFact cloud function to obtain the encoded base64 string. Similarly, the service has a generateAudioBlobURL method that streams the chunks to create a buffer and prepend it with the WAV header. The constructBlobURL creates a blob URL from the Blob Part array. export function constructBlobURL(parts: BlobPart[]) { return URL.createObjectURL(new Blob(parts, { type: 'audio/wav' })); } import { AudioPrompt } from '@/ai/types/audio-prompt.type'; import { constructBlobURL } from '@/photo-panel/blob.util'; import { inject, Injectable } from '@angular/core'; import { Functions, httpsCallable } from 'firebase/functions'; import { StreamMessage } from '../types/stream-message.type'; import { ConfigService } from './config.service'; @Injectable({ providedIn: 'root' }) export class SpeechService { private configService = inject(ConfigService); private get functions(): Functions { if (!this.configService.functions) { throw new Error('Firebase Functions has not been initialized.'); } return this.configService.functions; } async generateAudio(audioPrompt: AudioPrompt) { const readFactFunction = httpsCallable( this.functions, 'textToAudio-readFact' ); const { data: audioUri } = await readFactFunction(audioPrompt); return audioUri; } async generateAudioStream(audioPrompt: AudioPrompt) { const readFactStreamFunction = httpsCallable( this.functions, 'textToAudio-readFact' ); return readFactStreamFunction.stream(audioPrompt); } async generateAudioBlobURL(audioPrompt: AudioPrompt) { const { stream, data } = await this.generateAudioStream(audioPrompt); const audioParts: BlobPart[] = []; for await (const audioChunk of stream) { if (audioChunk && audioChunk.type === 'data') { audioParts.push(new Uint8Array(audioChunk.payload.buffer.data)); } } const wavHeader = await data; if (wavHeader && wavHeader.length) { audioParts.unshift(new Uint8Array(wavHeader)); } return constructBlobURL(audioParts); } } Similar to SpeechService.generateAudioBlobURL, the playStream method of AudioPlayerService also calls generateAudioStream to get a stream of chunks and play each of them immediately. import { SpeechService } from '@/ai/services/speech.service'; import { AudioPrompt } from '@/ai/types/audio-prompt.type'; import { inject, Injectable, OnDestroy, signal } from '@angular/core'; @Injectable({ providedIn: 'root' }) export class AudioPlayerService implements OnDestroy { async playStream(audioPrompt: AudioPrompt) { const { stream } = await this.speechService.generateAudioStream(audioPrompt); for await (const audioChunk of stream) { ... process each chunk ... } } ngOnDestroy(): void { ... release resources to prevent memory leak ... } } This is the end of the walkthrough for the demo. You should now be able to input different combinations of scene, emotion, and pace to create a unique personality to say the given text in an audio clip. The examples in Gemini AI Studio and Vertex AI Studio use static audio tags and transcripts and they work correctly for me. When I applied dynamic audio tags and transcripts in the demo, the Gemini 3.1 TTS Flash Preview model ignored the audio tags. The issue was resolved after debugging in Gemini CLI for hours. Here are the Caveats and Lessons Learned: The Token Boundary Trap. The code originally concatenated tags and transcript without a space (for example, "[giggle][slow]Before"). The LLM tokenizer failed to recognize the instruction to change the behavior and pace of the audio. My fix was to insert a space between the tags and the transcript, which was "[giggle] [slow] Before". Sanitize inputs before injecting into the prompt template. The sanitize functions remove Markdown headers (#) and triple quotes from the scene and transcript. The cleansed scene and transcript are injected into the prompt template to construct the final audio prompt. LLM does not understand idiom. I typed "at a snail's pace" in the signal form and inserted "[at a snail's pace]" before the line. However, the model vocalized the tag literally, and no pace change occurred. "Repetitive Weighting" is a Real Strategy. If standard tags like [slow] and [fast] are not dramatic enough, prepend the pace with "very" to increase the dramatic effect of the pace. It was evident when [very, very, very slow] generated a longer audio than [slow]. Replace newline character (\n) with \\n. to flatten the lines into a single paragraph. When the scene and transcript are cleansed and escaped, they are injected into the prompt template while the structure is preserved for the LLM parser. Conclusion The integration of text-to-speech with Firebase's serverless scalability empowers Angular applications for real-time audio generation. First, the Angular application neither requires the genai dependency nor stores the Vertex AI environment variables in a .env file. The client application calls the Cloud Functions to perform the text to speech tasks to generate an audio stream. The Cloud Functions receive arguments from the client, and execute a TTS operation to either return the entire audio as an encoded base64 string or stream the audio bytes in chunks. During local development, the Firebase Emulator calls the functions at http://localhost:5001 instead of the ones deployed on the Cloud Run platform to save cost. Try cloning the GitHub repository, uploading an image to generate an obscure fact, and using the Gemini 3.1 Flash TTS preview model to speak it with the specified scene, emotion, and pace. Demo GitHub Repo Firebase Cloud Functions Connect to the Cloud Functions Emulator Audio Tags Advanced Audio Prompting Prompting Strategies Previous Post about Gemini 2.5 Flash TTS, Angular and Firebase