Directing a Synthetic Voice: Latin Phonetics, Voice Design, and Theatrical Text-to-Speech in ElevenLabs

Francesca Fini
May 26
4 min read

For the third episode of THE GALLERY / SIMUL ACTIONS, I needed a voice that was not simply “dramatic,” “ancient,” or “solemn.” I needed a voice that could inhabit the architecture of a classical tragedy: a coryphaeus speaking from inside a ritual space, somewhere between Roman theatre, contemporary performance, and generative hallucination.

The text was in Latin. Or rather, in a deliberately theatrical Latin: not used as a philological ornament, but as vocal material, ritual code, and sonic architecture.

The challenge was clear from the beginning: how do you make an AI voice pronounce Latin with authority, emotional weight, and theatrical credibility?

This is where my work with ElevenLabs became particularly interesting.

Voice Design as casting

The first step was not text-to-speech, but Voice Design in ElevenLabs.

Before working on pronunciation, I needed to create the right vocal body for the coryphaeus: deep, resonant, aged but powerful, capable of tragedy without becoming melodrama. The voice had to feel like a performer standing before an ancient chorus — not a narrator, not a movie trailer voice, not a generic “epic” preset.

In this sense, the Voice Design tool became a form of casting.

I was not just selecting a voice. I was designing a performer.

I needed a vocal presence that could carry contradiction: authority and despair, prophecy and ridicule, solemnity and absurdity. Because in THE GALLERY, tragedy is never safe. It is always interrupted by error, slapstick, the body falling into the wrong mythology, and the generative image stumbling into unexpected forms.

Text-to-speech as vocal direction

Once the voice existed, the real work began: directing it.

This is where ElevenLabs becomes much more than a text-to-speech engine. With the right use of tags, rhythm, punctuation, line breaks, and phonetic writing, the text becomes a kind of vocal score.

For example, before the Latin text, I used performance tags such as:

[deep resonant voice] [solemn theatrical declamation] [tragic] [prophetic tone] [slow ritual cadence] [voice broken but powerful] [like a coryphaeus in a Roman tragedy]

These tags are not merely decorative. They function like a miniature direction note to an actor. They define posture, emotional temperature, rhythm, projection, and dramatic intention.

In a traditional recording session, I would give these instructions to a performer. With ElevenLabs, I write them into the prompt.

This is, for me, one of the most fascinating aspects of AI voice work: the prompt becomes vocal direction.

The problem of Latin pronunciation

The next challenge was pronunciation.

If I simply wrote:

Ecce signum!Sol niger super velarium ascendit.

the result could shift unpredictably depending on the selected voice. A voice might pronounce the Latin with an Italian ecclesiastical flavor, an English deformation, or a generic pseudo-ancient tone.

But I wanted something closer to classical reconstructed Latin pronunciation:

C is always hard.G is always hard.V is closer to U/W.QU becomes KU/KW.SCI becomes SKI, not the Italian “sci.”GN is separated, not the Italian soft “gn.”

So Luci could not be pronounced like modern Italian “luci.” It had to become closer to:

Luki

Renasci could not become “renasci” with the soft Italian “sci.” It had to become:

renaski

Ignoto could not be read as Italian “ignoto.” The G and N had to remain separated:

ig noto

This changed everything.

The voice suddenly became harder, more archaic, more ritualistic. The Latin stopped sounding ecclesiastical and became almost stone-like: a language carved into the scene.

Writing phonetics for the specific voice

One important discovery is that phonetic prompting is not universal.

The same Latin line must be rewritten differently depending on whether the selected voice is Italian, English, or another linguistic identity.

For an Italian voice, I used a phonetic adaptation like:

Implèta est profètia.Dìi tàndem nòbis fàuent.

For an English-speaking voice, the same line would need a completely different strategy:

im PLEH tah est pro FEH tee ah.DEE ee TAHN dem NOH bees FAH went.

This means that text-to-speech is not just about writing what should be said. It is about understanding the “mouth” of the synthetic performer.

Each voice carries its own linguistic instincts. A good prompt must anticipate them.

From text to vocal score

At a certain point, the Latin text became a vocal score.

The original line was:

O miser ego.Arcana est prophetia.

Loquitur de muliere inepta,falso flava,quae rosam sacram renasci facietex calce crepidae suae,colore ignoto deis.

For an Italian voice trained toward classical pronunciation, it became:

[voce profonda e risonante] [malinconico] [disperato] [tragico]

O mìser ègo.

Arkàna est profètia.

Lòkuitur de mulière inèpta,

fàlso flàua,

kuàe ròsam sàkram renàski fàkiet

eks kàlke krèpidae sùae,

kolòre ig nòto dèis.

This is not “correct Latin” in a written sense. It is a functional phonetic script.

It is designed to make the synthetic performer say the line correctly.

In this process, writing becomes a hybrid object: part poetry, part phonetics, part theatre direction, part sound design.

Directing the chorus

The same method had to change when working with the chorus.

The coryphaeus needed tragedy, rage, despair, and prophecy. But the chorus needed something else: a solemn, collective, almost emotionless chant.

So the tags changed:

[ancient tragedy chorus] [loud solemn declamation] [monotone chant] [emotionless voices in unison] [slow measured cadence] [formal ritual recitation]

This was crucial. If the tags were too emotional, the chorus became expressive in a modern way. But a tragic chorus should not sound like individual psychology. It should sound like a law being pronounced.

Again, ElevenLabs allowed me to work not only on the sound of the words, but on the dramaturgical function of the voice.

For this episode of THE GALLERY, ElevenLabs was not just a tool for text-to-speech. It became part of the performance system itself.

Through Voice Design, I created the vocal body of the coryphaeus. Through text-to-speech tags, I directed its emotional and theatrical behavior. Through phonetic rewriting, I taught the voice how to approach Latin pronunciation in a way that felt archaic, solemn, and dramatically alive.

The final result is not simply a synthetic voice speaking Latin.

It is a constructed vocal presence: tragic, artificial, ritualistic, and deeply performative.

And perhaps this is what interests me most about working with AI voices today. The question is no longer only: Can the machine speak?

The real question is:

Can we teach the machine how to perform? Try ElevenLabs: https://try.elevenlabs.io/acm9zmq8oxbl