ext-whisper

Local speech-to-text for PHP, in-process. ext-whisper is a PHP 8.3+ extension that loads a whisper.cpp model and transcribes audio inside the PHP process — no Python sidecar, no remote API, no audio leaving the box.

use Displace\Whisper\Model;

$model  = Model::load('models/ggml-tiny.en.bin');
$result = $model->transcribe('meeting.wav');

echo $result->text();

foreach ($result->segments() as $segment) {
    printf("[%6.2fs → %6.2fs] %s\n", $segment['start'], $segment['end'], $segment['text']);
}

Written in Rust on top of ext-php-rs and the whisper-rs bindings.

Part of a stack

ext-whisper is the ingest stage of the Displace local-first AI stack: transcribe with ext-whisper, embed with ext-infer, index and search with ext-turbovec — a complete audio-archive semantic-search pipeline with zero services.

Deliberately out of scope (v0.1)

Audio decoding — input is 16kHz mono 16-bit PCM WAV, full stop. See Preparing audio for the one-line ffmpeg conversion. Decoding (mp3/m4a/ogg) is a candidate for v0.2.
Streaming / realtime transcription — file in, transcription out.
Speaker diarization, word-level timestamps, GPU-default builds — later, if the scope test passes.
Windows — out of scope platform-wide until someone funds it.

Installation

Via PIE (recommended)

Once tagged releases are published, prebuilt binaries install with PIE on macOS arm64, Linux x86_64, and Linux arm64, for PHP 8.3 / 8.4 / 8.5:

php pie.phar install displace/ext-whisper

From source

Requirements: Rust (the version pinned in rust-toolchain.toml installs automatically), cmake, and a C/C++ toolchain.

git clone https://github.com/DisplaceTech/ext-whisper
cd ext-whisper
make build                 # target/debug/libwhisper.{so,dylib}
php -d extension=$PWD/target/debug/libwhisper.so \
    -r 'var_dump(extension_loaded("whisper"));'

Get a model

Models come from the whisper.cpp zoo on Hugging Face:

mkdir -p models
curl -L -o models/ggml-tiny.en.bin \
  https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin

Model	Size	Notes
`tiny.en`	75MB	English-only; fast; fine for clean speech
`base.en`	142MB	Noticeably better punctuation/names
`small`	466MB	Multilingual; the quality sweet spot on CPU

Verify

php -d extension=$PWD/target/debug/libwhisper.so examples/transcribe.php \
    models/ggml-tiny.en.bin tests/fixtures/jfk.wav

Expected output starts with “And so my fellow Americans…”.

Transcription

The shape of a result

Model::transcribe() returns a Transcription:

$result = $model->transcribe('episode.wav');

$result->text();       // string — the full transcript
$result->segments();   // list<array{start: float, end: float, text: string}>
$result->count();      // int — number of segments
$result->duration();   // float — end of the last segment, in seconds

Segment offsets are seconds (floats). The concatenated segment texts equal text() modulo whitespace, so you can store segments and derive the transcript, or vice versa.

The segment row shape is deliberately identical to Displace\AI\Contracts\Transcriber’s documented return — a contracts adapter is two lines:

public function transcribe(string $audioPath, array $options = []): array
{
    $t = $this->model->transcribe($audioPath, $options);
    return ['text' => $t->text(), 'segments' => $t->segments()];
}

Options

$model->transcribe('interview.wav', [
    'language'  => 'de',    // ISO 639-1 hint; omit for auto-detect
    'translate' => true,    // translate to English (multilingual models)
    'threads'   => 4,       // pin the decoder thread count
]);

Unknown keys are ignored (forward compatibility); present-but-wrong types throw with the key named.

Deployment notes

A loaded model is resident memory (75MB–500MB depending on size). The FPM guidance from ext-infer applies verbatim: load in CLI tools, queue workers, and daemons — not per-FPM-worker.
transcribe() is synchronous and CPU-bound; budget roughly real-time × 0.1–0.5 on modern cores with tiny/base models.
One Model handle is safe to share across threads: each call builds its own whisper state and no mutable state is shared.

The pipeline this feeds

Transcribe → chunk (ai-toolkit’s SentenceChunker fits transcripts well) → embed (ext-infer) → index (ext-turbovec): searchable audio archives, entirely on your hardware.

Preparing audio

ext-whisper accepts exactly one input shape: 16kHz, mono, 16-bit PCM WAV — the format whisper.cpp’s encoder consumes natively.

Everything else converts in one line:

ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le out.wav

That line handles mp3, m4a, ogg, flac, mp4/mkv audio tracks, other WAV flavors — anything ffmpeg reads. From PHP:

$command = sprintf(
    'ffmpeg -y -i %s -ar 16000 -ac 1 -c:a pcm_s16le %s 2>&1',
    escapeshellarg($input),
    escapeshellarg($wav),
);
exec($command, $output, $exit);

Why so strict?

Two deliberate reasons (this is a v0.1 scope decision, not a limitation we forgot to fix):

Resampling is a quality decision the caller should own. A silent internal resample hides the filter choice, dithering, and channel-mix decisions that affect transcription quality. ffmpeg does this better than we would, with flags you control.
Decoding is a dependency cliff. mp3/m4a/ogg support drags in a codec stack several times larger than whisper itself. It’s a candidate for v0.2 (via symphonia), gated on the same scope test as everything else.

Failure messages carry the fix

A non-conforming input never fails with a generic error — the message names what’s wrong and the command that fixes it:

AudioException: invalid audio input: podcast.wav: expected a 16000Hz
sample rate, got 44100Hz — convert with: ffmpeg -i input.ext -ar 16000
-ac 1 -c:a pcm_s16le out.wav

API surface

The complete public PHP API. For an authoritative copy in PHP-stub form (consumed by IDEs and static analyzers), see stubs/whisper.stubs.php.

`Displace\Whisper\Model`

final class Model
{
    public static function load(
        string $path,
        array  $options = [],   // use_gpu (bool, default false)
    ): self;

    public function transcribe(
        string $wavPath,
        array  $options = [],   // language (string), translate (bool), threads (int)
    ): \Displace\Whisper\Transcription;

    public function close(): void;   // idempotent
}

`Displace\Whisper\Transcription`

final class Transcription
{
    public function text(): string;

    /** @return list<array{start: float, end: float, text: string}> */
    public function segments(): array;

    public function count(): int;
    public function duration(): float;   // seconds
}

Read-only; constructed only by Model::transcribe(). Offsets in seconds.

Exception hierarchy

\RuntimeException
└── Displace\Whisper\WhisperException
    ├── Displace\Whisper\ModelLoadException      // load() failures
    ├── Displace\Whisper\AudioException          // bad/unsupported WAV input
    └── Displace\Whisper\TranscriptionException  // whisper.cpp failures, use-after-close

Environment variables

Variable	Effect
`EXT_WHISPER_LOG=1`	Restore whisper.cpp’s verbose stderr logging (silenced by default).

Conventions

Direct construction is refused on Model and Transcription — each throws WhisperException pointing at the right factory.
Unknown option keys are ignored; present-but-wrong-typed keys throw with the key named.
Audio-format errors always embed the ffmpeg one-liner that produces a conforming file.

Keyboard shortcuts

ext-whisper