ext-whisper
Local speech-to-text for PHP, in-process. ext-whisper is a PHP
8.3+ extension that loads a whisper.cpp
model and transcribes audio inside the PHP process — no Python
sidecar, no remote API, no audio leaving the box.
use Displace\Whisper\Model;
$model = Model::load('models/ggml-tiny.en.bin');
$result = $model->transcribe('meeting.wav');
echo $result->text();
foreach ($result->segments() as $segment) {
printf("[%6.2fs → %6.2fs] %s\n", $segment['start'], $segment['end'], $segment['text']);
}
Written in Rust on top of ext-php-rs
and the whisper-rs bindings.
Part of a stack
ext-whisper is the ingest stage of the Displace local-first AI stack: transcribe with ext-whisper, embed with ext-infer, index and search with ext-turbovec — a complete audio-archive semantic-search pipeline with zero services.
Deliberately out of scope (v0.1)
- Audio decoding — input is 16kHz mono 16-bit PCM WAV, full stop. See Preparing audio for the one-line ffmpeg conversion. Decoding (mp3/m4a/ogg) is a candidate for v0.2.
- Streaming / realtime transcription — file in, transcription out.
- Speaker diarization, word-level timestamps, GPU-default builds — later, if the scope test passes.
- Windows — out of scope platform-wide until someone funds it.
Installation
Via PIE (recommended)
Once tagged releases are published, prebuilt binaries install with PIE on macOS arm64, Linux x86_64, and Linux arm64, for PHP 8.3 / 8.4 / 8.5:
php pie.phar install displace/ext-whisper
From source
Requirements: Rust (the version pinned in rust-toolchain.toml
installs automatically), cmake, and a C/C++ toolchain.
git clone https://github.com/DisplaceTech/ext-whisper
cd ext-whisper
make build # target/debug/libwhisper.{so,dylib}
php -d extension=$PWD/target/debug/libwhisper.so \
-r 'var_dump(extension_loaded("whisper"));'
Get a model
Models come from the whisper.cpp zoo on Hugging Face:
mkdir -p models
curl -L -o models/ggml-tiny.en.bin \
https://huggingface.co/ggerganov/whisper.cpp/resolve/main/ggml-tiny.en.bin
| Model | Size | Notes |
|---|---|---|
tiny.en | 75MB | English-only; fast; fine for clean speech |
base.en | 142MB | Noticeably better punctuation/names |
small | 466MB | Multilingual; the quality sweet spot on CPU |
Verify
php -d extension=$PWD/target/debug/libwhisper.so examples/transcribe.php \
models/ggml-tiny.en.bin tests/fixtures/jfk.wav
Expected output starts with “And so my fellow Americans…”.
Transcription
The shape of a result
Model::transcribe() returns a Transcription:
$result = $model->transcribe('episode.wav');
$result->text(); // string — the full transcript
$result->segments(); // list<array{start: float, end: float, text: string}>
$result->count(); // int — number of segments
$result->duration(); // float — end of the last segment, in seconds
Segment offsets are seconds (floats). The concatenated segment
texts equal text() modulo whitespace, so you can store segments and
derive the transcript, or vice versa.
The segment row shape is deliberately identical to
Displace\AI\Contracts\Transcriber’s
documented return — a contracts adapter is two lines:
public function transcribe(string $audioPath, array $options = []): array
{
$t = $this->model->transcribe($audioPath, $options);
return ['text' => $t->text(), 'segments' => $t->segments()];
}
Options
$model->transcribe('interview.wav', [
'language' => 'de', // ISO 639-1 hint; omit for auto-detect
'translate' => true, // translate to English (multilingual models)
'threads' => 4, // pin the decoder thread count
]);
Unknown keys are ignored (forward compatibility); present-but-wrong types throw with the key named.
Deployment notes
- A loaded model is resident memory (75MB–500MB depending on size). The FPM guidance from ext-infer applies verbatim: load in CLI tools, queue workers, and daemons — not per-FPM-worker.
transcribe()is synchronous and CPU-bound; budget roughly real-time × 0.1–0.5 on modern cores withtiny/basemodels.- One
Modelhandle is safe to share across threads: each call builds its own whisper state and no mutable state is shared.
The pipeline this feeds
Transcribe → chunk (ai-toolkit’s
SentenceChunker fits transcripts well) → embed
(ext-infer) → index
(ext-turbovec):
searchable audio archives, entirely on your hardware.
Preparing audio
ext-whisper accepts exactly one input shape: 16kHz, mono, 16-bit PCM WAV — the format whisper.cpp’s encoder consumes natively.
Everything else converts in one line:
ffmpeg -i input.mp3 -ar 16000 -ac 1 -c:a pcm_s16le out.wav
That line handles mp3, m4a, ogg, flac, mp4/mkv audio tracks, other WAV flavors — anything ffmpeg reads. From PHP:
$command = sprintf(
'ffmpeg -y -i %s -ar 16000 -ac 1 -c:a pcm_s16le %s 2>&1',
escapeshellarg($input),
escapeshellarg($wav),
);
exec($command, $output, $exit);
Why so strict?
Two deliberate reasons (this is a v0.1 scope decision, not a limitation we forgot to fix):
- Resampling is a quality decision the caller should own. A silent internal resample hides the filter choice, dithering, and channel-mix decisions that affect transcription quality. ffmpeg does this better than we would, with flags you control.
- Decoding is a dependency cliff. mp3/m4a/ogg support drags in a
codec stack several times larger than whisper itself. It’s a
candidate for v0.2 (via
symphonia), gated on the same scope test as everything else.
Failure messages carry the fix
A non-conforming input never fails with a generic error — the message names what’s wrong and the command that fixes it:
AudioException: invalid audio input: podcast.wav: expected a 16000Hz
sample rate, got 44100Hz — convert with: ffmpeg -i input.ext -ar 16000
-ac 1 -c:a pcm_s16le out.wav
API surface
The complete public PHP API. For an authoritative copy in PHP-stub
form (consumed by IDEs and static analyzers), see
stubs/whisper.stubs.php.
Displace\Whisper\Model
final class Model
{
public static function load(
string $path,
array $options = [], // use_gpu (bool, default false)
): self;
public function transcribe(
string $wavPath,
array $options = [], // language (string), translate (bool), threads (int)
): \Displace\Whisper\Transcription;
public function close(): void; // idempotent
}
Displace\Whisper\Transcription
final class Transcription
{
public function text(): string;
/** @return list<array{start: float, end: float, text: string}> */
public function segments(): array;
public function count(): int;
public function duration(): float; // seconds
}
Read-only; constructed only by Model::transcribe(). Offsets in
seconds.
Exception hierarchy
\RuntimeException
└── Displace\Whisper\WhisperException
├── Displace\Whisper\ModelLoadException // load() failures
├── Displace\Whisper\AudioException // bad/unsupported WAV input
└── Displace\Whisper\TranscriptionException // whisper.cpp failures, use-after-close
Environment variables
| Variable | Effect |
|---|---|
EXT_WHISPER_LOG=1 | Restore whisper.cpp’s verbose stderr logging (silenced by default). |
Conventions
- Direct construction is refused on
ModelandTranscription— each throwsWhisperExceptionpointing at the right factory. - Unknown option keys are ignored; present-but-wrong-typed keys throw with the key named.
- Audio-format errors always embed the ffmpeg one-liner that produces a conforming file.