paulw.tokyo

Text embedding inference on a cheap server

Embedding models have made decent progress in the last few years. I was curious to see the kind of inference load now doable on a cheap CPU-only machine.

I ran some tests on my Netcup RS-2000-G9.5 VPS, a 10.75€/month (≅ 12.6USD) machine with 6 cores of a AMD EPYC 7702P CPU; 6 years old hardware but capable enough. For comparison, around that price you get a t4g.small on AWS (2 Graviton2 vCPU) or a e2-small on GCP (2 vCPU)

I picked two versions of IBM's granite embedding models and the recently released EmbeddingGemma from Google, three models that performed impressively well for their size on an English & Japanese retrieval test I ran recently.

Benchmarking was done with llama.cpp for inference and this fork of wrk for load testing:

# Serve and download the embedding model
llama-server --embeddings -c 512 --mlock --log-disable -hf bartowski/granite-embedding-107m-multilingual-GGUF:Q8_0

# Test for 60s at 300rps, single thread and connection.
wrk -t1 -c1 -sshort.lua -R300 -d60s --latency 'http://localhost:8080/embedding'

Content of short.lua used for the short test query below. paragraph.lua is similar with the three stanzas of The Arrow and the Song as content:

wrk.method = "POST"
wrk.body = "{\"content\": \"a very short query\"}"
wrk.headers["Content-Type"] = "application/json"

The max request per second (RPS) was found by iteratively increasing the rate until latency exploded, then kept slightly below that rate. Latency percentiles were measured at max RPS.
Here are the results:

Model "short" max RPS p50 (ms) p99 (ms) "paragraph" max RPS p50 (ms) p99 (ms)
granite-embedding-107m-multilingual-Q8_0 300 2.9 4.2 35 24.4 32.7
granite-embedding-278m-multilingual-Q5_K_M 70 12.9 17.3 4 198.9 259.0
embeddinggemma-300M-Q8_0 65 15.7 162.3 4 193.9 210.7

The larger models can only handle a small load on this machine, but the 100M parameters one can already run at about 50 rps/core on short queries and 6 rps/core on more substantial chunks of text. Not bad!

Running the benchmark for 30min produced the same latencies.

To improve performance further, next I'd try static embeddings of recent models. Potion-multilingual-128M had low retrieval accuracy but impressive sub-millisecond latency.
Though model support is more limited, Huggingface's TEI would be interesting to compare as well. On an M3 air laptop I had identical performances to llamacpp but it might be better tailored to server CPUs.

#devops #machine learning #programming #search