LLMs

2024-02-04

Different LLMs.

LLMs

Name	Training Tokens	Context Length	Org	Price
OLMo 1B/7B	2.5T	2K	AllenAI
Phi-2 2B	1.4T	2K	Microsoft
Llama2 7B/13B/70B	2T	4K	Meta
Yi 6B/34B	3T	4K	01.AI
Mistral 7B, 8x7B	8T	8K	Mistral
Gemini		8K, 32K	Google
Gemini 1.5 Pro		1M	Google
GPT-3.5		4K, 16K	OpenAI	In: $0.0015/KT, Out:$0.0020/KT
GPT-4	13T	8K, 32K, 128K	OpenAI	In: $0.01/KT, Out: $0.03/KT

Price

Inference cost of GPT-4 128K
- In: $0.01/KT
- Out: $0.03/KT
- In median usage, one user’s one year cost is around $500.
Training cost of GPT-4 128K
- $150M for 13T training tokens with 1.8T parameters

Each time I use GPT-3.5, the avg input is around 0.5KT, output is around 1KT. So each time I cost around 0.00275, let’s say $0.003 each time. So a thousand time cost is around $3.

Google Search claims each of their search is around 0.00002.

Google Search Ads CPM is around $38.40.

Google Display Ads CPM is around $3.12.

So even GPT-3.5 introduces Ads to cover its cost, it still can easily be profitable.

Terms

llama.cpp, a CPP optimized inference lib for LLaMA etc. models
mlx, similar like llama.cpp, for Apple Silicon
bitsandbytes, a quantization librarary to convert the LLM models from float to 8 bits or 4 bits, so that it can run faster in limited resources
safetensors, simple format for storing tensor, safely (pickle is not safe) and still fast (zero-copy), wroten in Rust
GGUF, a binary format used by llama.cpp, GGML is its previous version. One file.
TGI(text-generation-inference), production-ready gRPC server for LLM inference. Load safetensors, has watermarking, wroten in Rust.
Candle, wroten in Rust by HuggingFace, Rust based tiny ML framework for fast inference.

LLMs

Price

Terms

Contents