Different LLMs.
LLMs
Name | Training Tokens | Context Length | Org | Price |
---|---|---|---|---|
OLMo 1B/7B | 2.5T | 2K | AllenAI | |
Phi-2 2B | 1.4T | 2K | Microsoft | |
Llama2 7B/13B/70B | 2T | 4K | Meta | |
Yi 6B/34B | 3T | 4K | 01.AI | |
Mistral 7B, 8x7B | 8T | 8K | Mistral | |
Gemini | 8K, 32K | |||
Gemini 1.5 Pro | 1M | |||
GPT-3.5 | 4K, 16K | OpenAI | In: $0.0015/KT, Out:$0.0020/KT | |
GPT-4 | 13T | 8K, 32K, 128K | OpenAI | In: $0.01/KT, Out: $0.03/KT |
Price
- Inference cost of GPT-4 128K
- In: $0.01/KT
- Out: $0.03/KT
- In median usage, one user’s one year cost is around $500.
- Training cost of GPT-4 128K
- $150M for 13T training tokens with 1.8T parameters
Each time I use GPT-3.5, the avg input is around 0.5KT, output is around 1KT. So each time I cost around 0.00275, let’s say $0.003 each time. So a thousand time cost is around $3.
Google Search claims each of their search is around 0.00002.
Google Search Ads CPM is around $38.40.
Google Display Ads CPM is around $3.12.
So even GPT-3.5 introduces Ads to cover its cost, it still can easily be profitable.
Terms
- llama.cpp, a CPP optimized inference lib for LLaMA etc. models
- mlx, similar like llama.cpp, for Apple Silicon
- bitsandbytes, a quantization librarary to convert the LLM models from float to 8 bits or 4 bits, so that it can run faster in limited resources
- safetensors, simple format for storing tensor, safely (pickle is not safe) and still fast (zero-copy), wroten in Rust
- GGUF, a binary format used by llama.cpp, GGML is its previous version. One file.
- TGI(text-generation-inference), production-ready gRPC server for LLM inference. Load safetensors, has watermarking, wroten in Rust.
- Candle, wroten in Rust by HuggingFace, Rust based tiny ML framework for fast inference.