Different LLMs.


Name Training Tokens Context Length Org Price
OLMo 1B/7B 2.5T 2K AllenAI
Phi-2 2B 1.4T 2K Microsoft
Llama2 7B/13B/70B 2T 4K Meta
Yi 6B/34B 3T 4K 01.AI
Mistral 7B, 8x7B 8T 8K Mistral
Gemini 8K, 32K Google
Gemini 1.5 Pro 1M Google
GPT-3.5 4K, 16K OpenAI In: $0.0015/KT, Out:$0.0020/KT
GPT-4 13T 8K, 32K, 128K OpenAI In: $0.01/KT, Out: $0.03/KT


  • Inference cost of GPT-4 128K
    • In: $0.01/KT
    • Out: $0.03/KT
    • In median usage, one user’s one year cost is around $500.
  • Training cost of GPT-4 128K
    • $150M for 13T training tokens with 1.8T parameters

Each time I use GPT-3.5, the avg input is around 0.5KT, output is around 1KT. So each time I cost around 0.00275, let’s say $0.003 each time. So a thousand time cost is around $3.

Google Search claims each of their search is around 0.00002.

Google Search Ads CPM is around $38.40.

Google Display Ads CPM is around $3.12.

So even GPT-3.5 introduces Ads to cover its cost, it still can easily be profitable.


  • llama.cpp, a CPP optimized inference lib for LLaMA etc. models
  • mlx, similar like llama.cpp, for Apple Silicon
  • bitsandbytes, a quantization librarary to convert the LLM models from float to 8 bits or 4 bits, so that it can run faster in limited resources
  • safetensors, simple format for storing tensor, safely (pickle is not safe) and still fast (zero-copy), wroten in Rust
  • GGUF, a binary format used by llama.cpp, GGML is its previous version. One file.
  • TGI(text-generation-inference), production-ready gRPC server for LLM inference. Load safetensors, has watermarking, wroten in Rust.
  • Candle, wroten in Rust by HuggingFace, Rust based tiny ML framework for fast inference.
