Hi HN,
I built OpenGraviton, an open-source AI inference engine designed to push the limits of running extremely large models on consumer hardware.
The system combines several techniques to drastically reduce memory and compute requirements:
• 1.58-bit ternary quantization ({-1, 0, +1}) for ~10x compression
• dynamic sparsity with Top-K pruning and MoE routing
• mmap-based layer streaming to load weights directly from NVMe SSDs
• speculative decoding to improve generation throughput
These allow models far larger than system RAM to run locally.
In early benchmarks, OpenGraviton reduced TinyLlama-1.1B from ~2.05GB (FP16) to ~0.24GB using ternary quantization. Synthetic stress tests at the 140B scale show that models which would normally require ~280GB FP16 can fit within ~35GB when packed with the ternary format.
The project is optimized for Apple Silicon and currently uses custom Metal + C++ tensor unpacking.
Benchmarks, architecture, and details:
https://opengraviton.github.io
GitHub:
https://github.com/opengraviton
I'm currently working on further speed improvements — it's already around 8× faster in some cases, but there’s still potential for more optimization.
Since this is an open-source project, community support is very important. I believe AI shouldn’t be controlled or driven by only a few companies, so contributions, feedback, and ideas are always very welcome. Feel free to open an issue or PR if you'd like to help.