GitHub - briefgaming/mini-inference: llama3 inference engine

Mini Inference

Implementation of an inference engine for Llama3 variant architectures.

Goal is to have a well optimized engine suitable for personal use.

How to run

# Download weights
# Note: This will download the weights to your current directory
python3 scripts/load_weights.py --model meta-llama/Llama-3.2-1B --weights model.safetensors --out-bin llama.bin --out-index configs/model_index.json --dequantize-fp32

# Generate vocabulary
python3 scripts/tokenizer_script.py --out vocab.bin --byte-level

# Script usage
python3 scripts/load_weights.py --help

# Downlaod json.hpp
wget https://github.com/nlohmann/json/releases/download/v3.11.2/json.hpp

# Compile code and execute binary
g++ -std=c++23 -o inference bpe.cpp model.cpp weights.cpp main.cpp && MAX_NEW_TOKENS=32 ./inference

Note

Uses modern C++ features so compile with C++20 and above

References

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
configs		configs
headers		headers
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
bpe.cpp		bpe.cpp
main.cpp		main.cpp
model.cpp		model.cpp
pyproject.toml		pyproject.toml
uv.lock		uv.lock
weights.cpp		weights.cpp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mini Inference

How to run

Note

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mini Inference

How to run

Note

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages