BLING Models
Collection
Small CPU-based RAG-optimized, instruct-following 1B-3B parameter models • 27 items • Updated • 28
How to use llmware/bling-tiny-llama-ov with Transformers:
# Use a pipeline as a high-level helper
from transformers import pipeline
pipe = pipeline("text-generation", model="llmware/bling-tiny-llama-ov") # Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("llmware/bling-tiny-llama-ov")
model = AutoModelForCausalLM.from_pretrained("llmware/bling-tiny-llama-ov")How to use llmware/bling-tiny-llama-ov with vLLM:
# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "llmware/bling-tiny-llama-ov"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "llmware/bling-tiny-llama-ov",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker model run hf.co/llmware/bling-tiny-llama-ov
How to use llmware/bling-tiny-llama-ov with SGLang:
# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
--model-path "llmware/bling-tiny-llama-ov" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "llmware/bling-tiny-llama-ov",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:latest \
python3 -m sglang.launch_server \
--model-path "llmware/bling-tiny-llama-ov" \
--host 0.0.0.0 \
--port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
-H "Content-Type: application/json" \
--data '{
"model": "llmware/bling-tiny-llama-ov",
"prompt": "Once upon a time,",
"max_tokens": 512,
"temperature": 0.5
}'How to use llmware/bling-tiny-llama-ov with Docker Model Runner:
docker model run hf.co/llmware/bling-tiny-llama-ov
bling-tiny-llama-ov is a very small, very fast fact-based question-answering model, designed for retrieval augmented generation (RAG) with complex business documents, quantized and packaged in OpenVino int4 for AI PCs using Intel GPU, CPU and NPU.
This model is one of the smallest and fastest in the series. For higher accuracy, look at larger models in the BLING/DRAGON series.
Base model
llmware/bling-tiny-llama-v0