Ollama
Ollama Integration
Section titled “Ollama Integration”@trace_llm wrapping ollama.chat()
Section titled “@trace_llm wrapping ollama.chat()”The primary integration point with Ollama is tracing LLM calls using the @trace_llm decorator:
from openjck import trace_llmimport ollama
@trace_llmdef call_llm(messages): return ollama.chat(model="qwen2.5:7b", messages=messages)Available Models
Section titled “Available Models”OpenJCK works with all Ollama models. Popular choices include:
qwen2.5-coder:7b- Excellent for code generation and debuggingllama3.1:8b- Strong general-purpose modelgemma3:4b- Efficient and capable smaller modelmistral:7b- Good balance of performance and qualityphi3:medium- Microsoft’s capable medium-sized model
Full Example
Section titled “Full Example”from openjck import trace, trace_llm, trace_toolimport ollama
@trace(name="ollama_agent")def run_agent(task: str): messages = [{"role": "user", "content": task}] response = call_llm(messages) result = process_response(response.message.content) return result
@trace_llmdef call_llm(messages): # You can specify any Ollama model here return ollama.chat( model="qwen2.5:7b", # Change this to use different models messages=messages, options={"temperature": 0.7} # Optional: adjust generation parameters )
@trace_tooldef process_response(text: str) -> str: # Example tool that processes the LLM response return text.strip()
if __name__ == "__main__": result = run_agent("Explain quantum computing in simple terms") print(result)Note on ollama.AsyncClient for async usage
Section titled “Note on ollama.AsyncClient for async usage”For asynchronous applications, use the ollama.AsyncClient with async tracing:
from openjck import trace_llmimport ollama
# Initialize async clientasync_client = ollama.AsyncClient()
@trace_llmasync def call_llm_async(messages): return await async_client.chat( model="qwen2.5:7b", messages=messages )
# Usage in async functionimport asyncio
@trace(name="async_ollama_agent")async def run_async_agent(task: str): messages = [{"role": "user", "content": task}] response = await call_llm_async(messages) return response.message.content
# Run the async agentasync def main(): result = await run_async_agent("What is machine learning?") print(result)
if __name__ == "__main__": asyncio.run(main())How Token Extraction Works
Section titled “How Token Extraction Works”When tracing Ollama calls, OpenJCK automatically extracts token usage from the response:
- Input tokens:
prompt_eval_countfrom Ollama response - Output tokens:
eval_countfrom Ollama response - Model name: Extracted from the model parameter or auto-detected
- Latency: Measured from the duration of the ollama.chat() call
Example of captured data in trace JSON:
{ "step_id": 1, "type": "llm_call", "name": "call_llm", "duration_ms": 1250, "input": { "messages": [{"role": "user", "content": "Explain quantum computing"}] }, "output": { "content": "Quantum computing uses quantum bits..." }, "error": null, "tokens_in": 145, "tokens_out": 320, "model": "qwen2.5:7b", "cost_usd": 0.0 // Ollama is free/local, so cost is 0}Ollama Server Configuration
Section titled “Ollama Server Configuration”OpenJCK works with both local and remote Ollama instances:
Local (default)
Section titled “Local (default)”ollama.chat(model="qwen2.5:7b", messages=messages) # Uses localhost:11434Remote Server
Section titled “Remote Server”import osos.environ["OLLAMA_HOST"] = "http://your-ollama-server:11434"# orclient = ollama.Client(host="http://your-ollama-server:11434")Performance Tips
Section titled “Performance Tips”- Model warming: Keep frequently used models loaded in Ollama memory
- Batch processing: When possible, batch multiple prompts into fewer calls
- Temperature settings: Lower temperatures (0.1-0.3) for faster, more deterministic responses
- Context management: Be mindful of context window limits for your chosen model