Documentation
Installation, model choices, developer commentary, and operating notes for Porrima.
Installation
Your existing coding agent will help you with the installation process. Start with the prompt below; it provides the agent with instructions to probe your machine, ask for missing choices, make a plan, and only then install or configure services.
One-liners
These commands fetch the prompt shown below and pass it to your agent. Read the prompt first if you want to review the exact instructions.
codex "$(curl -fsSL https://porrima.cc/install/agent-prompt.txt)" claude "$(curl -fsSL https://porrima.cc/install/agent-prompt.txt)" opencode --prompt "$(curl -fsSL https://porrima.cc/install/agent-prompt.txt)" pi "$(curl -fsSL https://porrima.cc/install/agent-prompt.txt)" Install Prompt
Good morning. These instructions have been provided to guide you through the installation of Porrima, by its developer. Let's get started with installing the agent software on my computer.
First, probe the machine and summarize a plan before installing anything. Take a look at the project's readme and/or website at https://porrima.cc if you want to get a feel for what it is.
Do not overwrite existing services or model directories without making backups and explaining the change.
Please note: do not assume anything about my hardware or system without actually confirming it first.
Project:
- Repository: https://github.com/asa-degroff/porrima.git
- Branch/ref: main
- Install profile: core
Before installing:
- Probe the machine, then summarize what you found and the exact install plan.
- Ask me for any missing choices before making changes:
- install directory
- public Porrima hostname, if remote access should be configured
- Cloudflare account/zone and whether Cloudflare Access is available
- model choices or model download preferences
- whether optional TTS, image generation, and automations should be installed now
- Do not overwrite existing services, data directories, model directories, or Cloudflare tunnel config without making backups and explaining the change.
Requirements:
- This is a Linux-only application, and it requires systemd.
- This is a desktop-GPU application. If no capable NVIDIA CUDA or AMD ROCm GPU is available, stop and report that the machine does not meet the recommended requirements.
- Prefer reversible user-local installation.
- Use systemd user services, not root services, unless OS package installation is required.
- Install the core llama.cpp-based agent first. TTS, image generation, and automation tuning are optional packs; only install the packs listed in the install profile.
- Do not expose first-run Porrima on a public URL unless the first-run setup token exists and `ORIGIN`/`RP_ID` are configured for that public origin.
Probe:
Run and summarize:
```bash
uname -a
cat /etc/os-release
node --version || true
npm --version || true
python3 --version || true
pip3 --version || true
systemctl --user show-environment || true
lspci | grep -Ei 'vga|3d|display' || true
nvidia-smi || true
rocm-smi || true
hipcc --version || true
nvcc --version || true
cmake --version || true
ninja --version || true
ffmpeg -version || true
free -h
df -h
```
Core install tasks:
- Install or verify Node.js and npm.
- Install native build prerequisites for Node modules such as better-sqlite3, sharp, and sqlite-vec.
- Build or install llama.cpp with the best supported backend for this machine.
- Create `~/bin/llama-current/llama-server`.
- Create `~/.local/share/llama-models` for GGUF models.
- Download, import, or symlink GGUF models into the curated layout:
- `~/.local/share/llama-models/<chat-model-id>/<model>.gguf`
- `~/.local/share/llama-models/<extraction-model-id>/<model>.gguf`
- `~/.local/share/llama-models/<reranker-model-id>/<model>.gguf`
- `~/.local/share/llama-models/<embedding-model-id>/<model>.gguf`
- optional projector files such as `mmproj*.gguf` should live beside the primary model.
- Do not leave models only in the raw Hugging Face cache; Porrima scans the curated top-level model directories above.
- Choose stable model aliases that match the directory names and the aliases used in service files/settings.
- Create user systemd services for:
- `porrima.service`
- `llama-server.service` on port 32100
- `extraction-model.service` on port 32101
- `reranker.service` on port 32102
- `embedding-model.service` on port 32103
- `title-generation.service` on port 32104
- Configure each llama.cpp service with the selected model alias:
- chat inference: router mode with `--models-dir ~/.local/share/llama-models`
- extraction: chat-completion model, CPU-only by default
- reranker: `--embedding --reranking --pooling rank`
- embedding: `--embedding --pooling mean`
- title generation: chat-completion model, CPU-only by default
- Use CPU-only defaults for extraction, reranker, embedding, and title generation unless spare VRAM is explicitly available.
- In production, build both workspaces and run the Node server with `NODE_ENV=production` on the configured app port. The production server serves `client/dist` itself.
- Configure `porrima.service` itself, or a systemd drop-in under `~/.config/systemd/user/porrima.service.d/`, with explicit WebAuthn environment values before the first production start:
- local-only setup: `Environment=ORIGIN=http://localhost:<port>` and `Environment=RP_ID=localhost`
- remote setup: `Environment=ORIGIN=https://<public-hostname>` and `Environment=RP_ID=<public-hostname without port>`
- the `ORIGIN` value must be the exact browser origin used for passkey registration; the `RP_ID` must match that origin's hostname.
- Do not set `PORRIMA_DEV_TOKEN` in production; Porrima refuses to start with the development bearer-token bypass variable set under `NODE_ENV=production`.
- For dual equal GPUs, use tensor split for chat inference.
- Preserve performance flags such as flash attention, tensor split, batch/ubatch sizing, visible GPU environment variables, and llama.cpp library paths.
Passkey and Cloudflare setup:
- If I want remote access, ask for the final HTTPS hostname before passkey registration. WebAuthn passkeys are origin/RP-bound, so a passkey registered only on `localhost` will not be usable on the Cloudflare hostname.
- Production startup requires explicit WebAuthn settings in the `porrima.service` environment, not just shell-local exports:
- `ORIGIN=https://<public-hostname>`
- `RP_ID=<public-hostname without port>`
- If using a Cloudflare Tunnel such as `https://porrima.example.com -> http://localhost:3001`, keep `PORT=3001`, set `ORIGIN=https://porrima.example.com`, and set `RP_ID=porrima.example.com`. Do not set `ORIGIN` to the localhost target for a public passkey registration flow.
- Preferred safe sequence:
1. Build and start Porrima locally, bound only to localhost.
2. Confirm the first-run setup token exists at `~/.porrima/auth/setup-token.txt` after the server starts.
3. Create the Cloudflare Tunnel and DNS route. Use Cloudflare Access or an equivalent temporary access policy when available as an extra guard.
4. Visit the final HTTPS hostname and register the first owner passkey in Porrima using the setup token.
5. Confirm `GET /api/auth/status` reports `setupComplete: true`.
6. Only then remove the temporary Cloudflare Access policy if I want the Porrima passkey gate to be the only public gate.
- Do not bypass the setup-token gate or expose a first-run Porrima instance whose setup token is missing or unreadable.
- If I choose local-only setup, register the first passkey from `localhost` and leave Cloudflare Tunnel disabled.
- Install or verify `cloudflared` only if remote access is requested.
- For a Cloudflare-managed service, create a user-local tunnel config and systemd user service, keep the tunnel target pointed at the production Porrima server on localhost, and include commands to inspect tunnel health and logs.
Optional TTS pack:
Do not install TTS dependencies, voice models, or TTS services during initial setup.
Optional image pack:
Do not install ComfyUI, stable-diffusion.cpp, image models, or image services during initial setup.
Optional automation pack:
Do not create custom automation tasks during initial setup; leave built-in Daily Synthesis and Wake Cycle defaults to Porrima's first-run setup.
Validation:
- Run `systemctl --user daemon-reload`.
- Enable and start selected services.
- Check `systemctl --user status` for every service.
- Confirm `systemctl --user show porrima.service -p Environment -p ExecStart -p WorkingDirectory` shows `NODE_ENV=production`, the intended `PORT`, and matching `ORIGIN`/`RP_ID`; missing WebAuthn values will make Porrima exit during startup and can surface as proxy 502 errors.
- Check every llama.cpp service `/health` and `/v1/models`:
- `http://localhost:32100` chat inference
- `http://localhost:32101` extraction
- `http://localhost:32102` reranker
- `http://localhost:32103` embedding
- `http://localhost:32104` title generation
- For the chat router, load the selected chat model if needed and run a short `/v1/chat/completions` request.
- For extraction and title generation, run short `/v1/chat/completions` requests using their configured aliases.
- For the reranker, run a small `/v1/rerank` request.
- For the embedding server, run a small `/v1/embeddings` request.
- Confirm Porrima sees the same models through `/api/llama-servers`, `/api/llama-servers/available-models?slot=inference`, `/api/llama-servers/available-models?slot=embedding`, `/api/llama-servers/available-models?slot=reranker`, `/api/llama-servers/available-models?slot=extraction`, and `/api/llama-servers/available-models?slot=title-generation`.
- Check Porrima `/api/auth/status` locally and, if remote access was configured, through the final HTTPS hostname.
- Load a small test model if needed and run a short prompt benchmark.
- Report selected model aliases, tokens/sec, selected backend, GPU visibility, VRAM use, passkey setup state, tunnel/access state, and any warnings. What It Will Ask
- Install directory and whether existing services should be reused or backed up.
- Final public hostname if Cloudflare Tunnel remote access is wanted.
- Cloudflare account, zone, and whether Cloudflare Access is available for temporary first-run protection.
- Model download preferences for chat, extraction, reranking, embeddings, and title generation.
- Whether optional voice, image generation, and automation defaults should be installed now.
Please note: except where otherwise specified, this page contains human-authored prose, and may contain flaws or inaccuracies
Model Selection
Porrima runs five Llama.cpp instances. You'll have to make a model decision for each depending on your preferences and hardware constraints.
Main chat model
This should generally be the largest and most capable model that fits within your system's VRAM, with a few options depending on your preferences.
Top picks:
- Qwen3.6 27B — very strong overall, exceptionally well-tuned for agent use, the best choice for most users running 32-48GB VRAM
- Gemma 4 31B — also good, requires a little more VRAM or smaller context than the aforementioned Qwen
Recommended with caution:
- Qwen3.6 35B A3B: the MoE Qwen is significantly faster and less capable than the dense Qwen 27B. Recommended only if you cannot run the dense model at tolerable speeds.
- Gemma 4 26B A4B: same story here, use if the dense model is not runnable on your hardware
- Gemma 4 12B: even weaker than the above in agent capabilities, it may be sufficient for light use
Extraction
The extraction model should be the largest and most capable model that you can run simultaneously on the same system as the main model. For most users, this means running on CPU while the main model runs on GPU. If you have VRAM to spare, you may be able to run them both on GPU.
Note that inference speed is not crucial for the extraction model in Porrima, since it is non-interactive for the user and non-blocking for the main model, so if it takes several minutes to process a request, that's usually fine. You'll only run into issues if you run into a case where multiple extraction requests are stacking up for an extended period, or where requests are timing out (this threshold is configurable). In general, if you can achieve >10tok/s on the extraction model and take breaks from your computer every now and then, you're good.
It is recommended to pick an extraction model from the same model family as your selected main model. In Porrima, the main and extraction models inhabit a shared first-person perspective as part of one entity. You can think of the extraction model as a limb or augmentation. With this in mind, it is crucial that the main model recognize its memories as its own, rather than as messages from an external entity, which can be very distracting and defeat the purpose of the memory system. The smaller versions of models within a family are typically 'distilled' from the larger versions — so e.g. the 4B, 9B, and 27B Qwen models share a training dataset, and their pre-and post-training takes place in a similar environment, making these models highly compatible. The story is similar for Gemma 4 31B and E4B variants, which make a natural pair.
With that in mind, the current options (June 2026) are:
- Qwen3.5 9B: recommended for Qwen agents if your CPU can run it at ~10tok/s or greater
- Qwen3.5 4B: use if the 9B version at a 4-bit quantization is too slow
- Gemma 4 12B: use for Gemma agents if you have a capable CPU and sufficient RAM
- Gemma 4 E4B: the smaller choice for Gemma agents
Not recommended:
- Any model mismatched from the main model family: poorly formed memories that the main model will perceive as confusing and conflicting external messages, confining it to an 'assistant' mindset and RAG chatbot role
Reranker
The reranker model serves as the final step in the memory retrieval pipeline, taking the results of a hybrid search over the memory database, compares them to a snippet of conversation context, and narrows down the results with much better accuracy than simple semantic + keyword matching, before injecting the memories into the main model's context. It is recommended to run the reranker on CPU, unless you have VRAM to spare.
There's really only one player in this arena that's mainline Llama.cpp-compatible: Qwen3-reranker. The 0.6B variant is recommended unless you have the RAM and processing power to spare for the 4B, which provides marginal benefit. Recommendation:
- Qwen3-reranker 0.6B
Title generation
The titles are a cosmetic feature, used to populate the chat list and chat view header. The title generation model also writes brief summaries of long agent messages that will be displayed in small italic text under the message bubble, and used as the body text of the push notifications if you have those enabled.
It should be a small model that runs on CPU. Don't bother taking resources from more important parts of the application to fit a larger title generation model. Gemma works better than Qwen here — Qwen3.5 2B consistently produces lines starting with preamble like "The user asked for..." or "The developer...", while Gemma 4 E2B is pretty good at concise summaries. The top recommendation is:
- Gemma 4 E2B
Embedding
The embedding model is used to generate vector embeddings for the memory database. The top choice is Qwen3-embedding, which comes in a few different sizes. It is recommended that you run it on CPU, and pick either 0.6B or 4B size depending on your RAM capacity. The larger model with its higher dimension count will produce slightly more accurate search results. Barring the one-time migration of an already-populated database when switching embedding models, embeddings for new memories are very fast, so the primary constraint is RAM capacity. Top choice:
- Qwen3-embedding 4B
Notes on Memory Bandwidth
Porrima was developed on and for a system with two pools of memory: GPU VRAM and system RAM. Separating these means that the main model (the user-interactive part) can run at the maximum speed allowed by the GPU, while the other models can run simultaneously as needed on the CPU, dipping into a separate, typically larger and slower pool of memory without affecting the main model.
If you use partial CPU offloading or have a system with unified memory (e.g. AMD Strix Halo), expect degraded speed as models contend for access to memory. It will probably still work, but I have not tested it since I don't have access to such a system.
Notes from the author
Here are the models I have used the most throughout my testing and development. Treat these as a 'known good' setup if your hardware is in the same ballpark as mine (32GB VRAM, AMD RDNA 3 (Radeon RX 7700 x2), 48GB system RAM, Ryzen 9 7950x) and adjust as needed:
- Chat: Qwen 3.6 27B, Q6_K, 115K context (F16 KV cache because it's faster)
- Memory extraction: Qwen 3.5 9B, IQ4_NL, 32K context / Qwen 3.5 4B, IQ4_NL, 32K context
- Title generation: Gemma 4 E2B QAT, Q4_0
- Reranker: Qwen3-reranker 0.6B, Q4_K_M
- Embedding: Qwen3-embedding 4B, Q4_K_M
Design Principles
Polished user interface
Services like Gemini and ChatGPT offer web and mobile apps that are polished and accessible, while having powerful features like web search, rendered artifacts, searchable chat history, image generation, and voice mode built in, reducing all the complexity to a simple conversational chat interface.
Self-hosted AI typically lacks this type of user experience unless you build it yourself by piecing together many different tools. Porrima aims to do everything that these cloud-based services do, in a single app that provides a cohesive user experience. It is not perfect, but that's the direction to strive for.
It provides a vertically integrated client, server, and agent harness. General-purpose messaging apps made for human-human communication are quite bad at providing an AI interface, so a purpose-built, cross platform app is used.
Memory-native
The memory system is tightly integrated. It does not aim to do anything approaching lossless recall from a database of facts. Rather, it intends to supplement an LLM with the capacity to ambiently & spontaneously remember things from its past, emulating human long-term memory, and we do this during runtime without requiring tool use, prompt cache busting, or pausing inference.
Additionally, it has a full set of memory tools and self-managed memory blocks for structured information, with periodic chances to review its recent experiences and update its memory blocks.
Flexible
The original mistake of agent harnesses such as Claude Code to pigeonhole themselves as narrowly-scoped coding agents. Porrima is not intended for any specific purpose; it can be used for coding or anything else that is accessible via text on a computer.
If comes with a diverse toolset for computer use, retrieving information from the web, and showing things to the user visually.
Unlimited agent capabilities
Agents like Openclaw are popular because of how few restrictions they have. Porrima works in a similar way. It allows current generation LLMs post-trained for agent use to really flex their capabilities.
'Infinite' context
Porrima is designed to run indefinitely without ever losing its place. When reaching a context limit, it runs an extraction pass on the context, moves memory augmented content to the head (on top of the system prompt and memory blocks), cuts redundant or no-longer-relevant memory content, and greedily packs the post-compaction context targeting 30% capacity, keeping the tail of the previous conversation history intact. This means that it resumes with everything it was recently doing still in-context, as opposed to summoning a 'fresh' instance of itself. The cost is that the post-compaction prefill is quite slow compared to a more agressive 'wipe and resume' strategy, and the churn is more frequent, which are acceptable tradeoffs.
Single-box embodiment
Porrima runs inference, the agent harness, memory system, web server, and all databases and storage on a single box of consumer-grade (gaming/workstation) hardware. It relies on cloud APIs only for web search.
Many existing systems designed primarily for cloud models offer the option to drop in an API for llama.cpp or ollama and run a model locally. In this case they ignore the benefits and drawbacks of inference on consumer hardware, where hardware is directly observable and memory bandwidth is limited. Porrima is instead designed without the assumption of datacenter scale, built to be highly usable on the hardware that people own, with non-blocking background processing, cache observability, intelligent deferral of non-interactive automations, and slot-aware cache rewarming.
Modularity
Llama.cpp is the ideal choice of inference engine, with a broad ecosystem, model compatibility, and a highly trusted team of maintainers. Porrima provides and interface for configurung flags and and swapping models and llama.cpp binaries for each of the five managed servers, allowing use of any of the wide variety of available forks or customized builds. Additionally, the agent can hot-swap its own llama.cpp process without user intervention.
Do the right thing
Porrima spent a glacial four months in development prior to release, with every UI feature going through system integration and usability testing. This application is primarily for providing a human-AI interface, so although it is made in a collaborative human-AI development process, it is necessary for a human to take control of all the human-facing components at a human pace.
It is very difficult to evaluate a piece of software that is so open-ended. However, a good-faith effort should be made to evaluate the user experience for a variety of tasks, and to make it better with every change (while probbly not adhering to the full extent of the MIT approach, this is in contrast to New Jersey-style development or to vibe coding.)
Avoid cleverness
ChatGPT became the fastest-growing app of all time because people love the UX of entering text in a box, hitting enter, and seeing an answer appear. That is arguably the best software UX ever devised. We will rip off ChatGPT, Claude, and Gemini (the consumer-facing apps) wherever it makes sense to do so. Reducing complexity is the goal, to the extent that it does not interfere with the prior principle.
Every optional (non-chat) feature should be able to be ignored by users who don't want to use it.
Technology stack
- Runtime: Node.js
- Build tool: Vite
- Backend framework: Express
- Primary language: TypeScript
- Databases: SQLite
- Frontend framework: React
- Agent core: Pi agent core
Background
About the memory system
An analogue can be drawn between AI agent memory systems and human memory. Short-term, working memory is the context window, and long term memory is a system external to, but used to construct, the contents of the KV cache. Additionally, long-term memory operates primarily in two modes: implicit and explicit. Here is my high-level understanding of the mechanisms:
Implicit memory is formed and recalled 'subconsciously', by processes such as repetition of behaviors, or ambient association. This includes things like remembering how to operate a machine that you frequently use, navigate the familiar parts of your environment, or recalling a detail about something you have encountered before when exposed to something that reminds you of something else for some reason.
Explicit memory is 'consciously' remembered, usually consisting of what you might typically think of as deliberately saved 'knowledge', things that are learned or memorized, including most of what you get from school, reading, and writing. These are memories that you intentionally remember and would recognize if you forgot.
The line between 'conscious' and 'subconscious' is not well-enough defined for me to go into much detail, but it seems that human cognitive processes can be in operation in both of these modes simultaneously, effecting each other in subtle and mysterious ways. Mapping these modes to an AI agent system will not be a perfect match, but they are a useful abstraction that can help guide the design.
Most AI memory systems operate explicitly: they decide what to remember, either periodically, or at the end of a session, or however a memory 'checkpoint' is defined (where some sort of context refresh takes place), and then recall memory by either using a search tool, or loading the contents of a document into context. This is a very intentional process.
One obvious parallel for implicit memory is the model weights themselves. LLMs know enormous amounts of information and can figure out how to operate anything that can be operated by text. They don't necessarily know how they know, it just arises from the weights. However, this is set in stone when training completes (not how human memory works), and it is currently unknown whether parametric memory can be made to work well. Token-space (KV cache) learning works well and is well-documented at this point.
For the purpose of this system, we are augmenting the model with instance-specific, experiential first-person memories constructed from and inserted into KV cache at runtime.
Implicit memory has to be saved from and loaded into the model's context window without the model knowing that it has happened. "The model" in this case is defined as the 'main' model, the one taking the actions and interacting with its environment. The system as a whole can be thought of as one individual entity, containing more than one LLM.
To implement implicit memory, we have to run another model behind the scenes, an extraction model, which does not interact with the environment or take messages from the user, but just processes the context that the main model processes and decides what to save. This runs simultaneously to the main model, with a one-conversation-turn delay: the most recent turn from the user and main model is passed to the extraction model as soon as the main model finishes. Since turn-by-turn messages can miss broader-scope context, it also takes a second pass over the conversation history after reaching a context limit or after activity has cooled down for a while.
The extraction model itself operates in 'blank slate' mode without a memory system of its own, just a static system prompt and context copied from the main model's context. Creating a memory system for the extraction model is left as an exercise for the reader, though be advised that recursion needs a base case and that the resource requirements would be astronomical.
Going in the other direction, we have to decide what to recall and inject from implicit memory into the main model's working memory (context window) as it runs, not just as a lookup, tool call, or initial (zero-shot) context construction, but as a process invisible to the main model, that runs repeatedly while it works. We use a multi-layer search process: vector + full-text RRF hybrid search, MMR diversity scoring, and an instruction-tuned cross-encoder reranker. The tricky part is deciding what to search for.
The naive approach would be to use the user prompt as a search query, which works pretty well if the goal is just basic trivia-style back-and-forth conversation with information lookup, but we need to go beyond that for truly implicit, ambient associative memory. The agent needs to be able to think of something, or execute a task, and be reminded of something else that it has experienced before, and have that recalled mid-operation, without user input. So we use both user prompts initially, and then agent output, agent thinking traces, and tool calls as search queries for periodic memory lookups.
To perform the live injection, we have to switch from decode to prefill to insert the newly-recalled memories in to the main model's KV cache. This means interrupting the output. Since AI agents frequently do this anyway as they make tool calls, injecting memory content on top of the context window at tool call time is the best way to perform this injection.
Injected memories are not passed as a prompt. In practice, the main model will occasionally say something like 'looking at my memories,...' or 'the memories say...', but more often than not, the memories are just silently folded into the context, and the agent will know what they contain if it happens to be relevant.
In the absence of tool calls, passively recalled memories will just be recalled and injected at the end of the turn, and will be in-context after the agent resumes following the next user prompt. This is the back-and-forth conversation flow, where neither participant interrupts the other, and each participant has the chance to gather their thoughts before sending their next response.
This is operationally similar to RAG and shares a lot of the technical components with a RAG pipeline, but serves a very different purpose from traditional RAG, which treats an LLM as a search engine and augments it with some existing database of information that can't be found in the weights, usually with disappointing results, as the source material turns out to be incomplete and/or hallucination comes into play.
With Porrima, the goal is not to use 'reference' material, there is no 'source of truth', in much the same way that human memories are not a source of truth. Rather, the database consists of first-person experiences, they are all lossy, and some of them might be wrong. Some of them might also be right, as it accurately remembers answers to questions, technical details, names, places, facts, etc., which the extraction model tends to be biased toward writing down. This is a lossy system by design.
Next comes the explicit memory. Porrima can write down information in self-managed memory blocks, which are then included as part of the constructed system prompt for newly opened in-scope chats or on the next cache-bust. It comes with a synthesis cycle: every 24 hours (by default), if there has been new activity, the atomic memories (that's what the ones created by the extraction model are called), up to a certain limit and biased toward importance if the set needs to be truncated, are passed as part of the synthesis prompt to the main agent in its system chat. It then gets the chance to think through what has taken place and update its memory blocks accordingly. The memory blocks will then provide a sense of continuity from one day to the next, across all chats.
By default it comes with a 'zeitgeist' block, which is intended to provide a high-level summary of whatever the agent has been up to. Also the agent can create topical memory blocks for whatever it feels like writing down. As a user you can also create blocks, but it is recommended to keep the agent in the loop as you do so.
Taking a look at a typical high-end consumer desktop PC, it has two main pools of computing resources: a CPU and system RAM (~80-128GB/s), and a GPU with VRAM (~600-1800GB/s). The common approach to LLM inference is to run the model on the GPU, that's what it's made for, after all. At consumer hardware scale, this works well enough for a single request at a time, but concurrency introduces severe performance penalties on a memory bandwidth-limited system. Meanwhile, the CPU is left mostly idle, handling normal operating system tasks. Modern-day CPUs with many cores, DDR5 RAM, and AVX are actually perfectly capable of running AI inference, just not nearly as fast as a GPU. The CPU is typically a compromise that you settle for in the absence of an adequate GPU. However, it is a resource left on the table that is perfectly capable of running an LLM, so we may as well dip into it.
Running concurrent requests to the same model on the GPU is possible, but that's a user-interactive part of the application, and slowing it down is less than ideal. The memory extraction and retrieval pipelines, on the other hand, are not something that the user interacts with, and will not block the application if they run a little slow. The extraction model also doesn't need to be as smart as the model doing the difficult intellectual work; it basically just needs to look at the context and decide what's worth saving, with a slight bias toward saving everything, and write it down in coherent prose that fits well with the main model's style and perspective. With all that in mind, this part of the system is a perfect candidate for a smaller model, offloaded to the CPU + system RAM. Transient contention from the retrieval pipeline has low impact.
The Qwen and Gemma model series both offer small and medium model sizes that are perfectly suited for this division of work. Using two models from the same model family means the main model will likely see the outputs from the extraction model, with shared writing style, training data, and vocabulary, as its own memories, rather than messages from an external entity.
User guide
Sidebar
The sidebar contains the entry points for every other part of the app. At the top, you will find the search bar, which also shows the name of your agent and the activity indicator for the extraction model. Look here to see when extraction is happening. Also use it to search your chat history with full-text search and navigate to the returned chat(s).
Below the search bar you will find the status indicator and button row.
- Release: this button immediately enables sleep mode and lets automations take over if they are due
- Automations: this button allows you to run a specific automation immediately
- Memory: opens the memory debug panel
Within the memory debug panel, you will find a list of all atomic memories, and a search bar which runs a hybrid search over the memory database. From here you can sort and filter memories, and delete any if needed.
There is also a memory blocks viewer, from which you can view your agent's memory blocks, edit and delete them if needed.
The graph view shows a 2D graph of the memories in the database, grouped by semantic similarity. Adjust the tyhreshold and neighbor count to vary the graph. Click on a node to view that memory.
The extraction view shows extraction history since the last server restart, including running extractions. From here you can view the system prompt and messages sent to the extraction model, as well as the outputs for any given run.
- Model stats: opens the model stats panel
From here you can see the decode, prefill, and cache hitrate stats for the main chat and extraction models, grouped by model so you can compare after trying different ones. Expand to see run history and per-run stats.
The reranker stats shows the average latency, success rate (percentage not failed), and score quality (this is derived from the result similarity, low quality does not necessarily indicate a defect, but suggests that new semantic ground has been covered recently). Reranker runs can be expanded to see the query sent to the reranker, and memories selected for injection for each turn-start and passive (mid-turn) injection run.
The cache view shows observed cache slots for the main chat model, and recent hitrate for that slot.
- Settings: opens settings
This is where you can adjust all settings for the applicaiton
System chat
Below the buttons and system stats, you will find a single row, which is the system chat. This is where all automated activity takes place. Check in to see what your agent has been up to in the latest synthesis and wake cycles. You can also chat with your agent here, and use it as the 'main' chat if you want one continuous thread.
Activity indicators
The acitvity indicators shown in various parts of the UI are customizable 3D shapes (see the settings section below). In the chat view, note that the activity indicator takes on two different states, depending on whether the model is doing prefill or decode. The prefill indicator rotates slowly and steadily, while the decode indicator goes through a fast sequential flipping animation. This tells you at a glance whether your agent is reading something or writing.
Cache residency indicators
Chats that are currently residing in a cache slot (observed by hitting the llama.cpp slots endpoint) will be shown with an amber halo in the sidebar, except the most recently used chat which will be shown with a purple halo. This indicates that those chat(s) are ready to resume wihtout requring re-processing in most cases.
Chat Types
In the sidebar you will find three sections:
- Projects are chats that start with a working directory and the AGENTS.md from that directory in-context. You can start a project for a codebase or anything else. Project working directories can be located on a remote machine.
- Global chats are the default agent chats, with the working directory as your computer's home directory, and full agent capabilities.
- Quick chats do not use memory or the agent persona and do not have agent capabilities.
Settings
Note: this section was mostly written by my personal agent (Qwen3.6 27B), and may be incomplete.
Settings is where you will find all the configuration options for Porrima. The menu is organized into sections, accessible from the sidebar.
Models
- Default model — the main chat model for the agent. Picked from all models discovered on disk.
- Preserve thinking — when enabled, passes
preserve_thinking: truein chat template kwargs so models that support it can retain historical reasoning traces from prior turns.
Inference Servers
Manage the llama.cpp systemd services that power Porrima — inference, extraction, reranker, embedding, and title generation. For each server you can view health status, systemd state, start/stop/restart, and view recent logs.
- Service config — per-server drop-in overrides: extra args, model assignments, slot routing, binary selection. Preview generates the full systemd drop-in file before applying.
- Binary management — discover llama.cpp builds from a scan directory and assign specific binaries to individual servers.
- Model scan paths — directories scanned for GGUF model files. Preview shows discovered models before adding a path.
- Slot binding mode — auto lets llama.cpp choose slots and use its RAM prompt cache; enforced sends
id_slotbased on app-managed leases.
Appearance
- Theme — color scheme for the entire interface. Options: lapis, ocean, forest, crimson, asphalt, strawberry, coffee, emerald, copper, verdigris, iron, rust.
- Activity shape — the geometric shape shown as the agent activity indicator: octahedron, cube, or tetrahedron.
- Activity hue / saturation — fine-tune the color of the activity polyhedron.
- Corner shape — superellipse or circular shape for rounded corners throughout the UI.
- Corner radius — the degree of rounding.
- Flat background — removes the diagonal background gradient for a flatter look.
- Chromatic aberration — toggle the RGB-split effect for the animated backgrounds.
- Mouse warp — the background polyhedron subtly follows cursor position when enabled.
Background
- Effect — the background pattern behind the main content: static, ripple-grid, scan-lines, or ripple-dots.
Haptics
- Enabled — toggle haptic feedback on supported devices for button presses and gestures.
Notifications
- Push notifications — enable/disable notifications and test button for push notifications, saved per-device. When enabled, notifications are sent on this device when your agent has a message for you.
Remote Hosts
Configure SSH connections for remote project access. Each connection defines:
- Hostname, port, username, and identity file.
- Known hosts mode — strict (verify against known_hosts), accept-new (auto-accept on first connect), or off.
- Permission scopes — independently toggle bash execution, file writes, and absolute path access per connection.
- Test connection button validates the setup before saving.
Persona
- Agent name — the display name shown in the sidebar search bar and chat header.
- Header image — upload a custom image to display in the chat header. When absent, just the model name will be shown.
- Persona document — the agent's identity document, editable by the user.
About You
- User document — a document about you that is included in the agent's context. This is the file the agent refers to as user information.
Quick chats
- Quick chat system prompt presets — save and manage reusable system prompts, selectable from the dropdown in the header for quick chats.
API Keys
- Brave Search — API key for the Brave Search provider. Toggle on/off independently.
- Exa — API key for the Exa search provider. Toggle on/off independently.
- Tavily — API key for the Tavily search provider. Toggle on/off independently.
- Default provider — which search provider the agent uses when it calls web_search without specifying the provider.
Images
- Backend — choose between ComfyUI and stable-diffusion.cpp (sd-server) as the image generation backend.
- ComfyUI URL — the address of your ComfyUI server. Connection status is checked and displayed.
- sd.cpp URL — the address of your stable-diffusion.cpp server.
- Image sandbox enabled — toggle the Image Sandbox entry point in the sidebar on/off.
Skills
- Skills browser — browse, install, and manage Agent Skills. Skills can be installed from URLs (GitHub directories or direct SKILL.md links) and activated per-chat.
Extraction
- Extraction model — the model used for memory extraction. Should be from the same family as the main model for coherent memory formation. Selectable from models available on the extraction server.
- Extraction server URL — direct URL to the dedicated extraction llama.cpp instance (default: port 32101).
- Context size — context window for the extraction server (2048–131072, default 16384).
- Max output tokens — maximum tokens the extraction model can generate per request (100–32768, default 4000).
- Timeout — abort extraction requests after this many minutes (1–1440, default 10).
- Delayed extraction — configure the delayed extraction window: threshold (minutes of inactivity before batch extraction triggers, default 30) and message cap (max messages per batch, default 50).
- Extraction prompt — the system prompt sent to the extraction model. Editable inline; changes take effect on the next extraction run.
- Enrichment batch size — how many image corpus entries to process per enrichment check (default 5).
Backups
- Embedding backups — create labeled backups of the current embedding vectors before running a migration. Restore from any backup. Lists all backups with timestamps and sizes.
- Embedding migration — migrate the memory database to a new embedding model/dimension. Progress shown in real-time. Requires a backup before proceeding.
- Agent snapshots — create full-point-in-time snapshots of the agent's state (memories, blocks, settings). Optionally include the image corpus. Restore from any snapshot.
Memory Retrieval
Configure how memories are retrieved and injected into context. Two retrieval paths: turn-start (at the beginning of each chat turn) and passive recall (mid-turn, after the agent's thinking output).
- Depth profile — preset: Fast (lower CPU, fewer memories), Balanced (safe baseline), Thorough (more candidates, deeper reranking), or Custom (unlock advanced controls).
- Reranker timeout — abort rerank requests after this many milliseconds (default 25000).
- Turn-start: query chars — character budget for the search and rerank queries (search and rerank can differ).
- Turn-start: search limit — max candidates from hybrid vector + FTS5 search.
- Turn-start: candidate pool — candidates passed to the reranker after dedup.
- Turn-start: rerank document limit/chars — how many documents and how many chars per document sent to the reranker.
- Turn-start: top N — final memories injected from turn-start retrieval.
- Passive recall: query chars — character budget for passive recall queries (derived from thinking output and trajectory).
- Passive recall: search limit / candidate pool / diverse limit — same as turn-start but for mid-turn retrieval.
- Passive recall: rerank document limit/chars / top N — reranking parameters for passive injection.
- Memories per injection / per turn — cap on how many memories are injected per passive recall event and per full turn.
- Cross-project score multiplier — dampens memories from other projects (0.3 = 30% score, default). Prevents cross-project noise while allowing highly relevant content to surface.
- Global project score multiplier — controls how strongly project-scoped memories compete in chats without a current project (default 1.0).
System Stats
- Enabled — show/hide the system resource usage bar in the sidebar (CPU, RAM, GPU).
- Buffer duration — how many seconds of historical stats to retain (default 60).
- Hidden GPUs — list of PCI addresses to exclude from stats display. Useful when you have GPUs you don't want tracked.
Memory Blocks
- Max block characters — maximum size of a single agent-created memory block (1000–10000, step 500, default 4000).
Automations
- Sleep cycle threshold — minutes of user inactivity before the agent enters sleep mode, allowing automations to take over (default 60).
- Wake cycle — toggle autonomous periodic exploration during sleep. Configure the interval in hours (default 6).
- Post-synthesis cache warm — number of recent agent chats to pre-warm in the KV cache after each synthesis cycle (default 3, set to 0 to disable).
- Task management — create, edit, and delete automation tasks. Each task defines a schedule (interval or daily), activation policy (idle, sleep-only, manual-only), prompt steps with dispatch mode (sequence, random, cycle), iteration/timeout limits, and optional push notifications. View run history with per-run details.
Tool Options
- read_file default lines — default line limit when the agent reads a file without specifying a limit (default 1000).
- read_file max bytes — hard cap on returned file content as a safety net for pathological lines or minified bundles (default 256 KB).
TTS
- Backend — choose between Supertonic 3 and Kokoro as the text-to-speech engine.
- Voice — select the voice for the chosen backend. Voices are grouped by backend.
- Speed — playback speed multiplier (0.5× to 2×), logarithmic slider.
- Language — Supertonic supports 30+ languages; select the primary language for generation.
- Boundary tier — sentence boundary detection tier for streaming: controls how the TTS engine splits text into playback chunks.
- Text mode — how markdown is handled before sending to TTS (raw, stripped, or processed).
- Pitch — adjust pitch and select resample or rubberband for varying vocal effects.
Security
- Passkeys — manage WebAuthn passkeys for authentication. Add new passkeys or remove existing ones. This is the only login method — no passwords.
Chat view
Notebooks
The notebooks section provides notes for both you and your agent to write in. After writing a note, your agent will read that note during its next synthesis cycle. It will also write notes of its own during synthesis and wake cycles, or whenever it uses the create notebook entry tool.
Image sandbox
The image sandbox provides three main views: analyze, generate, and corpus.
The analyze section provides an interface for image-to-text tasks. This uses your selected main chat model to prodide text descriptions of images. There are built-in description styles that provide a system prompt for your vision model. 'Simple' is recommended for generating one-line captions or accessibility alt text for photos. 'Detailed' is recommended for generating rich and verbose accessibility alt text for photos. 'Z-Image' is recommended for img-txt-img prompts for modern text-to-image models such as Z-Image and Qwen Image, that take natural language markdown-formatted prompts.
The generate view provides an interface for text-to-image models, and shows your generated images in a masonry grid gallery with detailed view.
To get generation working, you'll have to first set up ComfyUI server or stable-diffusion.cpp server, and point Porrima at the URL from the images section of the settings menu. You can run the image generation server on the same hardware as the agent, but just note that unless you have enough memory to keep both loaded simultaneously, this will require unloading the LLM before loading the image model. Porrima handles this resource balancing automatically. For a better experience, you can run the image server on a separate machine on your local network or Tailnet and avoid resource contention.
(Note: images are stored as JPEG XL, and as of June 2026 some browsers may still require enabling a feature flag to support this format.) From the gallery you can favorite or delete images, see the prompt that was used to generate an image, and use it as a starting point for another generation.
The corpus section shows a graph of nodes, each representing an image in your gallery, clustered by similarity. Images from the same or similar prompts will be clustered together. You can hover a node to see a preview of that image, and select it to show the details in the right sidebar.
Future Work
Multi-agent
A plan is in place to implement multi-agent switching. Each will have its own set of memories, chat history, and model selection. The switcher will be activated from the chat header image.
Switching agents and resuming a chat will likely require expensive prompt reprocessing, so this option should be available, but is not expected to be done with high frequency or to enable concurrent usage.
Improved server management UX
The Llama.cpp server management requires existing knowledge or a lot of research about how things work. The defaults and configuration UI here could be improved.
Improved installation guidance
The installation process should be made more beignner-freindly.
Lowering hardware requirements
While outside of my hands or the scope of this project, I am hopeful that future models will bring highly capable agents to more accessible hardware. This is where this application will really start to shine.
Fast voice mode
Gemma 4 models with native hearing support and streaming TTS models can be combined for a low-latency conversational voice mode.
Improved UI animations
Many places in the UI still need meaningful animations.