Stellar

Why a Semantic Memory Layer Matters

Jarvis is a Retrieval Augmented Generation service that builds a semantic memory layer over security and AI data, using hybrid search for red and blue team workflows.
Will Burns

Will Burns

Offensive Security Engineer

AI

“Where did I see this before?” If you’ve ever worked on a security investigation or red team engagement, you know the pain of scattered notes, log files, and tool outputs piling up. I personally had hundreds of pages of findings and scripts from past ops – and yet I kept reinventing the wheel. Jarvis was born from this frustration. I set out to build a semantic memory layer: a long-term, AI-queryable knowledge base that remembers what I’ve already done and what I’ve already learned, so I don’t have to. In this first post of the series, we’ll explore why such a memory layer matters and how Jarvis implements it with advanced retrieval techniques.

The Overload of Unused Knowledge

In security R&D, we generate a flood of information – vulnerability reports, code snippets, configs, incident notes – but much of it goes unused after initial creation. Traditional keyword search (think grep or basic SIEM queries) often isn’t enough to surface the right info when you need it. Keywords fail if you don’t remember the exact terms, and they miss contextual meaning (e.g. “SAML issue in Azure AD” might not match a note titled “cloud auth bug”). I felt this pain acutely: I’d encounter a problem and dimly recall solving something similar last year, but couldn’t find it quickly by grepping through Markdown files. Clearly, we needed more semantic recall – searching by meaning, not just exact words.

This is where Retrieval-Augmented Generation (RAG) enters the picture. RAG pairs an AI language model with an external knowledge base: when asked a question, the AI first retrieves relevant documents from the knowledge base, then uses those to formulate an answer. Instead of hoping the AI “knows” about NTLM pitfalls or a specific CVE from training, we supply it with our own knowledge on demand. RAG has become standard for building intelligent Q&A systems on private data, and for good reason. It bridges the gap between what the AI knows and what you know. With Jarvis, the vision was to create a personal RAG system for security ops, so I could ask, “Have we seen this attack before?” or “What’s the usual fix for this finding?” and get an answer grounded in my actual past work.

However, building a good semantic memory isn’t as simple as tossing all your docs into a vector database. The quality of retrieval matters. If the system pulls irrelevant or incomplete snippets, the AI might generate incorrect answers. The first major challenge I faced was how to break up and index my data in a way that preserves context and meaning. Naively splitting files by size (e.g. every 500 characters) often produces garbled chunks that are meaningless out of context. To maximize retrieval quality, Jarvis employs contextual semantic chunking – an approach inspired by Anthropic’s research that adds document context to each chunk to preserve its meaning.

Preserving Context with Semantic Chunking

Rather than splitting documents blindly, Jarvis’s SemanticChunker uses a Max–Min algorithm to find semantic breakpoints. In simple terms, it looks at the text sentence by sentence and computes similarity between adjacent sentences. If two sentences are very dissimilar (below a certain cosine similarity threshold), that’s a clue that a new topic might be starting – a good place to break. By doing this, we ensure each chunk is a coherent thought. For example, if a report has sections on Initial Access and Persistence, the semantic chunker will likely split at the section boundary, instead of cutting a section in the middle. This avoids a common problem where a single chunk contains half of one topic and half of another, confusing the retriever.

In Jarvis, semantic chunking is configured with a minimum chunk size (e.g. 200 characters) and a similarity threshold (~0.5 by default) to balance chunk cohesion with chunk length. If needed, there’s also a FixedChunker fallback for speed, but semantic mode is the default.

Jarvis then goes one step further with Anthropic-style contextual chunking. This was a game-changer. The idea is to have an LLM (Claude) generate a short context summary for each chunk and prepend it to that chunk. The summary “situates” the chunk within the overall document. For instance, if a chunk is a code snippet setting a registry key, the prepended context might say “(Context: This is part of the persistence mechanism using a registry run key.)”. This way, even if the chunk is retrieved in isolation, it carries a hint of where it came from.

Jarvis uses a prompt to Claude like:

# Simplified prompt template from Jarvis
<document>
{document}
</document>

Here is the chunk we want to situate within the whole document:
<chunk>
{chunk}
</chunk>

Please give a short succinct context to situate this chunk within the overall document...

The model’s answer (a couple sentences of context) gets cached and prepended to the chunk text. These “contextualized chunks” dramatically improved retrieval in testing – there was a large reduction in retrieval errors when using context-enhanced chunks vs raw chunks. In practice, this means the system is far less likely to return an out-of-context snippet. Even if the chunk text itself is a bit ambiguous, the AI will see the prefixed context and understand how it fits in the bigger picture.

To illustrate, let’s say I have a lengthy internal wiki page about AWS penetration testing. If I ask Jarvis, “How do I enumerate S3 buckets?”, it will embed my question into a vector (more on embeddings soon) and search for relevant chunks. With contextual chunking, instead of retrieving a random paragraph that just says “use aws s3 ls with some flags”, Jarvis might retrieve a chunk that looks like:

Context: This section covers enumeration of S3 buckets using AWS CLI and Boto3.
Content: To enumerate S3 buckets, you can use the AWS CLI: aws s3 ls … (and so on)

The prefixed context (“This section covers enumeration of S3 buckets…”) ensures the AI understands where this advice is coming from and can trust it’s about S3 enumeration specifically.

Under the hood, Jarvis’s chunking service is modular and loosely coupled. It doesn’t hard-code calls to any one embedding model or LLM. Instead, it defines an EmbeddingFunction protocol and takes in an embedding function at runtime. Similarly, the contextual chunker takes in an Anthropic client instance and a Redis cache handle. This decoupling made it easy to plug in different models (e.g., try GPT-4 for context instead of Claude, if needed) and to test things in isolation.

To give a small code example, here’s how one might use the Jarvis chunking service with an embedding service:

from jarvis.services import ChunkingService, create_embedding_service

# Create the embedding service (using Alibaba GTE model)
embedding_service = create_embedding_service(
model_name="Alibaba-NLP/gte-large-en-v1.5",
device="cuda"
)
embed_fn = lambda texts: embedding_service.embed(texts)

# Create the chunker (semantic + contextual)
chunker = ChunkingService(
embedding_function=embed_fn,
strategy="semantic", # use semantic base
semantic_config={
"min_chunk_size": 200,
"max_chunk_size": 1000,
"similarity_threshold": 0.5,
},
contextual=True # enable contextual chunking via Claude
)

chunks = chunker.chunk_text(document_text)

This yields a list of Chunk objects, each with chunk.text containing a nice context-prefixed snippet ready for indexing. Now, let’s talk about that embedding and indexing step.

Dense + Sparse: A Hybrid Search Engine

Once documents are chunked, Jarvis embeds them into a high-dimensional vector space for semantic search. I chose the Alibaba NLP GTE-large model for embeddings (1024 dimensions) – at the time of writing it’s a state-of-the-art model for general text, and it performs exceptionally well on semantic similarity. Using a strong embedding model is crucial: it means that when I convert a piece of text into a vector, semantically related texts end up nearby in that vector space. “Kerberoasting detection in Windows logs” and “catching SPN ticket attacks” will ideally be close vectors, even if they don’t share exact wording.

Jarvis’s vector database of choice is Qdrant, a high-performance open-source vector store written in Rust. All chunk vectors go into Qdrant, which supports fast cosine similarity search. But semantic vectors alone aren’t the full story. Through hard experience and research, I knew that lexical search still adds value – especially in technical domains where specific keywords (function names, error codes) matter. So Jarvis implements a hybrid search that combines dense vector similarity with good old-fashioned BM25 keyword scoring.

Here’s how it works: when a query comes in, Jarvis does both a vector search in Qdrant and a BM25 search over the corpus text (Jarvis maintains an in-memory BM25 index for all documents). Each method yields its own list of top hits. The system then merges them using a fusion algorithm. Initially, I used a simple weighted score fusion (70% vector score, 30% BM25). This was straightforward and ensured that even if a query contained a literal keyword match (say a function name) that the embedding might miss, the BM25 part would boost those relevant docs.

Over time, I moved to Reciprocal Rank Fusion (RRF), which is a popular algorithm in information retrieval. RRF doesn’t use raw scores; instead, it uses the rank positions of results from each method. Essentially, any document gets a score like 1/(k + rank) from each list, and those are summed – so being the #1 result in either list gives a big boost, being #10 gives a smaller boost, etc. The intuition is to favor documents that appear in both lists highly, and to do it in a way that’s fairly robust to score scaling issues. With RRF, I noticed a modest improvement in hybrid search results after switching.

To make this concrete, imagine I search Jarvis for “Kerberoasting detection.” The vector search might not know the term “Kerberoasting” if that exact word wasn’t in the embedding training data, but BM25 will catch documents containing that keyword. Conversely, the vector search might surface a log analysis guide that never says “Kerberoast” explicitly but talks about suspicious ticket requests. RRF merges these so that a guide which has both the term in one section and conceptual discussion elsewhere will float to the top. This hybrid approach is considered best practice in modern RAG systems, and Jarvis’s implementation aligns with industry patterns (vector DB + BM25, merging results, etc.).

There’s one more reranking step: after hybrid search, Jarvis applies a cross-encoder reranker for fine-grained scoring. The cross-encoder is a separate transformer model (Microsoft’s ms-marco-MiniLM-L-12-v2 in this case) that takes a query and a candidate passage, and outputs a relevance score. It literally encodes the query and passage together (hence cross-encoder) rather than via separate embeddings, which allows it to catch subtle nuance and context overlap. Using a cross-encoder can significantly improve the final relevance of top results, at the cost of extra computation.

In Jarvis, I configured it to rerank the top 50 or so results from the hybrid search, and then trim to the top 20. In tests on my knowledge base, this boosted precision noticeably, especially for complex questions. For example, if I asked “How was the SharePoint CVE exploited during that 2022 engagement?”, the initial hybrid search might return many chunks from the 2022 report. The cross-encoder then helps pick the best, like the chunk that actually describes the exploit technique, not just mentions SharePoint in passing. The reranker model (MiniLM) is small enough to run quickly on GPU, so the latency hit was minimal (tens of milliseconds per query since it’s scoring only around 50 passages).

To summarize the retrieval pipeline in Jarvis, here’s the high-level flow:

  • Ingestion: Document → chunk into meaningful pieces (with semantic + contextual chunking) → embed each chunk into a vector → store vectors in Qdrant (and text in BM25 index).

  • Query: User question → embed question vector (GPU) → parallel search: semantic similarity in Qdrant + BM25 keyword search → fuse results (RRF) → take top N → cross-encoder rerank those → return final hits.

And of course, the final step is the generation: the system feeds those retrieved chunks into an LLM (like GPT-4 or Claude) to produce an answer or summary for the user. The focus of this first post is on the retrieval part, but it’s worth noting Jarvis can use either OpenAI or Anthropic APIs to then generate answers with the retrieved info as context.

Example: Retrieval in Action

Let’s walk through a simplified example to see this in action. Suppose I have two documents ingested:

  1. “Azure AD Monitoring Guide” – which contains a section about detecting golden ticket and Kerberoasting attacks.

  2. “2022 Red Team Report – CorpX” – which describes how we performed a Kerberoasting attack on CorpX’s domain and what tools we used.

Now I ask Jarvis: “What’s the method to find service accounts vulnerable to Kerberoasting?”

  1. Embedding: Jarvis converts my question to a 1024-d vector via the GTE model (on GPU). This vector captures that I’m looking for service accounts, Kerberoasting vulnerability, etc.

  2. Hybrid Search:

    • Vector search finds chunks that are semantically similar. Likely it finds a chunk from the CorpX report that describes “we enumerated SPNs for accounts with Kerberoastable tickets using PowerView...”, even if that chunk doesn’t explicitly use the exact phrasing of my query.

    • BM25 search kicks in because “Kerberoasting” is a rare keyword – it will likely pull the Azure AD guide section that literally lists “Kerberoasting detection tips” or similar.

  3. Fusion: RRF sees that the CorpX report chunk was rank #1 in the vector results (good semantic match) and maybe #3 in BM25 (it had the keyword once). The Azure AD guide chunk was rank #1 in BM25 but maybe rank #5 in vectors (less semantic match). Depending on the ranks, these two will bubble up.

  4. Rerank: The cross-encoder model takes the question and each candidate chunk. It might give a slightly higher score to the CorpX report chunk if that text specifically answers “method to find service accounts” (maybe it says “used Get-SPNTicket”). The Azure guide chunk might be more generic: “to detect Kerberoasting, monitor event XYZ”, which is slightly different (detection vs method to find).

  5. Result: Jarvis returns the top chunk(s). In this scenario, likely it returns the step from the red team report about using PowerView’s GetUserSPNs to list accounts with SPN (service principal names) – which is exactly the method to find Kerberoastable accounts.

The magic here is that Jarvis understood my question semantically and pulled an answer from my past operation, even though I phrased it differently. Without semantic embedding, I might not have thought to search for “SPN enumeration PowerView” manually. Without BM25, I might have missed some explicit references. The combination ensured I got what I needed.

Building a Lasting Memory

By implementing this multi-layer retrieval pipeline – contextual chunking, hybrid search, cross-encoder reranking – Jarvis transforms my jumble of notes and logs into a queryable memory. The payoff has been huge. I’ve seen the success rate of Jarvis retrieving relevant info for a given query jump dramatically from when I started (back when I naïvely did plain embeddings without context or BM25). In practice, this means Jarvis more often “remembers” things that I had forgotten I even wrote down.

In the next post, we’ll shift from retrieval into how Jarvis orchestrates different AI models and tools – essentially how it decides what to do with a query once it has the relevant knowledge. This is where Jarvis moves from a passive library into an active assistant, automatically choosing the best approach (e.g. code generation vs answering a question vs running a task). We’ll dive into the orchestration engine, task classification, and how Jarvis manages to use multiple AI services efficiently. The journey continues from just remembering knowledge to applying it in context – the real “AI-augmented workflow” part of the story.

Post 2: Building the Orchestration Engine

In the first post, we covered how Jarvis retains and retrieves knowledge. But Jarvis isn’t just a static search engine – it’s an active assistant that can figure out what you’re asking for and how to respond. Welcome to the orchestration engine: the “brain” that routes queries to the right AI models or tools, manages context, and balances factors like cost and performance.

In this post, I’ll share how I approached designing Jarvis’s orchestration system, making it a smart router rather than a one-trick pony. This part of the journey was about reducing manual effort – I didn’t want to decide every time “Should I use GPT-4 for this or a local model?” or “Do I need to run a reconnaissance script?”. I wanted Jarvis to handle those choices.

From Queries to Tasks: Classifying Intent

The first step in orchestration is understanding what the user (often me) is asking for. Is this a straightforward question answerable from documentation, or a coding task, or maybe a request to run a security scan? Jarvis addresses this with a Task Classifier that labels each request with a task type and complexity. The task types Jarvis recognizes include things like: coding, reasoning, research, analysis, summarization, conversation, multimodal, rag_query (for retrieval questions), and a general catch-all. Complexity is rated simple, moderate, complex, or critical.

For example:

  • “Write a Python script to parse PCAP files” might be classified as coding with moderate complexity.

  • “What were the findings in our Q3 phishing assessment?” might become rag_query (retrieval-augmented question) with low complexity.

  • “Investigate this suspicious domain across logs” could be research and possibly high complexity if it implies multi-step analysis.

How does Jarvis classify tasks? It uses a hybrid of semantic similarity and an LLM. I provided a set of “reference examples” for each task type in a config file (for instance, under coding I might have examples: “Write a Python function …”, “Implement an algorithm …”). On initialization, Jarvis’s classifier embeds all these examples into vectors. When a new request comes, it embeds the request and finds which task’s examples it’s closest to. If one task is clearly above a confidence threshold (~0.85 similarity), it assigns that task purely via this semantic match. This is very fast (a few milliseconds).

However, if the request is ambiguous – say the top match is only 0.6 and second-best is 0.55 – Jarvis then calls an LLM to refine the classification. Essentially it asks a model (like a lightweight GPT-4 or Claude model dedicated to classification) something like: “The user said X. I think it’s either a coding or analysis query. What do you think and why?”. The LLM then helps decide and also provides a short reasoning if needed. This two-stage approach (semantic first, LLM second) gives me the best of both: speed for obvious cases and accuracy for edge cases. It achieves high accuracy in internal tests, with typical classification taking under half a second even when the LLM is invoked.

Jarvis’s task classification doesn’t stop at labeling the type; it also assesses complexity. Complexity is determined by simple heuristics – e.g., length of the request, certain keywords (like “detailed” or “comprehensive” might signal higher complexity), and whether the user asks for code or just an explanation. The classifier sets a flag if it thinks the task will need heavy lifting. For instance, “Generate a full pentest report from these notes” would be complex, whereas “Summarize this one log line” is simple. Complexity will factor into model selection next.

Let’s peek into how the classifier represents its decision. It outputs a structured TaskClassification dataclass:

TaskClassification(
task_type="coding",
complexity="moderate",
requires_gpu=False,
requires_web_access=False,
requires_context=False,
estimated_tokens=1000,
confidence=0.94,
reasoning="Classified as 'coding' (confidence 0.94, method: semantic)",
classification_method="hybrid"
)

This is roughly what Jarvis produced for a prompt like “Write a Python function for binary search” in testing. You can see it decided on coding, moderate complexity, and it even set some resource flags: requires_gpu, requires_web_access, etc. Those flags are automatically inferred; e.g., if task_type is “multimodal” (like handling an image) it might set requires_gpu=True (assuming we need a vision model), or if task_type is “research” it sets requires_web_access=True because research might involve internet queries. In the binary search example, none of those were required. The classifier also estimated token usage and attached a reasoning string mostly for logging/debugging.

Intelligent Routing: Choosing the Right Model

Once Jarvis knows what the task is, it needs to choose how to execute it. I integrated multiple AI model providers into Jarvis – OpenAI, Anthropic, Google’s PaLM (Gemini), and even a local experimental model – each with different strengths. The orchestration engine includes an Intelligent Router that picks the optimal model given the task type, complexity, and my configured preferences.

Jarvis has a registry of models with their capabilities and costs (maintained in an orchestration.yaml config). For example:

  • Claude 4.5 (Anthropic) – great at coding, large 200k context, moderate cost.

  • Claude Opus (Anthropic) – excellent reasoning, very high cost, 200k context.

  • GPT-4 (OpenAI) – general purpose, good quality, 128k context, expensive but less than Claude Opus.

  • GPT-4 mini (OpenAI) – a cheaper variant for quick/light tasks.

  • Gemini 2 (Google) – free or low cost, extremely large context (1M+ tokens), good for broad research or multimodal tasks but maybe lower quality in some areas.

  • Grok, DeepSeek, etc. – other models with specialized strengths.

Each model entry includes metadata like cost per million tokens, max context length, and what it’s best at. The router uses a decision tree (configured rules) to map tasks to models. For instance, the routing policy might say:

  • For a coding task:

    • If complexity is simple or moderate → use Claude 4.5 (Sonnet).

    • If complexity is complex → escalate to Claude Opus 4.

    • Fallbacks: if Anthropic is unavailable, fall back to GPT-4.

  • For reasoning/analysis:

    • Simple → GPT-4 mini.

    • Moderate → Claude 4.5.

    • Complex → Claude Opus.

    • Fallback → GPT-4.

  • For research tasks (open-ended web research or multi-step):

    • Prefer Gemini 2.0.

    • If context required > 100k tokens, definitely use Gemini.

    • Fallbacks to other providers as needed.

  • For multimodal:

    • Use Gemini, fallback to GPT-4 if possible.

  • For rag_query:

    • Route to HybridSearchService, i.e., Jarvis’s own retrieval system, instead of a language model. The router recognizes “this is a question for Jarvis’s memory” and will perform the steps from Post 1 to fetch an answer, then possibly summarize it.

The Router doesn’t just pick the absolute cheapest or most expensive – it balances cost vs quality by looking at a cost sensitivity setting. I can tell Jarvis to be “budget”, “balanced”, or “performance-oriented”. For example, if I set cost_sensitivity to “optimize”, Jarvis will favor cheaper models more often, even if slightly lower quality. If “performance”, it will default to the best quality (likely Claude Opus or GPT-4) regardless of cost. In balanced mode (default), it tries to save money when it can do so without a big quality hit. This system cut my API usage costs dramatically by using cheaper models for simple stuff and only pulling out the big guns when truly necessary.

A few example scenarios:

  • Example 1: “Write a Python function for binary search.”

    • Classification: coding (moderate).

    • Router: coding/simple–moderate → chooses Claude Sonnet 4.5 as primary. Claude is known to excel at code and it’s cheaper than GPT-4 for similar coding quality.

  • Example 2: “What is 15% of 240?”

    • Classification: reasoning (simple).

    • If cost sensitivity is “optimize”, the router picks GPT-4 mini which costs almost nothing, instead of using Claude or full GPT-4.

  • Example 3: “Analyze this complex incident log and summarize attacker activity.”

    • Classification: analysis or reasoning (complex).

    • Router: sees high complexity, likely chooses Claude Opus 4 (pricey but strong reasoning and big context). In balanced mode, it might try Claude 4.5 first and only escalate if needed.

One important component is fallbacks. If the chosen model fails (say the API returns an error or times out), Jarvis will automatically try an alternate provider. It has an ordered list of backups for each task. For example, if Claude is down for coding tasks, try GPT-4; if that fails, maybe try another model. When I experienced an Anthropic outage once, Jarvis seamlessly switched to OpenAI – I just saw a slight slowdown but no interruption.

Unifying APIs: The Model Delegator

Talking to all these different providers’ APIs can be a nightmare – each has its own format, auth, quirks. To simplify, I built a ModelDelegator service. This is basically a unified client interface that wraps the SDKs or HTTP calls for Anthropic, OpenAI, Google, etc., under one roof. When the router picks a model, it calls:

ModelDelegator.generate(request, model_name="claude-sonnet-4-5", **kwargs)

and the delegator figures out which API to call. It translates the prompt into the right format (OpenAI expects a certain JSON with messages, Anthropic expects a specific prompt string, etc.), then sends the request off. It also:

  • Tracks tokens: the delegator knows the token lengths and costs for each provider, so it calculates how many tokens were sent/received and the approximate dollar cost.

  • Handles errors and retries: if a call fails due to a transient error (network issue or rate limit), the delegator has a retry policy with exponential backoff. It may also downgrade models – e.g., if I asked for GPT-4 but my API key quota is exhausted, it can automatically try GPT-3.5 as a fallback.

  • Enforces context limits: if the request plus retrieved docs exceed a model’s max context, the delegator can catch that and either trim or route to a bigger model as needed.

Implementing the delegator was tedious but crucial. It shields the rest of Jarvis from the nitty-gritty of each API. If a provider changes their format tomorrow, I just update the delegator. If I want to add a new provider (say a self-hosted LLaMA2), I implement a small adapter in the delegator and add it to the registry, without touching the high-level orchestration logic.

Keeping Context: Session Management

Another key piece of the orchestration system is Session Management. When I’m chatting with Jarvis or asking a series of questions, I want it to remember context from previous interactions (especially in “assistant” mode). For example:

  • Me: “I have an error on server X, here are the logs…”

  • (Jarvis analyzes.)

  • Me: “Now deploy the patch to fix it.”

In the second request, “the patch” refers to context from earlier. Jarvis uses a session ID to tie these together. Under the hood, Jarvis maintains a Redis store for session data. Each session has:

  • The last N conversation turns (N=20 by default).

  • Metadata like user preferences or any files/code I’ve “attached” to the session.

  • A running token count and cost total.

  • A TTL (time-to-live) – e.g., 1 hour of inactivity.

Jarvis’s API exposes this: when you make a request, you can pass a session_id. If you don’t, it creates a new session and returns the ID. On subsequent calls, provide that ID to continue the session. Jarvis will automatically prepend the conversation history from that session to your query so that the model sees the context.

The session memory is currently stored as raw text for the conversation turns, but I’m considering vectorizing and compressing older turns in the future. By limiting to 20 turns and often summarizing mid-session, it manages context window limitations. Because the session is in Redis, multiple Jarvis API instances (workers) can share it. There’s also a session stats endpoint to introspect a session – it shows when it was created, how many messages so far, tokens used, and so on.

Orchestration Flow Example

Let’s combine all these pieces – classification, routing, model delegation, and session – to trace what happens internally on a sample request to Jarvis’s orchestrate API.

  1. User Request: “Explain microservices architecture” (with no session, so a new conversation).

  2. Session Initialization: Jarvis creates a new session ID and stores an empty history.

  3. Task Classification: The classifier sees this is likely a general explanatory query – probably falls under “analysis” or “conversation”. Suppose it classifies as analysis with simple complexity and high confidence.

  4. Routing: For a simple analysis task, the router might pick a smaller model like GPT-4 mini or GPT-4 itself if quality is a priority. Suppose Jarvis picks GPT-4 (OpenAI) in balanced mode.

  5. Model Delegation: The delegator formats the prompt: system message with instructions, user message with “Explain microservices architecture”. It sends this to OpenAI’s API.

  6. LLM Generation: GPT-4 produces a nice explanation of microservices architecture.

  7. Return and Store: Jarvis returns the answer to the user. It also saves the Q&A into the session history in Redis.

Now, a follow-up:

  • User: “What are the downsides?” (with the same session_id.)

  • Session: Jarvis loads the session history, which has the earlier Q&A.

  • Classification: Again labeled analysis, simple; this time requires_context=True.

  • Routing: Same route as before (GPT-4).

  • Delegation: The delegator composes a prompt that includes the prior Q&A as context, then appends “What are the downsides?”.

  • Generation: GPT-4 answers with downsides of microservices (complexity, debugging challenges, etc.).

  • Return: Jarvis returns the answer and updates the history.

From the user’s perspective, Jarvis behaves like a persistent chat AI with knowledge of context. Internally, this is the orchestration engine juggling pieces to make that happen.

Under the Hood: Implementation Notes

Jarvis’s orchestration lives largely in a core orchestration service module. On app startup, it:

  • Spins up the TaskClassifier (loading reference embeddings).

  • Sets up the model registry.

  • Warms up any necessary model clients (OpenAI, Anthropic, etc.).

Each incoming request to the /api/v1/orchestrate endpoint goes roughly through this flow:

  1. Load/create session (via SessionManager).

  2. Call TaskClassifier.classify(request_text, context) – context may include file attachments or prior turns.

  3. Pass the TaskClassification result to the Router to get a model choice.

  4. Invoke ModelDelegator.generate with the chosen model and prompt.

  5. Receive the AI result, update session with the new conversation turn (plus any tokens used).

  6. Return the result (and metadata) to the API response.

The whole pipeline is asynchronous and runs within FastAPI’s event loop (model HTTP calls are I/O bound, so that’s fine). For certain long operations, I integrated Celery so some tasks (e.g. heavier research flows) can be handled in the background, but most user queries are real-time.

Throughout building this, a key focus was not re-inventing wheels where not needed. The classifier uses embeddings from the same vector model as the knowledge base (consistency and fewer models to maintain). The router logic is basically config-driven – I can edit a YAML to tweak routes rather than hardcode things. And by leveraging existing models’ strengths (Claude’s long context for code, Gemini’s free tier for broad research, etc.), Jarvis orchestrates efficiently.

Results: A Unified Assistant

After implementing the orchestration engine, using Jarvis started to feel qualitatively different. I could ask it to do things and it would just figure it out. Some anecdotes:

  • I can prompt: “Convert this Python script to PowerShell” and Jarvis will classify it as coding and use Claude (which has shown strength in code conversion tasks) to produce the PowerShell. No need for me to specify a model.

  • I can say: “Run a quick Nmap scan on this IP range and summarize open ports.” Jarvis recognizes that as a potential tool invocation. Even if it doesn’t execute the scan itself, it can outline exactly how and what to look for.

  • If a request is very heavy, e.g. “Read these 5 PDFs and answer questions”, Jarvis will split it into sub-tasks under the hood (ingesting the PDFs via the RAG pipeline, then querying them). The orchestration engine may identify a need for long context and decide to use Gemini 2, which can handle huge input.

One particularly satisfying moment: I once accidentally gave Jarvis a malformed query that confused the classifier (something like “$$$ system failure blah blah” – some test input). The classifier wasn’t confident, so it fell back to the LLM which produced a reasonable guess that I was asking about system failures. The router then did its thing. It was nice to see the hybrid classifier catch an edge case – it’s like Jarvis asked itself “did I get that right?” and double-checked with a smarter friend. The result was a correct routing despite the odd input.

The orchestration engine makes Jarvis dynamic. It reduces duplicated effort by automating model choice and allows Jarvis to interface with multiple systems through one coherent interface. Essentially, Jarvis became an AI concierge – you ask for something, and it decides which specialist to consult (which model or tool), gets the answer, and gives it to you with context. All while keeping track of the conversation state and not burning a hole in my API budget.

In the next and final post, we’ll explore Jarvis as a living system in the real world: how it ingests various data sources continuously, supports red team and blue team workflows, and how it’s deployed for daily use. We’ll also reflect on the journey and where this project is headed next.

Post 3: Jarvis as a Living System

In the final part of this series, we step back and look at Jarvis not just as a collection of features, but as an evolving living system integrated into security workflows. Jarvis has grown from a neat idea (an AI memory + assistant) into something I actively use to augment my red team and blue team activities. I’ll discuss how Jarvis ingests knowledge continuously, how it interfaces with real-world tools and data, and what it’s like deploying and running this system day-to-day. This post will be a bit more personal and exploratory, reflecting on how having a “persistent AI teammate” changes the way I work – and where I aim to take Jarvis next.

Feeding Jarvis: Ingesting Everything (Notes, Logs, and Beyond)

A memory is only as good as what you put into it. Early on, I had only ingested static documents (reports, guides, wikis) into Jarvis. But to truly reduce duplicated work, I needed Jarvis to absorb the ongoing stream of data I generate: every new finding, every how-to snippet I stumble upon, chat discussions about incidents, tool outputs from engagements, etc. This turned Jarvis into a living knowledge base that grows over time instead of a one-time import.

File Ingestion: I set up ingestion pipelines for various file formats. There’s a script that watches certain folders (my “Knowledge” folder, project folders, etc.) and whenever I drop a new Markdown, PDF, or text file, it sends it to Jarvis via an ingestion endpoint. Under the hood, Jarvis will chunk the document (using the semantic/contextual chunker) and index it into Qdrant and the BM25 index. I configured it to tag each piece with metadata like the source filename, project, date, and so on, so I can filter or trace results later.

PDFs are handled via a pipeline that first extracts text; I use a combination of fast and robust parsers, sometimes including OCR for scans. Ingestion is asynchronous – I can throw 100 documents at Jarvis and it will queue them and process in parallel without choking. This means after an engagement, I can feed in all our notes and output data to Jarvis and trust that knowledge will be there for the future.

Conversations and Chat Logs: A big breakthrough was capturing my chats with AI. I often use an AI coding assistant (like Claude in VSCode) for help. Jarvis has an auto-capture feature that can hook into those chats and save them. Originally, I had a VSCode extension hook that would automatically call Jarvis to ingest each conversation turn; later I shifted to a manual batch ingestion using a small PowerShell script that finds the latest Claude conversation transcripts on my PC and sends them to Jarvis in one go. After I finish a coding session with the AI, I run this script and it pushes the whole conversation JSON to Jarvis. Jarvis then indexes it so I can later query “What was that regex trick Claude suggested yesterday?” and it pops up.

Similarly, for important Slack discussions or email threads, I can forward them into Jarvis. I have an “ingest email” rule where I BCC a special address that pipes the email to Jarvis’s API. The content then lives in my semantic memory. Having a single place to search across personal notes, AI chats, and team communications is incredibly powerful. It’s like having Google for my stuff, with the AI context built in.

Shell Session Logs: As a red teamer, I often end up with lengthy terminal logs (e.g., an SSH session where I tried various privilege escalation strategies). Those were typically tossed aside or trimmed for reports. Now I log them and let Jarvis ingest them. It breaks them into steps, and because the chunks include timestamps and command outputs as context, Jarvis can answer questions like “Did we run mimikatz on DC01?” or “What was the output of ifconfig on the web server?” if I’ve fed it those session logs. I do apply filters to avoid indexing huge binary dumps or repetitive logs.

Tool Output and Structured Data: Jarvis’s ingestion isn’t limited to plain text. I integrated simple parsers for formats like Nmap XML and Nessus CSV exports. The pipeline can take those, parse out host-vuln information, and store it in a structured way. I tag the chunks with metadata like {"type": "nmap_scan", "target": "10.0.0.0/24", "date": "2025-11-01"}. This way, Jarvis can answer “Which hosts had port 445 open in the Oct engagement?” by searching the metadata and content of those scan results.

Ingesting continuously raises the question of data validation and duplicates. Jarvis handles this by:

  • Computing a hash of content on ingest; if the same chunk comes in twice, it can recognize it and skip or update.

  • Flagging very similar chunks (high cosine similarity) to avoid storing redundant info that wastes space.

  • Using metadata so I can scope searches to specific sources or date ranges, which also mitigates issues with duplication.

Multi-Tenancy for Projects: Since I use Jarvis for different projects and clients, I enabled a multi-tenant design. By default, everything goes into a shared global index, but there are also per-project indexes. If I label content with a project ID, Jarvis will store it in a separate collection in Qdrant (like jarvis_project_redteam1). Queries can specify a project to scope results, which helps avoid leaking info between contexts. You can also search both the shared collection and a project collection in one go, combining general knowledge with project-specific data.

Red Team Automation with Memory and AI

One of my original goals was to have Jarvis automate red team tasks. We’re not fully there yet, but we’ve made progress. Jarvis’s memory layer enables what I call context-aware automation. Instead of blindly running tools, Jarvis can use its knowledge to make decisions.

Imagine I’m doing an internal penetration test. I’ve gathered a bunch of data – user accounts, network shares, etc. I can ask Jarvis something like: “Enumerate likely lateral movement paths.” This is a high-level request; Jarvis interprets it as a research or analysis task. In its knowledge base, it has documents on lateral movement, maybe BloodHound results from previous tests, etc. It retrieves that relevant knowledge (for example, it might pull a chunk from a previous report: “User X is a local admin on Machine Y, which could be used for lateral movement”). Then Jarvis might choose a strategy: outline a series of steps or queries tailored to the environment, such as:

  • Focus on accounts with SPN set – those are kerberoastable.

  • Check if any of these accounts appear in local admin groups from prior recon.

  • Use RDP or SMB to attempt lateral movement to hosts where you have credentials.

If I feed it live data (like the latest BloodHound export), Jarvis can ingest that on the fly and include it in its reasoning. I’ve done this with simpler inputs: I ran a custom PowerShell script to enumerate local admins on a set of PCs, ingested the results, then asked Jarvis “Which machines can I access with user JohnDoe’s credentials?” and it cross-referenced the data to answer.

Another area is report generation and reuse. Red team ops often produce similar findings (missing patches, misconfigs) that we have to explain and remediate in reports. Jarvis’s memory of past findings allows it to auto-suggest report content. For example, if I type a finding title “LLMNR/NBT-NS Poisoning”, Jarvis can retrieve the detailed description and fix from the last time it was reported, saving me from digging through old reports. I can then adapt it. This has easily saved hours in reporting. It’s like having a tailored CWE/knowledge base that’s queryable in natural language.

While Jarvis hasn’t fully automated exploit execution (I still drive the actual tools), it certainly automates the reasoning and recall. It feels like pair-programming or pair-hacking with an AI that never forgets anything. During an engagement, I might ask Jarvis in our private Slack (via a bot integration) “Do we have any foothold on the SQL servers?” – Jarvis will search through the engagement notes it ingested and answer with whatever it finds, such as “Yes, we got a credential on SQL01 as user svc_report.” It’s pulling that from notes I ingested a day or two earlier, which beats scrolling back through OneNote or Slack history.

Blue Team Augmentation and Reasoning

On the blue team side (incident response, threat hunting), Jarvis serves as a historical memory and an analysis aide. A SOC analyst might use Jarvis to quickly recall “Have we seen this alert or malware before?” If all past incident reports and IOCs are in Jarvis, a query like “Triaged alerts similar to Trickbot on finance PC last year” will surface that previous incident, including what was done. This can reduce duplicated investigative work.

Jarvis’s orchestration also means the blue team can leverage different models for different tasks seamlessly. If an analyst asks, “Summarize this 10MB log file for anomalies,” Jarvis might split the file, index it, maybe use a local script to extract statistics, and then feed a summary prompt to an LLM. You could imagine the Intelligent Router recognizing a log_analysis task and delegating to a specialized pipeline or a fine-tuned model.

One concrete example: I had a Zeek network log of an attack scenario. I ingested the log (as text chunks) into Jarvis. Then I asked: “Find signs of data exfiltration in the network log.” Jarvis used hybrid search to pull any chunks with keywords like “POST” or “large” or “file” and also semantic clues about unusual traffic. It found an HTTP POST with a big payload and returned that line. Then, using the orchestration, it chose to summarize that finding via GPT-4 (since it classified it as analysis). The answer I got was something like: “There is an HTTP POST to an upload endpoint from 192.168.1.50 transferring several megabytes of data, which is a potential exfiltration event.” Jarvis basically did the tedious scanning and explanation for me.

Long-Running Assistant Mode

Jarvis isn’t used via just APIs and scripts – I often keep it running in an interactive mode for hours. For instance, during an investigation I might have a terminal or chat with Jarvis where I continuously feed it information and questions. This is where the session context tracking shines. Jarvis becomes a persistent analytical partner that “remembers” everything I’ve told it in the last hour.

In long sessions, I sometimes ask Jarvis, “Show me what you remember about X from earlier in this session.” Because the session history is stored (and even vector-indexed for quick lookup), Jarvis can fetch the relevant part of the conversation or data and present it. Internally it might just search its own conversation logs (vector search on past turns – Jarvis can RAG on its own dialogue). This self-referential ability prevents me from having to scroll up manually or recall details from an hours-long chat.

Another aspect of Jarvis as a living assistant is tool integration. While Jarvis doesn’t execute OS commands directly by default, I have integrated a set of “pseudo-tools” through its API. For example, if I prefix a query with #bash, a small middleware intercepts that and actually runs the command on a sandboxed VM, then returns the output which Jarvis ingests. So I can do:

Me: #bash nmap -p 22-25 10.0.0.0/28
Jarvis: (runs Nmap, ingests output, then responds) “Ports 22 and 25 are open on 10.0.0.5. Port 23 is filtered. The other hosts have no ports 22–25 open.”

This kind of tight loop where Jarvis can incorporate fresh results and immediately analyze them is like having a junior security engineer who can both run tools and write up findings on the fly. It’s only enabled in controlled environments and with my supervision, but it shows what’s possible.

Deployment and Scaling in Practice

Jarvis started as a local experiment on my beefy desktop, but it’s now running on a dedicated server with a GPU and is accessible via a web interface and API for my team (with auth). Some deployment notes:

  • Container Orchestration: Jarvis runs in Docker Compose with multiple services: a FastAPI service (the API), Celery workers for CPU-bound tasks (document parsing, etc.), and another worker for GPU tasks (embedding, cross-encoder inference). We have Redis for Celery + session storage, and Qdrant for vectors. Using Docker made it easy to ship the whole thing to a cloud VM and get running quickly.

  • GPU Awareness: The embedding service can saturate a GPU if it tries to embed too many texts at once. Jarvis’s embedding code uses an adaptive batching mechanism – it monitors GPU utilization and dynamically adjusts batch size to hit a target utilization without OOMs. On an RTX 3090 test box, Jarvis went from embedding ~100 texts/second initially to ~300 texts/second after tuning adaptive batching and enabling FP16 mode.

  • Scaling Workers: The API itself is an async FastAPI running multiple Uvicorn workers. Heavy lifting like creating embeddings or handling large file ingestion is offloaded to Celery workers. We run one Celery queue dedicated to GPU tasks and another queue for CPU tasks, scaled to multiple processes to utilize multi-core CPUs.

  • API and Security: All API calls require an API key header. I also put the service behind a VPN for my team, since this is sensitive data. Jarvis has a Swagger UI and some basic health endpoints for monitoring.

  • Monitoring: We use Grafana and Prometheus to monitor Jarvis. There are custom metrics like number of queries, ingestion rate, and default system metrics like CPU, memory, and GPU usage. During load tests, Jarvis sustained several queries per second and many ingestion operations per second without errors. The bottleneck is typically the external LLM APIs.

Data management-wise, the vector DB (Qdrant) holds on the order of a million vectors from everything ingested and uses a few gigabytes of memory. I set up nightly backups of Qdrant and Redis (session data is less critical, but the knowledge base is). Since Jarvis can wipe and re-ingest, I keep raw sources in a separate archive so I can rebuild the index if needed.

Running Jarvis in a “production-ish” environment taught me about maintenance. I had to implement health checks and auto-recovery for workers – e.g., Celery processes would occasionally hang due to large file parsing, so I added timeouts and a watchdog to restart workers if they become unresponsive. Upgrading dependencies (like a new version of Qdrant or the embedding model) required re-indexing, which Jarvis can do using a shadow index method (spin up a new collection, swap when ready). Health endpoints report model availability, making it easier to debug outages.

Reflections and Future Directions

Having used Jarvis extensively, I can say it has already improved my research and reuse of knowledge dramatically. It’s like a second brain that I can query. There have been numerous times I asked Jarvis something and it reminded me of an article or note I had completely forgotten writing. Or it retrieved an exact error message from a log that saved me re-running a lengthy scan. It reduces the “Swiss cheese memory” problem – the information is there, you just need to ask the right question.

It’s also changed how I document things. Knowing that Jarvis will ingest my notes, I write notes in a way that’s chunk-friendly. I give my sections clear headings and keep related info together, so that the semantic chunker will keep those in one piece. I also sometimes add a one-line summary above key findings in my notes – effectively creating that contextual prefix myself.

For the future, there are lots of exciting enhancements on the roadmap:

  • Streaming responses: letting Jarvis stream partial answers so the user sees content as it’s generated.

  • Model ensembles: running multiple models and having Jarvis compare answers, improving reliability.

  • Self-refinement loops: enabling Jarvis to reflect on an answer and refine it if it’s not good enough.

  • Cost awareness dashboard: surfacing usage and cost per user, project, or time window.

  • Deeper integrations: tying Jarvis into Slack, ticketing systems, CI pipelines, and more.

  • Assistant personas: modes like “Linux guru” vs “Windows DFIR expert” that use different prompts or model preferences.

  • Continuous learning: using user feedback (thumbs up/down, extra research needed) to refine retrieval, ranking, and routing over time.

I also have an eye on integrating custom local models. There are some open-source LLMs fine-tuned for cyber security tasks. If I plug those in via the model delegator, Jarvis could choose them when appropriate (e.g., a CVE-focused Q&A model for vulnerability questions). This would reduce dependence on external APIs and improve privacy for sensitive data.

In closing, building Jarvis has been a journey of turning disparate ideas into a cohesive system: from semantic search to intelligent routing to continuous ingestion. It reflects a philosophy that I think will be common in the future – having a long-term memory and adaptable AI as part of one’s workflow. Whether you call it an AI-augmented workflow or a smart knowledge base, the benefit is clear: less repeated work, faster access to information, and a partner that learns and grows with you.

For me as a security professional, it means I can focus more on creative and high-level problems, while Jarvis handles the grunt work of remembering, retrieving, and even initiating routine tasks. Jarvis is still “in-progress” and probably always will be. But it’s already an empowering presence in my daily work. If you’re building something similar or thinking about it, I hope this series offered useful insights – from technical bits to practical use cases. The code is available (open-source, MIT licensed) in the GitHub repo, and I’m excited to see how others might use or extend these ideas. As for me, I’ll keep expanding Jarvis’s capabilities, one commit at a time, with the vision of a truly smart, self-improving assistant always in mind.