Refactor `RetrieverAgent` to use parameterised AI Search queries and GPT-4.1-nano for low-latency RAG

#### Background

`retriever.py` currently:

* Hard-codes the semantic configuration name, top-10 limit, and index fields
* Falls back to mock documents instead of retrying or surfacing detailed errors
* Does not pass retrieved passages to an LLM for answer synthesis – the kernel only echoes a count
* Ties the embedding model to server-side reranking, even though GPT-4.1-nano is now available for fast, low-cost generation

Moving to a configurable retrieval → generation flow will let us:

* Tune query parameters per environment (dev, staging, prod) without code changes
* Drop the mock-doc path in favour of proper retries and telemetry
* Use GPT-4.1-nano to draft quick RAG answers while keeping GPT-4o or 4-turbo for high-quality fallbacks
* Prepare the ground for multi-vector hybrid search in a future milestone

---

#### Scope of Work

1. **Configuration**

   * Expose the following via `settings` + `.env`:

     * `SEARCH_INDEX`
     * `SEMANTIC_CONFIGURATION`
     * `SEARCH_TOP_K` (default 10)
     * `GPT_FAST_MODEL` (default `gpt-4.1-nano`)
     * `GPT_FALLBACK_MODEL` (default `gpt-4o`)

2. **Retrieval logic**

   * Replace `self.search_client.search(...)` with a helper that maps env values → SDK params.
   * Add exponential-backoff retry (3 attempts, 1-4-8 s) on `ServiceRequestError`, `HttpResponseError`, and `ClientAuthenticationError`.
   * Return a typed `RetrievalResult` dataclass containing: `id`, `content`, `title`, `source`, `score`, `reranker_score`.

3. **Generation logic**

   * After fetching docs, send the top N passages (env var `RAG_PASSAGES`, default 4) to GPT-4.1-nano with a dry prompt:

     ```
     You are a retrieval assistant. Provide a concise answer to the user
     based only on the passages provided. Use bullet points when listing items.
     ```
   * If the nano model times out or returns an empty response, retry with `GPT_FALLBACK_MODEL`.
   * Attach source-attribution footnotes: `[1]`, `[2]`, … mapping to `source`.

4. **Streaming updates**

   * In `invoke_stream`, yield interim tokens from the LLM call so the UI can display gradual output.

5. **Tests**

   * Unit test for config overrides via `pytest-env`.
   * Integration test that ensures:

     * At least one passage is returned for a known query (`"Microsoft revenue"`).
     * GPT response contains a footnote matching one of the passage sources.

---

#### Acceptance Criteria

* All hard-coded values removed from `RetrieverAgent`.
* Queries, semantic config, and top-K are driven by env vars.
* GPT-4.1-nano is invoked first; fallback engages only on error or empty content.
* End-to-end latency (retrieval + generation) averages <1.2 s for a 30-token query against the dev index.
* Tests pass in CI and the README section **“Fast RAG path”** is updated.

---

#### Additional Notes

* When `use_agentic_retrieval` is true, prefer `KnowledgeAgentRetrievalClient` but still respect env overrides.
* Consider moving mock-document logic to a separate debug utility instead of deleting outright.
* Follow the project style guide: hyphens, not em dashes, in log or user-visible messages.

---

⌛ **Effort**: Medium


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactor `RetrieverAgent` to use parameterised AI Search queries and GPT-4.1-nano for low-latency RAG #15

Background

Scope of Work

Acceptance Criteria

Additional Notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Refactor RetrieverAgent to use parameterised AI Search queries and GPT-4.1-nano for low-latency RAG #15

Description

Background

Scope of Work

Acceptance Criteria

Additional Notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Refactor `RetrieverAgent` to use parameterised AI Search queries and GPT-4.1-nano for low-latency RAG #15