Skip to content

Refactor RetrieverAgent to use parameterised AI Search queries and GPT-4.1-nano for low-latency RAG #15

@thegovind

Description

@thegovind

Background

retriever.py currently:

  • Hard-codes the semantic configuration name, top-10 limit, and index fields
  • Falls back to mock documents instead of retrying or surfacing detailed errors
  • Does not pass retrieved passages to an LLM for answer synthesis – the kernel only echoes a count
  • Ties the embedding model to server-side reranking, even though GPT-4.1-nano is now available for fast, low-cost generation

Moving to a configurable retrieval → generation flow will let us:

  • Tune query parameters per environment (dev, staging, prod) without code changes
  • Drop the mock-doc path in favour of proper retries and telemetry
  • Use GPT-4.1-nano to draft quick RAG answers while keeping GPT-4o or 4-turbo for high-quality fallbacks
  • Prepare the ground for multi-vector hybrid search in a future milestone

Scope of Work

  1. Configuration

    • Expose the following via settings + .env:

      • SEARCH_INDEX
      • SEMANTIC_CONFIGURATION
      • SEARCH_TOP_K (default 10)
      • GPT_FAST_MODEL (default gpt-4.1-nano)
      • GPT_FALLBACK_MODEL (default gpt-4o)
  2. Retrieval logic

    • Replace self.search_client.search(...) with a helper that maps env values → SDK params.
    • Add exponential-backoff retry (3 attempts, 1-4-8 s) on ServiceRequestError, HttpResponseError, and ClientAuthenticationError.
    • Return a typed RetrievalResult dataclass containing: id, content, title, source, score, reranker_score.
  3. Generation logic

    • After fetching docs, send the top N passages (env var RAG_PASSAGES, default 4) to GPT-4.1-nano with a dry prompt:

      You are a retrieval assistant. Provide a concise answer to the user
      based only on the passages provided. Use bullet points when listing items.
      
    • If the nano model times out or returns an empty response, retry with GPT_FALLBACK_MODEL.

    • Attach source-attribution footnotes: [1], [2], … mapping to source.

  4. Streaming updates

    • In invoke_stream, yield interim tokens from the LLM call so the UI can display gradual output.
  5. Tests

    • Unit test for config overrides via pytest-env.

    • Integration test that ensures:

      • At least one passage is returned for a known query ("Microsoft revenue").
      • GPT response contains a footnote matching one of the passage sources.

Acceptance Criteria

  • All hard-coded values removed from RetrieverAgent.
  • Queries, semantic config, and top-K are driven by env vars.
  • GPT-4.1-nano is invoked first; fallback engages only on error or empty content.
  • End-to-end latency (retrieval + generation) averages <1.2 s for a 30-token query against the dev index.
  • Tests pass in CI and the README section “Fast RAG path” is updated.

Additional Notes

  • When use_agentic_retrieval is true, prefer KnowledgeAgentRetrievalClient but still respect env overrides.
  • Consider moving mock-document logic to a separate debug utility instead of deleting outright.
  • Follow the project style guide: hyphens, not em dashes, in log or user-visible messages.

Effort: Medium

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions