You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have built a document-level retrieval benchmark based on a corpus of over 1,000 scientific papers. The goal is to assess if local_search can retrieve the correct source document for a given factual question.
My evaluation shows a very low document-level Hit Rate, starting at 3.44% for @k=1 and reaching only 21.63% at @k=10. This suggests that for tasks requiring precise document provenance, the default local_search configuration may not be optimal.
My Methodology
Corpus: A collection of 1,000+ scientific papers in a specific domain [e.g., pharmacology].
Benchmark Construction: I created a set of question-document pairs. Each question is designed to have its answer contained within one specific "target document" in the corpus.
GraphRAG Indexing: I indexed the entire corpus using the graphrag --init command with the --init-method fast setting.
Retrieval Step: For each question in my benchmark, I run the local_search function.
Evaluation: I inspect the result.context_data["sources"] field from the local_search output. A "hit" is counted if the ID of the "target document" is present in this list of sources. The Hit Rate @k is the percentage of questions for which the target document is found within the top k sources.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
Uh oh!
There was an error while loading. Please reload this page.
-
I have built a document-level retrieval benchmark based on a corpus of over 1,000 scientific papers. The goal is to assess if
local_search
can retrieve the correct source document for a given factual question.My evaluation shows a very low document-level Hit Rate, starting at 3.44% for @k=1 and reaching only 21.63% at @k=10. This suggests that for tasks requiring precise document provenance, the default
local_search
configuration may not be optimal.My Methodology
[e.g., pharmacology]
.graphrag --init
command with the--init-method fast
setting.local_search
function.result.context_data["sources"]
field from thelocal_search
output. A "hit" is counted if the ID of the "target document" is present in this list of sources. The Hit Rate @k is the percentage of questions for which the target document is found within the top k sources.Results
Here is the Hit Rate table from my benchmark:
Beta Was this translation helpful? Give feedback.
All reactions