DocuMind: A Two-Stage Retrieval-Augmented Generation System for Academic Research Paper Question Answering
Main Article Content
Abstract
Unstructured academic data has seen a massive increase in recent years and have become extremely challenging in terms of extraction of information. While current question answering applications on PDFs have high accuracy, they rely on closed source cloud services, which make them inappropriate for research papers. This work introduces DocuMind, an open-source and privately deployable retrieval augmented generation framework for question answering on research papers. It features a novel two-step retrieval scheme consisting of deterministic page one pinning along with maximal marginal relevance to tackle the issue of false answers coming from references sections in academic documents. An experimental evaluation is conducted through two hundred question and answer pairs from twenty research papers and results show an accuracy of 81.5 percent with full immunity against hallucinations. The method has improved the accuracy of identity questions to 82.7 percent from 44.4 percent. All components of DocuMind have been developed using open-source software without any requirement for cloud services.
Article Details

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.