Privacy-Preserving Document Intelligence System using OCR, LexRank, and Local LLMs

Gourav Mungutwar; Prabhakar Sharma; Ishita  Verma; Surbhi Verma; Khushi Ganguli; Anjali Chandra

PDF

Published: Apr 8, 2026

Keywords:

Tesseract OCR, LexRank, TF-IDF vectorization

Gourav Mungutwar

CSE (AI) SSIPMT, Raipur

Prabhakar Sharma

CSE (AI) SSIPMT, Raipur

Ishita Verma

CSE (AI) SSIPMT, Raipur

Surbhi Verma

CSE (AI) SSIPMT, Raipur

Khushi Ganguli

CSE (AI) SSIPMT, Raipur

Anjali Chandra

CSE (AI) SSIPMT, Raipur

Abstract

The rapid proliferation of digital documents has led to a growing need for systems capable of handling scanned and image-based Portable Document Format (PDF) files, which often lack machine-readable text and are difficult to search, analyze, and interact with. Existing solutions are typically based on cloud computing or computationally intensive transformer architectures, raising concerns about data privacy and resource consumption. This paper proposes a fully local and privacy- conscious document intelligence system that integrates Optical Character Recognition (OCR), extractive summarization, and question answering. Text extraction is performed using Tesseract OCR, followed by TF-IDF vectorization, cosine similarity, and graph-based processing. The LexRank algorithm is employed to generate concise summaries, while a locally deployed Large Language Model enables document-based question answering. Additionally, the system provides document analytics, such as word count and reading time has been implemented using Streamlit. The proposed system ensures efficiency, security, and offline processing, making it suitable for private and sensitive applications.

Downloads

Download data is not yet available.

How to Cite

Mungutwar, G., Sharma, P., Verma, I., Verma, S., Ganguli, K., & Chandra, A. (2026). Privacy-Preserving Document Intelligence System using OCR, LexRank, and Local LLMs. International Journal of Recent Advances in Engineering and Technology, 15(1), 104–112. Retrieved from https://journals.mriindia.com/index.php/ijraet/article/view/2063

Issue

Vol. 15 No. 1 (2026)

Section

Articles

This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License.

Privacy-Preserving Document Intelligence System using OCR, LexRank, and Local LLMs

Abstract

Downloads

Most read articles by the same author(s)

Similar Articles

Similar Articles

ML-Powered Career Guidance: A Web Application for Personalized Career Decision-Making

Intelligent Diet and Exercise Recommendation System Using AI

NBEP: Nature-Based Ensemble Prediction Framework for Intelligent Bug Report Classification

Helmet Detection and Number Plate Recognition using Machine Learning

Traffic Surveillance: An Integrated Approach for Helmet and Number Plate Detection

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Most read articles by the same author(s)

Similar Articles