https://github.com/gsolersanz/npl_becas_boe

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary

Last synced: 8 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: gsolersanz
Language: Python
Default Branch: main
Size: 7.88 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme

Intelligent Document Processing and Summarization Pipeline

🌟 Project Overview

This project implements an advanced document processing pipeline designed to extract, analyze, and summarize complex textual documents, with a specific focus on scholarship and official document analysis. The system leverages state-of-the-art natural language processing techniques to transform unstructured PDF documents into structured, insightful summaries.

System Architecture

🎯 Project Objectives

Automatically extract structured information from PDF documents
Perform topic modeling and clustering of document sections
Generate summaries using multiple state-of-the-art language models
Evaluate and compare summarization performance

🛠 System Architecture

The pipeline consists of several key components:

PDF to Text Conversion (pdf_totext.py)
- Converts PDF documents to clean, structured text
- Removes unnecessary metadata and formatting
Topic Modeling (transformer_topic_modeling.py)
- Utilizes transformer-based embeddings
- Clusters document sections into predefined categories
- Extracts most relevant sections for each topic
Summarization (summarization_models.py)
- Supports multiple summarization models:
  - BART
  - T5
  - Longformer
- Generates summaries at both article and document levels
Evaluation (llm_evaluator.py)
- Compares summaries generated by different models
- Provides quantitative assessment of summary quality
Pipeline Orchestration (pipeline.py)
- Coordinates the entire document processing workflow
- Manages input, processing, and output of documents

📦 Prerequisites

Python 3.8+
PyTorch
Transformers
Sentence-Transformers
scikit-learn
tqdm
huggingface_hub

🚀 Installation

```bash

Clone the repository

git clone https://github.com/gsolersanz/NPLbecasBOE.git cd document-processing-pipeline

```

🔧 Usage

Basic Pipeline Execution

bash python pipeline.py --input document1.pdf document2.txt --output results --summarization_models bart t5 --evaluate

Command Line Arguments

--input: One or more input files to process (PDF or TXT)
--output: Base directory for storing results
--num_topics: Number of topics for clustering (default: 3)
--custom_topics: Specify custom topics (e.g., "requisitos_academicos")
--summarization_models: Models to use for summarization
- Choices: bart, t5, longformer
--evaluate: Flag to evaluate and compare summaries with LLM, in the actual implementation you can choose between gpt-2 for spanish or gpt-2.

📂 Project Structure

pipeline.py: Main pipeline orchestration script
pdf_totext.py: PDF to text conversion
transformer_topic_modeling.py: Advanced topic modeling
summarization_models.py: Text summarization
llm_evaluator.py: Summary evaluation
cluster_becas.py: Additional clustering utilities

📊 Example Workflow

Input PDF/TXT documents
Convert to clean text
Extract and cluster document sections
Generate summaries using multiple models
Evaluate and compare summaries
Output structured results

📈 Performance Evaluation

The system provides: - Topic distribution analysis - Summary quality metrics - Comparative model performance

Owner

Name: Guillem Soler
Login: gsolersanz
Kind: user

Repositories: 1
Profile: https://github.com/gsolersanz

GitHub Events

Total

Member event: 1
Push event: 3
Create event: 3

Last Year

Member event: 1
Push event: 3
Create event: 3

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science