https://github.com/gsolersanz/npl_becas_boe
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.3%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: gsolersanz
- Language: Python
- Default Branch: main
- Size: 7.88 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Intelligent Document Processing and Summarization Pipeline
🌟 Project Overview
This project implements an advanced document processing pipeline designed to extract, analyze, and summarize complex textual documents, with a specific focus on scholarship and official document analysis. The system leverages state-of-the-art natural language processing techniques to transform unstructured PDF documents into structured, insightful summaries.

🎯 Project Objectives
- Automatically extract structured information from PDF documents
- Perform topic modeling and clustering of document sections
- Generate summaries using multiple state-of-the-art language models
- Evaluate and compare summarization performance
🛠 System Architecture
The pipeline consists of several key components:
PDF to Text Conversion (
pdf_totext.py)- Converts PDF documents to clean, structured text
- Removes unnecessary metadata and formatting
Topic Modeling (
transformer_topic_modeling.py)- Utilizes transformer-based embeddings
- Clusters document sections into predefined categories
- Extracts most relevant sections for each topic
Summarization (
summarization_models.py)- Supports multiple summarization models:
- BART
- T5
- Longformer
- Generates summaries at both article and document levels
- Supports multiple summarization models:
Evaluation (
llm_evaluator.py)- Compares summaries generated by different models
- Provides quantitative assessment of summary quality
Pipeline Orchestration (
pipeline.py)- Coordinates the entire document processing workflow
- Manages input, processing, and output of documents
📦 Prerequisites
- Python 3.8+
- PyTorch
- Transformers
- Sentence-Transformers
- scikit-learn
- tqdm
- huggingface_hub
🚀 Installation
```bash
Clone the repository
git clone https://github.com/gsolersanz/NPLbecasBOE.git cd document-processing-pipeline
```
🔧 Usage
Basic Pipeline Execution
bash
python pipeline.py --input document1.pdf document2.txt
--output results
--summarization_models bart t5
--evaluate
Command Line Arguments
--input: One or more input files to process (PDF or TXT)--output: Base directory for storing results--num_topics: Number of topics for clustering (default: 3)--custom_topics: Specify custom topics (e.g., "requisitos_academicos")--summarization_models: Models to use for summarization- Choices: bart, t5, longformer
--evaluate: Flag to evaluate and compare summaries with LLM, in the actual implementation you can choose between gpt-2 for spanish or gpt-2.
📂 Project Structure
pipeline.py: Main pipeline orchestration scriptpdf_totext.py: PDF to text conversiontransformer_topic_modeling.py: Advanced topic modelingsummarization_models.py: Text summarizationllm_evaluator.py: Summary evaluationcluster_becas.py: Additional clustering utilities
📊 Example Workflow
- Input PDF/TXT documents
- Convert to clean text
- Extract and cluster document sections
- Generate summaries using multiple models
- Evaluate and compare summaries
- Output structured results
📈 Performance Evaluation
The system provides: - Topic distribution analysis - Summary quality metrics - Comparative model performance
Owner
- Name: Guillem Soler
- Login: gsolersanz
- Kind: user
- Repositories: 1
- Profile: https://github.com/gsolersanz
GitHub Events
Total
- Member event: 1
- Push event: 3
- Create event: 3
Last Year
- Member event: 1
- Push event: 3
- Create event: 3