graph-extraction-from-brazilian-senate-and-chamber-of-deputies

https://github.com/eduardo-zampirolli/graph-extraction-from-brazilian-senate-and-chamber-of-deputies

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.6%) to scientific vocabulary

Last synced: 9 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: Eduardo-zampirolli
Language: Jupyter Notebook
Default Branch: main
Size: 9.75 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 1 year ago · Last pushed 11 months ago

Metadata Files

Readme Citation

Graph Creation from Brazilian Senate and Chamber of Deputies

This repository contains a comprehensive pipeline to create network graphs from Brazilian parliamentary session transcripts. The project extracts text data from government websites, applies Named Entity Recognition (NER) to identify parliamentarians, and constructs interaction graphs based on speech patterns and mentions.

About the Project

This project analyzes the dynamics of the Brazilian National Congress by creating network graphs from parliamentary session transcripts. Unlike traditional approaches that focus on legislative co-sponsorship, this method analyzes actual speech interactions and mentions between parliamentarians during sessions.

The pipeline consists of four main stages: 1. Data Extraction: Web scraping of session transcripts from official government websites 2. Named Entity Recognition: Identification of parliamentarian names using transformer-based models 3. Graph Construction: Creation of interaction networks based on speech patterns and mentions 4. Analysis and Visualization: Statistical analysis and visualization of the resulting networks

The project processes data from multiple sources: - Chamber of Deputies (Câmara): Session transcripts from 2016-2025 - Federal Senate (Senado): Regular sessions from 2020-2024 - Federal Senate (Senado R): Special sessions and committees from 2021-2025

🔧 Key Features

Web Scraping: Automated extraction of session transcripts from official government websites
Named Entity Recognition: Advanced NER using transformer models (BERT-based) to identify parliamentarian names
Entity Normalization: Fuzzy matching and name standardization to handle variations in how names appear
Graph Construction: Creation of weighted directed graphs where nodes are parliamentarians and edges represent interaction patterns
Statistical Analysis: Comprehensive graph metrics including degree distribution, clustering coefficients, and network centrality measures
Visualization: Multiple visualization options including static plots and network displays, using "Fruchterman Reingold" layout and detecting communities

📂 Project Structure

The repository is organized as follows:

. ├── data_extraction/ # Scripts for web scraping and data collection │ ├── achar_cod_*.py # Code discovery scripts for finding session IDs │ ├── *_txt.py # Text extraction scripts from government websites │ └── table.py # Data aggregation and statistics ├── ner/ # Named Entity Recognition pipeline │ ├── ner.py # Main NER processing script │ ├── create_annotations.py # Annotation file creation │ ├── compare_outputs.py # Evaluation and comparison tools │ └── README.md # Detailed NER documentation ├── graph_creation/ # Graph construction and analysis │ └── graph_joining.py # Main graph creation script ├── resultados/ # Analysis and visualization scripts │ ├── tabela.py # Graph metrics calculation │ ├── distrib*.py # Degree distribution analysis │ ├── analise_grafos.py # Graph analysis tools │ └── *.ipynb # Jupyter notebooks for visualization ├── Camara/ # Raw data from Chamber of Deputies ├── Senado/ # Raw data from Federal Senate ├── testes/ # Test files and validation data └── requirements.txt # Python dependencies

🚀 Getting Started

To get a local copy up and running, follow these simple steps.

Prerequisites

Python 3.8+ with pip installed
At least 4GB of RAM (transformer models are memory-intensive)
Internet connection for downloading pre-trained models

Installation

Clone the repository: bash git clone https://github.com/yourusername/Graph-creation-with-html.git cd Graph-creation-with-html
Install dependencies: It is recommended to use a virtual environment: bash python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate` Then install the required packages: bash pip install -r requirements.txt

⚙️ Usage

The pipeline consists of multiple stages that can be run independently or in sequence:

1. Data Extraction

First, discover session codes and extract text data:

```bash

Find session codes for Chamber of Deputies

python dataextraction/acharcod_cam.py

Extract text from Chamber sessions

python dataextraction/camtxt.py

Find session codes for Senate

python dataextraction/acharcodsens.py python dataextraction/acharcodsenr.py

Extract text from Senate sessions

python dataextraction/sentxt.py python dataextraction/senr.py ```

2. Named Entity Recognition

Process the extracted texts to identify parliamentarians:

```bash

Run NER on extracted texts

python ner/ner.py

Create annotated files

python ner/createannotations.py --original-dir <textdir> --json-dir --output-dir ```

3. Graph Construction

Build interaction graphs from the annotated texts:

```bash

Create graphs from annotated texts

python graphcreation/graphjoining.py ```

4. Analysis and Visualization

Analyze the generated graphs:

```bash

Calculate graph metrics

python resultados/tabela.py

Generate degree distribution plots

python resultados/distrib.py

Open Jupyter notebooks for interactive analysis

jupyter notebook resultados/Plots.ipynb #for data plot jupyter notebook resultados/graph_plot.ipynb #for graph visualization ```

📊 Results and Output

The pipeline generates several types of outputs:

Data Files

Raw Text Files: Extracted session transcripts organized by year and institution
JSON Files: NER results with identified entities and their positions
Annotated Files: Text files with embedded entity tags for validation

Graph Files

GEXF Format: Network graphs compatible with Gephi and other network analysis tools
Graph Metrics: CSV files containing network statistics (degree, clustering, centrality measures)

Visualizations

Degree Distribution Plots: Analysis of network connectivity patterns
Interactive Notebooks: Jupyter notebooks for exploring the data
Network Visualizations: Static and interactive graph plots

Key Metrics Calculated

Network Structure: Number of nodes, edges, density, diameter
Centrality Measures: Degree, betweenness, closeness centrality
Clustering: Local and global clustering coefficients
Degree Distribution: Power-law fitting and statistical analysis

🔬 Technical Details

Named Entity Recognition

Uses transformer-based models (BERT) fine-tuned for Portuguese
Implements fuzzy matching for name normalization
Handles parliamentary titles and formal address patterns
Evaluation framework with precision, recall, and F1-score metrics

Graph Construction

Weighted directed graphs based on mention frequency
Speaker-mention relationships from session transcripts
Name disambiguation using fuzzy string matching
Temporal analysis capabilities (graphs by year)

Data Sources

Chamber of Deputies: https://escriba.camara.leg.br/escriba-servicosweb/html/{code}
Federal Senate: https://www25.senado.leg.br/web/atividade/notas-taquigraficas/-/notas/s/{code}
Senate Committees: https://www25.senado.leg.br/web/atividade/notas-taquigraficas/-/notas/r/{code}

🧪 Testing and Validation

The repository includes comprehensive testing infrastructure:

Manual Annotation: Ground truth files for NER evaluation
Comparison Tools: Automated evaluation of NER performance
Test Cases: Sample files for validating the pipeline
Error Analysis: Tools for identifying and analyzing processing errors

📈 Performance Considerations

Memory Usage: Transformer models require significant RAM (4GB+ recommended)
Processing Time: NER processing can be time-intensive for large datasets
Storage: Raw data and results can occupy several GB of disk space
GPU Support: CUDA-enabled GPUs will significantly speed up NER processing

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Brazilian government for providing open access to parliamentary data
Hugging Face for transformer models and tools
NetworkX and igraph communities for graph analysis libraries
The open-source community for the various tools and libraries used

📬 Contact

For questions about this project, please open an issue in the repository or contact the maintainers.

This project is part of research into Brazilian parliamentary dynamics and network analysis. The data used is publicly available from official government sources.

Owner

Login: Eduardo-zampirolli
Kind: user

Repositories: 1
Profile: https://github.com/Eduardo-zampirolli

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Moura Zampirolli"
  given-names: "Eduardo"
  orcid: "https://orcid.org/0000-0000-0000-0000"
title: "Graph-extraction-from-Brazilian-Senate-and-Chamber-of-Deputies"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2025-05-23
url: "https://github.com/Eduardo-zampirolli/Graph-extraction-from-Brazilian-Senate-and-Chamber-of-Deputies"

graph-extraction-from-brazilian-senate-and-chamber-of-deputies

Science Score: 44.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Graph Creation from Brazilian Senate and Chamber of Deputies

About the Project

🔧 Key Features

📂 Project Structure

🚀 Getting Started

Prerequisites

Installation

⚙️ Usage

1. Data Extraction

Find session codes for Chamber of Deputies

Extract text from Chamber sessions

Find session codes for Senate

Extract text from Senate sessions

2. Named Entity Recognition

Run NER on extracted texts

Create annotated files

3. Graph Construction

Create graphs from annotated texts

4. Analysis and Visualization

Calculate graph metrics

Generate degree distribution plots

Open Jupyter notebooks for interactive analysis

📊 Results and Output

Data Files

Graph Files

Visualizations

Key Metrics Calculated

🔬 Technical Details

Named Entity Recognition

Graph Construction

Data Sources

🧪 Testing and Validation

📈 Performance Considerations

📄 License

🙏 Acknowledgments

📬 Contact

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year