ai-open-science
Repositorio para la realizacion y entrega de los trabajos de la asignatura Artificial Intelligence And Open Science In Research Software Engineering
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (19.0%) to scientific vocabulary
Repository
Repositorio para la realizacion y entrega de los trabajos de la asignatura Artificial Intelligence And Open Science In Research Software Engineering
Basic Info
- Host: GitHub
- Owner: fran2410
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 11.9 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
AI-Open-Science
Description
This repository provides tools for extracting and visualizing information from scientific papers in XML format. Using GROBID. for document processing, the scripts generate keyword clouds, charts displaying the number of figures per document, and extract links from XML files.
Features
Given a XML file (or a directory with some of them) the tool will extract the data and make: - Keyword Cloud: Keyword cloud based on the abstract information. - Charts: Charts visualization showing the number of figures per article. - Links: list of the links found in each paper while ignoring references.
Project Structure
├── papers/ # Example research papers
├── data/ # Example XML files
├── results/ # Example directory for generated files
├── scripts/ # Python scripts for data extraction and visualization
│ ├── keywordCloud.py # Generates a keyword cloud from abstracts
│ ├── charts.py # Creates charts showing the number of figures per document
│ ├── list.py # Extracts links from XML files (excluding references)
├── docs/ # Additional documentation
├── tests/ # Tests to check functionality
Installing fron Github
Clone the repository:
bash
git clone https://github.com/fran2410/AI-Open-Science.git
cd AI-Open-Science
1. Conda
For installing Conda on your system, please visit the official Conda documentation here.
Create and activate the Conda environment
bash
conda create -n ai-open-science python=3.13
conda activate ai-open-science
2. Poetry
For installing Poetry on your system, please visit the official Poetry documentation here.
Install project dependencies
Run the following command in the root of the repository to install dependencies:
bash
poetry install
Installing through Docker
We provide a Docker image with the scripts already installed. To run through Docker, you may build the Dockerfile provided in the repository by running:
bash
docker build -t ai-open-science .
Then, to run your image just type:
bash
docker run --rm -it ai-open-science
And you will be ready to use the scripts (see section below). If you want to have access to the results we recommend mounting a volume. For example, the following command will mount the current directory as the out folder in the Docker image:
bash
docker run -it --rm -v $PWD/out:/AI-Open-Science/out ai-open-science
If you move any files produced by the scripts or set the output folder to /out, you will be able to see them in your current directory in the /out folder.
USAGE
You can use any folder to store the PDFs to be processed and any other to extract the results. They don’t have to be specifically named paper, data, or results; you just need to specify them when running the commands.
Using GROBID for XML Extraction
To extract structured XML data from PDFs using GROBID, follow these steps:
bash
docker run --rm -p 8070:8070 lfoppiano/grobid:latest-full
This will start the GROBID service on port 8070.
- Process PDFs with GROBID Once the GROBID server is running, you can extract XML from a folder of PDFs using the following command:
bash
curl -F input=@<folder_with_pdf> "http://localhost:8070/api/processFulltextDocument" -o <output_xml>
Alternatively, for batch processing of all PDFs in a directory:
bash
for file in <pdf_folder>/*.pdf; do
curl -F input=@$file "http://localhost:8070/api/processFulltextDocument" -o "<output_folder>/$(basename "$file" .pdf).xml"
done
Generate Keyword Cloud
Extracts keywords from abstracts in XML files and creates a word cloud.
Command:
bash
python scripts/keywordCloud.py <folder_with_xmls> <output_folder>
Output: <output_folder>/keywordCloud.jpg
Chart Figures Count
Counts the number of figures in each XML file and generates a bar chart.
Command:
bash
python scripts/charts.py <folder_with_xmls> <output_folder>
Output: <output_folder>/charts.jpg
Extract Links
Extracts links from XML files while ignoring references.
Command:
bash
python scripts/list.py <folder_with_xmls> <output_folder>
Output: <output_folder>/links.txt
Examples
For a sample execution with provided XML data, see the results/ directory or run the scripts with sample files in data/.
Where to Get Help
For any issues or questions, please open an issue in the project issues.
Acknowledgements
Special thanks to the developers of GROBID for their tool for processing scientific documents.
License
This project is distributed under the Apache 2.0 License. Contributions to the project must follow the same licensing terms.
Owner
- Login: fran2410
- Kind: user
- Repositories: 1
- Profile: https://github.com/fran2410
Citation (CITATION.cff)
title: "AI-Open-Science"
license: Apache-2.0
authors:
- family-names: Chicote
given-names: Francisco
cff-version: 1.0.0
date-released: 2025-02-04
message: "If you use this software, please cite the software itself."
preferred-citation:
authors:
- family-names: Chicote
given-names: Francisco
title: "AI-Open-Science"
type: software
year: 2025
doi: 10.5281/zenodo.14882667
CodeMeta (codemeta.json)
{
"@context": "https://w3id.org/codemeta/3.0",
"@type": "SoftwareSourceCode",
"license": {
"name": "Apache License 2.0",
"url": "https://raw.githubusercontent.com/fran2410/AI-Open-Science/main/LICENSE",
"identifier": "https://spdx.org/licenses/Apache-2.0"
},
"codeRepository": "https://github.com/fran2410/AI-Open-Science",
"issueTracker": "https://github.com/fran2410/AI-Open-Science/issues",
"dateCreated": "2025-02-04",
"dateModified": "2025-02-26",
"downloadUrl": "https://github.com/fran2410/AI-Open-Science/releases",
"name": "AI-Open-Science",
"keywords": "Docker, Python, GROBID, XML, data extraction, data visualization, keywordCloud, papers, links, figures",
"programmingLanguage": [
"Python",
"Dockerfile"
],
"softwareRequirements": [
"Run the following command in the root of the repository to install dependencies:\n```bash\npoetry install\n```\n"
],
"releaseNotes": "First release with full functionality, including all installation methods (GitHub and Docker), complete documentation, and citation. Ready for the Individual Project.",
"softwareVersion": "1.0.0",
"buildInstructions": [
"https://ai-open-science.readthedocs.io/",
"# Installing fron Github\n\n## Clone the repository:\n ```bash\n git clone https://github.com/fran2410/AI-Open-Science.git\n cd AI-Open-Science\n ```\n## 1. Conda\n\nFor installing Conda on your system, please visit the official Conda documentation [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html).\n\n#### Create and activate the Conda environment\n```bash\nconda create -n ai-open-science python=3.13 \nconda activate ai-open-science\n```\n\n## 2. Poetry\n\nFor installing Poetry on your system, please visit the official Poetry documentation [here](https://python-poetry.org/docs/#installation).\n\n#### Install project dependencies\nRun the following command in the root of the repository to install dependencies:\n```bash\npoetry install\n```\n\n# Installing through Docker\n\nWe provide a Docker image with the scripts already installed. To run through Docker, you may build the Dockerfile provided in the repository by running:\n\n```bash\ndocker build -t ai-open-science .\n```\n\nThen, to run your image just type:\n\n```bash\ndocker run --rm -it ai-open-science\n```\n\nAnd you will be ready to use the scripts (see section below). If you want to have access to the results we recommend [mounting a volume](https://docs.docker.com/storage/volumes/). For example, the following command will mount the current directory as the `out` folder in the Docker image:\n\n```bash\ndocker run -it --rm -v $PWD/out:/AI-Open-Science/out ai-open-science \n```\nIf you move any files produced by the scripts or set the output folder to `/out`, you will be able to see them in your current directory in the `/out` folder.\n",
"https://github.com/fran2410/AI-Open-Science/tree/main/docs",
"https://raw.githubusercontent.com/fran2410/AI-Open-Science/main/README.md"
],
"author": [
{
"@type": "Person",
"@id": "https://github.com/fran2410"
}
],
"referencePublication": [
{
"@type": "ScholarlyArticle",
"name": "ai-open-science",
"identifier": "10.5281/zenodo.14882667",
"url": "https://doi.org/10.5281/zenodo.14882667"
}
],
"identifier": "https://doi.org/10.5281/zenodo.14882666",
"readme": "https://raw.githubusercontent.com/fran2410/AI-Open-Science/main/README.md",
"description": [
"Repositorio para la realizacion y entrega de los trabajos de la asignatura Artificial Intelligence And Open Science In Research Software Engineering",
"This repository provides tools for extracting and visualizing information from scientific papers in XML format. Using [GROBID](https://github.com/kermitt2/grobid). for document processing, the scripts generate keyword clouds, charts displaying the number of figures per document, and extract links from XML files.\n",
"```\n├── papers/ # Example research papers\n├── data/ # Example XML files \n├── results/ # Example directory for generated files\n├── scripts/ # Python scripts for data extraction and visualization\n│ ├── keywordCloud.py # Generates a keyword cloud from abstracts\n│ ├── charts.py # Creates charts showing the number of figures per document\n│ ├── list.py # Extracts links from XML files (excluding references)\n├── docs/ # Additional documentation \n├── tests/ # Tests to check functionality \n```\n \n"
]
}
GitHub Events
Total
- Release event: 2
- Push event: 37
- Public event: 1
- Create event: 2
Last Year
- Release event: 2
- Push event: 37
- Public event: 1
- Create event: 2