ai-open-science

Repositorio para la realizacion y entrega de los trabajos de la asignatura Artificial Intelligence And Open Science In Research Software Engineering

https://github.com/fran2410/ai-open-science

Last synced: 6 months ago · JSON representation ·

Repository

Repositorio para la realizacion y entrega de los trabajos de la asignatura Artificial Intelligence And Open Science In Research Software Engineering

Basic Info

Host: GitHub
Owner: fran2410
License: apache-2.0
Language: Python
Default Branch: main
Size: 11.9 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 2

Created about 1 year ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation Codemeta

AI-Open-Science

Description

This repository provides tools for extracting and visualizing information from scientific papers in XML format. Using GROBID. for document processing, the scripts generate keyword clouds, charts displaying the number of figures per document, and extract links from XML files.

Features

Given a XML file (or a directory with some of them) the tool will extract the data and make: - Keyword Cloud: Keyword cloud based on the abstract information. - Charts: Charts visualization showing the number of figures per article. - Links: list of the links found in each paper while ignoring references.

Project Structure

├── papers/ # Example research papers ├── data/ # Example XML files ├── results/ # Example directory for generated files ├── scripts/ # Python scripts for data extraction and visualization │ ├── keywordCloud.py # Generates a keyword cloud from abstracts │ ├── charts.py # Creates charts showing the number of figures per document │ ├── list.py # Extracts links from XML files (excluding references) ├── docs/ # Additional documentation ├── tests/ # Tests to check functionality

Installing fron Github

Clone the repository:

bash git clone https://github.com/fran2410/AI-Open-Science.git cd AI-Open-Science

1. Conda

For installing Conda on your system, please visit the official Conda documentation here.

Create and activate the Conda environment

bash conda create -n ai-open-science python=3.13 conda activate ai-open-science

2. Poetry

For installing Poetry on your system, please visit the official Poetry documentation here.

Install project dependencies

Run the following command in the root of the repository to install dependencies: bash poetry install

Installing through Docker

We provide a Docker image with the scripts already installed. To run through Docker, you may build the Dockerfile provided in the repository by running:

bash docker build -t ai-open-science .

Then, to run your image just type:

bash docker run --rm -it ai-open-science

And you will be ready to use the scripts (see section below). If you want to have access to the results we recommend mounting a volume. For example, the following command will mount the current directory as the out folder in the Docker image:

bash docker run -it --rm -v $PWD/out:/AI-Open-Science/out ai-open-science If you move any files produced by the scripts or set the output folder to /out, you will be able to see them in your current directory in the /out folder.

USAGE

You can use any folder to store the PDFs to be processed and any other to extract the results. They don’t have to be specifically named paper, data, or results; you just need to specify them when running the commands.

Using GROBID for XML Extraction

To extract structured XML data from PDFs using GROBID, follow these steps:

Start the GROBID container Run the following command to launch a GROBID server using Docker:

bash docker run --rm -p 8070:8070 lfoppiano/grobid:latest-full This will start the GROBID service on port 8070.

Process PDFs with GROBID Once the GROBID server is running, you can extract XML from a folder of PDFs using the following command:

bash curl -F input=@<folder_with_pdf> "http://localhost:8070/api/processFulltextDocument" -o <output_xml> Alternatively, for batch processing of all PDFs in a directory:

bash for file in <pdf_folder>/*.pdf; do curl -F input=@$file "http://localhost:8070/api/processFulltextDocument" -o "<output_folder>/$(basename "$file" .pdf).xml" done

Generate Keyword Cloud

Extracts keywords from abstracts in XML files and creates a word cloud.

Command: bash python scripts/keywordCloud.py <folder_with_xmls> <output_folder> Output: <output_folder>/keywordCloud.jpg

Chart Figures Count

Counts the number of figures in each XML file and generates a bar chart.

Command: bash python scripts/charts.py <folder_with_xmls> <output_folder> Output: <output_folder>/charts.jpg

Extract Links

Extracts links from XML files while ignoring references.

Command: bash python scripts/list.py <folder_with_xmls> <output_folder> Output: <output_folder>/links.txt

Examples

For a sample execution with provided XML data, see the results/ directory or run the scripts with sample files in data/.

Where to Get Help

For any issues or questions, please open an issue in the project issues.

Acknowledgements

Special thanks to the developers of GROBID for their tool for processing scientific documents.

License

This project is distributed under the Apache 2.0 License. Contributions to the project must follow the same licensing terms.

Owner

Login: fran2410
Kind: user

Repositories: 1
Profile: https://github.com/fran2410

Citation (CITATION.cff)

title: "AI-Open-Science"
license: Apache-2.0
authors:
  - family-names: Chicote
    given-names: Francisco
cff-version: 1.0.0
date-released: 2025-02-04
message: "If you use this software, please cite the software itself."
preferred-citation:
  authors:
  - family-names: Chicote
    given-names: Francisco
  title: "AI-Open-Science"
  type: software
  year: 2025
  doi: 10.5281/zenodo.14882667

CodeMeta (codemeta.json)

{
  "@context": "https://w3id.org/codemeta/3.0",
  "@type": "SoftwareSourceCode",
  "license": {
    "name": "Apache License 2.0",
    "url": "https://raw.githubusercontent.com/fran2410/AI-Open-Science/main/LICENSE",
    "identifier": "https://spdx.org/licenses/Apache-2.0"
  },
  "codeRepository": "https://github.com/fran2410/AI-Open-Science",
  "issueTracker": "https://github.com/fran2410/AI-Open-Science/issues",
  "dateCreated": "2025-02-04",
  "dateModified": "2025-02-26",
  "downloadUrl": "https://github.com/fran2410/AI-Open-Science/releases",
  "name": "AI-Open-Science",
  "keywords": "Docker, Python, GROBID, XML, data extraction, data visualization, keywordCloud, papers, links, figures",
  "programmingLanguage": [
    "Python",
    "Dockerfile"
  ],
  "softwareRequirements": [
    "Run the following command in the root of the repository to install dependencies:\n```bash\npoetry install\n```\n"
  ],
  "releaseNotes": "First release with full functionality, including all installation methods (GitHub and Docker), complete documentation, and citation. Ready for the Individual Project.",
  "softwareVersion": "1.0.0",
  "buildInstructions": [
    "https://ai-open-science.readthedocs.io/",
    "# Installing fron Github\n\n##  Clone the repository:\n   ```bash\n   git clone https://github.com/fran2410/AI-Open-Science.git\n   cd AI-Open-Science\n   ```\n## 1. Conda\n\nFor installing Conda on your system, please visit the official Conda documentation [here](https://docs.conda.io/projects/conda/en/latest/user-guide/install/index.html).\n\n#### Create and activate the Conda environment\n```bash\nconda create -n ai-open-science python=3.13 \nconda activate ai-open-science\n```\n\n## 2. Poetry\n\nFor installing Poetry on your system, please visit the official Poetry documentation [here](https://python-poetry.org/docs/#installation).\n\n#### Install project dependencies\nRun the following command in the root of the repository to install dependencies:\n```bash\npoetry install\n```\n\n# Installing through Docker\n\nWe provide a Docker image with the scripts already installed. To run through Docker, you may build the Dockerfile provided in the repository by running:\n\n```bash\ndocker build -t ai-open-science .\n```\n\nThen, to run your image just type:\n\n```bash\ndocker run --rm -it  ai-open-science\n```\n\nAnd you will be ready to use the scripts (see section below). If you want to have access to the results we recommend [mounting a volume](https://docs.docker.com/storage/volumes/). For example, the following command will mount the current directory as the `out` folder in the Docker image:\n\n```bash\ndocker run -it --rm -v $PWD/out:/AI-Open-Science/out ai-open-science \n```\nIf you move any files produced by the scripts or set the output folder to `/out`, you will be able to see them in your current directory in the `/out` folder.\n",
    "https://github.com/fran2410/AI-Open-Science/tree/main/docs",
    "https://raw.githubusercontent.com/fran2410/AI-Open-Science/main/README.md"
  ],
  "author": [
    {
      "@type": "Person",
      "@id": "https://github.com/fran2410"
    }
  ],
  "referencePublication": [
    {
      "@type": "ScholarlyArticle",
      "name": "ai-open-science",
      "identifier": "10.5281/zenodo.14882667",
      "url": "https://doi.org/10.5281/zenodo.14882667"
    }
  ],
  "identifier": "https://doi.org/10.5281/zenodo.14882666",
  "readme": "https://raw.githubusercontent.com/fran2410/AI-Open-Science/main/README.md",
  "description": [
    "Repositorio para la realizacion y entrega de los trabajos de la asignatura Artificial Intelligence And Open Science In Research Software Engineering",
    "This repository provides tools for extracting and visualizing information from scientific papers in XML format. Using [GROBID](https://github.com/kermitt2/grobid). for document processing, the scripts generate keyword clouds, charts displaying the number of figures per document, and extract links from XML files.\n",
    "```\n├── papers/              # Example research papers\n├── data/                # Example XML files \n├── results/             # Example directory for generated files\n├── scripts/             # Python scripts for data extraction and visualization\n│   ├── keywordCloud.py  # Generates a keyword cloud from abstracts\n│   ├── charts.py        # Creates charts showing the number of figures per document\n│   ├── list.py          # Extracts links from XML files (excluding references)\n├── docs/                # Additional documentation \n├── tests/               # Tests to check functionality \n```\n \n"
  ]
}

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science