triplea

Article Analysis Assistant

https://github.com/ehsanbitaraf/triple-a

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 15 DOI reference(s) in README
✓
Academic publication links
Links to: ncbi.nlm.nih.gov, ieee.org, zenodo.org
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (15.4%) to scientific vocabulary

Keywords

citation-graph graph research-assistant semantic-scholar

Last synced: 6 months ago · JSON representation ·

Repository

Article Analysis Assistant

Basic Info

Host: GitHub
Owner: EhsanBitaraf
License: apache-2.0
Language: JavaScript
Default Branch: main
Homepage:
Size: 73.4 MB

Statistics

Stars: 18
Watchers: 1
Forks: 2
Open Issues: 14
Releases: 3

Topics

citation-graph graph research-assistant semantic-scholar

Created about 3 years ago · Last pushed 7 months ago

Metadata Files

Readme Changelog Contributing License Code of conduct Citation Security

Triple-a - Article Analysis Assistant

Triple-A is a tool that can be used to create a repository of scientific articles and perform a series of citation graph analysis, bibilometric analysis, and automatic data extraction processes on this repository.

This program somehow creates a network of article references and provides a connection between authors and keywords, these things are usually called "Citation Graph".

There are various software and online systems for this, a brief review of which can be found here.

This tool gives you the power to create a graph of articles and analyze it. This tool is designed as a CLI (command-line interface) and you can use it as a Python library.

Open Issue

Repo Size GitHub code size in bytes Downloads

Release Release

PyPI - Python Version

Main Features
How to Install
- Installation From Source Code
- Installation from package
How to Work with the Program
- Functional Use
- Command Line (CLI) Use
Testing
Dependencies
Use Case
Public Dataset
Graph Visualization
Graph Analysis
Knowledge Extraction
Related Article
Code Quality
Citation
License

🎮 Main Features

Repository Creation: Collect and store articles based on a search strategy.
Citation Graph Analysis: Generate and analyze citation networks between articles.
Bibliometric Analysis: Perform advanced bibliometric analysis.
Retrieval-Augmented Generation (RAG): Automatically retrieve and analyze content for a domain of articles.
Single Article Analysis: Analyze individual articles.
Network Analysis: Conduct detailed network analysis at both node and overall graph levels.
Bibliography Import: Easily import bibliography files in various formats (e.g., .bib, .ris).
LLM Research Querying: Ask an LLM research questions from the repository of articles and review its results.
Topic Extraction: Perform topic extraction using an external service.
Affiliation Parsing: Perform affiliation parsing using an external service.

How to Install

Installation From Source Code

Step 1: Clone the Repository

First, clone the TripleA repository from GitHub using one of the following commands:

For HTTPS: shell git clone https://github.com/EhsanBitaraf/triple-a.git

For SSH: shell git clone git@github.com:EhsanBitaraf/triple-a.git

Step 2: Create a Python Virtual Environment

Navigate to the repository directory and create a Python virtual environment to isolate your project dependencies:

shell python -m venv venv

Step 3: Activate the Virtual Environment

For Windows: shell $ .\venv\Scripts\activate

For Linux/macOS: shell $ source venv/bin/activate

Step 4: Install Poetry

Poetry is used for managing dependencies in this project. If you don't already have Poetry installed, install it using pip:

shell pip install poetry

Step 5: Install Dependencies

Once Poetry is installed, use it to install all the required dependencies for the project:

shell poetry install

Step 6: Run the CLI

After the dependencies are installed, you can run the CLI by executing the following command:

shell poetry run python triplea/cli/aaa.py

This will launch the TripleA CLI, where you can interact with the various commands available.

Step 7: (Optional) Configure Environment Variables

To customize your environment, you can create a .env file in the root directory of the project. Refer to the installation from package instructions for the full list of environment variables you can set.

If the .env file is not created, default values will be used as specified in the package.

Installation from package

It is recommended to create a Python virtual environment before installing the package to keep your project dependencies isolated. You can do so by running the following commands:

Step 1: Create a Python Virtual Environment

sh $ python -m venv venv

Step 2: Activate the Virtual Environment

For Windows: sh $ .\venv\Scripts\activate

For macOS/Linux: sh $ source venv/bin/activate

Step 3: Install the Package

You can install the TripleA package from PyPI using pip:

sh $ pip install triplea

Alternatively, you can install the package directly from the GitHub repository:

sh $ pip install git+https://github.com/EhsanBitaraf/triple-a

Step 4: Configure Environment Variables

Create a .env file in the root of your project to set environment variables for the package. This file should contain the following key-value pairs:

TRIPLEA_DB_TYPE = TinyDB AAA_TINYDB_FILENAME = articledata.json AAA_MONGODB_CONNECTION_URL = mongodb://localhost:27017/ AAA_MONGODB_DB_NAME = articledata AAA_TPS_LIMIT = 1 AAA_PROXY_HTTP = AAA_PROXY_HTTPS = AAA_REFF_CRAWLER_DEEP = 1 AAA_CITED_CRAWLER_DEEP = 1 AAA_TOPIC_EXTRACT_ENDPOINT = http://localhost:8001/api/v1/topic/ AAA_CLIENT_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0"

If the .env file is not created, the default values will be used: TRIPLEA_DB_TYPE = TinyDB AAA_TINYDB_FILENAME = default-tiny-db.json AAA_TPS_LIMIT = 1 AAA_REFF_CRAWLER_DEEP = 1 AAA_CITED_CRAWLER_DEEP = 1 AAA_TOPIC_EXTRACT_ENDPOINT = http://localhost:8001/api/v1/topic/ AAA_CLIENT_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:109.0) Gecko/20100101 Firefox/109.0"

For reference, the latest version of a sample .env file can be found here.

Step 5: Running the CLI

You can access the TripleA CLI by running the following command:

sh $ aaa --help

The output will be: ```sh Usage: aaa [OPTIONS] COMMAND [ARGS]...

Options: -v, --version --help Show this message and exit.

Commands: analysis Analysis Graph. config Configuration additional setting. export Export article repository in specific format. exportarticle Export Article by identifier. exportgraph Export Graph. export_llm Export preTrain LLM. go Moves the articles state in the Arepo until end state. import Import article from specific file format to article... importbib Import article from .bib, .enw, .ris file format. ner Single NER with custom model. next Moves the articles state in the Arepo from the current... pipeline Run Custom Pipeline in arepo. search Search query from PubMed and store to Arepo. ```

Note:

The visualization feature is only available in the source version of the package.

Tutorial:

For additional guides and programming examples beyond using the package, please refer to the cookbook section.

How to Work with the Program

Once the program is installed, you can utilize it both as a CLI tool and by calling its functions directly within your Python code. Below is a step-by-step guide for retrieving articles, processing them through various stages, and performing additional tasks like topic extraction and affiliation mining.

Functional Use

Step 1 - Get Articles from Arxiv

You can retrieve articles from Arxiv using a specific search query and store them into the article repository (Arepo). Here's an example using a search string for large language models:

python arxiv_search_string = '(ti:“Large language model” OR ti:“Large language models” OR (ti:large AND ti:“language model”) OR (ti:large AND ti:“language models”) OR (ti:“large language” AND ti:model) OR (ti:“large language” AND ti:models) OR ti:“language model” OR ti:“language models” OR ti:LLM OR ti:LLMs OR ti:“GPT models” OR ti:“GPT model” OR ti:Gpt OR ti:gpts OR ti:Chatgpt OR ti:“generative pre-trained transformer” OR ti:“bidirectional encoder representations from transformers” OR ti:BERT OR ti:“transformer-based model” OR (ti:transformer AND ti:model) OR (ti:transformers AND ti:model) OR (ti:transformer AND ti:models) OR (ti:transformers AND ti:models)) AND (ti:Evaluation OR ti:Evaluat* OR ti:Assessment OR ti:Assess* OR ti:Validation OR ti:Validat* OR ti:Benchmarking OR ti:Benchmark*)' get_article_list_from_arxiv_all_store_to_arepo(arxiv_search_string, 0, 5000)

This fetches articles based on the query and stores up to 5000 articles into the repository.

Step 2 - Get Articles from PubMed

Similarly, you can retrieve articles from PubMed using a custom search string and store them in the repository:

python pubmed_search_string = '("Large language model"[ti] OR "Large language models"[ti] OR (large[ti] AND "language model"[ti]) OR (large[ti] AND "language models"[ti]) OR ("large language"[ti] AND model[ti]) OR ("large language"[ti] AND models[ti]) OR "language model"[ti] OR "language models"[ti] OR LLM[ti] OR LLMs[ti] OR "GPT models"[ti] OR "GPT model"[ti] OR Gpt[ti] OR gpts[ti] OR Chatgpt[ti] OR "generative pre-trained transformer"[ti] OR "bidirectional encoder representations from transformers"[ti] OR BERT[ti] OR "transformer-based model"[ti] OR (transformer[ti] AND model[ti]) OR (transformers[ti] AND model[ti]) OR (transformer[ti] AND models[ti]) OR (transformers[ti] AND models[ti])) AND (Evaluation[ti] OR Evaluat*[ti] OR Assessment[ti] OR Assess*[ti] OR Validation[ti] OR Validat*[ti] OR Benchmarking[ti] OR Benchmark*[ti])' get_article_list_from_pubmed_all_store_to_arepo(pubmed_search_string)

This stores all relevant articles from PubMed into the repository based on the specified query.

Step 3 - Get Information from Repository

To print article information that has been stored in the repository, you can use the following command:

python PERSIST.print_article_info_from_repo()

This will print out details about the articles that have been saved so far.

Step 4 - Move Articles from State 0 to State 1 (Save Article Details)

The articles are initially stored in state 0. Use this command to move them to state 1, where their original details (in JSON format) will be saved:

python move_state_forward(0)

Step 5 - Move Articles from State 1 to State 2 (Parse Article Information)

To parse the article's detailed information and move it from state 1 to state 2:

python move_state_forward(1)

Step 6 - Move Articles from State 2 to State 3 (Get Citations)

This step retrieves citation data for the articles and moves them to state 3:

python move_state_forward(2)

Step 7 - Move Articles from State 3 to State 4 (Get Full Text)

Fetch the full text of the articles and move them from state 3 to state 4:

python move_state_forward(3)

Step 8 - Move Articles from State 4 to State 5 (Convert Full Text to String)

In this step, the full text of the articles is converted to a string format for further analysis:

python move_state_forward(4)

Custom Pipeline Operations

Once the articles have been processed through the various states, you can perform more advanced operations in the custom pipeline.

1. Extract Topics from Articles

This function will run the topic extraction process on the articles:

python cPIPELINE.go_extract_topic()

2. Perform Affiliation Mining

You can extract affiliation information from articles using the method specified ("Titipata" in this case):

python cPIPELINE.go_affiliation_mining(method="Titipata")

3. Extract Triples (Subject-Predicate-Object Relations)

To extract triples (semantic relations) from the articles:

python cPIPELINE.go_extract_triple()

4. Generate Short Review Article with LLM

This function allows you to create a brief review of the articles using a large language model (LLM):

python cPIPELINE.go_article_review_by_llm()

5. Export Data

Finally, to export the processed data (e.g., triples) in a CSV format:

python export_triplea_csvs_in_relational_mode_save_file("export.csv")

This saves the exported data into a CSV file named export.csv.

Some practices

get list of PMID in state 0 python term = '("Electronic Health Records"[Mesh]) AND ("National"[Title/Abstract]) AND Iran' get_article_list_all_store_to_kg_rep(term)

move from state 1 python move_state_forward(1)

get list of PMID in state 0 and save to file for debugginf use python data = get_article_list_from_pubmed(1, 10,'("Electronic Health Records"[Mesh]) AND ("National"[Title/Abstract])') data = get_article_list_from_pubmed(1, 10,'"Electronic Health Records"') data1= json.dumps(data, indent=4) with open("sample1.json", "w") as outfile: outfile.write(data1)

open before file for debugging use python f = open('sample1.json') data = json.load(f) f.close()

get one article from kg and save to file python data = get_article_by_pmid('32434767') data= json.dumps(data, indent=4) with open("one-article.json", "w") as outfile: outfile.write(data)

Save Title for Annotation python file = open("article-title.txt", "w", encoding="utf-8") la = get_article_by_state(2) for a in la: try: article = Article(**a.copy()) except: pass file.write(article.Title + "\n")

Training NER for Article Title

You can use NLP(Natural Language Processing) methods to extract information from the structure of the article and add it to your graph. For example, you can extract NER(Named-entity recognition) words from the title of the article and add to the graph. Here's how to create a custom NER.

Command Line (CLI) Use

By using the following command, you can see the command completion help. Each command has a separate help.

shell python .\triplea\cli\aaa.py --help

output:

Get and Save list of article identifier base on search term

Get list of article identifier like PMID base on search term and save into knowledge repository in first state (0):

use this command: shell python .\triplea\cli\aaa.py search --searchterm [searchterm]

Even the PMID itself can be used in the search term. shell python .\triplea\cli\aaa.py search --searchterm 36467335

output:

Move core pipeline state

The preparation of the article for extracting the graph has different steps that are placed in a pipeline. Each step is identified by a number in the state value. The following table describes the state number:

List of state number

|State|Short Description|Description| |-----|-----------------|-----------| |0 |article identifier saved|At this stage, the article object stored in the data bank has only one identifier, such as the PMID or DOI identifier| |1 |article details article info saved (json Form)|Metadata related to the article is stored in the OriginalArticle field from the SourceBank, but it has not been parsed yet| |2 |parse details info|The contents of the OriginalArticle field are parsed and placed in the fields of the Article object.| |3 |Get Citation || |4 |Get Full Text |At this stage, the articles that are open access and it is possible to get their full text are taken and added to the bank| |5 |Convert full text to string || |-1 |Error |if error happend in move state 1 to 2| |-2 |Error |if error happend in move state 2 to 3|

There are two ways to run a pipeline. In the first method, we give the number of the existing state and all the articles in this state move forward one state. In another method, we give the final state number and each article under that state starts to move until it reaches the final state number that we specified. The first can be executed with the next command and the second with the go command.

With this command move from current state to the next state shell python .\triplea\cli\aaa.py next --state [current state]

for example move all article in state 0 to 1: shell python .\triplea\cli\aaa.py next --state 0 output:

go command: shell python .\triplea\cli\aaa.py go --end [last state]

shell python .\triplea\cli\aaa.py go --end 3

output:

Run custom pipeline

Apart from the core pipelines that should be used to prepare articles, customized pipelines can also be used. Custom pipelines may be implemented to extract knowledge from texts and NLP processing. These pipelines themselves can form a new graph other than the citation graph or in combination with it.

List of Custom Pipeline

|Action|Tag Name|Description|Prerequisite| |------|--------|-----------|------------| |Triple extraction from article abstract |FlagExtractKG ||At least core state 2| |Topic extraction from article abstract |FlagExtractTopic ||At least core state 2| |Convert Affiliation text to structural data |FlagAffiliationMining|This is simple way for parse Affiliation text |At least core state 2| |Convert Affiliation text to structural data |FlagAffiliationMining_Titipata|use Titipat Achakulvisut Repo for parsing Affiliation text|At least core state 2| |Text embedding abstract and send to SciGenius|FlagEmbedding ||At least core state 2| |Title and Abstract Review by LLM |FlagShortReviewByLLM ||At least core state 2|

NER Article Title

You can try the NER method to extract the major topic of the article's title by using the following command. This command is independent and is used for testing and is not stored in the Arepo.

shell python .\triplea\cli\ner.py --title "The Iranian Integrated Care Electronic Health Record."

Country-based Co-authorship

A country-based co-authorship network refers to a network of collaborative relationships between researchers from different countries who have co-authored academic papers together. It represents the connections and collaborations that exist among researchers across national boundaries.

By studying a country-based co-authorship network, researchers can gain insights into international collaborations, identify emerging research trends, foster interdisciplinary cooperation, and facilitate policy decisions related to research funding, academic mobility, and scientific development at a global scale.

There are several software tools available that can help you produce country-based co-authorship networks. Here are a few popular options:

VOSviewer: VOSviewer is a widely used software tool for constructing and visualizing co-authorship networks. It offers various clustering and visualization techniques and allows you to analyze and explore the network based on different attributes, including country affiliation.

Sci2 Tool: The Science of Science (Sci2) Tool is a Java-based software package (in GitHub) that supports the analysis and visualization of scientific networks. It offers a variety of functionalities for constructing and analyzing co-authorship networks, including country-based analysis. It allows users to perform data preprocessing, network analysis, and visualization within a single integrated environment.

To convert affiliation into a hierarchical structure of country, city and centers, you can use the following command:

shell python .\triplea\cli\aaa.py pipeline -n FlagAffiliationMining

Extract Triple from Abstract

shell python .\triplea\cli\aaa.py pipeline --name FlagExtractKG

Extract Topic from Abstract

shell python .\triplea\cli\aaa.py pipeline --name FlagExtractTopic

An example of working with the functions of this part using Jupyter is given in here. which is finally drawn using VOSviewer program as below:

Import Data

Import Single Reference File

Import file type is .bib , .enw , .ris

shell python .\triplea\cli\importbib.py "C:\...\bc.ris"

output:

Import Triplea Format

sh python .\triplea\cli\aaa.py import --help

sh python .\triplea\cli\aaa.py import --type triplea --format json --bar True "C:\BibliometricAnalysis.json"

Export Data

Various data export can be created from the article repository. These outputs are used to create raw datasets.

It has not yet been implemented.

For guidance from the export command, you can act like this: sh python .\triplea\cli\aaa.py export --help

For Example :

The export is limited to 100 samples, and the resulting exported articles are saved in the file Triple Json format named "testexport.json". ```sh python .\triplea\cli\aaa.py export --type triplea --format json --limit 100 --output "testexport.json" ```

sh python .\triplea\cli\aaa.py export --type triplea --format json --output "test_export.json"

Export Triplea CSV format: sh python .\triplea\cli\aaa.py export --type triplea --format csv --output "test_export.csv"

sh python .\triplea\cli\aaa.py export --type triplea --format csvs --output "export.csv"

Export for Rayyan CSV format: sh python .\triplea\cli\aaa.py export --type rayyan --format csv --output "test_export.csv"

Export Graph

for details information: sh python .\triplea\cli\aaa.py export_graph --help

Making a graph with the graphml format and saving it in a file test.graphml shell python .\triplea\cli\aaa.py export_graph -g gen-all -f graphml -o .\triplea\test

Making a graph with the gexf format and saving it in a file C:\Users\Dr bitaraf\Documents\graph\article.gexf.This graph contains article, author, affiliation and relation between them: shell python .\triplea\cli\aaa.py export_graph -g article-author-affiliation -f gexf -o "C:\Users\Dr bitaraf\Documents\graph\article"

Making a graph with the graphdict format and saving it in a file C:\Users\Dr bitaraf\Documents\graph\article.json.This graph contains article, Reference, article cite and relation between them: shell python .\triplea\cli\aaa.py export_graph -g article-reference -g article-cited -f graphdict -o "C:\Users\Dr bitaraf\Documents\graph\article.json"

Making a graph with the graphml format and saving it in a file C:\graph-repo\country-authorship.jgraphmlson.This graph contains article, country, and relation between them: shell python .\triplea\cli\aaa.py export_graph -g country-authorship -f graphml -o "C:\graph-repo\country-authorship"

Types of graph generators that can be used in the -g parameter:

|Name|Description| |----|-----------| |store|It considers all the nodes and edges that are stored in the database| |gen-all|It considers all possible nodes and edges| |article-topic|It considers article and topic as nodes and edges between them| |article-author-affiliation|It considers article, author and affiliation as nodes and edges between them| |article-keyword|It considers article and keyword as nodes and edges between them| |article-reference|It considers article and reference as nodes and edges between them| |article-cited|It considers article and cited as nodes and edges between them| |country-authorship||

Types of graph file format that can be used in the -f parameter: |Name|Description| |----|-----------| |graphdict|This format is a customized format for citation graphs in the form of a Python dictionary.| |graphjson|| |gson|| |gpickle|Write graph in Python pickle format. Pickles are a serialized byte stream of a Python object| |graphml|The GraphML file format uses .graphml extension and is XML structured. It supports attributes for nodes and edges, hierarchical graphs and benefits from a flexible architecture.| |gexf|GEXF (Graph Exchange XML Format) is an XML-based file format for storing a single undirected or directed graph.|

Visualizing Graph

Several visualizator are used to display graphs in this program. These include:

Alchemy.js : Alchemy.js is a graph drawing application built almost entirely in d3.

interactivegaraph : InteractiveGraph provides a web-based interactive visualization and analysis framework for large graph data, which may come from a GSON file

netwulf : Interactive visualization of networks based on Ulf Aslak's d3 web app.

shell python .\triplea\cli\aaa.py visualize -g article-reference -g article-cited -p 8001

shell python .\triplea\cli\aaa.py visualize -g gen-all -p 8001

output:

shell python .\triplea\cli\aaa.py visualize -g article-topic -g article-keyword -p 8001

output:

Visulaize File

A file related to the extracted graph can be visualized in different formats with the following command: sh python .\triplea\cli\aaa.py visualize_file --format graphdict "graph.json"

Analysis Graph

analysis info command calculates specific metrics for the entire graph. These metrics include the following: - Graph Type: - SCC: - WCC: - Reciprocity : - Graph Nodes: - Graph Edges: - Graph Average Degree : - Graph Density : - Graph Transitivity : - Graph max path length : - Graph Average Clustering Coefficient : - Graph Degree Assortativity Coefficient :

python .\triplea\cli\aaa.py analysis -g gen-all -c info

output:

Creates a graph with all possible nodes and edges and calculates and lists the sorted degree centrality for each node. python .\triplea\cli\aaa.py analysis -g gen-all -c sdc

output:

Work with Article Repository

Article Repository (Arepo) is a database that stores the information of articles and graphs. Different databases can be used. We have used the following information banks here:

TinyDB - TinyDB is a lightweight document oriented database
MongoDB - MongoDB is a source-available cross-platform document-oriented database program

To get general information about the articles, nodes and egdes in the database, use the following command. shell python .\triplea\cli\aaa.py arepo -c info

output: shell Number of article in article repository is 122 0 Node(s) in article repository. 0 Edge(s) in article repository. 122 article(s) in state 3.

Get article data by PMID sh python .\triplea\cli\aaa.py arepo -pmid 31398071

output: Title : Association between MRI background parenchymal enhancement and lymphovascular invasion and estrogen receptor status in invasive breast cancer. Journal : The British journal of radiology DOI : 10.1259/bjr.20190417 PMID : 31398071 PMC : PMC6849688 State : 3 Authors : Jun Li, Yin Mo, Bo He, Qian Gao, Chunyan Luo, Chao Peng, Wei Zhao, Yun Ma, Ying Yang, Keywords: Adult, Aged, Breast Neoplasms, Female, Humans, Lymphatic Metastasis, Magnetic Resonance Imaging, Menopause, Middle Aged, Neoplasm Invasiveness, Receptors, Estrogen, Retrospective Studies, Young Adult,

Get article data by PMID and save to article.json file. sh python .\triplea\cli\aaa.py arepo -pmid 31398071 -o article.json

another command fo this: sh python .\triplea\cli\aaa.py export_article --idtype pmid --id 31398071 --format json --output "article.json"

Configuration

For details information: shell python .\triplea\cli\aaa.py config --help

Get environment variable: shell python .\triplea\cli\aaa.py config -c info

Set new environment variable: shell python .\triplea\cli\aaa.py config -c update

Below is a summary of important environment variables in this project: |Environment Variables |Description|Default Value| |--------------------------|-----------|-------------| |TRIPLEADBTYPE |The type of database to be used in the project. The database layer is separate and you can use different databases, currently it supports MongoDB and TinyDB databases. TinyDB can be used for small scope and Mango can be used for large scope|TinyDB| |AAATINYDBFILENAME |File name of TinyDB|articledata.json| |AAAMONGODBCONNECTIONURL|Standard Connection String Format For MongoDB|mongodb://user:pass@127.0.0.1:27017/| |AAAMONGODBDBNAME |Name of MongoDB Collection|articledata| |AAATPSLIMIT |Transaction Per Second Limitation|1| |AAAPROXYHTTP |An HTTP proxy is a server that acts as an intermediary between a client and PubMed server. When a client sends a request to a server through an HTTP proxy, the proxy intercepts the request and forwards it to the server on behalf of the client. Similarly, when the server responds, the proxy intercepts the response and forwards it back to the client.|| |AAAPROXYHTTPS |HTTPS Proxy|| |AAACLIENTAGENT ||| |AAAREFFCRAWLERDEEP ||1| |AAACITEDCRAWLERDEEP ||1| |AAACLIALERTPOINT||500| |AAATOPICEXTRACTENDPOINT||| |AAASCIGENIUSENDPOINT||| |AAALLMTEMPLATEFILE||| |AAAFULLTEXTREPOTYPE||| |AAAFULLTEXTDIRECTORY||| |AAAFULLTEXTSTRINGREPOTYPE||| |AAAFULLTEXTSTRING_DIRECTORY|||

Testing

To ensure the functionality and reliability of the application, you can run tests using pytest. Follow the steps below to execute the tests:

Running All Tests

To run all tests in the project, use the following command:

sh poetry run pytest

This command will discover and execute all test files and functions within your project directory.

Running Tests in a Specific Directory

If you want to run tests that are specifically located in a designated directory (e.g., the tests/ directory), you can specify that directory as follows:

sh poetry run pytest tests/

This command will only execute the tests found within the specified tests/ directory.

Running Tests with Coverage

To measure test coverage, you can use the --cov option. This will report which parts of your code are covered by tests:

sh poetry run pytest --cov

This command provides a summary of code coverage in the terminal, allowing you to identify untested areas of your code.

Additional Coverage Reports

If you would like to generate a more detailed coverage report in HTML format, you can add the following command after running the tests:

sh poetry run pytest --cov --cov-report html

This will create a directory named htmlcov containing an HTML report, which you can open in your web browser to visually inspect coverage details.

Dependencies

The project relies on various libraries for different functionalities. Below is a categorized list of dependencies required for the project:

Graph Analysis

networkx: A library for creating, manipulating, and studying the structure and dynamics of complex networks.

Natural Language Processing (NLP)

PyTextRank: A library for keyword extraction and summarization using graph-based ranking algorithms.
transformers: A state-of-the-art library for natural language processing tasks, providing pre-trained models for various NLP applications.
spaCy: An advanced NLP library designed for production use, offering efficient and easy-to-use tools for text processing.

Data Storage

TinyDB: A lightweight document-oriented database that stores data in JSON format, suitable for small projects.
py2neo: A client library for working with Neo4j graph databases, allowing for easy manipulation of graph data.
pymongo: The official Python driver for MongoDB, providing a way to interact with MongoDB databases.

Visualization of Networks

netwulf: A library for visualizing networks directly in the browser, designed for interactive exploration of network data.
Alchemy.js: A JavaScript library for visualizing networks with an emphasis on aesthetics and interaction.
InteractiveGraph: A framework for creating interactive graph visualizations, enabling users to explore graph data dynamically.

Command-Line Interface (CLI)

click: A Python package for creating command-line interfaces with a focus on ease of use and flexibility.

Packaging and Dependency Management

Poetry: A dependency management and packaging tool that simplifies the management of Python projects and their dependencies.

Use Case

This tool allows you to create datasets in various formats. Below are examples of how to use the tool for creating a dataset related to breast cancer research.

Breast Cancer Dataset

PubMed Query

To gather relevant articles, use the following PubMed query: "breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]

This query returns 495,012 results.

Configuration

Before running the tool, ensure your configuration settings are properly defined in your environment variables: plaintext AAA_MONGODB_DB_NAME = bcarticledata AAA_REFF_CRAWLER_DEEP = 0 AAA_CITED_CRAWLER_DEEP = 0 Note: The EDirect tool is used for fetching articles from PubMed.

Search Command

You can initiate the search using the following command: bash python .\triplea\cli\aaa.py search --searchterm r'"breast neoplasms"[MeSH Terms] OR ("breast"[All Fields] AND "neoplasms"[All Fields]) OR "breast neoplasms"[All Fields] OR ("breast"[All Fields] AND "cancer"[All Fields]) OR "breast cancer"[All Fields]'

If the --searchterm argument is too complex, you can run the search without it: bash python .\triplea\cli\aaa.py search

Filtering Results

You can filter the search results based on publication date using the following filter criteria: json { "mindate": "2022/01/01", "maxdate": "2022/12/30" }

Retrieving Downloaded Article Information

To get an overview of all downloaded articles, run: bash python .\triplea\cli\aaa.py arepo -c info The output will provide details like this: plaintext Number of articles in article repository: 30,914 0 Node(s) in article repository. 0 Edge(s) in article repository. 30,914 article(s) in state 0.

Advancing Article States

To move the articles through different processing states, execute the following commands:

Run the core pipeline to advance from state 0 to state 1: bash python .\triplea\cli\aaa.py next --state 0
Parse articles from state 1 to state 2: bash python .\triplea\cli\aaa.py next --state 1

Custom Pipeline for Extracting Triples

To extract triples from the articles using a custom pipeline, run: bash python .\triplea\cli\aaa.py pipeline --name FlagExtractKG

Bio-Bank Dataset

PubMed Query

To gather articles related to biological specimen banks, use the following PubMed query: "Biological Specimen Banks"[Mesh] OR BioBanking OR biobank OR dataBank OR "Bio Banking" OR "bio bank"

This query returns a total of 39,023 results.

Search Command

You can initiate the search using the following command: bash python .\triplea\cli\aaa.py search --searchterm "\"Biological Specimen Banks\"[Mesh] OR BioBanking OR biobank OR dataBank OR \"Bio Banking\" OR \"bio bank\""

Handling Query Limitations

When querying PubMed, if the number of results exceeds 10,000, you may encounter an error similar to this: "ERROR":"Search Backend failed: Exception:\n\'retstart\' cannot be larger than 9998. For PubMed, ESearch can only retrieve the first 9,999 records matching the query. To obtain more than 9,999 PubMed records, consider using EDirect, which contains additional logic to batch PubMed search results automatically."

PubMed's ESearch can only retrieve the first 10,000 records. To gather more than 10,000 UIDs, consider submitting multiple ESearch requests while incrementing the retstart value. For detailed instructions, refer to the EDirect documentation.

This limitation is hardcoded in the get_article_list_from_pubmed method in PARAMS.

Additional Query

A more recent query was added to refine the search: "bio-banking"[Title/Abstract] OR "bio-bank"[Title/Abstract] OR "data-bank"[Title/Abstract] This query returns an additional 9,012 results.

You can run this query using the following command: bash python .\triplea\cli\aaa.py search --searchterm "\"bio-banking\"[Title/Abstract] OR \"bio-bank\"[Title/Abstract] OR \"data-bank\"[Title/Abstract]"

Retrieve Article Information

After running the above search, you can check the number of articles in the repository with: Number of articles in article repository: 47,735

Exporting Data

To export the dataset in graphml format, execute the following command: bash python .\triplea\cli\aaa.py export_graph -g article-reference -g article-keyword -f graphml -o .\triplea\datasets\biobank.graphml

Registry of Breast Cancer Dataset

Keyword Checking

To ensure comprehensive coverage of breast cancer research, the following keywords were verified: "Breast Neoplasms"[Mesh] "Breast Cancer"[Title] "Breast Neoplasms"[Title] "Breast Neoplasms"[Other Term] "Breast Cancer"[Other Term] "Registries"[Mesh] "Database Management Systems"[Mesh] "Information Systems"[MeSH Major Topic] "Registries"[Other Term] "Information Storage and Retrieval"[MeSH Major Topic] "Registry"[Title] "National Program of Cancer Registries"[Mesh] "Registries"[MeSH Major Topic] "Information Science"[Mesh] "Data Management"[Mesh]

Final PubMed Query

Based on the above keywords, the final PubMed query is constructed as follows: plaintext ("Breast Neoplasms"[Mesh] OR "Breast Cancer"[Title] OR "Breast Neoplasms"[Title] OR "Breast Neoplasms"[Other Term] OR "Breast Cancer"[Other Term]) AND ("Registries"[MeSH Major Topic] OR "Database Management Systems"[MeSH Major Topic] OR "Information Systems"[MeSH Major Topic] OR "Registry"[Other Term] OR "Registry"[Title] OR "Information Storage and Retrieval"[MeSH Major Topic])

Query URL

You can execute this query directly using the following URL: https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=("Breast+Neoplasms"[Mesh]+OR+"Breast+Cancer"[Title]+OR+"Breast+Neoplasms"[Title]+OR+"Breast+Neoplasms"[Other+Term]+OR+"Breast+Cancer"[Other+Term])+AND+("Registries"[MeSH+Major+Topic]+OR+"Database+Management+Systems"[MeSH+Major+Topic]+OR+"Information+Systems"[MeSH+Major+Topic]+OR+"Registry"[Other+Term]+OR+"Registry"[Title]+OR+"Information+Storage+and+Retrieval"[MeSH+Major+Topic])&retmode=json&retstart=1&retmax=10

Downloading Dataset

You can download the results of this network, which include the relationships between articles and keywords, in graphdict format from the following link: - Download graphdict format

If you prefer to work with the graph in graphml format, you can download it here: - Download graphml format

Public Dataset

This section provides access to several datasets produced using this program. These datasets have been structured in a simpler format compared to the program's internal database, enhancing usability for researchers and practitioners. You can utilize the export_engine function to obtain outputs tailored to your preferred structure. For a simple example of how to use this function, please refer to the sample export engine script.

Topic Extraction Dataset - Related to Breast Cancer Therapy

This dataset comprises a total of 9,691 articles from the medical domain, specifically focused on breast cancer therapy. Topic extraction was performed using two distinct methodologies: TextRank and LLM (Large Language Models). These approaches leveraged the keywords found within the articles to generate the dataset for analysis. The dataset includes various fields, such as:

Article title
Publication year
PMID (PubMed Identifier)
Keyword listings
Topics derived through the TextRank algorithm
Topics identified through LLM analysis

License:
MIT

DOI: 10.6084/m9.figshare.25533532.v1

Coronary Artery Disease Clinical Trial Articles

This collection consists of articles related to clinical trials on coronary artery disease, featuring the following information for each article:

Year of publication
Title
Abstract
PMID (PubMed Identifier)

These articles were extracted from the PubMed database using a specific search strategy designed to capture relevant clinical trial information.

License:
CC BY 4.0

DOI: 10.6084/m9.figshare.26116768.v2

MIE Articles Dataset

The MIE Articles Dataset contains 4,606 articles presented at the Medical Informatics Europe Conference (MIE) from 1996 to 2024. This data was extracted from PubMed, and topic extraction as well as affiliation parsing were conducted on the dataset.

License:
CC BY 4.0

DOI: 10.6084/m9.figshare.27174759.v1

Graph Visualization

Various tools have been developed to visualize graphs. We have done a brief review and selected a few tools to use in this program.

Graph Analysis

In this project, we used one of the most powerful libraries for graph analysis. Using NetworkX, we generated many indicators to check a citation graph. Some materials in this regard are given here. You can use other libraries as well.

Knowledge Extraction

In the architecture of this software, the structure of the article is stored in the database and this structure also contains the summary of the article. For this reason, it is possible to perform NLP processes such as keywords extraction, topic extraction etc., which can be completed in the future.

This topic is very interesting from a research point of view, so I have included the articles that were interesting here.

Code Quality

We used flake8 and black libraries to increase code quality. More information can be found here.

Citation

If you use Triple A for your scientific work, consider citing us! We're published in IEEE.

bibtex @INPROCEEDINGS{10139229, author={Jafarpour, Maryam and Bitaraf, Ehsan and Moeini, Ali and Nahvijou, Azin}, booktitle={2023 9th International Conference on Web Research (ICWR)}, title={Triple A (AAA): a Tool to Analyze Scientific Literature Metadata with Complex Network Parameters}, year={2023}, volume={}, number={}, pages={342-345}, doi={10.1109/ICWR57742.2023.10139229}}

License

TripleA is available under the Apache License.

Owner

Name: Ehsan Bitaraf
Login: EhsanBitaraf
Kind: user

Website: linkedin.com/in/ehsan-bitaraf-34aa28247
Repositories: 2
Profile: https://github.com/EhsanBitaraf

Any fool can write code that a computer can understand. Good programmers write code that humans can understand.

Citation (CITATION.cff)

cff-version: 1.2.0
preferred-citation:
  title: >-
    Triple A (AAA): a Tool to Analyze Scientific Literature
    Metadata with Complex Network Parameters
  message: >-
    If you use this software, please cite it using the
    metadata from this file.
  type: conference-paper
  authors:
    - given-names: Maryam
      family-names: Jafarpour
      affiliation: >-
        Department of Algorithms and Computation, School of
        Engineering Science, College of Engineering,
        University of Tehran, Tehran, Iran
    - given-names: Ehsan
      family-names: Bitaraf
      affiliation: >-
        Center for Statistics and Information Technology, Iran
        University of Medical Sciences, Tehran, Iran
      orcid: 'https://orcid.org/0000-0002-6588-7349'
    - given-names: Ali
      family-names: Moeini
      orcid: 'https://orcid.org/0000-0002-6408-3525'
      affiliation: >-
        Department of Algorithms and Computation, School of
        Engineering Science, College of Engineering,
        University of Tehran, Tehran, Iran
    - given-names: Azin
      family-names: Nahvijou
      affiliation: >-
        Cancer Research Centre, Cancer Institute, Tehran
        University of Medical Sciences, Tehran, Iran
  year: 2023
  collection-title: "2023 9th International Conference on Web Research (ICWR)"
  collection-doi: 10.1109/ICWR57742.2023
  start: 342
  end: 345
  identifiers:
    - type: doi
      value: 10.1109/ICWR57742.2023.10139229
    - type: url
      value: 'https://ieeexplore.ieee.org/document/10139229'
  repository-code: 'https://github.com/EhsanBitaraf/triple-a'
  abstract: >-
    It is essential to analyze scientific literature when
    conducting review studies (systematic, narrative, etc.).
    Review articles can improve in quality by choosing or
    incorporating papers with high research impact. The
    quality of research has been measured using a variety of
    indicators. These metrics primarily address certain
    characteristics like the citation index. It is impossible
    to study the caliber of research in any field on an
    individual basis. It has to do with connections.
    Therefore, it would be advantageous to create a network of
    research items. In this study, we introduce a novel tool
    for the analysis of metadata in scientific literature. We
    tested our technique on the literature of breast cancer.
    The tool extracted 49,604 papers resulting in 575,894
    nodes and 1,532,328edges. We looked at the topological and
    structural characteristics of the constructed network,
    briefly. However, this tool can be utilized in any other
    domain of interest.
  keywords:
    - Bibliographies
    - Complex Networks
    - Citation Network
    - Graph Analysis
  license: Apache-2.0

GitHub Events

Total

Watch event: 1
Delete event: 5
Issue comment event: 2
Push event: 13
Pull request event: 13
Create event: 7

Last Year

Watch event: 1
Delete event: 5
Issue comment event: 2
Push event: 13
Pull request event: 13
Create event: 7

Issues and Pull Requests

Last synced: 7 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 24 minutes
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 1

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 24 minutes
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

shoolsina (1)

Pull Request Authors

dependabot[bot] (7)

Top Labels

Issue Labels

Pull Request Labels

dependencies (7) python (5)

Packages

Total packages: 1
Total downloads:
- pypi 22 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 3
Total maintainers: 1

pypi.org: triplea

Article Analysis Assistant

Homepage: https://github.com/EhsanBitaraf/triple-a
Documentation: https://triplea.readthedocs.io/
License: Apache-2.0
Latest release: 0.0.5
published about 2 years ago

Versions: 3
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 22 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 17.1%

Forks count: 19.1%

Average: 21.1%

Dependent repos count: 21.8%

Downloads: 37.6%

Maintainers (1)

bitaraf

Last synced: 7 months ago

Dependencies

.github/workflows/push-docker-image.yml actions

actions/checkout v3 composite
docker/build-push-action v4 composite
docker/login-action v2 composite
docker/setup-buildx-action v2 composite

.github/workflows/pylint.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/python-flake.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/test-poetry-action.yml actions

actions/cache v3 composite
actions/checkout v3 composite
actions/setup-python v4 composite
snok/install-poetry v1 composite

triplea/service/nlp/ner_model/model-best/meta.json cpan

triplea/service/nlp/ner_model/model-last/meta.json cpan

Dockerfile docker

python 3-slim build
python latest build

poetry.lock pypi

127 dependencies

pyproject.toml pypi

click ^8.1.3
flake8 ^6.0.0
ipykernel ^6.21.2
ipywidgets ^8.0.4
networkx ^3.0
netwulf ^0.1.5
nxviz ^0.7.4
pandas ^1.5.3
py2neo ^2021.2.3
pydantic ^1.10.4
pymongo ^4.3.3
pytest ^7.2.1
pytest-mock ^3.11.1
pytextrank ^3.2.4
python ^3.10
python-dotenv ^0.21.1
spacy ^3.5.0
tinydb ^4.7.1
transformers ^4.30.0
xmltodict ^0.13.0

.github/workflows/build-package.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite

.github/workflows/pyinstaller.yml actions

actions/checkout v2 composite
actions/setup-python v2 composite
actions/upload-artifact v2 composite
snok/install-poetry v1 composite

triplea

Science Score: 67.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

Triple-a - Article Analysis Assistant

Table of contents

🎮 Main Features

How to Install

Installation From Source Code

Step 1: Clone the Repository

Step 2: Create a Python Virtual Environment

Step 3: Activate the Virtual Environment

Step 4: Install Poetry

Step 5: Install Dependencies

Step 6: Run the CLI

Step 7: (Optional) Configure Environment Variables

Installation from package

Step 1: Create a Python Virtual Environment

Step 2: Activate the Virtual Environment

Step 3: Install the Package

Step 4: Configure Environment Variables

Step 5: Running the CLI

How to Work with the Program

Functional Use

Step 1 - Get Articles from Arxiv

Step 2 - Get Articles from PubMed

Step 3 - Get Information from Repository

Step 4 - Move Articles from State 0 to State 1 (Save Article Details)

Step 5 - Move Articles from State 1 to State 2 (Parse Article Information)

Step 6 - Move Articles from State 2 to State 3 (Get Citations)

Step 7 - Move Articles from State 3 to State 4 (Get Full Text)

Step 8 - Move Articles from State 4 to State 5 (Convert Full Text to String)

Custom Pipeline Operations

1. Extract Topics from Articles

2. Perform Affiliation Mining

3. Extract Triples (Subject-Predicate-Object Relations)

4. Generate Short Review Article with LLM

5. Export Data

Some practices

Training NER for Article Title

Command Line (CLI) Use

Get and Save list of article identifier base on search term

Move core pipeline state

Run custom pipeline

NER Article Title

Country-based Co-authorship

Extract Triple from Abstract

Extract Topic from Abstract

Import Data

Import Single Reference File

Import Triplea Format

Export Data

Export Graph

Visualizing Graph

Analysis Graph

Work with Article Repository

Configuration

Testing

Running All Tests

Running Tests in a Specific Directory

Running Tests with Coverage

Additional Coverage Reports

Dependencies

Graph Analysis

Natural Language Processing (NLP)

Data Storage

Visualization of Networks

Command-Line Interface (CLI)

Packaging and Dependency Management

Use Case

Breast Cancer Dataset

PubMed Query

Configuration

Search Command

Filtering Results