id_extractor
ID_Extractor (ID_Ex) for extracting IDs and references
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.7%) to scientific vocabulary
Repository
ID_Extractor (ID_Ex) for extracting IDs and references
Basic Info
- Host: GitHub
- Owner: pBxr
- License: apache-2.0
- Language: Python
- Default Branch: main
- Size: 41 KB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 0
- Releases: 2
Metadata Files
README.md
IDExtractor (IDEx) for extracting IDs and references from jats article files
Introductory remarks
Several scientific journals edited by the German Archaeological Institute use jats xml to be displayed in an instance of the eLife Lens 2.0.0 (for example Archäologischer Anzeiger, see: https://publications.dainst.org/journals/aa).
The articles are enhanced with bibliographic and geographic authority data as well as other references to specific information resources of the institute´s information infrastructure.
Approach
ID_Ex browses the .xml files stored in the article repository folder and extracts the pre-defined references. The results are stored in separate sqlite3 tables reflecting the relation of a specific record to the doi of the article, e. g. from
- bibliographic records (zenon-IDs, see https://zenon.dainst.org/),
- geographic authority data (gazetteer-IDs, see https://gazetteer.dainst.org/),
- or records of other entities like objects (iDAI.objects-IDs, see https://arachne.dainst.org/) or records from archaeological fieldwork documentation systems (iDAI.field-IDs, see https://field.idai.world/).
ID_Ex is based on Python 3.12.0 using bs4 from BeautifulSoup library, so it can be easily modified for own purposes.
Mode of operation - and things to be done
If not existing, IDEx generates the required sqlite3 tables in a subfolder ("dbfolder") when starting the tool for the first time. In the initial version of IDEx you have to enter the path to the repository folder in which the .jats files are stored manually. IDEx extracts the data and saves them in mentioned sqlite3 tables.
To avoid duplicates IDEx checks if an article is already recorded using the doi and skipps in this case further actions.
Additionally IDEx generates a detailed .txt log file containing the file names and the IDs extracted from them in a subfolder ("IDExLOG").
With minor modifications IDEx can be run at certain intervalls (using a CronJob for example) to keep the corpus up to date automatically.
New in v.1.2.0
- Added a GUI to use the application more comfortably
- Improved log-handling for multiple runs
- Tables merged to one single database ("IDExdatabase.db").
New in v1.1.0:
- A menue allows to export the records of a selected table into a
.txtfile in the log subfolder, not only after the extraction process but also in form of a request to a previous generated database - Improved handling of the parameters needed for
sqlite3operations using adictthat contains all necessary informations to minimize repetitions
To be done:
- Enable automatical scraping of scattered repositories containing
.jatsarticle files. - Adding step by step features to export the records as
.jsonfiles or in other formats. - Enable ID_Ex to handle more complex queries and requests
- Implement a mode of running autonomously to make ID_Ex usable within a CronJob
- Improvements of the GUI, especially exception handling
Technical remarks
Python 3.12.0bs4fromBeautifulSoupsqlite3- Tested for Windows (not for Linux yet)
See also
In this context see following repositories for preparing the .jats files of the journals mentioned above:
- TagToolWiZArD application (ttw), see https://github.com/pBxr/TagToolWiZArd
- Web Extension for TagToolWiZArD application (ttwwebx), see https://github.com/pBxr/ttw_WebExtension
Owner
- Name: Peter Baumeister
- Login: pBxr
- Kind: user
- Location: Berlin
- Repositories: 2
- Profile: https://github.com/pBxr
Citation (CITATION.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Baumeister" given-names: "Peter" orcid: "https://orcid.org/0000-0001-5430-1456" title: "ID_Extractor (ID_Ex) for extracting IDs and references from .jats article files" version: 1.2.0 doi: 10.5281/zenodo.10138490 date-released: 2023-12-17 url: "https://github.com/pBxr/ID_Extractor"