publicationclassification

Java package for creating a multi-level classification of scientific publications based on citation links between publications.

https://github.com/cwtsleiden/publicationclassification

Science Score: 77.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 15 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.5%) to scientific vocabulary

Keywords

citation-links clustering clustering-algorithm community-detection java leiden-algorithm multi-level-classification publication-classification scientific-publications
Last synced: 4 months ago · JSON representation ·

Repository

Java package for creating a multi-level classification of scientific publications based on citation links between publications.

Basic Info
  • Host: GitHub
  • Owner: CWTSLeiden
  • License: mit
  • Language: Java
  • Default Branch: main
  • Homepage:
  • Size: 277 KB
Statistics
  • Stars: 7
  • Watchers: 3
  • Forks: 1
  • Open Issues: 1
  • Releases: 2
Topics
citation-links clustering clustering-algorithm community-detection java leiden-algorithm multi-level-classification publication-classification scientific-publications
Created over 2 years ago · Last pushed almost 2 years ago
Metadata Files
Readme License Citation

README.md

publicationclassification

Build master branch License: MIT Latest release Maven Central version DOI

Introduction

This Java package can be used to create a multi-level classification of scientific publications based on citation links between publications.

The package uses the direct citation approach introduced by Waltman and Van Eck (2012) combined with the Leiden algorithm introduced by Traag et al. (2019). The package also supports the extended direct citation approach introduced by Waltman et al. (2020).

The publicationclassification package was developed by Nees Jan van Eck at the Centre for Science and Technology Studies (CWTS) at Leiden University. It relies on the networkanalysis package that was developed by Nees Jan van Eck, Vincent Traag, and Ludo Waltman.

Documentation

Documentation of the source code of publicationclassification is provided in the code in javadoc format. The documentation is also available in a compiled format.

Installation

Maven

<dependency> <groupId>nl.cwts</groupId> <artifactId>publicationclassification</artifactId> <version>1.1.0</version> </dependency>

Gradle

implementation group: 'nl.cwts', name: 'publicationclassification', version: '1.1.0'

Usage

The publicationclassification package requires Java 8 or higher. The latest version of the package is available as a pre-compiled jar file on Maven Central and GitHub Packages. Instructions for compiling the source code of the package are provided below.

Use the command-line tool PublicationClassificationCreator to create a publication classification. The tool can be run as follows:

java -cp publicationclassification-1.1.0.jar nl.cwts.publicationclassification.PublicationClassificationCreator

If no further arguments are provided, the following usage notice will be displayed:

``` PublicationClassificationCreator version 1.1.0 By Nees Jan van Eck Centre for Science and Technology Studies (CWTS), Leiden University

Usage: PublicationClassificationCreator (to create a publication classification based on data in text files)

or PublicationClassificationCreator (to create a publication classification based on data in an SQL Server database)

Arguments: Name of the publications input file. This text file must contain two tab-separated columns (without a header line), first a column of publication numbers and then a column of core publication indicators (1 for core publications and 0 for non-core publications). Publication numbers must be integers starting at zero. Non-core publications are auxiliary publications that can be included to improve the clustering of core publications. The lines in the file must be sorted by the publication numbers in the first column. Name of the citation links input file. This text file must contain three tab-separated columns (without a header line), first two columns of publication numbers and then a column of weights. Each citation link must be included only once in the file. The lines in the file must be sorted first by the publication numbers in the first column and then by the publication numbers in the second column. Name of the classification output file. This text file will contain four tab-separated columns (without a header line), first a column of publication numbers and then three columns of cluster numbers at the micro, meso, and macro level. Cluster numbers are integers starting at zero. SQL Server server name. A connection will be made using integrated authentication. Database name. Name of the publications input table. This table must have two columns: pubno and corepub. Publication numbers must be integers starting at zero. Non-core publications (corepub = 0) are auxiliary publications that can be included to improve the clustering of core publications (corepub = 1). Name of the citation links input table. This table must have three columns: pubno1, pubno2, and citweight. Each citation link must be included only once in the table. <classificationtable> Name of the classification output table. This table will have four columns: pubno, microclusterno, mesoclusterno, and macroclusterno. Cluster numbers are integers starting at zero. <largestcomponent> Boolean indicating whether the publication classification should include only publications belonging to the largest connected component of the citation network ('true') or all publications ('false'). Number of iterations of the Leiden algorithm (e.g., 50). Value of the resolution parameter at the micro level. Minimum number of publications per cluster at the micro level (excluding non-core publications). Value of the resolution parameter at the meso level. Minimum number of publications per cluster at the meso level (excluding non-core publications). Value of the resolution parameter at the macro level. Minimum number of publications per cluster at the macro level (excluding non-core publications). ```

Example

The following example illustrates the use of the PublicationClassificationCreator tool. Suppose you have a text file pubs.txt:

0 0 1 0 2 0 3 0 4 0 5 0 6 0 7 0 8 0 9 0 10 0 …

You also have a text file cit_links.txt

1 1516 0.5 1 1988 1 1 25388 1 2 821 0.142857142857143 2 2504 0.0714285714285714 2 24459 0.5 2 24656 0.5 3 1841 0.2 3 2009 0.166666666666667 3 5337 0.0833333333333333 … The PublicationClassificationCreator tool can then be run as follows:

java -cp publicationclassification-1.1.0.jar nl.cwts.publicationclassification.PublicationClassificationCreator pubs.txt cit_links.txt classification.txt true 100 4e-4 25 2e-4 250 7e-5 1000

The publication classification created by the tool can be found in the text file classification.txt:

1 83 9 3 2 1 1 2 3 43 14 2 4 1 1 2 5 7 7 1 18 49 2 0 19 4 5 0 20 24 0 1 21 33 20 0 22 2 3 0 …

The tool displays the following output:

``` PublicationClassificationCreator version 1.1.0 By Nees Jan van Eck Centre for Science and Technology Studies (CWTS), Leiden University

Reading citation network from file... Finished! Reading citation network from file took 0h 0m 0s. Citation network: Number of publications: 26800 Number of citation links: 150613 Total publication weight: 18643 Total citation link weight: 18321

Identifying largest connected component in citation network... Finished! Identifying largest connected component in citation network took 0h 0m 0s. Largest connected component: Number of publications: 20988 Number of citation links: 150387 Total publication weight: 17206 Total citation link weight: 18131

Creating publication classification... Clustering algorithm: Leiden algorithm Number of iterations: 100 Random seed: 0

Adding micro-level classification... Creating clustering... Finished! 335 clusters created. Reassigning small clusters... Finished! 98 clusters remaining. Adding micro-level classification took 0h 0m 2s. Micro-level classification: Resolution: 4.0E-4 Threshold: 25 Number of clusters: 98

Adding meso-level classification... Creating clustering... Finished! 63 clusters created. Reassigning small clusters... Finished! 25 clusters remaining. Adding meso-level classification took 0h 0m 0s. Meso-level classification: Resolution: 2.0E-4 Threshold: 250 Number of clusters: 25

Adding macro-level classification... Creating clustering... Finished! 9 clusters created. Reassigning small clusters... Finished! 4 clusters remaining. Adding macro-level classification took 0h 0m 0s. Macro-level classification: Resolution: 7.0E-5 Threshold: 1000 Number of clusters: 4

Writing publication classification to file... Finished! Writing publication classification to file took 0h 0m 0s. ```

License

The publicationclassification package is distributed under the MIT license.

Issues

If you encounter any issues, please report them using the issue tracker on GitHub.

Contribution

You are welcome to contribute to the development of the publicationclassification package. Please follow the typical GitHub workflow: Fork from this repository and make a pull request to submit your changes. Make sure that your pull request has a clear description and that the code has been properly tested.

Development and deployment

The latest stable version of the source code is available in the main branch on GitHub. The most recent version of the source code, which may be under development, is available in the develop branch.

Compilation

To compile the source code of the publicationclassification package, a Java Development Kit needs to be installed on your system (version 8 or higher). Having Gradle installed is optional as the Gradle Wrapper is also included in this repository.

On Windows systems, the source code can be compiled as follows:

gradlew build

On Linux and MacOS systems, use the following command:

./gradlew build

The compiled class files can be found in the directory build/classes. The compiled jar file can be found in the directory build/libs. The compiled javadoc files can be found in the directory build/docs.

The class nl.cwts.publicationclassification.run.PublicationClassificationCreator has a main method. After compiling the source code, the PublicationClassificationCreator tool can be run as follows:

java -cp build/libs/publicationclassification-<version>.jar nl.cwts.publicationclassification.run.PublicationClassificationCreator

References

Traag, V.A., Waltman, L., & Van Eck, N.J. (2019). From Louvain to Leiden: Guaranteeing well-connected communities. Scientific Reports, 9, 5233. https://doi.org/10.1038/s41598-019-41695-z

Waltman, L., Boyack, K.W., Colavizza, G., & Van Eck, N.J. (2020). A principled methodology for comparing relatedness measures for clustering publications. Quantitative Science Studies, 1(2), 691-713. https://doi.org/10.1162/qssa00035

Waltman, L., & Van Eck, N.J. (2012). A new methodology for constructing a publication-level classification system of science. Journal of the American Society for Information Science and Technology, 63(12), 2378-2392. https://doi.org/10.1002/asi.22748

Owner

  • Name: Centre for Science and Technology Studies
  • Login: CWTSLeiden
  • Kind: organization
  • Email: info@cwts.leidenuniv.nl
  • Location: Leiden, the Netherlands

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it using the metadata from this file."
type: "software"
authors:
  - family-names: "Van Eck"
    given-names: "Nees Jan"
    orcid: "https://orcid.org/0000-0001-8448-4521"
    email: "ecknjpvan@cwts.leidenuniv.nl"
    affiliation: "Centre for Science and Technology Studies (CWTS), Leiden University"
title: "publicationclassification"
abstract: "This Java package can be used to create a multi-level classification of scientific publications based on citation links between publications."
keywords:
  - publication classification
  - scientific publications
  - citation links
  - multi-level classification
  - clustering
  - community detection
  - clustering algorithm
  - Leiden algorithm
  - Java
url: "https://github.com/CWTSLeiden/publicationclassification#readme"
repository-code: "https://github.com/CWTSLeiden/publicationclassification"
repository-artifact: "https://central.sonatype.com/artifact/nl.cwts/publicationclassification/"
license: MIT
doi: 10.5281/zenodo.8263452
version: 1.1.0
date-released: 2023-08-21

GitHub Events

Total
  • Watch event: 2
  • Fork event: 1
Last Year
  • Watch event: 2
  • Fork event: 1

Committers

Last synced: almost 2 years ago

All Time
  • Total Commits: 16
  • Total Committers: 2
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.5
Past Year
  • Commits: 16
  • Committers: 2
  • Avg Commits per committer: 8.0
  • Development Distribution Score (DDS): 0.5
Top Committers
Name Email Commits
Nees Jan van Eck 3****k 8
Nees Jan van Eck e****n@c****l 8
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: almost 2 years ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
  • ravwojdyla (1)
Top Labels
Issue Labels
Pull Request Labels

Dependencies

.github/workflows/build-main.yml actions
  • actions/checkout v3 composite
  • actions/setup-java v3 composite
  • actions/upload-artifact v3 composite
.github/workflows/publish-release.yml actions
  • actions/checkout v3 composite
  • actions/setup-java v3 composite
build.gradle maven