gnfinder

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.

https://github.com/gnames/gnfinder

Science Score: 54.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
    Links to: zenodo.org
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (15.9%) to scientific vocabulary

Keywords

biodiversity-heritage-library biodiversity-informatics bioinformatics
Last synced: 7 months ago · JSON representation ·

Repository

GNfinder finds scientific names in UTF8 texts, PDF files, MS Word/Excel documents, URLs etc.

Basic Info
  • Host: GitHub
  • Owner: gnames
  • License: mit
  • Language: Go
  • Default Branch: master
  • Homepage:
  • Size: 205 MB
Statistics
  • Stars: 48
  • Watchers: 7
  • Forks: 5
  • Open Issues: 21
  • Releases: 44
Topics
biodiversity-heritage-library biodiversity-informatics bioinformatics
Created almost 8 years ago · Last pushed 12 months ago
Metadata Files
Readme Changelog License Citation

README.md

Global Names Finder (GNfinder)

DOI Doc Status Go Report Card

Try GNfinder online or learn about its API.

Very fast finder of scientific names. It uses dictionary and NLP approaches. On modern multiprocessor laptop it is able to process 15 million pages per hour. Works with many file formats and includes names verification against many biological databases. For full functionality it requires an Internet connection.

GNfinder is also awailable via web or as a RESTful API.

Citing

Zenodo DOI can be used to cite GNfinder.

Features

  • Multiplatform app (supports Linux, Windows, Mac OS X).
  • Self-contained, no external dependencies, only binary gnfinder or gnfinder.exe (~15Mb) is needed. However the internet connection is required for name-verification.
  • Includes REST API and web-based User Interface.
  • Takes UTF8-encoded text and returns back CSV, TSV or JSON-formatted output that contains detected scientific names.
  • Extracts text from PDF files, MS Word, MS Excel, HTML, XML, RTF, JPG, TIFF, GIF etc. files for names-detection.
  • Downloads web-page from a given URL for names-detection.
  • Optionally, automatically detects the language of the text, and adjusts Bayes algorithm for the language. English and German languages are currently supported.
  • Uses complementary heuristic and natural language processing algorithms.
  • Optionally verifies found names against multiple biodiversity databases using gnindex service.
  • Detection of nomenclatural annotations like sp. nov., comb. nov., ssp. nov., nom. nov. and their variants.
  • Ability to see words that surround detected name-strings.
  • The library can be used concurrently to significantly improve speed. On a server with 40threads it is able to detect names on 50 million pages in approximately 3 hours using both heuristic and Bayes algorithms. Check bhlindex project for an example.

Installation

Homebrew on Mac OS X, Linux, and Linux on Windows (WSL2)

Homebrew is a popular package manager for Open Source software originally developed for Mac OS X. Now it is also available on Linux, and can easily be used on MS Windows 10 or 11, if Windows Subsystem for Linux (WSL) is installed.

Note that Homebrew requires some other programs to be installed, like Curl, Git, a compiler (GCC compiler on Linux, Xcode on Mac). If it is too much, go to the Linux and Mac without Homebrew section.

  1. Install Homebrew according to their instructions.

  2. Install GNfinder with:

    ```bash brew tap gnames/gn brew install gnfinder

    to upgrade

    brew upgrade gnfinder ```

Arch Linux AUR package

AUR package is located at https://aur.archlinux.org/packages/gnfinder. Install it by hand, or with AUR helpers like yay or pacaur.

```bash yay -S gnfinder

or

pacaur -S gnfinder ```

Manual Install

GNfinder consists of just one executable file, so it is pretty easy to install it by hand. To do that download the binary executable for your operating system from the latest release.

Linux and Mac without Homebrew

Move gnfinder executable somewhere in your PATH (for example /usr/local/bin)

bash sudo mv path_to/gnfinder /usr/local/bin

Windows without Homebrew and WSL

It is possible to use GNfinder natively on Windows, without Homebrew or Linux installed.

One possible way would be to create a default folder for executables and place gnfinder there.

Use Windows+R keys combination and type "cmd". In the appeared terminal window type:

cmd mkdir C:\bin copy path_to\gnfinder.exe C:\bin

Add C:\bin directory to your PATH environment variable.

Go

Install Go v1.19 or higher.

bash git clone git@github.com:/gnames/gnfinder cd gnfinder make tools make install

Configuration

When you run gnfinder command for the first time, it will create a gnfinder.yml configuration file.

This file should be located in the following places:

MS Windows: C:\Users\AppData\Roaming\gnfinder.yml

Mac OS: $HOME/.config/gnfinder.yml

Linux: $HOME/.config/gnfinder.yml

This file allows to set options that will modify behaviour of GNfinder according to your needs. It will spare you to enter the same flags for the command line application again and again.

Command line flags will override the settings in the configuration file.

It is also possible to setup environment variables. They will override the settings in both the configuration file and from the flags.

| Settings | Environment variables | |-----------------------|-----------------------------| | BayesOddsThreshold | GNFBAYESODDSTHRESHOLD | | DataSources | GNFDATASOURCES | | Format | GNFFORMAT | | InputTextOnly | GNFINPUTTEXTONLY | | IncludeInputText | GNFINCLUDEINPUTTEXT | | Language | GNFLANGUAGE | | TikaURL | GNFTIKAURL | | TokensAround | GNFTOKENSAROUND | | VerifierURL | GNFVERIFIERURL | | WithAllMatches | GNFWITHALLMATCHES | | WithAmbiguousNames | GNFWITHAMBIGUOUSNAMES | | WithBayesOddsDetails | GNFWITHBAYESODDSDETAILS | | WithOddsAdjustment | GNFWITHODDSADJUSTMENT | | WithPlainInput | GNFWITHPLAININPUT | | WithPositionInBytes | GNFWITHPOSITIONINBYTES | | WithUniqueNames | GNFWITHUNIQUENAMES | | WithVerification | GNFWITHVERIFICATION | | WithoutBayes | GNFWITHOUTBAYES |

Usage

Usage of a web-based application.

GNfinder can be found at https://finder.globalnames.org.

Usage of RESTful API

API is located at https://finder.globalnames.org/api/v1.

Best source for API usage is its documenation.

If you want to start your own API endpoint (for example on localhost, port 8080) use:

bash gnfinder -p 8080 curl localhost:8080/api/v1/ping

To upload a file and detect names from its content:

bash curl -v -F verification=true -F file=@/path/to/test.txt https://gnfinder.globalnames.org/api/v1/find

Usage as a command line app

To see flags and usage:

```bash gnfinder --help

or just

gnfinder ```

To see the version of its binary:

bash gnfinder -V

Examples:

Starting as a web-application and an API server on port 8080

bash gnfinder -p 8080

Getting names from a UTF8-encoded file without remote Tika service.

```bash

-U flag prevents use of remote Apache Tika service for file conversion to

UTF8-encoded plain text

-U flag is optional, but it removes unnecessary remote call to Tika.

gnfinder filewithnames.txt -U ```

Getting names from a UTF8-encoded file in tab-separated values (TSV) format

bash gnfinder file_with_names.txt -U -f tsv

Getting names from a file that is not a plain UTF8-encoded text

bash gnfinder file.pdf

Getting names from a URL

bash gnfinder https://en.wikipedia.org/wiki/Raccoon

Getting unique names from a file in JSON format. Disables -w flag.

bash gnfinder file_with_names.txt -u -f pretty

Getting names from a file in JSON format, and using jq to process JSON

bash gnfinder file_with_names.txt -f compact | jq

Getting data from a pipe forcing English language and verification

bash echo "Pomatomus saltator and Parus major" | gnfinder -v -l eng echo "Pomatomus saltator and Parus major" | gnfinder --verify --lang eng

Limit matches to NCBI and Encyclopedia of Life. For the list of data source ids go to gnverifier's data sources page.

bash echo "And Parus major" | gnfinder -v -l eng -s "4,12" echo "And Parus major" | gnfinder --verify --lang eng --sources "4,12"

Preserve uninomial names that are also common words.

bash echo "Cancer is a genus" | gnfinder -A echo "America is also a genus" | gnfinder --ambiguous-uninomials

Show all matches, not only the best result.

bash echo "Pomatomus saltator and Parus major" | gnfinder -M echo "Pomatomus saltator and Parus major" | gnfinder --all-matches

Show all matches, but only for selected data-sources.

bash echo "Pomatomus saltator and Parus major" | gnfinder -M -s 1,12

Adjusting Prior Odds using information about found names. They are calculated as "found names number / (capitalized words number - found names number)". Such adjustment will decrease Odds for texts with very few names, and increase odds for texts with a lot of found names.

bash gnfinder -a -d -f pretty file_with_names.txt

Returning 5 words before and after found name-candidate. This flag does is ignored if unique names are returned.

bash gnfinder -w 5 file_with_names.txt gnfinder --words-around 5 file_with_names.txt

Getting data from a file and redirecting result to another file

bash gnfinder file1.txt > file2.json

Detection of nomenclatural annotations

bash echo "Parus major sp. n." | gnfinder

Returning found names positions in the number of bytes from the beginning of the text instead of the number of UTF-8 characters

bash echo "Это Parus major" | gnfinder -b

There is also a tutorial about processing many PDF files in parallel.

Usage as a library

```go import ( "github.com/gnames/gnfinder" "github.com/gnames/gnfinder/ent/nlp" "github.com/gnames/gnfinder/io/dict" )

func Example() { txt := Blue Adussel (Mytilus edulis) grows to about two inches the first year,Pardosa moesta Banks, 1892 cfg := gnfinder.NewConfig() dictionary := dict.LoadDictionary() weights := nlp.BayesWeights() gnf := gnfinder.New(cfg, dictionary, weights) res := gnf.Find(txt) name := res.Names[0] fmt.Printf( "Name: %s, start: %d, end: %d", name.Name, name.OffsetStart, name.OffsetEnd, ) // Output: // Name: Mytilus edulis, start: 13, end: 29 } ```

Usage as a docker container

```bash docker pull gnames/gnfinder

run GNfinder server, and map it to port 8888 on the host machine

docker run -d -p 8888:8778 --name gnfinder gnames/gnfinder ```

Projects based on GNfinder

gnfinder-plus allows to work with MS Docs and PDF files without remote services (requires local install of poppler package).

bhlindex creates an index of scientific names for Biodiversity Heritage Library (BHL).

bhlnames adds synonymy and currently accepted names to searches in BHL, connects publications to pages in BHL.

Development

To install the latest GNfinder

bash git clone git@github.com:/gnames/gnfinder cd gnfinder make tools make install

Testing

From the root of the project:

```bash make tools

run make install for CLI testing

make install ```

To run tests go to the root directory of the project and run

```bash go test ./...

or

make test ```

Owner

  • Name: gnames
  • Login: gnames
  • Kind: organization

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
title: "GNfinder -- a finder of scientific names in a variety of media."
version: v1.1.5
authors:
  - family-names: "Mozzherin"
    given-names: "Dmitry"
    orcid: "https://orcid.org/0000-0003-1593-1417"
repository-code: "https://github.com/gnames/gnfinder"
date-released: 2024-05-24
doi: 10.5281/zenodo.10070488
license: MIT

GitHub Events

Total
  • Issues event: 8
  • Watch event: 5
  • Issue comment event: 6
  • Push event: 8
  • Pull request event: 1
  • Create event: 4
Last Year
  • Issues event: 8
  • Watch event: 5
  • Issue comment event: 6
  • Push event: 8
  • Pull request event: 1
  • Create event: 4

Committers

Last synced: almost 2 years ago

All Time
  • Total Commits: 224
  • Total Committers: 3
  • Avg Commits per committer: 74.667
  • Development Distribution Score (DDS): 0.009
Past Year
  • Commits: 14
  • Committers: 1
  • Avg Commits per committer: 14.0
  • Development Distribution Score (DDS): 0.0
Top Committers
Name Email Commits
Dmitry Mozzherin d****n@g****m 222
Alexander Myltsev a****r@m****m 1
Harsh Zalavadiya h****h@f****n 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 115
  • Total pull requests: 4
  • Average time to close issues: about 1 month
  • Average time to close pull requests: 4 days
  • Total issue authors: 25
  • Total pull request authors: 2
  • Average comments per issue: 1.5
  • Average comments per pull request: 0.75
  • Merged pull requests: 1
  • Bot issues: 0
  • Bot pull requests: 3
Past Year
  • Issues: 6
  • Pull requests: 0
  • Average time to close issues: 2 days
  • Average time to close pull requests: N/A
  • Issue authors: 4
  • Pull request authors: 0
  • Average comments per issue: 0.33
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • dimus (76)
  • Adafede (5)
  • abubelinha (4)
  • Archilegt (4)
  • mjy (3)
  • sav-che (2)
  • mlichtenberg (2)
  • rdmpage (2)
  • git-arbitrarysystems (2)
  • nickynicolson (1)
  • AdamUlicny (1)
  • LocoDelAssembly (1)
  • harshzalavadiya (1)
  • mo-nathan (1)
  • ka7eh (1)
Pull Request Authors
  • dependabot[bot] (5)
  • harshzalavadiya (1)
Top Labels
Issue Labels
bug (13) question (3) wontfix (2) duplicate (1) help wanted (1)
Pull Request Labels
dependencies (5)

Packages

  • Total packages: 1
  • Total downloads: unknown
  • Total dependent packages: 10
  • Total dependent repositories: 7
  • Total versions: 66
proxy.golang.org: github.com/gnames/gnfinder
  • Versions: 66
  • Dependent Packages: 10
  • Dependent Repositories: 7
Rankings
Dependent packages count: 1.7%
Dependent repos count: 1.9%
Average: 4.8%
Stargazers count: 6.8%
Forks count: 9.0%
Last synced: 7 months ago

Dependencies

go.mod go
  • github.com/abadojack/whatlanggo v1.0.1
  • github.com/aclements/perflock v0.0.0-20220309210112-c3e96ed36b4a
  • github.com/davecgh/go-spew v1.1.1
  • github.com/fsnotify/fsnotify v1.5.4
  • github.com/gnames/bayes v0.4.0
  • github.com/gnames/gndoc v0.3.1
  • github.com/gnames/gner v0.1.4
  • github.com/gnames/gnfmt v0.2.0
  • github.com/gnames/gnlib v0.14.0
  • github.com/gnames/gnquery v0.3.3
  • github.com/gnames/gnstats v0.1.0
  • github.com/gnames/gnsys v0.2.2
  • github.com/gnames/gnuuid v0.1.1
  • github.com/gnames/gnverifier v1.0.0
  • github.com/golang-jwt/jwt v3.2.2+incompatible
  • github.com/google/go-tika v0.2.0
  • github.com/google/uuid v1.3.0
  • github.com/hashicorp/hcl v1.0.0
  • github.com/inconshreveable/mousetrap v1.0.0
  • github.com/json-iterator/go v1.1.12
  • github.com/labstack/echo/v4 v4.7.2
  • github.com/labstack/gommon v0.3.1
  • github.com/magiconair/properties v1.8.6
  • github.com/mattn/go-colorable v0.1.12
  • github.com/mattn/go-isatty v0.0.14
  • github.com/maxbrunsfeld/counterfeiter/v6 v6.5.0
  • github.com/mitchellh/mapstructure v1.5.0
  • github.com/modern-go/concurrent v0.0.0-20180306012644-bacd9c7ef1dd
  • github.com/modern-go/reflect2 v1.0.2
  • github.com/pelletier/go-toml v1.9.5
  • github.com/pelletier/go-toml/v2 v2.0.1
  • github.com/pmezard/go-difflib v1.0.0
  • github.com/rendon/testcli v1.0.0
  • github.com/rogpeppe/go-internal v1.8.1
  • github.com/rs/zerolog v1.26.1
  • github.com/spf13/afero v1.8.2
  • github.com/spf13/cast v1.4.1
  • github.com/spf13/cobra v1.4.0
  • github.com/spf13/jwalterweatherman v1.1.0
  • github.com/spf13/pflag v1.0.5
  • github.com/spf13/viper v1.11.0
  • github.com/stretchr/testify v1.7.1
  • github.com/subosito/gotenv v1.2.0
  • github.com/valyala/bytebufferpool v1.0.0
  • github.com/valyala/fasttemplate v1.2.1
  • golang.org/x/crypto v0.0.0-20220507011949-2cf3adece122
  • golang.org/x/mod v0.6.0-dev.0.20220106191415-9b9b3d81d5e3
  • golang.org/x/net v0.0.0-20220425223048-2871e0cb64e4
  • golang.org/x/perf v0.0.0-20220411212318-84e58bfe0a7e
  • golang.org/x/sys v0.0.0-20220503163025-988cb79eb6c6
  • golang.org/x/text v0.3.7
  • golang.org/x/time v0.0.0-20220411224347-583f2d630306
  • golang.org/x/tools v0.1.10
  • golang.org/x/xerrors v0.0.0-20220411194840-2f41105eb62f
  • gopkg.in/ini.v1 v1.66.4
  • gopkg.in/yaml.v2 v2.4.0
  • gopkg.in/yaml.v3 v3.0.0-20210107192922-496545a6307b
go.sum go
  • 590 dependencies
.github/workflows/lint.yml actions
  • actions/checkout v2 composite
  • golangci/golangci-lint-action v2 composite
.github/workflows/test.yml actions
  • actions/checkout v2 composite
  • actions/setup-go v2 composite
Dockerfile docker
  • alpine 3.17 build