gleaner
Gleaner: JSON-LD and structured data on the web harvesting
Science Score: 54.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
✓Committers with academic emails
1 of 6 committers (16.7%) from academic institutions -
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (15.2%) to scientific vocabulary
Keywords
Keywords from Contributors
Repository
Gleaner: JSON-LD and structured data on the web harvesting
Basic Info
- Host: GitHub
- Owner: gleanerio
- License: apache-2.0
- Language: Go
- Default Branch: master
- Homepage: https://gleaner.io
- Size: 378 MB
Statistics
- Stars: 18
- Watchers: 6
- Forks: 9
- Open Issues: 72
- Releases: 20
Topics
Metadata Files
README.md
Gleaner (https://gleaner.io)
About
Gleaner is a tool for extracting JSON-LD from web pages. You provide Gleaner a list of sites to index and it will access and retrieve pages based on the sitemap.xml of the domain(s). Gleaner can then check for well formed and valid structure in documents. The product of Gleaner runs can then be used to form Knowledge Graphs, Full-Text Indexes, Semantic Indexes Spatial Indexes or other products to drive discovery and use.
About

The image above gives an overview of the basic workflow of Gleaner.
This image show that the product of Gleaner is really a populated data warehouse (document warehouse). Where those documents are either the JSON-LD structured data document harvested or the the provenance graphs generated by Gleaner during the process of harvesting.
Gleaner talks to an S3 compliant object store as part of its configuration. This can be AWS S3, Google Cloud Storage (GCS) or other S3 compliant object stores. A typical set up might see the use the open source Minio package in this role.
Note also the use of headless chrome in this diagram. A headless chrome instance is use for those cases where the resources to be harvested are placing the JSON-LD documents into the document object model (DOM) dynamically. In this case then the headless chrome is used to render the page and run the Javascript to form the rendered HTML document that can be parsed for the JSON-LD.

This previous image gives a view of a typical completed installation and use of Gleaner. In this image we now see the use of the Nabu package (also in this Gleaner.io organization) to synchronize the data warehouse with a triple store.
Nabu is described in its own repository but it basically reads the the JSON-LD document and performs ELT, ETL workflows on it. In this case, a simple ETL of the JSON-LD. Extracted from the S3, translated from JSON-LD into ntriples and then loaded into the triplestores. If your triplestore natively handles the JSON-LD serialization of the RDF then this could be a simple extract and load.
Code and Git Branch Patterns
Go versions
Gleaner is written in Go and we ask that the developers stay in sycn with the latest stable release. Go is a very stable language API so generally there are little issues with being off a version or two.
Note that conflicts with the go.mod and go.sum are not unexpected. As noted here, please resolve conflicts with these files by doing a union followed by a
go mod tidy
following the merge. Once you have resolved the conflict and done tidy can add, if needed, the go.mod and go.sum files and commit.
Branches
If you are interested in working on Gleaner we ask that you use the following git pattern. Branches should start with your initials followed by -- and then a name. This can be a descriptive name or an issue name.
Please branch off of dev and merge back into dev. Given the small number of developers we hope this wont result in many conflicts. As we agree on a version of dev that we like, we will make merges to master from which builds for releases and container will be done.
``` $ git checkout -b [initialsorteamname]--[yourbranchtitlesnakecase] $ git checkout -b df--devdoc_updates < make some code changes > $ git add . $ git commit -m '[initials]
```
Gleaner Indexing
While we work on bringing this repository documentation in line please visit:
- https://book.oceaninfohub.org/indexing/qstart.html
- https://book.oceaninfohub.org/indexing/cliDocker/README.html
For the best documentation on using Gleaner at this time.
Unit tests
There are some unit tests here; to run them, you can do go test -v ./...
Owner
- Name: GleanerIO
- Login: gleanerio
- Kind: organization
- Website: https://gleaner.io
- Repositories: 7
- Profile: https://github.com/gleanerio
A set of projects implementing principles around indexing structured data on the web / schema.org (Developed as part of NSF's EarthCube)
Citation (citation.cff)
cff-version: 1.1.0
message: If you use this software, please cite it as below.
authors:
- family-names: Fils
given-names: Douglas
orcid: https://orcid.org/0000-0002-2257-9127
- family-names: Minch
given-names: Melinda
orcid: https://orcid.org/0000-0003-3878-7147
- family-names: Valentine
given-names: David
orcid: https://orcid.org/0000-0002-5018-048X
- family-names: Shepherd
given-names: Adam
orcid: https://orcid.org/0000-0003-4486-9448
title: Gleaner, a tool for indexing JSON-LD based structured data on the web
version: 2.0.24
doi:
date-released: 2020-11-17
GitHub Events
Total
- Issues event: 3
- Watch event: 1
- Delete event: 1
- Issue comment event: 3
- Push event: 12
- Pull request review event: 1
- Pull request event: 6
- Create event: 2
Last Year
- Issues event: 3
- Watch event: 1
- Delete event: 1
- Issue comment event: 3
- Push event: 12
- Pull request review event: 1
- Pull request event: 6
- Create event: 2
Committers
Last synced: over 2 years ago
Top Committers
| Name | Commits | |
|---|---|---|
| Douglas Fils | d****s@g****m | 181 |
| David Valentine | d****e@g****m | 143 |
| melinda | m****a@m****m | 80 |
| melinda | m****h@o****a | 21 |
| dependabot[bot] | 4****] | 2 |
| Adam Shepherd | a****d@w****u | 1 |
Committer Domains (Top 20 + Academic)
Issues and Pull Requests
Last synced: 6 months ago
All Time
- Total issues: 1
- Total pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 1
- Total pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Issue authors: 1
- Pull request authors: 2
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- valentinedwv (11)
Pull Request Authors
- dependabot[bot] (5)
- valentinedwv (4)
- C-Loftus (2)
- ylyangtw (2)
- fils (1)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
- Total downloads: unknown
- Total dependent packages: 1
- Total dependent repositories: 2
- Total versions: 2
proxy.golang.org: github.com/gleanerio/gleaner
- Homepage: https://github.com/gleanerio/gleaner
- Documentation: https://pkg.go.dev/github.com/gleanerio/gleaner#section-documentation
- License: Apache-2.0
-
Latest release: v0.0.0-20230131024133-c180764b588a
published about 3 years ago
Rankings
Dependencies
- github.com/PuerkitoBio/goquery v1.8.0
- github.com/apache/thrift v0.14.1
- github.com/araddon/dateparse v0.0.0-20210429162001-6b43995a97de
- github.com/aws/aws-sdk-go v1.41.12
- github.com/boltdb/bolt v1.3.1
- github.com/chromedp/chromedp v0.6.5
- github.com/gleanerio/nabu v0.0.0-20211107193830-958398c3aaef
- github.com/gocarina/gocsv v0.0.0-20211020200912-82fc2684cc48
- github.com/gorilla/mux v1.8.0
- github.com/gosuri/uilive v0.0.4
- github.com/gosuri/uiprogress v0.0.1
- github.com/knakk/rdf v0.0.0-20190304171630-8521bf4c5042
- github.com/mafredri/cdp v0.32.0
- github.com/minio/minio-go/v7 v7.0.15
- github.com/oxffaa/gopher-parse-sitemap v0.0.0-20191021113419-005d2eb1def4
- github.com/piprate/json-gold v0.4.1-0.20210813112359-33b90c4ca86c
- github.com/rs/xid v1.2.1
- github.com/schollz/progressbar/v3 v3.8.3
- github.com/spf13/cobra v1.2.1
- github.com/spf13/viper v1.9.0
- github.com/stretchr/testify v1.7.0
- github.com/utahta/go-openuri v0.1.0
- github.com/xitongsys/parquet-go v1.6.0
- github.com/xitongsys/parquet-go-source v0.0.0-20211010230925-397910c5e371
- go.etcd.io/bbolt v1.3.6
- golang.org/x/crypto v0.0.0-20210817164053-32db794688a5
- golang.org/x/oauth2 v0.0.0-20211005180243-6b3c2da341f1
- google.golang.org/api v0.60.0
- 962 dependencies
- github.com/gorilla/mux v1.7.4
- github.com/gorilla/mux v1.7.4
- github.com/vugu/vjson v0.0.0-20200505061711-f9cbed27d3d9
- github.com/vugu/vugu v0.3.3
- github.com/chromedp/cdproto v0.0.0-20191009033829-c22f49c9ff0a
- github.com/chromedp/chromedp v0.5.1
- github.com/davecgh/go-spew v1.1.0
- github.com/davecgh/go-spew v1.1.1
- github.com/gobwas/httphead v0.0.0-20180130184737-2c6c146eadee
- github.com/gobwas/pool v0.2.0
- github.com/gobwas/ws v1.0.2
- github.com/knq/sysutil v0.0.0-20191005231841-15668db23d08
- github.com/mailru/easyjson v0.7.0
- github.com/pmezard/go-difflib v1.0.0
- github.com/stretchr/objx v0.1.0
- github.com/stretchr/testify v1.4.0
- github.com/vugu/html v0.0.0-20190914200101-c62dc20b8289
- github.com/vugu/vjson v0.0.0-20200505061711-f9cbed27d3d9
- github.com/vugu/vugu v0.3.3
- github.com/vugu/xxhash v0.0.0-20191111030615-ed24d0179019
- golang.org/x/crypto v0.0.0-20190308221718-c2843e01d9a2
- golang.org/x/net v0.0.0-20190912160710-24e19bdeb0f2
- golang.org/x/sys v0.0.0-20190215142949-d0b11bdaac8a
- golang.org/x/sys v0.0.0-20191008105621-543471e840be
- golang.org/x/text v0.3.0
- gopkg.in/check.v1 v0.0.0-20161208181325-20d25e280405
- gopkg.in/yaml.v2 v2.2.2
- actions/checkout v2 composite
- actions/setup-go v2 composite
- docker/build-push-action v2 composite
- docker/login-action v1 composite
- docker/metadata-action v3 composite
- docker/setup-buildx-action v1 composite
- docker/setup-qemu-action v1 composite
- actions/checkout v2 composite
- actions/setup-go v2 composite
- docker/build-push-action v2 composite
- docker/login-action v1 composite
- docker/metadata-action v3 composite
- docker/setup-buildx-action v1 composite
- docker/setup-qemu-action v1 composite
- actions/checkout v2 composite
- wangyoucao577/go-release-action v1.22 composite
- actions/checkout v2 composite
- wangyoucao577/go-release-action v1.22 composite
- alpine latest build