https://github.com/52north/ecmwf-dataset-crawl
Science Score: 26.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.8%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: 52North
- License: apache-2.0
- Language: Java
- Default Branch: master
- Size: 22.1 MB
Statistics
- Stars: 7
- Watchers: 18
- Forks: 4
- Open Issues: 2
- Releases: 0
Metadata Files
README.md
ARCHIVED
This project is no longer maintained and will not receive any further updates. If you plan to continue using it, please be aware that future security issues will not be addressed.
ecmwf-dataset-crawl
A webcrawler for (hydrological) datasets. Developed as part of the ECMWF Summer of Weather Code 2018.
Within the project "Web Crawler for hydrological Data" we have developed a web crawling solution for multilingual discovery of environmental data sets. The discovered pages can help to add new data sources to global predictive weather forecasting models.
The application offers a specialised web search engine, which can be tasked to discover websites containing data sets based on keywords and countries. Keywords for each task are automatically translated into the languages of the desired countries to support multilingual discovery. Each discovered web-page's content is classified with its probability of linking to data by a custom trained machine learning model. Relevant content such as contact information, data license and direct-links is extracted and indexed for faster accessibility.
A web based user interface offers list of pages with their extracted content, sorted by relevance. These results can be filtered with a full text search or on metadata such as content language, classification label. Each result can be manually classified into categories, to help in training new models for the machine learning classifier. The interface furthermore offers usability features such as direct links to translated pages and search queries. Comparative assessment of the different keywords can be done in a visualization of the crawler's performance metrics.
Design notes & more information can be found in the wiki.
run (docker-compose)
you can also have a look in the wiki for more hints.
```sh
get API keys for google custom search, Azure Text Translator
and insert them into configuration via environment vars.
each required VAR is documented in the file.
vi .env
start all the services
docker-compose up --build --force-recreate -d
stop the services
docker-compose stop
stop the services DELETING ALL DATA
docker-compose down --volumes ```
To configure Kibana visualizations:
- set
action.auto_create_indextotrueinelasticsearch/config/elasticsearch.ymland restart elasticsearch withdocker-compose restart elasticsearch - visit http://localhost/kibana/app/kibana#/management/objects and click "Import".
- select
kibana/saved_objects.jsonfrom this project's directory. - mark any of the index patterns as "favorite" (star button)
- reset the elasticsearch configuration and restart it again.
dev
For information about the development environment, look at the readme of each component.
Licensed under Apache License 2
Owner
- Name: 52°North Spatial Information Research GmbH
- Login: 52North
- Kind: organization
- Email: info@52north.org
- Location: Münster
- Website: https://52north.org/
- Twitter: fivetwon
- Repositories: 261
- Profile: https://github.com/52North
Advancing spatial information infrastructures to foster open science
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: over 1 year ago
All Time
- Total issues: 32
- Total pull requests: 0
- Average time to close issues: about 1 month
- Average time to close pull requests: N/A
- Total issue authors: 3
- Total pull request authors: 0
- Average comments per issue: 0.84
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- noerw (24)
- rcoughlan-1980 (4)
- carletes (4)
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- node 8 build
- maven 3.5-jdk-8 build
- abiosoft/caddy 0.11.0-no-stats
- docker.elastic.co/elasticsearch/elasticsearch-oss 6.4.0 build
- busybox latest build
- node 10-alpine build
- docker.elastic.co/kibana/kibana-oss 6.4.0 build
- org.apache.storm:storm-core 1.2.1 provided
- com.digitalpebble.stormcrawler:storm-crawler-core 1.10
- com.digitalpebble.stormcrawler:storm-crawler-elasticsearch 1.10
- com.optimaize.languagedetector:language-detector 0.6
- com.sun.xml.bind:jaxb-core 2.2.10
- com.sun.xml.bind:jaxb-impl 2.2.10
- javax.xml.bind:jaxb-api 2.2.8
- net.sf.saxon:Saxon-HE 9.8.0-12
- org.apache.storm:flux-core 1.0.2
- org.apache.storm:multilang-python 1.2.1
- com.digitalpebble.stormcrawler:storm-crawler-core 1.9 test
- org.mockito:mockito-all 1.10.8 test
- @types/bunyan ^1.8.4 development
- @types/elasticsearch ^5.0.23 development
- @types/js-yaml ^3.11.1 development
- @types/node ^10.1.2 development
- @types/swagger-tools ^0.10.6 development
- nodemon ^1.17.5 development
- rimraf ^2.6.2 development
- tslint ^5.10.0 development
- tslint-config-standard ^7.0.0 development
- typescript ^2.8.3 development
- typescript-json-schema ^0.24.1 development
- axios ^0.18.0
- bunyan ^1.8.12
- connect ~3.6.6
- countries-list ^2.3.2
- dataobject-parser ^1.2.1
- elasticsearch ^15.0.0
- es-mapping-ts ^0.0.8
- js-yaml ~3.11.0
- json2csv ^4.1.3
- swagger-tools 0.10.3
- 439 dependencies
- autoprefixer ^7.1.2 development
- babel-core ^6.22.1 development
- babel-eslint ^7.1.1 development
- babel-helper-vue-jsx-merge-props ^2.0.3 development
- babel-jest ^21.0.2 development
- babel-loader ^7.1.1 development
- babel-plugin-dynamic-import-node ^1.2.0 development
- babel-plugin-syntax-jsx ^6.18.0 development
- babel-plugin-transform-es2015-modules-commonjs ^6.26.0 development
- babel-plugin-transform-runtime ^6.22.0 development
- babel-plugin-transform-vue-jsx ^3.5.0 development
- babel-preset-env ^1.3.2 development
- babel-preset-stage-2 ^6.22.0 development
- chalk ^2.0.1 development
- compression-webpack-plugin ^1.1.11 development
- copy-webpack-plugin ^4.0.1 development
- css-loader ^0.28.0 development
- eslint ^3.19.0 development
- eslint-config-standard ^10.2.1 development
- eslint-friendly-formatter ^3.0.0 development
- eslint-loader ^1.7.1 development
- eslint-plugin-html ^3.0.0 development
- eslint-plugin-import ^2.7.0 development
- eslint-plugin-node ^5.2.0 development
- eslint-plugin-promise ^3.4.0 development
- eslint-plugin-standard ^3.0.1 development
- extract-text-webpack-plugin ^3.0.0 development
- file-loader ^1.1.4 development
- friendly-errors-webpack-plugin ^1.6.1 development
- html-webpack-plugin ^2.30.1 development
- jest ^21.2.0 development
- jest-serializer-vue ^0.3.0 development
- node-notifier ^5.1.2 development
- optimize-css-assets-webpack-plugin ^3.2.0 development
- ora ^1.2.0 development
- portfinder ^1.0.13 development
- postcss-import ^11.0.0 development
- postcss-loader ^2.0.8 development
- postcss-url ^7.2.1 development
- rimraf ^2.6.0 development
- semver ^5.5.1 development
- shelljs ^0.7.6 development
- uglifyjs-webpack-plugin ^1.1.1 development
- url-loader ^0.5.8 development
- vue-jest ^1.0.2 development
- vue-loader ^13.3.0 development
- vue-style-loader ^3.0.1 development
- vue-template-compiler ^2.5.2 development
- webpack ^3.6.0 development
- webpack-bundle-analyzer ^2.9.0 development
- webpack-dev-server ^2.9.1 development
- webpack-merge ^4.1.0 development
- axios ^0.18.0
- vue ^2.5.2
- vue-async-properties ^0.5.0
- vue-router ^3.0.1
- vuetify ^1.0.0
- 1118 dependencies
- joblib ==0.12.0
- matplotlib ==2.2.2
- nltk ==3.3
- numpy ==1.14.5
- scikit_learn ==0.19.1
- scipy ==1.1.0