https://github.com/centrefordigitalhumanities/gretel-upload
Upload treebanks for use in GrETEL
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.6%) to scientific vocabulary
Repository
Upload treebanks for use in GrETEL
Basic Info
- Host: GitHub
- Owner: CentreForDigitalHumanities
- License: mit
- Language: PHP
- Default Branch: develop
- Homepage: http://gretel.hum.uu.nl
- Size: 2.16 MB
Statistics
- Stars: 1
- Watchers: 2
- Forks: 0
- Open Issues: 10
- Releases: 1
Metadata Files
README.md
GrETEL-upload
GrETEL-upload is an extension package for GrETEL that allows to upload your own corpus or dataset. The application will then automatically transform your corpus in an Alpino XML-treebank. After processing, the treebanks are searchable in GrETEL, and if you supply metadata, you can use these for filtering and analysis.
Local installation
Requirements
On top of a default LAMP installation (with PHP 7.*; PHP 8 is currently not working), the following packages are required:
- basex: Storing processed treebanks into a XML-database.
- php-zip: Required to process .zip-files.
- php-ldap: Authentication via LDAP.
- php-sqlite3: SQLite3 module for PHP, allows tests with in-memory database.
- php-libxml
GrETEL-upload also requires the following external programs to be installed:
- Alpino. Download and then unpack (preferably) into
/opt/Alpino/. You can change the installation directory in theapplication/config/database_default.php. It also need to be changed inalpino.sh. - Corpus2alpino. This can be installed globally using
sudo -H pip3 install corpus2alpino. This requires Python 3.6+.
It is also possible to install using pip:
bash
pip install -r requirements.txt
Make sure to modify config/common.php to point to the install location of corpus2alpino.
Configuration
You will have to provide configuration details in four files:
application/config/common.php: Paths and other common settings.application/config/config.php: CodeIgniter settings.application/config/database.php: Settings for your database connection to both the relational database (e.g. MySQL) and the XML-database (basex).application/config/ldap.php: Settings for LDAP authentication.
An example configuration for each can be found in application/config/{NAME}_default.php .
Update the apache config, to allow read-write access to gretel-upload (and gretel).
Database schema
Create the mysql database gretel_upload
You can use the command php index.php migrate in the source directory to create/migrate the database schema.
See docs/schema.png for the current database schema (exported from phpMyAdmin).
Permissions
Make sure the uploads directory is writable for the user running the Apache daemon (usually www-data ). Also create a writable sessions directory and refer to its absolute path in application/config/config.php if using the default files session driver.
Start-up
Start both Alpino and BaseX as server instances by running the following two commands:
basexserver -S
./alpino.sh
Then, navigate to the installation directory in your web browser (e.g. localhost/gretel-upload/ ) to start using GrETEL-upload.
Production: Cron Task
For production servers, a cron job is required for processing uploaded treebanks. Schedule the following e.g. every 5 minutes:
/usr/bin/php {root}/index.php cron process
Uploading corpora
Formats
Currently, three formats are supported: LASSY-XML, CHAT and plain text (UTF-8 encoded). When you upload a set of texts (always in a zipped folder, possibly consisting of multiple directories), you can specify whether the text is already sentence- and/or word-tokenized. If not, the application will do this for you.
Metadata
GrETEL-upload allows metadata annotation using the PaQu metadata format. This metadata will be converted to LASSY-XML during import.
The GrETEL-upload interface then allows you to select which facet you would want to use to filter the data in GrETEL. You can e.g. choose to display a metadata column called 'year' as a slider, dropdown list or set of checkboxes. You can also choose to hide certain columns.
Libraries
PHP
GrETEL-upload is written in PHP and created with CodeIgniter 3.1.11. The application uses the following libraries:
application/libraries/Alpino.php: Wrapper around Alpino's dependency parser and tokenisation scripts.application/libraries/BaseX.php: BaseX PHP connector. Slightly modified to work in CodeIgniter.application/libraries/Format.php: Helper to convert between various formats such as XML, JSON, CSV, etc. Part of CodeIgniter Rest Server (see below).application/libraries/Ldap.php: Authentication via LDAP. Inspired by the LDAP Authentication library.application/libraries/REST_Controller.php: CodeIgniter Rest Server, turns controllers into REST APIs.
Javascript
GrETEL-upload uses the following JavaScript libraries:
CSS
GrETEL-upload is created with Pure CSS.
Images
GrETEL-upload uses the FamFamFam silk icon set.
API
GrETEL-upload has an API for retrieving data from the database:
- treebank/: Returns all publicly available treebanks.
- treebank/show/[title]: Returns the components of the treebank given by title.
- treebank/metadata/[title]: Returns the metadata of the treebank given by title.
- treebank/user/[user_id]: Returns all treebanks available to the currently logged in user. This might include private treebanks.
Tests
The test suite is created using ci-phpunit-test.
This uses PHPUnit.
You can run the tests by navigating to the application/tests directory and calling phpunit .
Demo
A working version is available on http://gretel.hum.uu.nl.
Owner
- Name: Centre for Digital Humanities
- Login: CentreForDigitalHumanities
- Kind: organization
- Email: cdh@uu.nl
- Location: Netherlands
- Website: https://cdh.uu.nl/
- Repositories: 39
- Profile: https://github.com/CentreForDigitalHumanities
Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.
GitHub Events
Total
- Push event: 2
Last Year
- Push event: 2
Dependencies
- phpunit/phpunit ~4.0 development
- ext-tokenizer *
- php >=5.4
- phpunit/phpunit ~4.0|~5.0 development
- ext-tokenizer *
- php >=5.5
- corpus2alpino >=0.3.10
- argparse ==1.4.0
- beautifulsoup4 ==4.11.1
- blis ==0.7.8
- catalogue ==2.0.8
- certifi ==2022.6.15
- chamd ==0.5.8
- charset-normalizer ==2.1.0
- click ==8.1.3
- corpus2alpino ==0.3.10
- cymem ==2.0.6
- folia ==2.5.8
- idna ==3.3
- importlib-metadata ==4.12.0
- isodate ==0.6.1
- jinja2 ==3.1.2
- langcodes ==3.3.0
- lxml ==4.9.1
- markupsafe ==2.1.1
- murmurhash ==1.0.7
- numpy ==1.21.6
- packaging ==21.3
- pathy ==0.6.2
- preshed ==3.0.6
- pydantic ==1.9.1
- pyparsing ==3.0.9
- rdflib ==6.2.0
- requests ==2.28.1
- six ==1.16.0
- smart-open ==5.2.1
- soupsieve ==2.3.2.post1
- spacy ==3.4.1
- spacy-legacy ==3.0.9
- spacy-loggers ==1.0.3
- srsly ==2.4.4
- tei-reader ==0.0.17
- thinc ==8.1.0
- tqdm ==4.64.0
- typer ==0.4.2
- typing-extensions ==4.1.1
- urllib3 ==1.26.11
- wasabi ==0.10.1
- zipp ==3.8.1