https://github.com/centrefordigitalhumanities/gretel-upload

Upload treebanks for use in GrETEL

https://github.com/centrefordigitalhumanities/gretel-upload

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.6%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Upload treebanks for use in GrETEL

Basic Info
  • Host: GitHub
  • Owner: CentreForDigitalHumanities
  • License: mit
  • Language: PHP
  • Default Branch: develop
  • Homepage: http://gretel.hum.uu.nl
  • Size: 2.16 MB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 0
  • Open Issues: 10
  • Releases: 1
Created over 10 years ago · Last pushed 12 months ago
Metadata Files
Readme License

README.md

GrETEL-upload

GrETEL-upload is an extension package for GrETEL that allows to upload your own corpus or dataset. The application will then automatically transform your corpus in an Alpino XML-treebank. After processing, the treebanks are searchable in GrETEL, and if you supply metadata, you can use these for filtering and analysis.

Local installation

Requirements

On top of a default LAMP installation (with PHP 7.*; PHP 8 is currently not working), the following packages are required:

  • basex: Storing processed treebanks into a XML-database.
  • php-zip: Required to process .zip-files.
  • php-ldap: Authentication via LDAP.
  • php-sqlite3: SQLite3 module for PHP, allows tests with in-memory database.
  • php-libxml

GrETEL-upload also requires the following external programs to be installed:

  • Alpino. Download and then unpack (preferably) into /opt/Alpino/ . You can change the installation directory in the application/config/database_default.php . It also need to be changed in alpino.sh .
  • Corpus2alpino. This can be installed globally using sudo -H pip3 install corpus2alpino . This requires Python 3.6+.

It is also possible to install using pip:

bash pip install -r requirements.txt

Make sure to modify config/common.php to point to the install location of corpus2alpino.

Configuration

You will have to provide configuration details in four files:

  • application/config/common.php : Paths and other common settings.
  • application/config/config.php : CodeIgniter settings.
  • application/config/database.php : Settings for your database connection to both the relational database (e.g. MySQL) and the XML-database (basex).
  • application/config/ldap.php : Settings for LDAP authentication.

An example configuration for each can be found in application/config/{NAME}_default.php .

Update the apache config, to allow read-write access to gretel-upload (and gretel).

Database schema

Create the mysql database gretel_upload You can use the command php index.php migrate in the source directory to create/migrate the database schema. See docs/schema.png for the current database schema (exported from phpMyAdmin).

Permissions

Make sure the uploads directory is writable for the user running the Apache daemon (usually www-data ). Also create a writable sessions directory and refer to its absolute path in application/config/config.php if using the default files session driver.

Start-up

Start both Alpino and BaseX as server instances by running the following two commands:

basexserver -S
./alpino.sh

Then, navigate to the installation directory in your web browser (e.g. localhost/gretel-upload/ ) to start using GrETEL-upload.

Production: Cron Task

For production servers, a cron job is required for processing uploaded treebanks. Schedule the following e.g. every 5 minutes:

/usr/bin/php {root}/index.php cron process

Uploading corpora

Formats

Currently, three formats are supported: LASSY-XML, CHAT and plain text (UTF-8 encoded). When you upload a set of texts (always in a zipped folder, possibly consisting of multiple directories), you can specify whether the text is already sentence- and/or word-tokenized. If not, the application will do this for you.

Metadata

GrETEL-upload allows metadata annotation using the PaQu metadata format. This metadata will be converted to LASSY-XML during import.

The GrETEL-upload interface then allows you to select which facet you would want to use to filter the data in GrETEL. You can e.g. choose to display a metadata column called 'year' as a slider, dropdown list or set of checkboxes. You can also choose to hide certain columns.

Libraries

PHP

GrETEL-upload is written in PHP and created with CodeIgniter 3.1.11. The application uses the following libraries:

  • application/libraries/Alpino.php : Wrapper around Alpino's dependency parser and tokenisation scripts.
  • application/libraries/BaseX.php : BaseX PHP connector. Slightly modified to work in CodeIgniter.
  • application/libraries/Format.php : Helper to convert between various formats such as XML, JSON, CSV, etc. Part of CodeIgniter Rest Server (see below).
  • application/libraries/Ldap.php : Authentication via LDAP. Inspired by the LDAP Authentication library.
  • application/libraries/REST_Controller.php : CodeIgniter Rest Server, turns controllers into REST APIs.

Javascript

GrETEL-upload uses the following JavaScript libraries:

CSS

GrETEL-upload is created with Pure CSS.

Images

GrETEL-upload uses the FamFamFam silk icon set.

API

GrETEL-upload has an API for retrieving data from the database:

  • treebank/: Returns all publicly available treebanks.
  • treebank/show/[title]: Returns the components of the treebank given by title.
  • treebank/metadata/[title]: Returns the metadata of the treebank given by title.
  • treebank/user/[user_id]: Returns all treebanks available to the currently logged in user. This might include private treebanks.

Tests

The test suite is created using ci-phpunit-test. This uses PHPUnit. You can run the tests by navigating to the application/tests directory and calling phpunit .

Demo

A working version is available on http://gretel.hum.uu.nl.

Owner

  • Name: Centre for Digital Humanities
  • Login: CentreForDigitalHumanities
  • Kind: organization
  • Email: cdh@uu.nl
  • Location: Netherlands

Interdisciplinary centre for research and education in computational and data-driven methods in the humanities.

GitHub Events

Total
  • Push event: 2
Last Year
  • Push event: 2

Dependencies

application/tests/_ci_phpunit_test/patcher/third_party/PHP-Parser-2.1.1/composer.json packagist
  • phpunit/phpunit ~4.0 development
  • ext-tokenizer *
  • php >=5.4
application/tests/_ci_phpunit_test/patcher/third_party/PHP-Parser-3.0.3/composer.json packagist
  • phpunit/phpunit ~4.0|~5.0 development
  • ext-tokenizer *
  • php >=5.5
requirements.in pypi
  • corpus2alpino >=0.3.10
requirements.txt pypi
  • argparse ==1.4.0
  • beautifulsoup4 ==4.11.1
  • blis ==0.7.8
  • catalogue ==2.0.8
  • certifi ==2022.6.15
  • chamd ==0.5.8
  • charset-normalizer ==2.1.0
  • click ==8.1.3
  • corpus2alpino ==0.3.10
  • cymem ==2.0.6
  • folia ==2.5.8
  • idna ==3.3
  • importlib-metadata ==4.12.0
  • isodate ==0.6.1
  • jinja2 ==3.1.2
  • langcodes ==3.3.0
  • lxml ==4.9.1
  • markupsafe ==2.1.1
  • murmurhash ==1.0.7
  • numpy ==1.21.6
  • packaging ==21.3
  • pathy ==0.6.2
  • preshed ==3.0.6
  • pydantic ==1.9.1
  • pyparsing ==3.0.9
  • rdflib ==6.2.0
  • requests ==2.28.1
  • six ==1.16.0
  • smart-open ==5.2.1
  • soupsieve ==2.3.2.post1
  • spacy ==3.4.1
  • spacy-legacy ==3.0.9
  • spacy-loggers ==1.0.3
  • srsly ==2.4.4
  • tei-reader ==0.0.17
  • thinc ==8.1.0
  • tqdm ==4.64.0
  • typer ==0.4.2
  • typing-extensions ==4.1.1
  • urllib3 ==1.26.11
  • wasabi ==0.10.1
  • zipp ==3.8.1
package.json npm