sgex
A Python package for the Sketch Engine API
Science Score: 67.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 3 DOI reference(s) in README -
✓Academic publication links
Links to: zenodo.org -
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.5%) to scientific vocabulary
Keywords
Repository
A Python package for the Sketch Engine API
Basic Info
Statistics
- Stars: 8
- Watchers: 2
- Forks: 0
- Open Issues: 4
- Releases: 20
Topics
Metadata Files
README.md
Sketch Grammar Explorer
Introduction
Sketch Grammar Explorer is an API wrapper for Sketch Engine, a corpus management software useful for linguistic research. The goal is to build a flexible scaffold for any kind of programmatic work with Sketch Engine and NoSketch Engine.
Installation
Clone SGEX or install it with pip install sgex (main dependencies are pandas pyyaml aiohttp aiofiles).
Get a Sketch Engine API key. Be sure to reference SkE's documentation and schema:
wget https://www.sketchengine.eu/apidoc/openapi.yaml -O .openapi.yaml
Getting started
A quick intro on the API (examples use a local NoSketch Engine server).
Most things are identical for SkE's main server, apart from using credentials and more call types being available. SGEX currently uses the Bonito API, with URLs ending in
/bonito/run.cgi, not newer paths like/search/corp_info.
Package modules
job: the primary module - makes requests and manipulates datacall: classes and methods for API call typesquery: functions to generate/manipulate CQL queriesutil: utility functions
The Job class
Calls are made with the job module, which can also be run as a script. The Job class has a few options:
```py from sgex.job import Job
j = Job( # define API calls infile: str | list | None = None, params: str | dict | list | None = None, # set server info server: str = "local", defaultservers: dict = defaultservers, # supply credentials apikey: str | None = None, username: str | None = None, # manage caching cachedir: str = "data", clearcache: bool = False, # run asynchronous requests thread: bool = False, # control request throttling waitdict: dict = waitdict, # make a dry run dryrun: bool = False, # change verbosity verbose: bool = False, )
j.run() ```
Making a call and accessing the response
Here's how to make a request:
```py
from sgex.job import Job
instantiate the job with options
j = Job( ... params={"calltype": "View", "corpname": "preloaded/susanne", "q": 'alemma,"bird"'}, ... apikey="", # add key ... username="", # add name ... server="ske") # use SkE main server ...
this example uses a local server (the default)
j = Job( ... params={"call_type": "View", "corpname": "susanne", "q": 'alemma,"bird"'}) ...
run the job
j.run()
get a summary
dt = j.summary() for k,v in dt.items(): ... print(k, ("
" if k == "seconds" else v)) ... seconds calls 1 errors Counter()
results are stored in Job.data.
j.data.view [View 8cdfca2 {asyn: '0', corpname: susanne, format: json, q: 'alemma,"bird"'}]
the response gets cached in data/<hash>.json: repeating the same request pulls from the cache
data is accessible via .text or .json()
j.data.view[0].response.json()["concsize"] # the number of concordances for "bird" 12
```
Making multiple calls
Just provide a list of call parameters (list of dict) to make more than one call.
```py
supplying a list of calls
from sgex.job import Job j = Job( ... params=[ ... {"calltype": "CorpInfo", "corpname": "susanne"}, ... {"calltype": "View", "corpname": "susanne", "q": 'alemma,"bird"'}, ... {"calltype": "Collx", "corpname": "susanne", "q": 'alemma,"bird"'}]) ... j.run() j.data
collx (1) [Collx 26d29b1 {corpname: susanne, format: json, q: 'alemma,"bird"'}] corpinfo (1) [Corp info 9c08055 {corpname: susanne, format: json}] view (1) [View 8cdfca2 {asyn: '0', corpname: susanne, format: json, q: 'alemma,"bird"'}]
```
Or supply a JSON, JSONL or YAML file with calls:
json
// test/example.jsonl
{"call_type": "Collx", "corpname": "susanne", "q": "alemma,\"apple\""}
{"call_type": "Collx", "corpname": "susanne", "q": "alemma,\"carrot\""}
{"call_type": "Collx", "corpname": "susanne", "q": "alemma,\"lettuce\""}
```py
supplying a file of calls
from sgex.job import Job j = Job(infile="test/example.jsonl") j.run() j.data
collx (3) [Collx bc5d89b {corpname: susanne, format: json, q: 'alemma,"apple"'}, Collx 19495d0 {corpname: susanne, format: json, q: 'alemma,"carrot"'}, Collx 7edee07 {corpname: susanne, format: json, q: 'alemma,"lettuce"'}]
```
Making multiple calls for concordances
The View call retrieves concordances by page, defaulting to page 1. Its fromp and pagesize parameters adjust the current page and max number of concordances per page. Using a large pagesize is often fine to get data in one request, but it may be better to use several. For this, try Job.run_repeat(), which gets the first page, calculates how many pages remain, and then gets remaining pages (or up to max_pages, if defined).
This example gets all the hits for work in the Susanne corpus in sets of 10 concordances per page. There are 93 in total, meaning that 10 requests are made (fromp=1 through fromp=10).
```py
from sgex.job import Job
run job
j = Job(params={"calltype": "View", "corpname": "susanne", "q": 'aword,"work"', "fromp": 1, "pagesize": 10}) j.runrepeat(maxpages=0) # optionally set maxpages to stop after n pages
the 93 concordances were retrieved in 10 calls
j.data.view[0].response.json()["concsize"] == 93 True len(j.data.view) == 10 True
```
Manipulating data
Response data can be manipulated by accessing the lists of calls stored in Job.data. A few methods are included so far, such as Freqs.df_from_json(), which transforms a JSON frequency query to a DataFrame.
```py
convert frequency JSON to a Pandas DataFrame
from sgex.job import Job j = Job( ... params={ ... "calltype": "Freqs", ... "corpname": "susanne", ... "fcrit": "doc.file 0", ... "q": 'alemma,"bird"'}) j.run() df = j.data.freqs[0].dffromjson() df.head(3) frq rel reltt fpm value attribute arg nicearg corpname totalfpm total_frq fmaxitems 0 7 3625.97107 2892.561983 46.534509 A11 doc.file "bird" bird susanne 79.77 12 None 1 2 1093.37113 872.219799 13.295574 N08 doc.file "bird" bird susanne 79.77 12 None 2 1 525.59748 419.287212 6.647787 G04 doc.file "bird" bird susanne 79.77 12 None
```
Next steps
A few more considerations for doing corpus linguistics with SGEX.
The Data and Call classes
Data is a dataclass used in Job to store API data and make associated methods easily available. Whenever requests are made from a list of dictionary parameters, the responses are automatically sorted by call_type. Each call type has a list, which gets appended each time a request is made. These lists of responses can be processed using methods shared by a given call type.
Every call type is a subclass of the call.Call base class. All calls share some universal methods, including simple parameter verification to reduce API errors. Every subclass (Freqs, View, CorpInfo, etc.) can also have its own methods for data processing tasks. These methods tend to focus on manipulating JSON data, which is the only complete format for the API; manipulating other response formats like CSV is also possible.
At least while SGEX is in beta, existing methods aren't stable for production purposes: using your own custom method, like the following example, is a safer bet.
Custom corpus data manipulation techniques
Adding custom methods to a call type is easy:
```py
from sgex.job import Job from sgex.call import CorpInfo
write a new method
def newmethodfromjson(self) -> str: ... """Returns a string of corpus structures.""" ... self.checkformat() ... json = self.response.json() ... return " ".join([k.replace("count", "") for k in _json.get("sizes").keys()]) CorpInfo.newmethodfromjson = newmethodfrom_json
run the job
j = Job( ... clearcache=True, ... params={"calltype": "CorpInfo", "corpname": "susanne", "structattrstats": 1}) j.run()
use the method
j.data.corpinfo[0].newmethodfrom_json() 'token word doc par sent'
```
Feel free to suggest more methods for call types if you think they're useful. Be sure to explain the purpose and required parameters in the docstring (e.g., "Requires a corp_info call with these parameters: {"x": "y"}).
Request throttling
Wait periods are added between calls with a wait_dict that defines the required increments for a number of calls. This is the standard dictionary, following SkE's fair use policy:
py
wait_dict = {"0": 9, "0.5": 99, "4": 899, "45": None}
In other words:
- no wait applied for 9 calls or fewer
- 0.5 seconds added for up to 99 calls
- 4 seconds added for up to 899 calls
- 45 seconds added for any number above that
Asynchronous calling
The aiohttp package is used to implement async requests.
This is activated with Job(thread=True); it's usable with local servers only.
The number of connections for async calling is adjustable by adding a kwarg when running a job. The default of 20 should increase rates while reducing errors, although this depends on how many calls are made, their complexity, and the hardware.
py
Job.run(connector=aiohttp.TCPConnector(limit_per_host=int))
If a large asynchronous job raises a few exceptions caused by the server struggling to handle requests, it's often simpler to just run the job again. This retries failed calls and loads successful ones from the cache. Trying to adjust the
connectorto eliminate one or two exceptions out of 1,000 calls isn't necessary.If calls are complex and the corpus is large, using sequential calling might be the best option.
Getting different data formats
Data can be retrieved in JSON, XML, CSV or TXT formats with Job(params={"format": "csv"}) etc. Only JSON is universal: most API call types can only return some of these formats.
How caching works
A simple filesystem cache is used to store response data. These are named with a hashing function and are accompanied with response metadata. Once a call is cached, identical requests are loaded from the cache. Calls with format="json" and no exceptions or SkE errors get cached. Data in other formats (CSV, XML) are always cached since error handling isn't implemented.
Response data can include credentials in several locations. SGEX strips credentials before caching URLs and JSON data, although inspecting data before sharing it is still prudent.
Simple queries
simple_query approximates this query type in SkE: enter a phrase and a CQL rule is returned. The search below uses double hyphens to include tokens w/ or w/o hyphens or spaces; wildcard tokens are also possible.
```py
from sgex.query import simplequery simplequery("home--made * recipe") '( [lc="homemade" | lemmalc="homemade"] | [lc="home" | lemmalc="home"] [lc="made" | lemmalc="made"] | [lc="home-made" | lemmalc="home-made"] | [lc="home" | lemma_lc="home"] [lc="-" | lemmalc="-"] [lc="made" | lemmalc="made"] ) [lc="." | lemma_lc="."][lc="recipe" | lemma_lc="recipe"]'
```
Fuzzy queries
fuzzy_query takes a sentence or longer phrase and converts it into a more forgiving CQL rule. This can be helpful to relocate an extracted concordance or find similar results elsewhere. The returned string is formatted to work with word or word_lowercase as a default attribute.
```py
from sgex.query import fuzzyquery fuzzyquery("Before yesterday, it was fine, don't you think?") '"Before" "yesterday" []{,1} "it" "was" "fine" []{,3} "you" "think"' fuzzy_query("We saw 1,000.99% more visitors at www.example.com yesterday") '"We" "saw" []{,6} "more" "visitors" "at" []{,2} "yesterday"'
```
Numbers, URLs and other challenging tokens are parsed to some extent, but these can prevent
fuzzy_queryfrom finding concordances.
Checking hashes
To cache data, each unique call is identified by hashing an ordered JSON representation of its parameters. Hashes can be derived from input data (the parameters you write) and response data (the parameters as stored in a JSON API response). Accessing hashes can be done as such:
```py
from sgex.job import Job from sgex.call import CorpInfo
get shortened hash from input parameters
c = CorpInfo({"corpname": "susanne", "structattrstats": 1}) c.hash()[:7] '9c28c7a'
send request
j = Job( ... params={"calltype": "CorpInfo", "corpname": "susanne", "structattr_stats": 1}) ... j.run()
get shortened hash from response
j.data.corpinfo[0].hash()[:7] '9c28c7a'
```
Adding a timeout / changing aiohttp behavior
Timeouts are disabled for the local server, which lets expensive queries run as needed. Other servers use the aiohttp default of 5 minutes. Enforce a custom timeout by adding it to Job kwargs. (Use this technique to pass other args to the aiohttp session as well.)
```py
from sgex.job import Job import aiohttp
add a very short timeout for testing
timeout = aiohttp.ClientTimeout(sock_read=0.01)
design a call with a demanding CQL query
j = Job( ... params={ ... "call_type": "Collx", ... "corpname": "susanne", ... "q": "alemma,[]{,10}"}) ...
run with additional session args
j.run(timeout=timeout)
check for timeout exception [(error, call, index), ...]
isinstance(j.errors[0][0], aiohttp.client_exceptions.ServerTimeoutError) True
```
Even if a request is timed-out by the client, a server may still try to compute results (and continue taking up resources on a local machine, causing unexpected exceptions).
Example: make a stratified random sample with a series of API calls
Data from different call types can be utilized together to construct more complex queries and custom operations. For example, the random sample feature in Sketch Engine's interface uses simple randomization, yet some analyses might require a stratified sampling technique (taking a separate random sample for each category in a text type). This can be done with the code below.
This includes three API call types:
- a
CorpInfocall - a
AttrValscall - a series of
Viewcalls (64, requested concurrently)
It retrieves 5 random concordances for each doc.file text type in the susanne corpus for cases of "the" and a following token.
```py
from sgex.job import Job
1. check the corpus's attributes
j = Job(params={"calltype": "CorpInfo", "corpname": "susanne", "structattrstats": 1}) j.run() j.data.corpinfo[0].structuresfrom_json() structure attribute size 0 font type 2 0 head type 2 0 doc file 64 1 doc n 12 2 doc wordcount 1
2. get values for one text type
(make sure avmaxitems is >= the size of the text type)
j0 = Job(params={"call_type": "AttrVals", "corpname": "susanne", "avattr": "doc.file", "avmaxitems": 1000000}) j0.run() values = j0.data.attrvals[0].response.json()["suggestions"]
the query ['a,""', "r"]
q = [f'alemma,"the" []', "r5"] calltemplate = { ... "calltype": "View", ... "corpname": "susanne", ... "viewmode": "sen", ... "pagesize": 1000, # make greater than "r
"" to get everything in one request ... "attrs": "word,tag,lemma", ... "attr_allpos": "all"}
generate list of calls
calls = [] for value in values: ... within = f' within
' ... calls.append(call_template | {"q": [q[0] + within, q[1]]})
3. execute job
j1 = Job(params=calls, thread=True) j1.run()
process data as needed
(print the query for the first sample)
print(j1.data.view[0].response.json()["request"]["q"]) ['alemma,"the" [] within
', 'r5']
(print the KWICs for the first sample)
for x in range(5): ... kwic = j1.data.view[0].response.json()["Lines"][x]["Kwic"] ... tokens = [] ... for dt in kwic: ... if dt["class"] != "attr": ... tokens.append(dt["str"].strip()) ... print(" ".join(tokens)) the jury the Fulton the state the recommendations the jury
```
Running as a script
If the repo is cloned and is the current working directory, SGEX can be run as a script as such:
```sh
gets collocation data from the Susanne corpus for the lemma "bird"
python sgex/job.py -p '{"call_type": "Collx", "corpname": "susanne","q": "alemma,\"bird\""}' ```
Basic commands are available to run as a script for downloading data. Example: one could read a list of API calls from a file (-i "<myfile.json>") and send requests to the SkE server (-s "ske"). More complex tasks still require importing modules in Python.
Run SGEX with --help for up-to-date options.
```sh python sgex/job.py --help
usage: SGEX [-h] [-k API_KEY] [--cache-dir CACHE_DIR] [--clear-cache] [--data DATA] [--default-servers DEFAULT_SERVERS] [--dry-run] [-i [INFILE ...]] [-p [PARAMS ...]] [-s SERVER] [-x] [-u USERNAME] [-w WAIT_DICT] ```
|arg|example|description|
|---|---|---|
| -k --api-key | "1234" | API key, if required by server |
| --cache-dir | "data" (default) | cache directory location |
| --clear-cache | (disabled by default) | clear the cache directory (ignored if --dry-run) |
| --data | (reserved) | placeholder for API call data |
| --default-servers | '{"server_name": "URL"}' | settings for default servers |
| --dry-run | (disabled by default) | print job settings |
| -i --infile | "api_calls.json" | file(s) to read calls from |
| -p --params | '{"call_type": "Collx", "corpname": "susanne","q": "alemma,\"bird\""}' | JSON/YAML string(s) with a dict of params |
| -s --server | "local" (default) | local, ske or a URL to another server |
| -x, --thread | (disabled by default) | run asynchronously, if allowed by server |
| -u --username | "J. Doe" | API username, if required by server |
| -v --verbose | (disabled by default) | print details while running |
| -w --wait-dict | '{"0": 10, "1": null}' (wait zero seconds for =<10 calls and 1 second for 10<) | wait period between calls |
Environment variables
Environment variables can be set by exporting them or using an .env file. When used as env variables, argument names are just converted to uppercase and given a prefix.
Example file
```bash
.env
SGEXAPIKEY="
Example usage
```bash
export variables in .env
set -a && source .env && set +a
run SGEX
python sgex/job.py # add args here
unset variables
unset ${!SGEX_*} ```
About
SGEX has been developed to meet research needs at the University of Granada Translation and Interpreting Department. See the LexiCon research group for related projects.
The name refers to sketch grammars, which are series of generalized corpus queries in Sketch Engine (see their bibliography).
Questions, suggestions and support are welcome.
Citation
If you use SGEX, please cite it. This paper introduces the package in the context of doing collocation analysis:
bibtex
@inproceedings{isaacsAggregatingVisualizingCollocation2023,
address = {Lisbon, Portugal},
title = {Aggregating and {Visualizing} {Collocation} {Data} for {Humanitarian} {Concepts}},
url = {https://ceur-ws.org/Vol-3427/short11.pdf},
booktitle = {Proceedings of the 2nd {International} {Conference} on {Multilingual} {Digital} {Terminology} {Today} ({MDTT} 2023)},
publisher = {CEUR-WS},
author = {Isaacs, Loryn and León-Araúz, Pilar},
editor = {Di Nunzio, Giorgio Maria and Costa, Rute and Vezzani, Federica},
year = {2023},
}
See Zenodo for citing specific versions of the software.
Owner
- Login: engisalor
- Kind: user
- Repositories: 2
- Profile: https://github.com/engisalor
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you this software, please cite it as below."
authors:
- family-names: Isaacs
given-names: Loryn
orcid: "https://orcid.org/0000-0003-0267-4853"
title: "Sketch Grammar Explorer"
version: 0.7.5 # x-release-please-version
date-released: 2022-07-08
repository-code: "https://github.com/engisalor/sketch-grammar-explorer"
license: bsd-3-clause
doi: 10.5281/zenodo.6812334
GitHub Events
Total
- Watch event: 3
Last Year
- Watch event: 3
Packages
- Total packages: 1
-
Total downloads:
- pypi 56 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 15
- Total maintainers: 1
pypi.org: sgex
Sketch Grammar Explorer (Sketch Engine API wrapper)
- Documentation: https://sgex.readthedocs.io/
- License: BSD 3-Clause License Copyright (c) 2022, Loryn Isaacs All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: 1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. 3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
-
Latest release: 0.7.5
published over 1 year ago
Rankings
Maintainers (1)
Dependencies
- PyYAML ==6.0
- SecretStorage ==3.3.2
- certifi ==2022.5.18.1
- cffi ==1.15.0
- charset-normalizer ==2.0.12
- cryptography ==37.0.2
- idna ==3.3
- importlib-metadata ==4.11.4
- jeepney ==0.8.0
- keyring ==23.5.0
- numpy ==1.22.4
- pandas ==1.4.2
- pycparser ==2.21
- python-dateutil ==2.8.2
- pytz ==2022.1
- requests ==2.27.1
- six ==1.16.0
- urllib3 ==1.26.9
- zipp ==3.8.0
- google-github-actions/release-please-action v3 composite
- pandas *
- pyyaml *
- requests *
- requests-cache *