https://github.com/converged-computing/jobspec-database
Database of jobs, starting with Slurm (under development(
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (14.0%) to scientific vocabulary
Repository
Database of jobs, starting with Slurm (under development(
Basic Info
- Host: GitHub
- Owner: converged-computing
- License: mit
- Language: Shell
- Default Branch: main
- Size: 1.01 GB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 1
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Job Specification Database
This database is under developement!
It will eventually be added to Dinosaur Datasets.
Usage
The data files are organized by repository in data. These instructions are for generation. Create a python environment and install dependencies:
bash
pip install -r requirements.txt
You'll need to make a "drivers" directory and download the chromedriver (matching your browser) to it inside of scripts. Then, run the parsing script, customizing the matrix of search terms. You should have a chromedriver installed, all browsers closed, and be prepared to login to GitHub.
bash
cd scripts/
python search.py
Then download files, from the root, targeting the output file of interest.
bash
python scripts/get_jobspecs.py ./scripts/data/raw-links-may-23.json --outdir ./data
Note that the data now is just a trial run! For the first run, we had 11k+ unique results from just a trial run.
For the second run, that went up to 19544. When I added more applications, for half of the run it was 25k.
The current total is 31932 scripts. I didn't add the last run of flux because I saw what I thought were false positives.
Also try to get associated GitHub files.
bash
python scripts/get_jobspec_configs.py
Analysis
1. Word2Vec
Word2Vec is a little old, and I think a flaw is that it is combining jobspecs. But if we have the window the correct size, we can make associations between close terms. The space I'm worried about is the beginning of one script and the end of another, and maybe a different approach or strategy could help with that. To generate the word2vec embeddings you can run:
bash
python scripts/word2vec.py --input ./data
Updates to the above on June 9th:
- Better parsing to tokenize
- we combine by space instead of empty space so words at end are not combined (this was a bug)
- punctuation that should be replaced by space instead of empty space honored (dashes, underscore, etc)
- hash bangs for shell parsed out
- better tokenization and recreation of content
- each script is on one line (akin to how done for word2vec)
I think it would be reasonable to create a similarity matrix, specifically cosine distance between the vectors. This will read in the metadata.tsv and vectors.tsv we just generated.
bash
python scripts/vector_matrix.py --vectors ./scripts/data/combined/vectors.tsv --metadata ./scripts/data/combined/metadata.tsv
The above does the following:
- We start with our jobspecs that are tokenized according to the above.
- We further remove anything that is purely numerical
- We use TF-IDF to reduce the feature space to 300 terms
- We do a clustering of these terms to generate the resulting plot.
The hardest thing is just seeing all the terms. I messed with JavaScript for a while but gave up for the time being, the data is too big for the browser and likely we need to use canvas.
2. Directive Counts
I thought it would be interesting to explicitly parse the directives. That's a bit hard, but I took a first shot:
bash
python scripts/parse_directives.py --input ./data
console
Assessing 33851 conteder jobscripts...
Found (and skipped) 535 duplicates.
You can find tokenized lines (with one jobspec per line), the directive counts, and the dictionary and skips in scripts/data/combined/
3. Adding Topics or More Structure
I was thinking about adding doc2vec, because word2vec is likely making associations between terms in different documents, but I don't think anyone is using doc2vec anymore, because the examples I'm finding using a deprecated version of tensorflow that has functions long removed. We could use the old gensim version, but I think it might be better to think of a more modern approach. I decided to try top2vec.
```bash
Using pretrained model (not great because not jobscript terms)
python scripts/run_top2vec.py
Build with doc2vec - be careful we set workers and learn mode (slower) here
started at 7pm
python3 scripts/runtop2vecwithdoc2vec.py --speed learn python3 scripts/runtop2vecwithdoc2vec.py --speed deep-learn ```
And then to explore (finding matches for a subset of words):
``` python3 scripts/exploretop2vec.py python3 scripts/exploretop2vec.py --outname top2vec-jobspec-database-learn.md --model ./scripts/data/combined/wordclouds/top2vec-with-doc2vec-learn.model
Deep learn (highest quality vectors), takes about 6-7 hours to run 128 GB ram CPU instance
python3 scripts/explore_top2vec.py --outname top2vec-jobspec-database-deep-learn.md --model ./scripts/data/combined/wordclouds/top2vec-with-doc2vec-deep-learn.model ```
For word2vec:
- continuous bag of words: we create a window around the word and predict the word from the context
- skip gram: we create the same window but predict the context from the word (supposedly slower but better results)
I had to run this on a large VM for it to work. See the topics in scripts/data/combined/wordclouds. We can likely tweak everything but I like how this tool is approaching it (see docs in ddangelov/Top2Vec).
4. Gemini
We can run Gemini across our 33K jobspecs to generate a templatized output for each one:
bash
python scripts/classify-gemini.py
That takes a little over a day to run, and it will cost about 25-$30 per run. I did two runs for about $55. Then we can both check the model, normalize and visualize our resources (that we parsed) and compare to what Gemini says.
bash
python scripts/process-gemini.py
You can then see the data output in scripts/data/gemini-with-template-processed or use this script to visualize results that are filtered to those with all, missing, or some wrong values:
```bash
pip install rich
python scripts/inspect-gemini.py
How to customize
python scripts/inspect-gemini.py --type missing python scripts/inspect-gemini.py --type wrong
Print more than 1
python scripts/inspect-gemini.py --type all --number 3 ```
4. Cyclomatic Complexity
Next, we want to calculate the cyclomatic complexity. Since these are akin to bash scripts, we can use shellmetrics. It's not perfect, but I did a few spot checks and the result was what I'd want or expect - the more complex scripts (with arrays, etc) got a higher score. Since we know our database on LC is now in S3, let's instead write this to an SQL file with a table that can be queried based on path, sha1, or sha256. First, make sure the binary is on your path:
bash
mkdir -p ./bin
curl -fsSL https://git.io/shellmetrics > ./bin/shellmetrics
chmod +x ./bin/shellmetrics
export PATH=$PWD/bin:$PATH
Here is example output, when run manually. Note that I think we want the first section, which has the CCN "cognitive complexity number" for main, which is the main chunk. In the csv, that is the middle block and 4th column "1"
```console
$ shellmetrics data/abdullahrkw/FAU-FAPS/ViT/run-job.sh
LLOC CCN Location
5 1 <main> data/abdullahrkw/FAU-FAPS/ViT/run-job.sh
1 file(s), 1 function(s) analyzed. [bash 5.1.16(1)-release]
============================================================================== NLOC NLOC LLOC LLOC CCN Func File (lines:comment:blank)
total avg total avg avg cnt
5 5.00 5 5.00 1.00 1 data/abdullahrkw/FAU-FAPS/ViT/run-job.sh (20:14:1)
============================================================================== NLOC NLOC LLOC LLOC CCN Func File lines comment blank
total avg total avg avg cnt cnt total total total
5 5.00 5 5.00 1.00 1 1 20 14 1
console
$ shellmetrics --csv data/abdullahrkw/FAU-FAPS/ViT/run-job.sh
file,func,lineno,lloc,ccn,lines,comment,blank
"data/abdullahrkw/FAU-FAPS/ViT/run-job.sh","
Next, generate a database for files in data.
bash
python scripts/cyclomatic-complexity.py --input ./data --db ./scripts/data/cyclomatic-complexity-github.db
IMPORTANT For the above and complexity calculation below, duplicates are not removed. We store the sha256 and sha1 so you can do this!
5. LC Jobspec Database
This database is kind of messy - not sure I like it as much as the one I generated. Someone else can deal with it :)
- Total unique jobspec jsons: 210351
- Total with BatchScript: 116117
bash
cd ./lc
python scripts/cyclomatic-complexity.py --input ./raw/jobdata_json --db ./data/cyclomatic-complexity-lc.db
IMPORTANT Since this is a combination of json and .tar files (for which we extract members) the database has an extra column for the jobid, and the original filename path corresponds to the file here. The file that we actually read is parsed from the BatchScript directive of the json file, which is only the batch portion of the data to match what we use in GitHub.
Reading Sqlite Databases
Examples to read in the two databases:
```python import sqlite3 conn = sqlite3.connect("scripts/data/cyclomatic-complexity-github.db") cursor = conn.cursor()
This gets the field names and metadata
cursor.execute('PRAGMA table_info(jobspecs);').fetchall()
console
[(0, 'id', 'INTEGER', 0, None, 1),
(1, 'name', 'TEXT', 0, None, 0),
(2, 'sha256', 'TEXT', 0, None, 0),
(3, 'sha1', 'TEXT', 0, None, 0),
(4, 'ccn', 'NUMBER', 0, None, 0)]
```
And this gets the jobspecs (one for example)
```python query = cursor.execute("SELECT * from jobspecs;") query.fetchone()
rows = query.fetchall()
console
(1,
'./data/ZIYU-DEEP/reprover-test/2gpu.sh',
'48ef130f0700b606c3b5d4b2a784cd78f97439b935bd7d7df4673d9683d420e1',
'c88fdc41ae091734565c63ed67c51a45de66a07f',
1)
```
Don't forget to close.
python
conn.close()
And don't forget LC will have an extra field, for the name of the file plus the jobid since some are members in a tar.
python
conn = sqlite3.connect("lc/data/cyclomatic-complexity-lc.db")
cursor = conn.cursor()
cursor.execute('PRAGMA table_info(jobspecs);').fetchall()
console
[(0, 'id', 'INTEGER', 0, None, 1),
(1, 'name', 'TEXT', 0, None, 0),
(2, 'jobid', 'TEXT', 0, None, 0),
(3, 'sha256', 'TEXT', 0, None, 0),
(4, 'sha1', 'TEXT', 0, None, 0),
(5, 'ccn', 'NUMBER', 0, None, 0)]
python
conn.close()
6. Summarizing
We want to look at, for each databases, what we have for:
- applications
- job managers
- length of jobs
The first two will be tags based on presence of directives. The last will be a calculation.
bash
python scripts/summarize-jobspecs.py --input ./data --db ./scripts/data/jobspec-summary-github.db
console
{'slurm': 23958,
'pbs': 6381,
'lsf': 2614,
'oar': 145,
'flux': 74,
'cobalt': 630}
This saves to scripts/data/jobspec-summary
Reading Sqlite Database
Examples to read in the two database:
```python import sqlite3 conn = sqlite3.connect("./scripts/data/jobspec-summary-github.db") cursor = conn.cursor()
This gets the field names and metadata
cursor.execute('PRAGMA table_info(jobspecs);').fetchall()
console
[(0, 'id', 'INTEGER', 0, None, 1),
(1, 'name', 'TEXT', 0, None, 0),
job manager tags
(2, 'manager_tags', 'TEXT', 0, None, 0),
these are tags from gemini without a template
(3, 'software_tags', 'TEXT', 0, None, 0),
These were tags with a template (nicer / cleaner set)
(4, 'softwaretagswith_template', 'TEXT', 0, None, 0), (5, 'length', 'NUMBER', 0, None, 0)] ```
Each of managertags and softwaretags are json dumped lists.
```python query = cursor.execute("SELECT * from jobspecs;") result = query.fetchone()
rows = query.fetchall()
console
(1,
'./data/ZIYU-DEEP/reprover-test/2gpu.sh',
job manager tags
'["slurm"]',
these are tags from gemini without a template
'["rs", "head", "scontrol", "yt", "distributed", "node", "les", "main", "an", "run", "generator", "ame", "at", "srun", "nam", "python", "bash", "gp", "gt", "aria", "aml", "scrip", "ct", "re", "tact", "ed", "su", "vi", "tac", "ic", "li", "hostname", "ip", "go", "train", "env", "bin", "os", "export", "cc", "docker", "random", "training", "ml", "tr", "sh", "yaml", "hon", "gene", "od", "gpu", "host"]',
These were tags with a template (nicer / cleaner set)
'["python", "bash", "srun", "docker", "r"]', 733) ```
To read in the tags:
console
import json
print(json.loads(result[2]))
print(json.loads(result[3]))
print(json.loads(result[4]))
['slurm']
['rs', 'head', 'scontrol', 'yt', 'distributed', 'node', 'les', 'main', 'an', 'run', 'generator', 'ame', 'at', 'srun', 'nam', 'python', 'bash', 'gp', 'gt', 'aria', 'aml', 'scrip', 'ct', 're', 'tact', 'ed', 'su', 'vi', 'tac', 'ic', 'li', 'hostname', 'ip', 'go', 'train', 'env', 'bin', 'os', 'export', 'cc', 'docker', 'random', 'training', 'ml', 'tr', 'sh', 'yaml', 'hon', 'gene', 'od', 'gpu', 'host']
['python', 'bash', 'srun', 'docker', 'r']
And that's it! Go hither and jobspec, away!!!!
License
HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.
See LICENSE, COPYRIGHT, and NOTICE for details.
SPDX-License-Identifier: (MIT)
LLNL-CODE- 842614
Owner
- Name: Converged Computing
- Login: converged-computing
- Kind: organization
- Website: https://converged-computing.org
- Repositories: 84
- Profile: https://github.com/converged-computing
The best of cloud and high performance computing: technology and community combined.
GitHub Events
Total
- Watch event: 1
- Delete event: 3
- Push event: 7
- Pull request event: 7
- Fork event: 1
- Create event: 3
Last Year
- Watch event: 1
- Delete event: 3
- Push event: 7
- Pull request event: 7
- Fork event: 1
- Create event: 3
Issues and Pull Requests
Last synced: 10 months ago
All Time
- Total issues: 0
- Total pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Total issue authors: 0
- Total pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 1
- Average time to close issues: N/A
- Average time to close pull requests: less than a minute
- Issue authors: 0
- Pull request authors: 1
- Average comments per issue: 0
- Average comments per pull request: 0.0
- Merged pull requests: 1
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
- vsoch (6)