https://github.com/converged-computing/jobspec-database

Database of jobs, starting with Slurm (under development(

Last synced: 9 months ago · JSON representation

Repository

Database of jobs, starting with Slurm (under development(

Basic Info

Host: GitHub
Owner: converged-computing
License: mit
Language: Shell
Default Branch: main
Size: 1.01 GB

Statistics

Stars: 1
Watchers: 1
Forks: 1
Open Issues: 0
Releases: 0

Created about 2 years ago · Last pushed over 1 year ago

Metadata Files

Readme License

Job Specification Database

This database is under developement!

It will eventually be added to Dinosaur Datasets.

Usage

The data files are organized by repository in data. These instructions are for generation. Create a python environment and install dependencies:

bash pip install -r requirements.txt

You'll need to make a "drivers" directory and download the chromedriver (matching your browser) to it inside of scripts. Then, run the parsing script, customizing the matrix of search terms. You should have a chromedriver installed, all browsers closed, and be prepared to login to GitHub.

bash cd scripts/ python search.py

Then download files, from the root, targeting the output file of interest.

bash python scripts/get_jobspecs.py ./scripts/data/raw-links-may-23.json --outdir ./data

Note that the data now is just a trial run! For the first run, we had 11k+ unique results from just a trial run. For the second run, that went up to 19544. When I added more applications, for half of the run it was 25k. The current total is 31932 scripts. I didn't add the last run of flux because I saw what I thought were false positives.

Also try to get associated GitHub files.

bash python scripts/get_jobspec_configs.py

Analysis

1. Word2Vec

Word2Vec is a little old, and I think a flaw is that it is combining jobspecs. But if we have the window the correct size, we can make associations between close terms. The space I'm worried about is the beginning of one script and the end of another, and maybe a different approach or strategy could help with that. To generate the word2vec embeddings you can run:

bash python scripts/word2vec.py --input ./data

Updates to the above on June 9th:

Better parsing to tokenize
- we combine by space instead of empty space so words at end are not combined (this was a bug)
- punctuation that should be replaced by space instead of empty space honored (dashes, underscore, etc)
- hash bangs for shell parsed out
- better tokenization and recreation of content
- each script is on one line (akin to how done for word2vec)

I think it would be reasonable to create a similarity matrix, specifically cosine distance between the vectors. This will read in the metadata.tsv and vectors.tsv we just generated.

bash python scripts/vector_matrix.py --vectors ./scripts/data/combined/vectors.tsv --metadata ./scripts/data/combined/metadata.tsv

The above does the following:

We start with our jobspecs that are tokenized according to the above.
We further remove anything that is purely numerical
We use TF-IDF to reduce the feature space to 300 terms
We do a clustering of these terms to generate the resulting plot.

The hardest thing is just seeing all the terms. I messed with JavaScript for a while but gave up for the time being, the data is too big for the browser and likely we need to use canvas.

2. Directive Counts

I thought it would be interesting to explicitly parse the directives. That's a bit hard, but I took a first shot:

bash python scripts/parse_directives.py --input ./data console Assessing 33851 conteder jobscripts... Found (and skipped) 535 duplicates.

You can find tokenized lines (with one jobspec per line), the directive counts, and the dictionary and skips in scripts/data/combined/

3. Adding Topics or More Structure

I was thinking about adding doc2vec, because word2vec is likely making associations between terms in different documents, but I don't think anyone is using doc2vec anymore, because the examples I'm finding using a deprecated version of tensorflow that has functions long removed. We could use the old gensim version, but I think it might be better to think of a more modern approach. I decided to try top2vec.

```bash

Using pretrained model (not great because not jobscript terms)

python scripts/run_top2vec.py

Build with doc2vec - be careful we set workers and learn mode (slower) here

started at 7pm

python3 scripts/runtop2vecwithdoc2vec.py --speed learn python3 scripts/runtop2vecwithdoc2vec.py --speed deep-learn ```

And then to explore (finding matches for a subset of words):

``` python3 scripts/exploretop2vec.py python3 scripts/exploretop2vec.py --outname top2vec-jobspec-database-learn.md --model ./scripts/data/combined/wordclouds/top2vec-with-doc2vec-learn.model

Deep learn (highest quality vectors), takes about 6-7 hours to run 128 GB ram CPU instance

python3 scripts/explore_top2vec.py --outname top2vec-jobspec-database-deep-learn.md --model ./scripts/data/combined/wordclouds/top2vec-with-doc2vec-deep-learn.model ```

For word2vec:

continuous bag of words: we create a window around the word and predict the word from the context
skip gram: we create the same window but predict the context from the word (supposedly slower but better results)

I had to run this on a large VM for it to work. See the topics in scripts/data/combined/wordclouds. We can likely tweak everything but I like how this tool is approaching it (see docs in ddangelov/Top2Vec).

4. Gemini

We can run Gemini across our 33K jobspecs to generate a templatized output for each one:

bash python scripts/classify-gemini.py

That takes a little over a day to run, and it will cost about 25-$30 per run. I did two runs for about $55. Then we can both check the model, normalize and visualize our resources (that we parsed) and compare to what Gemini says.

bash python scripts/process-gemini.py

You can then see the data output in scripts/data/gemini-with-template-processed or use this script to visualize results that are filtered to those with all, missing, or some wrong values:

```bash

pip install rich

python scripts/inspect-gemini.py

How to customize

python scripts/inspect-gemini.py --type missing python scripts/inspect-gemini.py --type wrong

Print more than 1

python scripts/inspect-gemini.py --type all --number 3 ```

4. Cyclomatic Complexity

Next, we want to calculate the cyclomatic complexity. Since these are akin to bash scripts, we can use shellmetrics. It's not perfect, but I did a few spot checks and the result was what I'd want or expect - the more complex scripts (with arrays, etc) got a higher score. Since we know our database on LC is now in S3, let's instead write this to an SQL file with a table that can be queried based on path, sha1, or sha256. First, make sure the binary is on your path:

bash mkdir -p ./bin curl -fsSL https://git.io/shellmetrics > ./bin/shellmetrics chmod +x ./bin/shellmetrics export PATH=$PWD/bin:$PATH

Here is example output, when run manually. Note that I think we want the first section, which has the CCN "cognitive complexity number" for main, which is the main chunk. In the csv, that is the middle block and 4th column "1"

```console

$ shellmetrics data/abdullahrkw/FAU-FAPS/ViT/run-job.sh

LLOC CCN Location

 5    1  <main> data/abdullahrkw/FAU-FAPS/ViT/run-job.sh

1 file(s), 1 function(s) analyzed. [bash 5.1.16(1)-release]

============================================================================== NLOC NLOC LLOC LLOC CCN Func File (lines:comment:blank)

total avg total avg avg cnt

5    5.00     5    5.00   1.00    1 data/abdullahrkw/FAU-FAPS/ViT/run-job.sh (20:14:1)

============================================================================== NLOC NLOC LLOC LLOC CCN Func File lines comment blank

total avg total avg avg cnt cnt total total total

5    5.00     5    5.00   1.00    1    1       20      14       1

console $ shellmetrics --csv data/abdullahrkw/FAU-FAPS/ViT/run-job.sh file,func,lineno,lloc,ccn,lines,comment,blank "data/abdullahrkw/FAU-FAPS/ViT/run-job.sh","",0,0,0,20,14,1 "data/abdullahrkw/FAU-FAPS/ViT/run-job.sh","

",0,5,1,0,0,0 "data/abdullahrkw/FAU-FAPS/ViT/run-job.sh","",0,0,0,20,14,1 ```

Next, generate a database for files in data.

bash python scripts/cyclomatic-complexity.py --input ./data --db ./scripts/data/cyclomatic-complexity-github.db

IMPORTANT For the above and complexity calculation below, duplicates are not removed. We store the sha256 and sha1 so you can do this!

5. LC Jobspec Database

This database is kind of messy - not sure I like it as much as the one I generated. Someone else can deal with it :)

Total unique jobspec jsons: 210351
Total with BatchScript: 116117

bash cd ./lc python scripts/cyclomatic-complexity.py --input ./raw/jobdata_json --db ./data/cyclomatic-complexity-lc.db

IMPORTANT Since this is a combination of json and .tar files (for which we extract members) the database has an extra column for the jobid, and the original filename path corresponds to the file here. The file that we actually read is parsed from the BatchScript directive of the json file, which is only the batch portion of the data to match what we use in GitHub.

Reading Sqlite Databases

Examples to read in the two databases:

```python import sqlite3 conn = sqlite3.connect("scripts/data/cyclomatic-complexity-github.db") cursor = conn.cursor()

This gets the field names and metadata

cursor.execute('PRAGMA table_info(jobspecs);').fetchall() console [(0, 'id', 'INTEGER', 0, None, 1), (1, 'name', 'TEXT', 0, None, 0), (2, 'sha256', 'TEXT', 0, None, 0), (3, 'sha1', 'TEXT', 0, None, 0), (4, 'ccn', 'NUMBER', 0, None, 0)] ```

And this gets the jobspecs (one for example)

```python query = cursor.execute("SELECT * from jobspecs;") query.fetchone()

rows = query.fetchall()

console (1, './data/ZIYU-DEEP/reprover-test/2gpu.sh', '48ef130f0700b606c3b5d4b2a784cd78f97439b935bd7d7df4673d9683d420e1', 'c88fdc41ae091734565c63ed67c51a45de66a07f', 1) ```

Don't forget to close.

python conn.close()

And don't forget LC will have an extra field, for the name of the file plus the jobid since some are members in a tar.

python conn = sqlite3.connect("lc/data/cyclomatic-complexity-lc.db") cursor = conn.cursor() cursor.execute('PRAGMA table_info(jobspecs);').fetchall() console [(0, 'id', 'INTEGER', 0, None, 1), (1, 'name', 'TEXT', 0, None, 0), (2, 'jobid', 'TEXT', 0, None, 0), (3, 'sha256', 'TEXT', 0, None, 0), (4, 'sha1', 'TEXT', 0, None, 0), (5, 'ccn', 'NUMBER', 0, None, 0)] python conn.close()

6. Summarizing

We want to look at, for each databases, what we have for:

applications
job managers
length of jobs

The first two will be tags based on presence of directives. The last will be a calculation.

bash python scripts/summarize-jobspecs.py --input ./data --db ./scripts/data/jobspec-summary-github.db console {'slurm': 23958, 'pbs': 6381, 'lsf': 2614, 'oar': 145, 'flux': 74, 'cobalt': 630}

This saves to scripts/data/jobspec-summary

Reading Sqlite Database

Examples to read in the two database:

```python import sqlite3 conn = sqlite3.connect("./scripts/data/jobspec-summary-github.db") cursor = conn.cursor()

This gets the field names and metadata

cursor.execute('PRAGMA table_info(jobspecs);').fetchall() console [(0, 'id', 'INTEGER', 0, None, 1), (1, 'name', 'TEXT', 0, None, 0),

job manager tags

(2, 'manager_tags', 'TEXT', 0, None, 0),

these are tags from gemini without a template

(3, 'software_tags', 'TEXT', 0, None, 0),

These were tags with a template (nicer / cleaner set)

(4, 'softwaretagswith_template', 'TEXT', 0, None, 0), (5, 'length', 'NUMBER', 0, None, 0)] ```

Each of managertags and softwaretags are json dumped lists.

```python query = cursor.execute("SELECT * from jobspecs;") result = query.fetchone()

rows = query.fetchall()

console (1, './data/ZIYU-DEEP/reprover-test/2gpu.sh',

job manager tags

'["slurm"]',

these are tags from gemini without a template

'["rs", "head", "scontrol", "yt", "distributed", "node", "les", "main", "an", "run", "generator", "ame", "at", "srun", "nam", "python", "bash", "gp", "gt", "aria", "aml", "scrip", "ct", "re", "tact", "ed", "su", "vi", "tac", "ic", "li", "hostname", "ip", "go", "train", "env", "bin", "os", "export", "cc", "docker", "random", "training", "ml", "tr", "sh", "yaml", "hon", "gene", "od", "gpu", "host"]',

These were tags with a template (nicer / cleaner set)

'["python", "bash", "srun", "docker", "r"]', 733) ```

To read in the tags:

console import json print(json.loads(result[2])) print(json.loads(result[3])) print(json.loads(result[4])) ['slurm'] ['rs', 'head', 'scontrol', 'yt', 'distributed', 'node', 'les', 'main', 'an', 'run', 'generator', 'ame', 'at', 'srun', 'nam', 'python', 'bash', 'gp', 'gt', 'aria', 'aml', 'scrip', 'ct', 're', 'tact', 'ed', 'su', 'vi', 'tac', 'ic', 'li', 'hostname', 'ip', 'go', 'train', 'env', 'bin', 'os', 'export', 'cc', 'docker', 'random', 'training', 'ml', 'tr', 'sh', 'yaml', 'hon', 'gene', 'od', 'gpu', 'host'] ['python', 'bash', 'srun', 'docker', 'r']

And that's it! Go hither and jobspec, away!!!!

License

HPCIC DevTools is distributed under the terms of the MIT license. All new contributions must be made under this license.

See LICENSE, COPYRIGHT, and NOTICE for details.

SPDX-License-Identifier: (MIT)

LLNL-CODE- 842614

Owner

Name: Converged Computing
Login: converged-computing
Kind: organization

Website: https://converged-computing.org
Repositories: 84
Profile: https://github.com/converged-computing

The best of cloud and high performance computing: technology and community combined.

GitHub Events

Total

Watch event: 1
Delete event: 3
Push event: 7
Pull request event: 7
Fork event: 1
Create event: 3

Last Year

Watch event: 1
Delete event: 3
Push event: 7
Pull request event: 7
Fork event: 1
Create event: 3

Issues and Pull Requests

Last synced: 10 months ago

All Time

Total issues: 0
Total pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Total issue authors: 0
Total pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: less than a minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 1
Bot issues: 0
Bot pull requests: 0

https://github.com/converged-computing/jobspec-database

Science Score: 13.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

Job Specification Database

Usage

Analysis

1. Word2Vec

2. Directive Counts

3. Adding Topics or More Structure

Using pretrained model (not great because not jobscript terms)

Build with doc2vec - be careful we set workers and learn mode (slower) here

started at 7pm

Deep learn (highest quality vectors), takes about 6-7 hours to run 128 GB ram CPU instance

4. Gemini

pip install rich

How to customize

Print more than 1

4. Cyclomatic Complexity

$ shellmetrics data/abdullahrkw/FAU-FAPS/ViT/run-job.sh

LLOC CCN Location

total avg total avg avg cnt

total avg total avg avg cnt cnt total total total

5. LC Jobspec Database

Reading Sqlite Databases

This gets the field names and metadata

rows = query.fetchall()

6. Summarizing

Reading Sqlite Database

This gets the field names and metadata

job manager tags

these are tags from gemini without a template

These were tags with a template (nicer / cleaner set)

rows = query.fetchall()

job manager tags

these are tags from gemini without a template

These were tags with a template (nicer / cleaner set)

License

Owner

GitHub Events

Total

Last Year

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels