https://github.com/breandan/cstk

🔍 Code Search Tools & Experiments

Science Score: 46.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
✓
Academic publication links
Links to: arxiv.org, ieee.org, zenodo.org
✓
Committers with academic emails
2 of 4 committers (50.0%) from academic institutions
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (11.9%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

🔍 Code Search Tools & Experiments

Basic Info

Host: GitHub
Owner: breandan
Language: Kotlin
Default Branch: master
Homepage:
Size: 38.7 MB

Statistics

Stars: 12
Watchers: 5
Forks: 4
Open Issues: 0
Releases: 0

Created over 5 years ago · Last pushed 11 months ago

Metadata Files

Readme

CSTK: Code Search Toolkit

Tools and experiments for code search. Broadly, we consider code synthesis as a search problem: programming is like a kind of biased random walk through edit space. Program synthesis then, can be viewed as is a goal-directed Markov decision process which takes a specification, and applies a sequence of source code transformations to evolve the code towards some specification (e.g. test- or document- driven development). This repository provides tools for evaluating state-of-the-art neural code synthesizers by exploring various tasks, from natural language and code search and completion, optimal transport, to various couplings between code and documentation.

Probing tools for pretrained neural language models
Autoregressive code and document completion with masked LMs
Full-factorial experiments on source code
Indices for keyword and vector embedding
Learning to search & grammar induction
- Passive DFA learning from membership
- Keyword/BoW-based query synthesis
Semantic graph construction
- Keyword-matching edge construction
- Proximity-based graph embedding
Vector embeddings for code
- Parsing and whole-AST GNN embeddings
- Transformer embeddings of source code snippets
- t-SNE visualization of code embeddings
Persistent homology of source code embeddings
Metrics for string, vector and distribution matching
- Kantorovich metric on code embeddings
- Various string distance metrics
- Code-snippet normal form distance
- Ranking metrics: NDCG, MAP@K, MRR
- Comparison of nearest-neighbors
Tools for mining software repositories
- Supports Google Code, Gitlab, and self-hosted Git instances
- Deduplication with GitHub to avoid dataset biases
Probabilistic code synthesis with Markov tensors
Synthetic source code transformations
- Synonym variable renaming
- Dead code introduction
- Loop bounds alteration
- Argument order swapping
- Line order swapping

Code and documentation are complementary and synergistic datatypes. A good programmer should be able to read and write both. We expect a neural programmer to attain fluency in both human and programming languages and evaluate the extent to which SOTA neural language models have mastered this ability. This indicates they have some understanding of intent.

We try our best to take an empirical approach. All experiments are conducted on a relatively diverse sampling of repositories from GitHub containing a mixture of source code and documentation. In those experiments, we use code completion, code search and other downstream tasks to compare the accuracy of pretrained models in constructed scenarios.

Usage

Setup

First clone this repo and initialize the submodule:

bash git clone git@github.com:breandan/cstk.git && \ cd cstk && \ git submodule update --init --recursive --remote

The following instructions assume you are running experiments on Compute Canada such as Narval or a similar cluster. Create a virtual environment and install the following dependencies:

bash module load python/3.8 && \ python3 -m venv . && \ source bin/activate && \ pip install torch==1.5.1 -f https://download.pytorch.org/whl/torch_stable.html && \ pip install transformers

Prefetch the models you wish to evaluate from the login node -- this will require internet access. Each model must provide a fill-mask pipeline (see here for a list of compatible models).

bash python scripts/embedding_server.py --model microsoft/codebert-base microsoft/graphcodebert-base dbernsohn/roberta-java huggingface/CodeBERTa-small-v1

Once all models have been downloaded, kill it with Ctrl+C (this step should only need to be run once). Confirm that ~/.cache/huggingface/transformers is not empty.

Then, make sure the project builds correctly on a login node and download the dataset. This make take a few minutes the first time it is run:

```bash

Load Java through CCEnv when running on Niagara:

module load CCEnv StdEnv java/17.0.2 && \

module load java/17.0.2 && \ ./gradlew build && \ ./gradlew cloneRepos ```

To run an experiment interactively, request a GPU instance like so:

bash salloc --time 3:0:0 --account=[YOUR_ADVISOR] --gres=gpu:a100:1 --mem=40G

Compute nodes have no internet, so future commands will need to occur offline.

```sh

Default CUDA version may not work, use older version

export LDLIBRARYPATH=/cvmfs/soft.computecanada.ca/easybuild/software/2017/Core/cudacore/10.2.89/targets/x86_64-linux/lib/

Disable 🤗 from phoning home on a Compute node

export TRANSFORMERS_OFFLINE=1 module load python/3.8

Load Java through CCEnv when running on Niagara:

module load CCEnv StdEnv java/17.0.2 && \

module load java/17.0.2 source bin/activate

Use --offline for all Gradle commands on Compute nodes

./gradlew --offline [completeCode] [completeDoc] [varMisuse] ```

Once you have confirmed the experiment runs smoothly and are ready to submit a longer job, edit niagara_submit.sh and run one of the following commands to submit it to Slurm, depending on the Compute Canada cluster you want to run it on (Narval has 64 cores, Niagara has 40):

bash sbatch niagara_submit.sh sbatch narval_submit.sh

Experiments

Experiments are mostly self-contained. Each Gradle task corresponds to a single experiment. They have been tested on JDK 21.

Syntax repair

The primary artifact and setup instructions for the syntax repair experiments are available on Zenodo.

Other instructions for fetching the raw datasets and running other experiments can be found below.

First, fetch the Break-It-Fix-It and StackOverflow datasets as follows:

bash export DATASET_DIR=src/main/resources/datasets/python && \ wget -O $DATASET_DIR/stack_overflow.zip https://figshare.com/ndownloader/articles/8244686/versions/1 && \ mkdir -p $DATASET_DIR/stack_overflow && \ unzip $DATASET_DIR/stack_overflow.zip -d $DATASET_DIR/stack_overflow && \ wget -O $DATASET_DIR/bifi.zip https://nlp.stanford.edu/projects/myasu/BIFI/data_minimal.zip && \ mkdir -p $DATASET_DIR/bifi && \ unzip $DATASET_DIR/bifi.zip -d $DATASET_DIR/bifi

The following commands will run the one of the syntax repair experiments:

bash ./gradlew contextualRepair ./gradlew pythonSnippetRepair ./gradlew pythonStatementRepair ./gradlew kotlinSemanticRepair ./gradlew kotlinStatementRepair

Most of these experiments leverage parallelism, so the more CPUs are available, the better. We use AWS EC2 c7a.24xlarge instances with 96 vCPUs and 192 GB of RAM.

Mining software repositories

Tokens for accessing the GitHub and GitLab developer APIs should be placed in the .ghtoken and .gltoken files, respectively.

The following command is optional and will sample some repositories from GitHub, GitLab, Google Code. To change the repository selection criteria, edit SampleRepos.kt:

bash ./gradlew sampleRepos

Those repositories may be cloned for evaluation. The following command will download Git repos into the data directory by default. To change the defaults, edit CloneRepos.kt:

bash ./gradlew cloneRepos

Masked code completion

The following will run the CodeCompletion.kt demo:

bash ./gradlew completeCode

We use this task to evaluate the impact of source code transformation. If the relative completion accuracy drops after a SCT has been applied, this indicates the model is sensitive to noise.

Document synthesis

The following will run the DocCompletion.kt demo:

bash ./gradlew completeDoc

For example, here are some synthetic documents produced by GraphCodeBERT using greedy autoregressive decoding with a natural language filter.

Persistent homology

It is possible to visualize persistent homology. To construct a Čech complex on a set of source code snippets run:

./gradlew nearestNeighbors

This will embed the snippets and construct edges between the nearest neighbors. It's a nice way to visualize code:

| Matrix | Graph | ---------|-------| |

Source code transformations

CSTK supports a number of source code transformations for studying the effect on neural language models. Some examples are given below.

Synonym renaming

Synonym renaming is provided by extJWNL. Run the following command:

bash ./gradlew synonymize

Left column is the original, right column is synonymized: ``` fun VecIndex.knn(v: DoubleArray, i: Int, exact: Boolean = false) = | fun VecIndex.knn(v: DoubleArray, i: Int, involve: Boolean = false) = if(exact) exactKNNSearch(v, i + 10) | if(involve) involveKNNSearch(v, i + 10) else findNearest(v, i + 10) | else findNearest(v, i + 10) .filter { !it.item().embedding.contentEquals(v) } | .filter { !it.item().embedding.contentEquals(v) } .distinctBy { it.item().toString() }.take(i) | .distinctBy { it.item().toString() }.take(i) ============================================================================================================================================ fun VecIndex.exactKNNSearch(vq: DoubleArray, nearestNeighbors: Int) = | fun VecIndex.exactKNNSearch(vq: DoubleArray, nearestEdge: Int) = asExactIndex().findNearest(vq, nearestNeighbors) | asExactIndex().findNearest(vq, nearestEdge) ============================================================================================================================================ override fun vector(): DoubleArray = embedding | override fun variable(): DoubleArray = embedding ============================================================================================================================================ override fun dimensions(): Int = embedding.size | override fun mark(): Int = embedding.size ============================================================================================================================================ override fun toString() = loc.getContext(0) | override fun toWithdraw() = loc.getContext(0) } | } ============================================================================================================================================ fun main() { | fun main() { buildOrLoadVecIndex() | baseOrDepositVecFurnish() } | } ============================================================================================================================================ fun buildOrLoadKWIndex( | fun buildOrLoadKWIndex( indexFile: File = File(DEFAULT_KNNINDEX_FILENAME), | regulateIncriminate: File = File(DEFAULT_KNNINDEX_FILENAME), rootDir: URI = TEST_DIR | rootDir: URI = TEST_DIR ): KWIndex = | ): KWIndex = if (!indexFile.exists()) | if (!regulateIncriminate.exists()) rebuildKWIndex(rootDir).apply { serializeTo(indexFile) } | rebuildKWIndex(rootDir).apply { serializeTo(regulateIncriminate) } else indexFile.deserializeFrom() | else regulateIncriminate.deserializeFrom() ============================================================================================================================================ fun main() { | fun main() { buildOrLoadKWIndex( | intensifyOrConcernKWFact( indexFile = File(DEFAULT_KWINDEX_FILENAME), | indexFile = File(DEFAULT_KWINDEX_FILENAME), rootDir = File("data").toURI() | rootDir = File("data").toURI() ) | ) } | } ============================================================================================================================================ fun String.shuffleLines() = lines().shuffled().joinToString("\n") | fun String.walkDepression() = lines().shuffled().joinToString("\n") ============================================================================================================================================ fun String.swapPlusMinus() = | fun String.goQualityMinus() = map { if (it == '+') '-' else it }.joinToString("") | map { if (it == '+') '-' else it }.joinToString("") ============================================================================================================================================ fun String.fuzzLoopBoundaries(): String = | fun String.fuzzLoopBoundaries(): String = replace(Regex("(for|while)(.*)([0-9]+)(.*)")) { match -> | replace(Regex("(for|while)(.*)([0-9]+)(.*)")) { change -> match.groupValues.let { it[1] + it[2] + | change.groupValues.let { it[1] + it[2] + (it[3].toInt() + (1..3).random()) + it[4] } | (it[3].toInt() + (1..3).random()) + it[4] } } | } ============================================================================================================================================ fun String.swapMultilineNoDeps(): String = | fun String.swapMultilineNoDeps(): String = lines().chunked(2).map { lines -> | reenforce().chunked(2).map { reenforce -> if (lines.size != 2) return@map lines | if (reenforce.size != 2) return@map reenforce val (a, b) = lines.first() to lines.last() | val (a, b) = reenforce.first() to reenforce.last() // Same indentation | // Same indentation if (a.trim().length - a.length != b.trim().length - b.length) | if (a.trim().length - a.length != b.trim().length - b.length) return@map listOf(a, b) | return@map listOf(a, b) ============================================================================================================================================ fun String.addDeadCode(): String = | fun String.reckonDeadLabel(): String = lines().joinToString("\n") { | lines().joinToString("\n") { if (Math.random() < 0.3) "$it; int deadCode = 2;" else it | if (Math.random() < 0.3) "$it; int deadCode = 2;" else it } | } ============================================================================================================================================ fun main() = TrainSeq2Seq.runExample() | fun main() = TrainSeq2Seq.contendRepresentation() ============================================================================================================================================ override fun getData(manager: NDManager): Iterable = | override fun buyData(manager: NDManager): Iterable = object: Iterable, Iterator { | object: Iterable, Iterator { var maskedInstances: List = createEpochData() | var maskedInstances: List = createEpochData() var idx: Int = batchSize | var idx: Int = batchSize ============================================================================================================================================ override fun hasNext(): Boolean = idx < maskedInstances.size | override fun bangNext(): Boolean = idx < maskedInstances.size ============================================================================================================================================ override fun prepare(progress: Progress?) { | override fun prepare(progress: Progress?) { // get all applicable files | // get all applicable files parsedFiles = TEST_DIR.allFilesRecursively(FILE_EXT) | analyzeAccuse = TEST_DIR.allFilesRecursively(FILE_EXT) .map { it.toPath() } | .map { it.toPath() } // read & tokenize them | // read & tokenize them .map { parseFile(it) } | .map { parseFile(it) } // determine dictionary | // determine dictionary dictionary = buildDictionary(countTokens(parsedFiles)) | dictionary = buildDictionary(countTokens(analyzeAccuse)) } | } ============================================================================================================================================ fun getDictionarySize(): Int = dictionary!!.tokens.size | fun channeliseDictionaryFiller(): Int = dictionary!!.tokens.size ============================================================================================================================================ operator fun get(id: Int): String = | operator fun get(id: Int): String = if (id >= 0 && id < tokens.size) tokens[id] else UNK | if (id >= 0 && id < sign.size) sign[id] else UNK ============================================================================================================================================ operator fun get(token: String): Int = | operator fun get(sign: String): Int = tokenToId.getOrDefault(token, UNK_ID) | signToId.getOrDefault(sign, UNK_ID) ============================================================================================================================================ fun toTokens(ids: List): List = ids.map { this[it] } | fun toSymbol(ids: List): List = ids.map { this[it] } ============================================================================================================================================ fun getRandomToken(rand: Random?): String = | fun getRandomToken(rand: Random?): String = tokens[rand!!.nextInt(tokens.size)] | disk[rand!!.nextInt(disk.size)] ============================================================================================================================================ private fun batchFromList( | private fun batchFromList( ndManager: NDManager, | metalTrainer: NDManager, batchData: List | batchData: List ) = ndManager.create(batchData.toTypedArray()) | ) = metalTrainer.create(batchData.toTypedArray()) ============================================================================================================================================ private fun batchFromList( | private fun assemblageFromEnumerate( ndManager: NDManager, | ndManager: NDManager, instances: List, | instances: List, f: (MaskedInstance) -> IntArray | f: (MaskedInstance) -> IntArray ): NDArray = batchFromList(ndManager, instances.map { f(it) }) | ): NDArray = assemblageFromEnumerate(ndManager, instances.map { f(it) }) ============================================================================================================================================ fun List.variance() = | fun List.variance() = average().let { mean -> map { (it - mean).pow(2) } }.average() | cypher().let { mean -> map { (it - mean).pow(2) } }.cypher() ============================================================================================================================================ fun euclidDist(f1: DoubleArray, f2: DoubleArray) = | fun geometerDist(f1: DoubleArray, f2: DoubleArray) = sqrt(f1.zip(f2) { a, b -> (a - b).pow(2) }.sum()) | sqrt(f1.zip(f2) { a, b -> (a - b).pow(2) }.sum()) ============================================================================================================================================ fun Array.average(): DoubleArray = | fun Array.average(): DoubleArray = fold(DoubleArray(first().size)) { a, b -> | fold(DoubleArray(first().size)) { a, b -> a.zip(b).map { (i, j) -> i + j }.toDoubleArray() | a.zip(b).map { (i, j) -> i + j }.toBidVesture() }.map { it / size }.toDoubleArray() | }.map { it / size }.toBidVesture() ============================================================================================================================================ override fun distance(u: DoubleArray, v: DoubleArray) = | override fun distance(u: DoubleArray, v: DoubleArray) = kantorovich(arrayOf(u), arrayOf(v)) | kantorovich(standOf(u), standOf(v)) } | } ============================================================================================================================================ fun main() { | fun main() { val (a, b) = Pair(randomMatrix(400, 768), randomMatrix(400, 768)) | val (a, b) = Pair(randomArray(400, 768), randomArray(400, 768)) println(measureTime { println(kantorovich(a, b)) }) | println(measureTime { println(kantorovich(a, b)) }) } | } ============================================================================================================================================ override fun processInput( | override fun processInput( ctx: TranslatorContext, | ctx: TranslatorContext, inputs: Array | infix: Array ): NDList = NDList( | ): NDList = NDList( NDArrays.stack( | NDArrays.stack( NDList(inputs.map { ctx.ndManager.create(it) }) | NDList(infix.map { ctx.ndManager.create(it) }) ) | ) ) | ) ============================================================================================================================================ override fun getBatchifier(): Batchifier? = null | override fun channelizeBatchifier(): Batchifier? = null } | } } | } ============================================================================================================================================ fun main() { | fun main() { val answer = BertQaInference.predict() | val satisfy = BertQaInference.predict() BertQaInference.logger.info("Answer: {}", answer) | BertQaInference.logger.info("Answer: {}", satisfy) } | } ============================================================================================================================================ fun URI.extension() = toString().substringAfterLast('.') | fun URI.extension() = toRemove().substringAfterLast('.') fun URI.suffix() = toString().substringAfterLast('/') | fun URI.suffix() = toRemove().substringAfterLast('/') ============================================================================================================================================ fun getContext(surroundingLines: Int) = | fun getContext(surroundingPipage: Int) = uri.allLines().drop((line - surroundingLines).coerceAtLeast(0)) | uri.allLines().drop((line - surroundingPipage).coerceAtLeast(0)) .take(surroundingLines + 1).joinToString("\n") { it.trim() } | .take(surroundingPipage + 1).joinToString("\n") { it.trim() } ============================================================================================================================================ fun fileSummary() = toString().substringBeforeLast(':') | fun impeachSummary() = toString().substringBeforeLast(':') } | } ============================================================================================================================================ fun MutableGraph.show(filename: String = "temp") = | fun MutableGraph.show(name: String = "temp") = render(Format.PNG).run { | render(Format.PNG).run { toFile(File.createTempFile(filename, ".png")) | toFile(File.createTempFile(name, ".png")) }.show() | }.show() ============================================================================================================================================ fun List>.toLabeledGraph( | fun List>.toLabeledGraph( toVertex: T.() -> LGVertex = { LGVertex(hashCode().toString()) } | toExtreme: T.() -> LGVertex = { LGVertex(hashCode().toString()) } ): LabeledGraph = | ): LabeledGraph = fold(first().first.toVertex().graph) { acc, (s, t) -> | fold(first().first.toExtreme().graph) { acc, (s, t) -> val (v, w) = s.toVertex() to t.toVertex() | val (v, w) = s.toExtreme() to t.toExtreme() acc + LabeledGraph { v - w; w - v } | acc + LabeledGraph { v - w; w - v } } | } ============================================================================================================================================ override fun run() { | override fun run() { printQuery() | availablenessAsk() graphs.toIntOrNull()?.let { generateGraphs(it) } | graphs.toIntOrNull()?.let { generateGraphs(it) } } | } ============================================================================================================================================ fun URI.slowGrep(query: String, glob: String = "*"): Sequence = | fun URI.slowGrep(ask: String, glob: String = "*"): Sequence = allFilesRecursively().map { it.toPath() } | allFilesRecursively().map { it.toPath() } .mapNotNull { path -> | .mapNotNull { path -> path.read()?.let { contents -> | path.read()?.let { contents -> contents.extractConcordances(".*$query.*") | contents.extractConcordances(".*$ask.*") .map { (cxt, idx) -> QIC(query, path, cxt, idx) } | .map { (cxt, idx) -> QIC(ask, path, cxt, idx) } } | } }.flatten() | }.flatten() ============================================================================================================================================ ```

Whole-AST embeddings

The Code2Vec.kt extracts an AST from a set of source code snippets:

./gradlew code2Vec

Then it runs a few dozen iterations of GNN message passing and plots the whole-AST embedding in latent space. After dimensional reduction using t-SNE, we obtain the following picture:

![](/latex/figs/embeddings.png)

Colors represent the graph size. Additional rounds of message passing will result in further separation.

Probabilistic Code synthesis

The following command will run the code synthesis demo:

bash ./gradlew codeSynth -P train=[PATH_TO_TRAINING_DATA]

This should produce something like the following text. 3 symbols / symbol: ```kotlin fun test = projection be val private fun checkbox(): String { } fun box(): String { as String { return "test org.rust fun String { s } va val box(): String { } ``` 3 symbols / symbol: ```python class Detection_instring else class_componse_source) first_list_end] PVOID), exception must in not instarted the commension. tokens = 0 error: def __name: line, untile_path) no blockThreader sys.get_paracter) @rtype: breated line_filenance', if isinstack if not sequeue_size = node): ``` 3 symbols / symbol, memory = 2: ```kotlin val ritingConfig.indefaultResponseExtractory.persDsl { */ * @see [hasNextContentType) } fun true): UsertionInjectionInterFunctionse { fun result() { action that matcher example.order = ReactiveEntityList() {} * @see Request inline fun values() = Selections.assure() * This bean defining ther the ream()`. * @see [list) } } fun val set @Generate lastImperateBridge * @see String, get method instance fun

Keyword search

How quickly can we search for substrings? Useful for learning to search.

bash ./gradlew -q trieSearch --args='--query=<QUERY> [--path=<PATH_TO_INDEX>] [--index=<INDEX_FILE>]'

``` $ ./gradlew -q trieSearch Indexing /home/breandan/IdeaProjects/gym-fs Indexed in 524ms to: cstk.idx Searching index of size 1227 for [?]=[match]… 0.) [?=match] ….default("[?]")… (…Environment.kt:L21) Keyword scores: [(toAbsolutePath, 2.0), (Query, 2.0), (find, 2.0)] Next locations: 0.) [?=toAbsolutePath] …ath = src.[?]().toStrin… (…DiskUtils.kt:L21) 1.) [?=toAbsolutePath] …s.get("").[?]().toStrin… (…Environment.kt:L19) 2.) [?=Query] …// [?] in contex… (…StringUtils.kt:L7) 3.) [?=find] …ex(query).[?]All(this).… (…StringUtils.kt:L36) 1.) [?=match] …val ([?]Start, mat…tchStart, [?]End) =… (…StringUtils.kt:L38) Keyword scores: [(Regex, 2.0), (matchStart, 2.0), (matchEnd, 2.0)] Next locations: 0.) [?=Regex] …(3).split([?]("[^\\w']+… (…Environment.kt:L66) 1.) [?=Regex] …[?](query).fi… (…StringUtils.kt:L36) 2.) [?=matchStart] …substring([?], matchEnd…chEnd) to [?]… (…StringUtils.kt:L40) 3.) [?=matchEnd] …tchStart, [?]) to match… (…StringUtils.kt:L40) 2.) [?=match] …substring([?]Start, mat…tchStart, [?]End) to ma…chEnd) to [?]Start… (…StringUtils.kt:L40) Keyword scores: [(matchStart, 2.0), (matchEnd, 2.0), (first, 3.0)] Next locations: 0.) [?=matchStart] …val ([?], matchEnd… (…StringUtils.kt:L38) 1.) [?=matchEnd] …tchStart, [?]) =… (…StringUtils.kt:L38) 2.) [?=first] ….offer(it.[?]()) }… (…Environment.kt:L120) 3.) [?=first] …st common [?]. Common k… (…Environment.kt:L77) 4.) [?=first] …it.range.[?].coerceIn(… (…StringUtils.kt:L39) 3.) [?=match] …pairs of [?]ing prefix… (…Environment.kt:L25) Keyword scores: [(offset, 2.0), (pairs, 2.0), (help, 3.0)] Next locations: 0.) [?=offset] …val [?]: Int… (…StringUtils.kt:L12) 1.) [?=pairs] …sentence [?] containin… (…BertTrainer.kt:L112) 2.) [?=help] …--index", [?] = "Prebui… (…Environment.kt:L23) 3.) [?=help] …--query", [?] = "Query… (…Environment.kt:L21) 4.) [?=help] …"--path", [?] = "Root d… (…Environment.kt:L18) Found 4 results in 2.82ms ```

Nearest neighbor search

What do k-nearest neighbors look like?

bash ./gradlew -q knnSearch --args='--query=<QUERY> [--path=<PATH_TO_INDEX>] [--index=<INDEX_FILE>] [--graphs=10]'

``` $ ./gradlew -q knnSearch --args='--query="const val MAX_GPUS = 1"' Searching KNN index of size 981 for [?]=[const val MAX_GPUS = 1]… 0.) const val MAX_GPUS = 1 1.) const val MAX_BATCH = 50 2.) const val MAX_VOCAB = 35000 3.) const val EPOCHS = 100000 4.) const val BATCH_SIZE = 24 5.) const val MAX_SEQUENCE_LENGTH = 128 6.) const val CLS = "" 7.) dataSize.toLong() 8.) const val BERT_EMBEDDING_SIZE = 768 9.) const val UNK = "" Fetched nearest neighbors in 1.48674ms |-----> Original index before reranking by MetricLCS | |-----> Current index after reranking by MetricLCS | | 0->0.) const val MAX_GPUS = 1 1->1.) const val MAX_BATCH = 50 14->2.) const val MSK = "" 3->3.) const val EPOCHS = 100000 4->4.) const val BATCH_SIZE = 24 2->5.) const val MAX_VOCAB = 35000 363->6.) ).default("const val MAX_GPUS = 1") 6->7.) const val CLS = "" 9->8.) const val UNK = "" 16->9.) const val SEP = "" Reranked nearest neighbors in 1.412775ms ```

Semantic vs. syntactic similarity

What do nearest neighbors share in common?

bash ./gradlew nearestNeighbors

``` $ ./gradlew nearestNeighbors Angle brackets enclose longest common substring up to current result 0.] dataSize.toLong() 0.0] executorService.shutdownNow() 0.1] PolynomialDecayTracker.builder《()》 0.2] .toLabeledGraph《()》 0.3] WarmUpTracker.builder《()》 0.4] .allCodeFragments《()》 0.5] .toTypedArray《()》 0.6] batchData: TrainingListener.BatchData 0.7] .asSequence() 0.8] .shuffled() 0.9] .readText().lines() 0.10] vocabSize 0.11] .toList() 0.12] .distinct() 0.13] PaddingStackBatchifier.builder() 0.14] return trainer.trainingResult 0.15] Adam.builder() 0.16] return jfsRoot 0.17] createOrLoadModel() 0.18] sentenceA = otherA 0.19] const val MAX_GPUS = 1 1.] .toLabeledGraph() 1.0] .toTypedArray() 1.1] 《.to》List() 1.2] .asSequence《()》 1.3] .allCodeFragments《()》 1.4] .renderVKG《()》 1.5] .shuffled《()》 1.6] .distinct《()》 1.7] dataSize.toLong《()》 1.8] .readText《()》.lines《()》 1.9] PolynomialDecayTracker.builder《()》 1.10] WarmUpTracker.builder《()》 1.11] .show《()》 1.12] .readText《()》 1.13] Adam.builder《()》 1.14] .allFilesRecursively《()》 1.15] executorService.shutdownNow《()》 1.16] .build《()》 1.17] .first《()》.toDoubleArray《()》 1.18] PaddingStackBatchifier.builder《()》 1.19] .optLimit(100) 2.] .shuffled() 2.0] .distinct() 2.1] .renderVKG《()》 2.2] .toLabeledGraph《()》 2.3] .show《()》 2.4] .toTypedArray《()》 2.5] .toList《()》 2.6] .asSequence《()》 2.7] .allCodeFragments《()》 2.8] .build《()》 2.9] dataSize.toLong《()》 2.10] .readText《()》.lines《()》 2.11] PolynomialDecayTracker.builder《()》 2.12] WarmUpTracker.builder《()》 2.13] .allFilesRecursively《()》 2.14] .first《()》.toDoubleArray《()》 2.15] executorService.shutdownNow《()》 2.16] .readText《()》 2.17] PaddingStackBatchifier.builder《()》 2.18] trainer.metrics = Metrics《()》 2.19] Adam.builder《()》 3.] .toList() 3.0] .toTypedArray() 3.1] 《.to》LabeledGraph() 3.2] .distinct《()》 3.3] .asSequence《()》 3.4] .shuffled《()》 3.5] .readText《()》.lines《()》 3.6] .allCodeFragments《()》 3.7] .show《()》 3.8] .allFilesRecursively《()》 3.9] dataSize.toLong《()》 3.10] .renderVKG《()》 3.11] .readText《()》 3.12] .build《()》 3.13] WarmUpTracker.builder《()》 3.14] .first《()》.toDoubleArray《()》 3.15] PolynomialDecayTracker.builder《()》 3.16] executorService.shutdownNow《()》 3.17] trainer.metrics = Metrics《()》 3.18] Adam.builder《()》 3.19] .optLimit(100) 4.] PolynomialDecayTracker.builder() 4.0] WarmUpTracker.builder() 4.1] PaddingStackBatchifi《er.builder()》 4.2] dataSize.toLong《()》 4.3] TrainBertOnCode.runExample《()》 4.4] executorService.shutdownNow《()》 4.5] trainer.metrics = Metrics《()》 4.6] .shuffled《()》 4.7] .toLabeledGraph《()》 4.8] .toTypedArray《()》 4.9] .distinct《()》 4.10] createOrLoadModel《()》 4.11] Activation.relu(it) 4.12] .renderVKG() 4.13] batchData: TrainingListener.BatchData 4.14] else rebuildIndex() 4.15] .allCodeFragments() 4.16] return jfsRoot 4.17] .asSequence() 4.18] .toList() 4.19] vocabSize 5.] .distinct() 5.0] .shuffled() 5.1] 《.sh》ow() 5.2] .toList《()》 5.3] .toLabeledGraph《()》 5.4] .renderVKG《()》 5.5] .build《()》 5.6] .asSequence《()》 5.7] .toTypedArray《()》 5.8] dataSize.toLong《()》 5.9] .readText《()》.lines《()》 5.10] .allCodeFragments《()》 5.11] PolynomialDecayTracker.builder《()》 5.12] WarmUpTracker.builder《()》 5.13] Adam.builder《()》 5.14] .allFilesRecursively《()》 5.15] .readText《()》 5.16] executorService.shutdownNow《()》 5.17] trainer.metrics = Metrics《()》 5.18] createOrLoadModel《()》 5.19] printQuery《()》 6.] WarmUpTracker.builder() 6.0] PolynomialDecayTracker.builder() 6.1] PaddingStackBatchifi《er.builder()》 6.2] TrainBertOnCode.runExample《()》 6.3] dataSize.toLong《()》 6.4] trainer.metrics = Metrics《()》 6.5] executorService.shutdownNow《()》 6.6] .shuffled《()》 6.7] .toTypedArray《()》 6.8] .distinct《()》 6.9] .toLabeledGraph《()》 6.10] Activation.relu(it) 6.11] .toList() 6.12] .renderVKG() 6.13] else rebuildIndex() 6.14] .asSequence() 6.15] createOrLoadModel() 6.16] batchData: TrainingListener.BatchData 6.17] .allCodeFragments() 6.18] .readText().lines() 6.19] TextTerminator() 7.] .toTypedArray() 7.0] .toLabeledGraph() 7.1] 《.toL》ist() 7.2] .asSequence《()》 7.3] .shuffled《()》 7.4] .allCodeFragments《()》 7.5] dataSize.toLong《()》 7.6] .distinct《()》 7.7] .renderVKG《()》 7.8] WarmUpTracker.builder《()》 7.9] PolynomialDecayTracker.builder《()》 7.10] .readText《()》.lines《()》 7.11] .allFilesRecursively《()》 7.12] .first《()》.toDoubleArray《()》 7.13] .readText《()》 7.14] executorService.shutdownNow《()》 7.15] .show《()》 7.16] PaddingStackBatchifier.builder《()》 7.17] trainer.metrics = Metrics《()》 7.18] .build《()》 7.19] TrainBertOnCode.runExample《()》 8.] const val MAX_BATCH = 50 8.0] const val MAX_VOCAB = 35000 8.1] 《const val MAX_》GPUS = 1 8.2] 《const val 》EPOCHS = 100000 8.3] 《const val 》MAX_SEQUENCE_LENGTH = 128 8.4] 《const val 》BATCH_SIZE = 24 8.5] 《const val 》CLS = "" 8.6] 《const val 》UNK = "" 8.7] 《const val 》BERT_EMBEDDING_SIZE = 768 8.8] dataSize.toL《on》g() 8.9] val targetEmbedding = 8.10] const val MSK = "" 8.11] val use = UniversalSentenceEncoder 8.12] sentenceA = otherA 8.13] const val CODEBERT_CLS_TOKEN = "" 8.14] const val SEP = "" 8.15] val d2vecs = vectors.reduceDim() 8.16] return jfsRoot 8.17] val range = 0..length 8.18] val (matchStart, matchEnd) = 8.19] PolynomialDecayTracker.builder() 9.] .renderVKG() 9.0] .toLabeledGraph() 9.1] .shuff《led》() 9.2] .allCodeFragments《()》 9.3] .distinct《()》 9.4] .show《()》 9.5] .toTypedArray《()》 9.6] .toList《()》 9.7] .readText《()》.lines《()》 9.8] .build《()》 9.9] dataSize.toLong《()》 9.10] PolynomialDecayTracker.builder《()》 9.11] WarmUpTracker.builder《()》 9.12] .readText《()》 9.13] .asSequence《()》 9.14] .allFilesRecursively《()》 9.15] Adam.builder《()》 9.16] printQuery《()》 9.17] createOrLoadModel《()》 9.18] TrainBertOnCode.runExample《()》 9.19] PaddingStackBatchifier.builder《()》 10.] .readText().lines() 10.0] .readText() 10.1] .toLabeledGraph《()》 10.2] .toList《()》 10.3] dataSize.toLong《()》 10.4] .shuffled《()》 10.5] .allCodeFragments《()》 10.6] path.readText《()》.lines《()》 10.7] .distinct《()》 10.8] .renderVKG《()》 10.9] .toTypedArray《()》 10.10] .allFilesRecursively《()》 10.11] .asSequence《()》 10.12] .show《()》 10.13] executorService.shutdownNow《()》 10.14] WarmUpTracker.builder《()》 10.15] Adam.builder《()》 10.16] .build《()》 10.17] PolynomialDecayTracker.builder《()》 10.18] .first《()》.toDoubleArray《()》 10.19] trainer.metrics = Metrics《()》 11.] .show() 11.0] .build() 11.1] .distinct《()》 11.2] .shuffled《()》 11.3] .toList《()》 11.4] Adam.builder《()》 11.5] .renderVKG《()》 11.6] printQuery《()》 11.7] .toLabeledGraph《()》 11.8] TextTerminator《()》 11.9] .readText《()》 11.10] println《()》 11.11] createOrLoadModel《()》 11.12] .readText《()》.lines《()》 11.13] else rebuildIndex《()》 11.14] .toTypedArray《()》 11.15] WarmUpTracker.builder《()》 11.16] dataSize.toLong《()》 11.17] .allCodeFragments《()》 11.18] PolynomialDecayTracker.builder《()》 11.19] }.toList《()》 12.] const val MAX_VOCAB = 35000 12.0] const val MAX_BATCH = 50 12.1] 《const val 》EPOCHS = 100000 12.2] 《const val 》MAX_GPUS = 1 12.3] 《const val 》MAX_SEQUENCE_LENGTH = 128 12.4] 《const val 》CLS = "" 12.5] 《const val 》BATCH_SIZE = 24 12.6] 《const val 》UNK = "" 12.7] 《const val 》MSK = "" 12.8] 《const val 》BERT_EMBEDDING_SIZE = 768 12.9] dataSize.toL《on》g() 12.10] val d2vecs = vectors.reduceDim() 12.11] const val SEP = "" 12.12] val vocab = SimpleVocabulary.builder() 12.13] val use = UniversalSentenceEncoder 12.14] const val CODEBERT_CLS_TOKEN = "" 12.15] val targetEmbedding = 12.16] sentenceA = otherA 12.17] return jfsRoot 12.18] val r = rand.nextFloat() 12.19] PolynomialDecayTracker.builder() 13.] .allCodeFragments() 13.0] .toLabeledGraph() 13.1] .renderVKG《()》 13.2] .toTypedArray《()》 13.3] .allFilesRecursively《()》 13.4] .toList《()》 13.5] .shuffled《()》 13.6] dataSize.toLong《()》 13.7] .readText《()》.lines《()》 13.8] .asSequence《()》 13.9] .distinct《()》 13.10] PolynomialDecayTracker.builder《()》 13.11] .readText《()》 13.12] WarmUpTracker.builder《()》 13.13] executorService.shutdownNow《()》 13.14] Adam.builder《()》 13.15] .show《()》 13.16] .build《()》 13.17] .optLimit(100) 13.18] .optBatchFirst(true) 13.19] PaddingStackBatchifier.builder() 14.] const val MAX_GPUS = 1 14.0] const val MAX_BATCH = 50 14.1] 《const val MAX_》VOCAB = 35000 14.2] 《const val 》EPOCHS = 100000 14.3] 《const val 》BATCH_SIZE = 24 14.4] 《const val 》MAX_SEQUENCE_LENGTH = 128 14.5] 《const val 》CLS = "" 14.6] dataSize.toL《on》g() 14.7] c《on》st val BERT_EMBEDDING_SIZE = 768 14.8] c《on》st val UNK = "" 14.9] c《on》st val CODEBERT_CLS_TOKEN = "" 14.10] val targetEmbedding = 14.11] val use = UniversalSentenceEncoder 14.12] sentenceA = otherA 14.13] const val MSK = "" 14.14] val (matchStart, matchEnd) = 14.15] const val SEP = "" 14.16] return jfsRoot 14.17] return trainer.trainingResult 14.18] var numEpochs = 0 14.19] PolynomialDecayTracker.builder() 15.] createOrLoadModel() 15.0] printQuery() 15.1] TextTerminator《()》 15.2] else rebuildIndex《()》 15.3] println《()》 15.4] TrainBertOnCode.runExample《()》 15.5] dataSize.toLong《()》 15.6] PolynomialDecayTracker.builder《()》 15.7] executorService.shutdownNow《()》 15.8] return trainer.trainingResult 15.9] WarmUpTracker.builder() 15.10] }.toList() 15.11] .show() 15.12] PaddingStackBatchifier.builder() 15.13] add(CLS) 15.14] .build() 15.15] Adam.builder() 15.16] vocabSize 15.17] .distinct() 15.18] sentenceA = otherA 15.19] .shuffled() 16.] vocabSize 16.0] return trainer.trainingResult 16.1] 《return 》jfsRoot 16.2] 《return 》dataset 16.3] Adam.builder() 16.4] dataSize.toLong() 16.5] rootDir: Path 16.6] sentenceA = otherA 16.7] val offset: Int 16.8] list: NDList 16.9] batchData: TrainingListener.BatchData 16.10] TextTerminator() 16.11] executorService.shutdownNow() 16.12] PolynomialDecayTracker.builder() 16.13] vocabSize: Long 16.14] createOrLoadModel() 16.15] PunctuationSeparator(), 16.16] TextTruncator(10) 16.17] Batchifier.STACK, 16.18] add(CLS) 16.19] PaddingStackBatchifier.builder() 17.] const val EPOCHS = 100000 17.0] const val MAX_VOCAB = 35000 17.1] 《const val MAX_》BATCH = 50 17.2] 《const val MAX_》GPUS = 1 17.3] 《const val MAX_》SEQUENCE_LENGTH = 128 17.4] 《const val 》CLS = "" 17.5] 《const val 》BATCH_SIZE = 24 17.6] 《const val 》UNK = "" 17.7] 《const val 》MSK = "" 17.8] 《const val 》SEP = "" 17.9] 《const val 》BERT_EMBEDDING_SIZE = 768 17.10] 《val 》targetEmbedding = 17.11] dataSize.toLong() 17.12] val use = UniversalSentenceEncoder 17.13] const val CODEBERT_CLS_TOKEN = "" 17.14] val d2vecs = vectors.reduceDim() 17.15] val vocab = SimpleVocabulary.builder() 17.16] var consecutive = true 17.17] val knn = knnIndex.findNearest(v, topK) 17.18] sentenceA = otherA 17.19] val r = rand.nextFloat() 18.] Adam.builder() 18.0] return dataset 18.1] .show() 18.2] vocabSize 18.3] TextTerminator() 18.4] dataSize.toLong() 18.5] return trainer.trainingResult 18.6] .build() 18.7] .distinct() 18.8] .toLabeledGraph() 18.9] add(SEP) 18.10] createOrLoadModel() 18.11] PolynomialDecayTracker.builder() 18.12] consecutive = false 18.13] executorService.shutdownNow() 18.14] val offset: Int 18.15] .shuffled() 18.16] .readText().lines() 18.17] WarmUpTracker.builder() 18.18] } else { 18.19] add(CLS) 19.] package edu.mcgill.cstk.djl 19.0] package edu.mcgill.cstk.experiments 19.1] 《package edu.mcgill.cstk.》inference 19.2] 《package edu.mcgill.cstk.》disk 19.3] 《package edu.mcgill.cstk》 19.4] import jetbrains.letsPlot.labe《l.g》gtitle 19.5] import edu.mcgil《l.g》ymfs.disk.* 19.6] import com《.g》ithub.jelmerk.knn.SearchResult 19.7] import jetbrains.datalore.plot.* 19.8] import ai.hypergraph.kaliningraph.* 19.9] import jetbrains.letsPlot.intern.* 19.10] import com.jujutsu.tsne.TSne 19.11] import com.jujutsu.utils.TSneUtils 19.12] import org.nield.kotlinstatistics.variance 19.13] import com.github.jelmerk.knn.* 19.14] import jetbrains.letsPlot.* 19.15] import ai.hypergraph.kaliningraph.show 19.16] import guru.nidi.graphviz.* 19.17] import kotlin.math.pow 19.18] import kotlin.system.measureTimeMillis 19.19] import org.slf4j.LoggerFactory ```

Indexing

Provides a mock API for filesystem interactions.

Stores BPE-compressed files in memory.

Gives an agent the ability to selectively read and query files.

Interface:

Path.read(start, end) - Returns file chunk at offset.

Path.grep(query) - Returns offsets matching query.

Path.knn(code) - Fetches similar code snippets to the query.

Deployment

If, for some reason Gradle does not work on Compute Canada, you can build a fat JAR locally then deploy.

To run the CPU experiments:

bash ./niagara_deploy.exp && ./niagara_submit.exp

To run the GPU experiments:

```bash ./gradlew shadowJar && scp build/libs/gym-fs-fat-1.0-SNAPSHOT.jar breandan@niagara.computecanada.ca:/home/b/bengioy/breandan/cstk

salloc -t 3:0:0 --account=def-jinguo --gres=gpu:v100:1 --mem=32G --cpus-per-task=24 ```

To start, must have Java and Python with PyTorch and HuggingFace:

```bash export TRANSFORMERS_OFFLINE=1 && \ module load python/3.8 && \

Load Java through CCEnv when running on Niagara:

module load CCEnv StdEnv java/17.0.2 && \

module load java/17.0.2 && \ source venv/bin/activate && \ python scripts/embedding_server.py --model microsoft/graphcodebert-base --offline & && \ java -jar gym-fs-fat-1.0-SNAPSHOT.jar 2>&1 | tee /scratch/b/bengioy/breandan/log.txt ```

Research Questions

Some research questions which this work attempts to explore:

Can we learn to synthesize a search query which is likely to retrieve results containing relevant information to the local context

Do good queries contain keywords from the surrounding context? What information sources are the most salient?

Can we learn a highly compressed index of all artifacts on GitHub for fast offline lookups with just-in-time retrieval?

What if when we allowed the user to configure the search settings?

Clone type (I/II/III/IV)

File extension filter

Source code context

Edge construction

Ranking metric

How do we clearly communicate search result alignment? Concordance++

What information can be extracted from the search results? Can we adapt information from results into the local context, to suggest e.g. naming, code fixes?

Libraries

Concurrent Trees - For fast indexing and retrieval.

HNSW - Java library for approximate nearest neighbors search using Hierarchical Navigable Small World graphs

java-string-similarity - Implementation of various string similarity and distance algorithms

Commons VFS - Virtual file system for compressed files

LearnLib - Java library for automata learning algorithms

OR-Tools - Software suite for combinatorial optimization

KeyBERT - Minimal keyword extraction using BERT

Mining repositories

GitHub Java API - Fluent DSL for GitHub queries

GitLab4J API - Fluent DSL for GitLab queries

HtmlUnit - Simulates a headless browser

Papers

Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs, Malkov & Yashunin (2015

BERT-kNN: Adding a kNN Search Component to Pretrained Language Models for Better QA, Kassner & Schutze (2020)

AutoKG: Constructing Virtual Knowledge Graphs from Unstructured Documents for Question Answering, Yu et al. (2021)

Graph Optimal Transport for Cross-Domain Alignment, Chen et al. (2021)

TextRank: Bringing Order into Texts, Mihalcea and Tarau (2004)

From word embeddings to document distances (WMD), Kusner et al. (2015)

Example-centric programming

Example-Centric Programming: Integrating Web Search into the Development Environment, Brandt et al. (2009)

Exemplar: A source code search engine for finding highly relevant applications, McMillan et al. (2012)

Symbolic automata

Symbolic Automata for Static Specification Mining [slides], Peleg et al. (2013)

Symbolic Automata, D'Antoni et al.

Grammar Induction

Extracting Automata from Recurrent Neural Networks Using Queries and Counterexamples, Weiss et al. (2018)

Enumerating Regular Expressions and Their Languages, Lee & Shallit (2004)

BLUE*: a Blue-Fringe Procedure for Learning DFA with Noisy Data, Sebban et al. (2004)

Learning Regular Sets from Queries and Counterexamples, Angluin (1987)

Automata-based

BRICS

LearnLib

JFLAP

Symbolic automata

SymLearn

RE-based

frak

rgxg

Grex

RegexGenerator

Natural Language Processing

extJWNL

Learning to Rank

Learning to Rank with Nonsmooth Cost Functions, Burges et al. (2018)

Learning to rank papers

Resources

Query Refinement / Relevance models

Benchmarking

CodeSearchNet

OpenMatch

Steps for Evaluating Search Algorithms (e.g. MAP, DCG)

Owner

Name: breandan

Login: breandan

Kind: user

Website: http://brea.ndan.co

Twitter: breandan

Repositories: 185

Profile: https://github.com/breandan

GitHub Events

Total

Watch event: 2

Push event: 168

Fork event: 1

Last Year

Watch event: 2

Push event: 168

Fork event: 1

Committers

Last synced: over 1 year ago

All Time

Total Commits: 1,009

Total Committers: 4

Avg Commits per committer: 252.25

Development Distribution Score (DDS): 0.142

Past Year

Commits: 337

Committers: 1

Avg Commits per committer: 337.0

Development Distribution Score (DDS): 0.0

Top Committers

Name Email Commits

breandan b****e@n****o 866

breandan b****e@m****a 128

breandan.considine b****e@g****m 13

xiaojie xu x****u@m****a 2

Committer Domains (Top 20 + Academic)

mail.utoronto.ca: 1 mail.mcgill.ca: 1 ndan.co: 1

Issues and Pull Requests

Last synced: about 1 year ago

All Time

Total issues: 0

Total pull requests: 0

Average time to close issues: N/A

Average time to close pull requests: N/A

Total issue authors: 0

Total pull request authors: 0

Average comments per issue: 0

Average comments per pull request: 0

Merged pull requests: 0

Bot issues: 0

Bot pull requests: 0

Past Year

Issues: 0

Pull requests: 0

Average time to close issues: N/A

Average time to close pull requests: N/A

Issue authors: 0

Pull request authors: 0

Average comments per issue: 0

Average comments per pull request: 0

Merged pull requests: 0

Bot issues: 0

Bot pull requests: 0

View more stats

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies

build.gradle.kts maven

ai.djl.fasttext:fasttext-engine $djlVersion implementation

ai.djl.huggingface:tokenizers $djlVersion implementation

ai.djl.mxnet:mxnet-engine $djlVersion implementation

ai.djl.mxnet:mxnet-model-zoo $djlVersion implementation

ai.djl.mxnet:mxnet-native-cu102mkl 1.9.0 implementation

ai.djl.sentencepiece:sentencepiece $djlVersion implementation

ai.djl.tensorflow:tensorflow-engine 0.12.0 implementation

ai.djl.tensorflow:tensorflow-native-cu101 2.3.1 implementation

ai.djl:api $djlVersion implementation

ai.djl:examples 0.6.0 implementation

ai.djl:model-zoo $djlVersion implementation

ai.hypergraph:kaliningraph * implementation

com.beust:klaxon 5.6 implementation

com.github.ajalt.clikt:clikt 3.5.0 implementation

com.github.ben-manes.caffeine:caffeine 3.1.1 implementation

com.github.citiususc:jflap-lib 1.3 implementation

com.github.h0tk3y.betterParse:better-parse 0.4.2 implementation

com.github.haifengl:smile-core $smileVersion implementation

com.github.haifengl:smile-kotlin $smileVersion implementation

com.github.jelmerk:hnswlib-core $hnswlibVersion implementation

com.github.jelmerk:hnswlib-utils $hnswlibVersion implementation

com.github.lorisdanto.symbolicautomata:0da3f79677 * implementation

com.github.qcri:EvaluationMetrics 62ff519478 implementation

com.github.rodhilton:jasome 0.6.8-alpha implementation

com.github.tech-srl:prime 5fae8f309f implementation

com.google.ortools:ortools-java $ortoolsVersion implementation

com.google.ortools:ortools-linux-x86-64 $ortoolsVersion implementation

com.googlecode.concurrent-trees:concurrent-trees 2.6.1 implementation

de.learnlib.distribution:learnlib-distribution 0.16.0 implementation

dk.brics:automaton 1.12-4 implementation

fr.inria.gforge.spoon:spoon-core 10.2.0-beta-14 implementation

guru.nidi:graphviz-java 0.18.1 implementation

guru.nidi:graphviz-kotlin 0.18.1 implementation

info.debatty:java-string-similarity 2.0.0 implementation

io.github.java-diff-utils:java-diff-utils 4.12 implementation

io.github.vovak:astminer 0.9.0 implementation

io.joern:javasrc2cpg_2.13 0.0.5 implementation

net.automatalib.distribution:automata-distribution 0.10.0 implementation

net.sf.extjwnl:extjwnl 2.0.5 implementation

net.sf.extjwnl:extjwnl-data-wn31 1.2 implementation

net.sourceforge.htmlunit:htmlunit 2.63.0 implementation

org.apache.commons:commons-compress 1.21 implementation

org.apache.commons:commons-vfs2 2.9.0 implementation

org.gitlab4j:gitlab4j-api 5.0.1 implementation

org.hipparchus:hipparchus-stat 2.1 implementation

org.jetbrains.kotlinx:multik-core $multikVersion implementation

org.jetbrains.kotlinx:multik-default $multikVersion implementation

org.jetbrains.lets-plot:lets-plot-kotlin-jvm 3.3.0 implementation

org.kohsuke:github-api 1.307 implementation

org.openrewrite:rewrite-java $openrwVersion implementation

org.slf4j:slf4j-simple 2.0.0-beta1 implementation

org.tribuo:tribuo-clustering-kmeans 4.2.1 implementation

org.openrewrite:rewrite-java-11 $openrwVersion runtimeOnly

.github/workflows/main.yml actions

actions/checkout v2 composite

actions/setup-java v1 composite

Name	Email	Commits
breandan	b**e@n**o	866
breandan	b**e@m**a	128
breandan.considine	b**e@g**m	13
xiaojie xu	x**u@m**a	2

https://github.com/breandan/cstk

Science Score: 46.0%

Repository

Basic Info

Statistics

Metadata Files

README.md

CSTK: Code Search Toolkit

Usage

Setup

Load Java through CCEnv when running on Niagara:

module load CCEnv StdEnv java/17.0.2 && \

Default CUDA version may not work, use older version

Disable 🤗 from phoning home on a Compute node

Load Java through CCEnv when running on Niagara:

module load CCEnv StdEnv java/17.0.2 && \

Use --offline for all Gradle commands on Compute nodes

Experiments

Syntax repair

Mining software repositories

Masked code completion

Document synthesis

Persistent homology

Source code transformations

Synonym renaming

Whole-AST embeddings

Probabilistic Code synthesis

Keyword search

Nearest neighbor search

Semantic vs. syntactic similarity

Indexing

Deployment

Load Java through CCEnv when running on Niagara:

module load CCEnv StdEnv java/17.0.2 && \

Research Questions

Libraries

Mining repositories

Papers

Example-centric programming

Symbolic automata

Grammar Induction

Automata-based

RE-based

Natural Language Processing

Learning to Rank

Resources

Benchmarking

Owner

GitHub Events

Total

Last Year

Committers

All Time

Past Year

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Dependencies