paresy

Artifacts for the PLDI 2023 paper "Search-Based Regular Expression Inference on a GPU"

https://github.com/mojtabavalizadeh/paresy

Science Score: 67.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 6 DOI reference(s) in README
  • Academic publication links
    Links to: arxiv.org, acm.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.8%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

Artifacts for the PLDI 2023 paper "Search-Based Regular Expression Inference on a GPU"

Basic Info
Statistics
  • Stars: 14
  • Watchers: 2
  • Forks: 1
  • Open Issues: 1
  • Releases: 0
Created almost 3 years ago · Last pushed about 1 year ago
Metadata Files
Readme Citation

README.md

PaRESy

This repo contains the code and other artifacts for the paper.

Search-Based Regular Expression Inference on a GPU

by Mojtaba Valizadeh and Martin Berger. The paper is available at - https://dl.acm.org/doi/10.1145/3591274 (publisher) - https://arxiv.org/abs/2305.18575 (draft)

The original code, submitted to PLDI 2023, is available using the tag v0.1, or using commit 6aa87ae. See also - https://github.com/MojtabaValizadeh/ltl-learning-on-gpus

for follow-up work that trades off scalability against minimality.

Introduction

In this work, the goal is to find a precise and minimal regular expression (RE) that accepts a given set of positive strings and rejects a given set of negative ones. To accomplish this, we developed an algorithm called PaRESy, Parallel Regular Expression Synthesiser, which is implemented in two codes: one for CPU using C++, and another for GPU using Cuda C++. By doing this, we could measure the speed-up for the most challenging examples.

In this version of the work, we use a simple grammar for the REs:

R ::= Φ|ε|a|R?|R*|R.R|R+R Regarding the minimality, we need to use a cost function that maps every constructor in the RE to a positive integer. By summing up the costs of all the constructors, we can obtain the overall cost of the RE. This cost function helps us to avoid overfitting and returning the trivial RE that is the union of the positive strings.

Example: - Positive: ["", "010", "1000", "0101", "1010", "0010"] - Negative: ["001", "1011", "111"]

Note: In this work, we use "" to represent the epsilon (ε) as the empty string.

We aim to find a regular expression (RE) that accepts all strings in the positive set and rejects all strings in the negative set. However, there could be an infinite number of such REs, and we want to identify the minimal one based on a given cost function. This cost function assigns positive integers to each constructor, including the alphabet, question mark, star, concatenation, and union. Let us assume that the costs of these constructors are given as [c1, c2, c3, c4, c5]. Based on the different costs, we can generate various REs, as follows.

  • [1, 1, 1, 1, 1] ---> (0+101?)*
  • [1, 5, 1, 1, 5] ---> (01)*(1*0)*

Note: In this work, we use the + symbol to represent the union constructor, which is commonly denoted by | in standard libraries.

In this simple example, both REs are precise (i.e., accepts all positive and rejects all negative examples) and minimal w.r.t their own cost functions. We can observe that by increasing the costs of question mark and union in the second cost function, the resulting RE contains fewer instances of these constructors. However, to compensate for this, the regular expression tends to use more stars, which are cheaper in this particular cost function.

Colab Notebook

This work is presented on a Google Colab notebook platform, which automatically clones this GitHub repository and comes fully equipped with all necessary dependencies and libraries, eliminating the need for additional installations. You can run the scripts by clicking on the designated buttons and adjusting inputs as needed.

To access the notebook, please use the link below and follow the instructions provided.

Citations

@article{ValizadehM:seabasreioag, address = {New York, NY, USA}, articleno = {160}, author = {Valizadeh, Mojtaba and Berger, Martin}, doi = {10.1145/3591274}, issue_date = {June 2023}, journal = {Proc. ACM Program. Lang.}, keywords = {machine learning, program synthesis, GPU, Grammar inference, regular expression inference}, month = {jun}, note = {Draft available at \url{https://arxiv.org/abs/2305.18575}, implementation: \url{https://github.com/MojtabaValizadeh/paresy}}, number = {PLDI}, numpages = {23}, publisher = {Association for Computing Machinery}, title = {{Search-Based Regular Expression Inference on a GPU}}, url = {https://doi.org/10.1145/3591274}, volume = {7}, year = {2023}, bdsk-url-1 = {https://doi.org/10.1145/3591274}}

@misc{Valizadeh_Code_and_data, author = {Valizadeh, Mojtaba and Berger, Martin}, title = {{Code and data for the PLDI 2023 paper: Search-Based Regular Expression Inference on a GPU}}, year = {2023}, howpublished = {https://github.com/MojtabaValizadeh/paresy}, }

Owner

  • Name: Mojtaba Valizadeh
  • Login: MojtabaValizadeh
  • Kind: user

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Valizadeh"
  given-names: "Mojtaba"
  orcid: "https://orcid.org/0000-0003-1582-3213"
- family-names: "Berger"
  given-names: "Martin"
  orcid: "https://orcid.org/0000-0003-3239-5812"
title: "Code and data for the PLDI 2023 paper: Search-Based Regular Expression Inference on a GPU"
version: TBC
doi: TBC
date-released: TBC
url: "https://github.com/MojtabaValizadeh/paresy"

GitHub Events

Total
  • Watch event: 3
  • Push event: 2
Last Year
  • Watch event: 3
  • Push event: 2