p2smi: A toolkit enabling SMILES generation and property analysis for noncanonical and cyclized peptides
p2smi: A toolkit enabling SMILES generation and property analysis for noncanonical and cyclized peptides - Published in JOSS (2025)
Science Score: 87.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
○codemeta.json file
-
○.zenodo.json file
-
✓DOI references
Found 2 DOI reference(s) in README and JOSS metadata -
✓Academic publication links
Links to: arxiv.org, acs.org -
○Committers with academic emails
-
○Institutional organization owner
-
✓JOSS paper metadata
Published in Journal of Open Source Software
Repository
Basic Info
- Host: GitHub
- Owner: AaronFeller
- License: mit-0
- Language: Python
- Default Branch: master
- Size: 11.4 MB
Statistics
- Stars: 12
- Watchers: 1
- Forks: 2
- Open Issues: 2
- Releases: 2
Metadata Files
README.md

p2smi: Generation and Analysis of Drug-like Peptide SMILES Strings
p2smi is a Python toolkit for peptide design and analysis.
It enables generation of peptide sequences, conversion to SMILES representations—including support for cyclic and noncanonical amino acids—and evaluation of molecular properties. The package also provides utilities for structural modification (e.g., N-methylation, PEGylation), synthesis feasibility assessment, and output in a dedicated .p2smi format that links peptide sequences to their corresponding SMILES.
Developed in support of PeptideCLM, a SMILES-based language model for modified peptides, p2smi provides an extensible foundation for computational peptide chemistry and machine-learning-driven molecular design.
Features
- Generate random peptide sequences (with NCAAs, D-stereochemistry, and cyclization)
- Convert peptide FASTA files into valid SMILES strings
- Support five cyclization types: disulfide, head-to-tail, sidechain-to-sidechain, sidechain-to-N-term, sidechain-to-C-term
- Modify SMILES with user-defined N-methylation and PEGylation rates
- Evaluate synthetic feasibility based on common failure motifs
- Compute molecular properties (MW, logP, TPSA, Lipinski, etc.)
Updates
- Version 1.1.1 - Added functionality to allow for user-defined cyclizing residue constraints
- Version 1.1.0 - Updated codebase, documentation, fixed bugs -- for JOSS review
- Version 1.0.0 - First release for JOSS submission
Citation
If you use this tool, please cite:
p2smi: A Python Toolkit for Peptide FASTA-to-SMILES Conversion and Molecular Property Analysis.
Feller, A. L. and Wilke, C. O. (2025).
arXiv
A JOSS publication for this package is in review.
Manuscript
Directory
Installation
Install from PyPI:
bash
pip install p2smi
For local development:
bash
git clone https://github.com/AaronFeller/p2smi.git
cd p2smi
pip install -e .[dev]
Command-Line Tools
| Command | Description |
|-------------------------|-------------------------------------------------------------------------------------------------------------------------|
| generate-peptides. | Summary: Generates random peptide sequences with user-defined constraints including number of sequences, length range, NCAA percentage, D-stereochemistry rate, and cyclization types. Supports over 100 noncanonical amino acids (SwissSidechain).
Input: CLI arguments for generation settings and output filename.
Output: FASTA file with single-letter codes, including noncanonical residues. |
| fasta2smi | Summary: Converts peptide sequences from FASTA format into SMILES, parsing cyclization tags from the FASTA header.
Note: Supports five cyclization types: disulfide (SS), head-to-tail (HT), sidechain-to-sidechain (SCSC), sidechain-to-head (SCNT), and sidechain-to-tail (SCCT). To define specific cyclizations, include notation in fasta file as described in the next section below.
Input: Peptide FASTA file, optional cyclization tags.
Output: .p2smi file containing amino acid sequence, cyclization type, and SMILES string. |
| modify-smiles | Summary: Applies random N-methylation and PEGylation to SMILES strings. Modifications are probabilistic and tracked when input is in .p2smi format.
Input: Plaintext SMILES file or .p2smi file.
Output: Modified SMILES in same format as input, with changes recorded. |
| smiles-props | Summary: Computes a wide range of molecular properties from SMILES, including MW, TPSA, logP, H-bond donors/acceptors, rotatable bonds, ring count, fraction Csp3, heavy atoms, formal charge, molecular formula, and Lipinski rule evaluation.
Input: SMILES text file or .p2smi file.
Output: JSON-formatted text file with calculated properties for each SMILES. |
| synthesis-check | Summary: Evaluates peptide sequences for synthetic feasibility using hard-coded filters (e.g., N/Q at N-terminus, Gly/Pro motifs, Cys count, hydrophobicity, charge distribution). Currently supports natural amino acids only.
Input: FASTA file.
Output: FASTA file with headers annotated as PASS/FAIL. |
Use --help on any command for options:
bash
fasta2smi --help
Manually encoding cyclizations
Cyclizations can be specified directly in the FASTA header to control how fasta2smi interprets bond formation between residues.
Each cyclization tag begins with a two-letter code identifying the bond type (SS or SC), followed by a constraint mask of equal length to the peptide sequence, where:
- X marks positions left unconstrained
- C marks residues participating in a disulphide bond
- N marks residues with side-chain cyclization to N-term
- Z marks residues with side-chain cyclization to C-term
- if N and Z included, form side-chain to side-chain cyclization
Supported Formats:
| Tag | Type | Description | Example header|
|------|------|-------------|----------|
| SS | Disulfide | Connects two cysteine residues | >peptide\|SSXXXCXXXCX |
| HT | Head-to-tail | Amide bond between N- and C-termini | >peptide\|HT |
| SCSC | Sidechain–Sidechain | Covalent link between two sidechains (e.g., Lys–Asp lactam) | >peptide\|SCXXNXXXXXZ |
| SCNT | Sidechain–N-Terminus | Link between N-terminus and a sidechain residue | >peptide\|SCXXNXXXXXX |
| SCCT | Sidechain–C-Terminus | Link between a sidechain residue and C-terminus | >peptide\|SCXXXXXZXXX |
Example Usage
Generate random peptides with constraints:
bash
generate-peptides \
--num 10 \
--min_length 10 \
--max_length 20 \
--noncanonical 0.1 \
--dextro 0.1 \
--cyclization_constraints all \
--outfile peptides.fasta
Convert FASTA to SMILES:
bash
fasta2smi -i peptides.fasta -o peptides.p2smi
Modify SMILES strings:
bash
modify-smiles -i peptides.p2smi -o modified.p2smi --peg_rate 0.2 --nmeth_rate 0.2 --nmeth_residues 0.2
Compute molecular properties:
bash
smiles-props -i modified.p2smi
Check synthesis feasibility (natural AAs only):
bash
generate-peptides -o nat_peptides.fasta
synthesis-check -i nat_peptides.fasta
Future Work
- Extend synthesis rules to NCAAs and modified peptides
- Support alternative encodings (HELM, SELFIES)
- Batch processing and multiprocessing support
- Integration with predictive models
- Post-translational modification import pipelines
For Contributors
You’re welcome to contribute! Suggestions, bugs, and pull requests are appreciated.
- 📂 Open an Issue
- 🛠 Submit a pull request
- 📝 Improve the docs
License
Owner
- Name: Aaron
- Login: AaronFeller
- Kind: user
- Repositories: 2
- Profile: https://github.com/AaronFeller
JOSS Publication
p2smi: A toolkit enabling SMILES generation and property analysis for noncanonical and cyclized peptides
Authors
Tags
cheminformatics bioinformatics peptides SMILES molecular propertiesGitHub Events
Total
- Create event: 3
- Release event: 2
- Issues event: 4
- Watch event: 12
- Delete event: 2
- Issue comment event: 4
- Push event: 35
- Pull request event: 2
- Fork event: 1
Last Year
- Create event: 3
- Release event: 2
- Issues event: 4
- Watch event: 12
- Delete event: 2
- Issue comment event: 4
- Push event: 35
- Pull request event: 2
- Fork event: 1
Committers
Last synced: 5 months ago
Top Committers
| Name | Commits | |
|---|---|---|
| AaronFeller | a****r@g****m | 57 |
| Fergal Duffy | f****d@g****m | 14 |
Issues and Pull Requests
Last synced: 5 months ago
All Time
- Total issues: 4
- Total pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: 19 minutes
- Total issue authors: 3
- Total pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 4
- Pull requests: 3
- Average time to close issues: N/A
- Average time to close pull requests: 19 minutes
- Issue authors: 3
- Pull request authors: 1
- Average comments per issue: 0.0
- Average comments per pull request: 0.0
- Merged pull requests: 2
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- andregiuseppecavalli (2)
- dr-aspirinas (1)
- jfFEM (1)
Pull Request Authors
- AaronFeller (3)
Top Labels
Issue Labels
Pull Request Labels
Packages
- Total packages: 1
-
Total downloads:
- pypi 289 last-month
- Total dependent packages: 0
- Total dependent repositories: 0
- Total versions: 4
- Total maintainers: 1
pypi.org: p2smi
A toolkit enabling SMILES generation and property analysis for noncanonical and cyclized peptides.
- Homepage: https://github.com/AaronFeller/p2smi
- Documentation: https://p2smi.readthedocs.io/
- License: MIT
-
Latest release: 1.1.1
published 4 months ago
