Recent Releases of pytoda
pytoda - v1.1.6
What's Changed
Crawlers by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/165
Fix a bug in the PubChem crawler that occurs when the database has multiple SMILES for a molecule
Removing zinc crawler because it is not functional
Restructure installation to support recent versions of numpy (2), torch (2), pip (20-24) and python (3.9-3.12)
Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.5...v1.1.6
- Python
Published by jannisborn over 1 year ago
pytoda - Improved PubChem error handling
What's Changed
- Update PubChem crawler by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/163
Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.2...v1.1.3
- Python
Published by jannisborn almost 3 years ago
pytoda - v1.1.1
Kinase sequence alignment data now available as pytoda.proteins.kinase_as_alignment: (https://github.com/PaccMann/paccmann_datasets/commit/245c9b56bce2557b23004fa4101fcb5afea9d69a)
Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.0...v1.1.1
- Python
Published by jannisborn over 3 years ago
pytoda - v1.1.0
What's Changed
- Deprecate device handling by @YoelShoshan in https://github.com/PaccMann/paccmanndatasets/pull/156 NOTE: This DOES impact backwards compatibility because now, whenever a device is passed, an exception will be raised saying that GPU support is not longer maintained. The reason for this is that having pytoda handling sending data to the GPU is significantly slower compared to sending the full batch (for details see: https://github.com/PaccMann/paccmanndatasets/issues/155)
- Novel protein augmentation by @jannisborn and @YoelShoshan in https://github.com/PaccMann/paccmann_datasets/pull/160
Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.0.2...v1.1.0
- Python
Published by jannisborn over 3 years ago
pytoda - Release of pypi distribution (version 1.0.0)
Pytoda version 1.0.0
- Smiles transforms by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/137
- PubChem crawler can parse IDs by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/140
- Handling chores by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/143
- Codecov by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/145
- Fix proteinlanguage handling by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/146
- Multi protein languages by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/147
- Handling of SMILES transforms when language is passed to SMILESDataset by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/148
- PyPI Release by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/149
Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/0.2.5...1.0.0
- Python
Published by jannisborn about 4 years ago
pytoda - Query SMILES in PubChem, handle/impute NaN in GeneExpressionDataset
Change to black formatter with configuration files (#93)
conda.yml: refers to requirements.txt (#92)
GeneExpressionDataset: delayed optional imputation of NaN until after statistics collection and transformation (#88)
AugmentByReversing: can now take a probability to perform the reversing (#85)
read_smi: Raise error in when wrong delimiter is used (#85)
Added remove_pubchem_smiles: filtering out PubChem-SMILES (#85)
- Python
Published by C-nit over 5 years ago
pytoda - Refactor introducing base datasets supporting key lookup
Many top level datasets in pytoda are "just" torch Datasets supporting __len__ and __getitem__. Typically the dataset itself is (possibly labeling and) pairing keys of samples/entities, where the respective item data comes from specific sources.
So, in their implementations the datasets rely on other datasets specific for an datatype/entity, and rely on getting items via hashable key not integer index.
This PR introduces some base classes in base_dataset.py that provide an interface one can expect from such datasets.
New Base Classes
KeyDataset
Every KeyDataset can implements it's own mechanism to store keys and match them with index/item, but minimally implements get_key and get_index. That's it.
The following methods are available:
get_key(index: int) -> Hashable
get_index(key: Hashable) -> int
get_item_from_key(key: Hashable) -> Any
keys() -> Iterator
has_duplicate_keys -> bool
__add__(other) -> _ConcatenatedDataset
(with default implementations that can be overloaded with more efficient methods)
ConcatKeyDataset
based on torchs ConcatDataset supports concatenation of multiple KeyDatasets. The key lookup through the source datasets was implemented in each top level dataset itself before, now built in and referring to the datasets own implementation of the lookup.
Also featuring methods get_index_pair and get_key_pair to get the dataset index.
DatasetDelegator
Often there are base classes implementing functionality for a datatype, with the setup of the datasource (e.g. eager vs lazy, filetype) left to child classes.
DatasetDelegators with an assigned self.dataset behave as if they were that dataset, delegating all method/attribute calls not implemented to self.dataset. This provides "base" methods saving reimplementation but allows "overloading".
keyed and indexed
Once a dataset is fed to a dataloader that shuffles the data, it's hard to go back and investigate loaded samples without the items index/key.
The keyed and indexed functions called on a dataset will return a shallow copy of the dataset with changed indexing behaviour, also returning the index/key in addition to the item.
While AnnotatedDataset iterates through samples in the annotation file, use of keyed and indexed in contrast would allow to iterate through the samples in the dataset, but still allow looking up labels manually from some source.
Notes
In the case of duplicate keys, the behaviour is implementation specific (i.e. could raise or return first/last).
Many tests were refactored to test lazy/eager backends in the same file w/o code duplication, and added test for base methods where appropriate.
Datasets that filter items for availability from sources usually define masks_df now, that has column wise masks for the original df to allow inspection of missing entities per item.
Refactor
SMILESLanguage and SMILESTokenizer
SMILESLanguage can translate SMILES to token indexes. Transforms of SMILES and transforms of the encoded token indexes are separated and default to identity functions. Definition of the transforms is job of child implementations like SMILESTokenizer or at runtime on instances. There is a named choice of tokenization functions in TOKENIZER_FUNCTIONS, but the function can be passed, too.
The instances can be used to load or build up vocabularies and remember the longest sequence.
The vocabulary can be stored/loaded from/to .json. Additionally an instance can be stored/loaded in a directory of text files (of defined names). This is achieved similar as in the huggingface transformers (added attribution header and licence). A pretrained tokenizer is shipped in pytoda.smiles.metadata.
A new method add_dataset allows building up the vocabulary from an iterable (list, SMILESDataset, ...), that checks for invalid source smiles, applies transform_smiles and passes the result to the tokenizer function to add new tokens.
SMILESDataset and SMILESTokenizerDataset
SMILESDataset is now merely returning SMILES strings as one might expect from the name. This is a breaking change, where users of the old SMILESDataset can use SMILESTokenizerDataset now.
SMILESTokenizerDataset uses SMILESTokenizer as default smiles language to transform items via a SMILESDataset.
SMILESTokenizerDataset can now optionally load a vocab and (not) iterate a dataset.
- Python
Published by C-nit over 5 years ago
pytoda - Protein Language Modelling
Added various functionalities for Protein language modelling.
Including a submodule pytoda.proteins with a ProteinLanguage class and 2 new types of datasets, availablle through pytoda.datasets called ProteinSequenceDataset and ProteinProteinInteractionDataset.
- Python
Published by jannisborn almost 6 years ago
pytoda - Webservice migration
Several improvements made, partially in response to the migration of our webservice (https://ibm.biz/paccmann-aas) to a pytorch-deployed model.
- Python
Published by jannisborn about 6 years ago