What's Changed

Crawlers by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/165
Fix a bug in the PubChem crawler that occurs when the database has multiple SMILES for a molecule
Removing zinc crawler because it is not functional
Restructure installation to support recent versions of numpy (2), torch (2), pip (20-24) and python (3.9-3.12)

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.5...v1.1.6

- Python
Published by jannisborn over 1 year ago

What's Changed

higher python compatibility by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/164

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.4...v1.1.5

- Python
Published by jannisborn about 2 years ago

pytoda - v1.1.4

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.3...v1.1.4

- Python
Published by jannisborn over 2 years ago

pytoda - Improved PubChem error handling

What's Changed

Update PubChem crawler by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/163

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.2...v1.1.3

- Python
Published by jannisborn over 3 years ago

What's Changed

feat: bump selfies version (close #150) by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/162

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.1...v1.1.2

- Python
Published by jannisborn over 3 years ago

pytoda - v1.1.1

Kinase sequence alignment data now available as pytoda.proteins.kinase_as_alignment: (https://github.com/PaccMann/paccmann_datasets/commit/245c9b56bce2557b23004fa4101fcb5afea9d69a)

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.1.0...v1.1.1

- Python
Published by jannisborn about 4 years ago

What's Changed

Deprecate device handling by @YoelShoshan in https://github.com/PaccMann/paccmanndatasets/pull/156 NOTE: This DOES impact backwards compatibility because now, whenever a device is passed, an exception will be raised saying that GPU support is not longer maintained. The reason for this is that having pytoda handling sending data to the GPU is significantly slower compared to sending the full batch (for details see: https://github.com/PaccMann/paccmanndatasets/issues/155)
Novel protein augmentation by @jannisborn and @YoelShoshan in https://github.com/PaccMann/paccmann_datasets/pull/160

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.0.2...v1.1.0

- Python
Published by jannisborn about 4 years ago

What's Changed

Save language restoration by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/154

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/v1.0.1...v1.0.2

- Python
Published by jannisborn about 4 years ago

What's Changed

Compatibility by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/153

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/1.0.0...v1.0.1

- Python
Published by jannisborn about 4 years ago

pytoda - Release of pypi distribution (version 1.0.0)

Pytoda version 1.0.0

Smiles transforms by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/137
PubChem crawler can parse IDs by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/140
Handling chores by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/143
Codecov by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/145
Fix proteinlanguage handling by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/146
Multi protein languages by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/147
Handling of SMILES transforms when language is passed to SMILESDataset by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/148
PyPI Release by @jannisborn in https://github.com/PaccMann/paccmann_datasets/pull/149

Full Changelog: https://github.com/PaccMann/paccmann_datasets/compare/0.2.5...1.0.0

- Python
Published by jannisborn over 4 years ago

pytoda - Query SMILES in PubChem, handle/impute NaN in GeneExpressionDataset

Change to black formatter with configuration files (#93)

conda.yml: refers to requirements.txt (#92)

GeneExpressionDataset: delayed optional imputation of NaN until after statistics collection and transformation (#88)

AugmentByReversing: can now take a probability to perform the reversing (#85)

read_smi: Raise error in when wrong delimiter is used (#85)

Added remove_pubchem_smiles: filtering out PubChem-SMILES (#85)

- Python
Published by C-nit over 5 years ago

pytoda - Refactor introducing base datasets supporting key lookup

Many top level datasets in pytoda are "just" torch Datasets supporting __len__ and __getitem__. Typically the dataset itself is (possibly labeling and) pairing keys of samples/entities, where the respective item data comes from specific sources.
So, in their implementations the datasets rely on other datasets specific for an datatype/entity, and rely on getting items via hashable key not integer index.

This PR introduces some base classes in base_dataset.py that provide an interface one can expect from such datasets.

New Base Classes

`KeyDataset`

Every KeyDataset can implements it's own mechanism to store keys and match them with index/item, but minimally implements get_key and get_index. That's it. The following methods are available: get_key(index: int) -> Hashable get_index(key: Hashable) -> int get_item_from_key(key: Hashable) -> Any keys() -> Iterator has_duplicate_keys -> bool __add__(other) -> _ConcatenatedDataset (with default implementations that can be overloaded with more efficient methods)

`ConcatKeyDataset`

based on torchs ConcatDataset supports concatenation of multiple KeyDatasets. The key lookup through the source datasets was implemented in each top level dataset itself before, now built in and referring to the datasets own implementation of the lookup. Also featuring methods get_index_pair and get_key_pair to get the dataset index.

`DatasetDelegator`

Often there are base classes implementing functionality for a datatype, with the setup of the datasource (e.g. eager vs lazy, filetype) left to child classes. DatasetDelegators with an assigned self.dataset behave as if they were that dataset, delegating all method/attribute calls not implemented to self.dataset. This provides "base" methods saving reimplementation but allows "overloading".

`keyed` and `indexed`

Once a dataset is fed to a dataloader that shuffles the data, it's hard to go back and investigate loaded samples without the items index/key.
The keyed and indexed functions called on a dataset will return a shallow copy of the dataset with changed indexing behaviour, also returning the index/key in addition to the item.

While AnnotatedDataset iterates through samples in the annotation file, use of keyed and indexed in contrast would allow to iterate through the samples in the dataset, but still allow looking up labels manually from some source.

Notes

In the case of duplicate keys, the behaviour is implementation specific (i.e. could raise or return first/last).

Many tests were refactored to test lazy/eager backends in the same file w/o code duplication, and added test for base methods where appropriate.

Datasets that filter items for availability from sources usually define masks_df now, that has column wise masks for the original df to allow inspection of missing entities per item.

Refactor

`SMILESLanguage` and `SMILESTokenizer`

SMILESLanguage can translate SMILES to token indexes. Transforms of SMILES and transforms of the encoded token indexes are separated and default to identity functions. Definition of the transforms is job of child implementations like SMILESTokenizer or at runtime on instances. There is a named choice of tokenization functions in TOKENIZER_FUNCTIONS, but the function can be passed, too.

The instances can be used to load or build up vocabularies and remember the longest sequence. The vocabulary can be stored/loaded from/to .json. Additionally an instance can be stored/loaded in a directory of text files (of defined names). This is achieved similar as in the huggingface transformers (added attribution header and licence). A pretrained tokenizer is shipped in pytoda.smiles.metadata.

A new method add_dataset allows building up the vocabulary from an iterable (list, SMILESDataset, ...), that checks for invalid source smiles, applies transform_smiles and passes the result to the tokenizer function to add new tokens.

`SMILESDataset` and `SMILESTokenizerDataset`

SMILESDataset is now merely returning SMILES strings as one might expect from the name. This is a breaking change, where users of the old SMILESDataset can use SMILESTokenizerDataset now. SMILESTokenizerDataset uses SMILESTokenizer as default smiles language to transform items via a SMILESDataset.

SMILESTokenizerDataset can now optionally load a vocab and (not) iterate a dataset.

- Python
Published by C-nit almost 6 years ago

pytoda - Protein Language Modelling

Added various functionalities for Protein language modelling. Including a submodule pytoda.proteins with a ProteinLanguage class and 2 new types of datasets, availablle through pytoda.datasets called ProteinSequenceDataset and ProteinProteinInteractionDataset.

- Python
Published by jannisborn over 6 years ago

pytoda - Webservice migration

Several improvements made, partially in response to the migration of our webservice (https://ibm.biz/paccmann-aas) to a pytorch-deployed model.

- Python
Published by jannisborn over 6 years ago

pytoda - Extend SMILES functionalities

- Python
Published by drugilsberg over 6 years ago

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

Recent Releases of pytoda

pytoda - v1.1.6

What's Changed

pytoda - v1.1.5

What's Changed

pytoda - v1.1.4

pytoda - Improved PubChem error handling

What's Changed

pytoda - v1.1.2

What's Changed

pytoda - v1.1.1

pytoda - v1.1.0

What's Changed

pytoda - v1.0.2

What's Changed

pytoda - v1.0.1

What's Changed

pytoda - Release of pypi distribution (version 1.0.0)

Pytoda version 1.0.0

pytoda - Query SMILES in PubChem, handle/impute NaN in GeneExpressionDataset

pytoda - Refactor introducing base datasets supporting key lookup

New Base Classes

`KeyDataset`

`ConcatKeyDataset`

`DatasetDelegator`

`keyed` and `indexed`

Notes

Refactor

`SMILESLanguage` and `SMILESTokenizer`

`SMILESDataset` and `SMILESTokenizerDataset`

pytoda - Protein Language Modelling

pytoda - Webservice migration

pytoda - Extend SMILES functionalities

Recent Releases of pytoda

pytoda - v1.1.6

What's Changed

pytoda - v1.1.5

What's Changed

pytoda - v1.1.4

pytoda - Improved PubChem error handling

What's Changed

pytoda - v1.1.2

What's Changed

pytoda - v1.1.1

pytoda - v1.1.0

What's Changed

pytoda - v1.0.2

What's Changed

pytoda - v1.0.1

What's Changed

pytoda - Release of pypi distribution (version 1.0.0)

Pytoda version 1.0.0

pytoda - Query SMILES in PubChem, handle/impute NaN in GeneExpressionDataset

pytoda - Refactor introducing base datasets supporting key lookup

New Base Classes

KeyDataset

ConcatKeyDataset

DatasetDelegator

keyed and indexed

Notes

Refactor

SMILESLanguage and SMILESTokenizer

SMILESDataset and SMILESTokenizerDataset

pytoda - Protein Language Modelling

pytoda - Webservice migration

pytoda - Extend SMILES functionalities

`KeyDataset`

`ConcatKeyDataset`

`DatasetDelegator`

`keyed` and `indexed`

`SMILESLanguage` and `SMILESTokenizer`

`SMILESDataset` and `SMILESTokenizerDataset`