Recent Releases of glycowork

glycowork - v1.6.3

Changelog

[1.6.3]

  • glycowork is now compatible with specifying narrow modification ambiguities (e.g., Gal(b1-3)GalNAc4/6S) (ec290e8)
  • made the bokeh dependency runtime-optional by importing it just-in-time for plot_network (ea9929e)

glycan_data

stats

Added ✨
  • Alpha biodiversity calculation in alpha_biodiversity_stats now performs Welch's ANOVA instead of ANOVA if scipy>=1.16 (ab73368)
  • ALR transformation functions now also expose the random_state keyword argument for reproducible seeding (23cafe7)

motif

processing

Added ✨
  • COMMON_ENANTIOMER dict to track the implicit enantiomer state (e.g., we write Gal instead of D-Gal but we do note the deviation L-Gal) (bb7575c)
  • GLYCONNECT_TO_GLYTOUCAN dict to support GlyConnect IDs as input to Universal Input / canonicalize_iupac (ea9929e)
Changed 🔄
  • canonicalize_iupac and its parsers will now leave the D-/L- prefixes in monosaccharides, which will then be centrally homogenized with COMMON_ENANTIOMER, for a more refined and detailed output (bb7575c)
  • canonicalize_iupac now considers more IUPAC variations, such as Neu5,9Ac instead of Neu5,9Ac2 (a764897)
  • canonicalize_iupac no longer strips trailing -Cer (d8c948b)
  • canonicalize_iupac now handles alpha and beta (d8c948b)
  • glycoworkbench_to_iupac is now trigged by presence of either End-- or u-- (d8c948b)
  • wurcs_to_iupac now supports more tokens (d9d6e57)
  • canonicalize_iupac now supports Gal4,6Pyr modifications (487c68a)
  • wurcs_to_iupac can now process sulfur linkages (e.g., Glc(b1-S-4)Glc) (88b2d54)
  • wurcs_to_iupac is now more robust to prefixes (e.g., L-, 6-deoxy-, etc) (ac171c5)
  • wurcs_to_iupac can now deal with ultra-long glycans (i.e., a-z, A-Z, aa-az, and aA-aZ) (487c68a)

tokenization

Changed 🔄
  • glycan_to_composition is now compatible with the new narrow modification ambiguities (e.g., Gal(b1-3)GalNAc4/6S) (ec290e8)

graph

Changed 🔄
  • compare_glycans is now compatible with the new narrow modification ambiguities (e.g., Gal(b1-3)GalNAc4/6S) (ec290e8)

draw

Fixed 🐛
  • fixed overlap in floating substituents in GlycoDraw if glycan had fewer branching levels than unique floating substituents (daade78)

analysis

Added ✨
  • ANOVA-based time series analysis in get_time_series now performs Welch's ANOVA instead of ANOVA if scipy>=1.16 (ab73368)
  • All analysis endpoint functions can now be directly seeded, without having to pre-transform data, with the newly exposed random_state keyword argument (23cafe7)

- Jupyter Notebook
Published by Bribak 11 months ago

glycowork - v1.6.2

Changelog

[1.6.2]

glycan_data

loader

Changed 🔄
  • huggingface_hub will now only be imported upon running download_model, making it technically run-time optional and improving package start-up time (d87e8af)

draw

Changed 🔄
  • openpyxl will now only be imported upon running plot_glycans_excel, making it technically run-time optional and improving package start-up time (d87e8af)

processing

Added ✨
  • canonicalize_iupac now removes extraneous quote marks around input glycans (fbe454c)
  • Added more milk oligosaccharide common names to the Universal Input pipeline as recognized by canonicalize_iupac (39e8a19)
Changed 🔄
  • canonicalize_iupac will now recognize GLYCAM sequences terminating in -OME (6430ebb)
Fixed 🐛
  • Fixed capitalisation in mapping of IGG N-glycan codes to account for .lower() call in canonicalize_iupac (48fb211)
  • Fixed variant LDManHep handling in canonicalize_iupac (6430ebb)

- Jupyter Notebook
Published by Bribak about 1 year ago

glycowork - v1.6.1

[1.6.1]

  • Moved xgboost dependency into the optional [ml] install (0c62acf)
  • glycowork now no longer has a svglib dependency, due to improvements in glycorender, requiring glycorender[png]==0.2.0 (4cad68f)

motif

graph

Changed 🔄
  • glycan_to_nxGraph_int will now automatically convert provided lib dicts into HashableDict objects, if they aren't already (fe5cd74)
  • compare_glycans used with two strings now has another early-return condition if the two glycans have different numbers of branches, enhancing efficiency (fe5cd74)

processing

Changed 🔄
  • canonicalize_iupac can now handle some more variations, such as double-anomeric linkages ((a2-1b)), and will leave modification-containing-seeming monosaccharides (e.g., Psif, Sorf) intact (5b25c1f)

- Jupyter Notebook
Published by Bribak about 1 year ago

glycowork - v1.6.0

Changelog

[1.6.0]

  • All glycan graphs are now directed graphs (nx.Graph --> nx.DiGraph), flowing from the root (reducing end) to the tips (non-reducing ends), which has led to code changes in quite few functions. Some functions run faster now, yet outputs are unaffected (03dfad6)
  • Added huggingface_hub>=0.16.0 as a new dependency to facilitate more robust model distribution (22f6b8f)
  • Moved drawSvg~=2.0 and openpyxl from the optional [draw] install to the dependencies of base glycowork. That allows for the usage of GlycoDraw in, e.g., Jupyter environments etc, even if glycowork[draw] has not been installed. Since these dependencies are unproblematic, no special install needs to be followed for the base glycowork install (60e51da)
  • Deprecated the optional [draw] install completely, by replacing the problematic cairosvg dependency with our new & custom renderer glycorender, which is now a new base dependency of glycowork (7c4fbe1)
  • Moved Pillow dependency into glycorender (793e71f)
  • Deprecated mpld3 and matplotlib-inline dependencies; added new bokeh and IPython base dependencies for better interactive plotting in a Jupyter environment (972c34b, 13b0699)
  • Formally added numpy and matplotlib to base dependencies (ba40c73)
  • Exposed canonicalize_iupac to the glycoworkGUI (ba40c73)
  • Implemented submodule lazy loading to speed up package imports & start-up (9bf18f7)

glycan_data

loader

Added ✨
  • Added HashableDict class to allow for caching of functions with dicts as inputs (03dfad6)
  • Added GlycoDataFrame class to extend pd.DataFrame by adding the .glyco_filter method, to easily filter glycan dataframes by the occurrence/count of sequence motifs (9764b3e)
  • Added new curated glycoproteomics dataset: sorghum_N_PMID39137587 (13b0699)
  • Updated glycan_binding, df_glycan, df_species to be bigger, better, and cleaner (e302075)
Changed 🔄
  • Refined motif definition of Internal_LewisX/Internal_Lewis_A/i_antigen in motif_list, to exclude LewisY/LewisB/I_antigen from matching/overlapping (07c9c12)
  • Renamed Hyluronan in motif_list into Hyaluronan (07c9c12)
  • Removed Nglycolyl_GM2 from motif_list; it's captured by GM2 (07c9c12)
  • Further curated glycomics datasets stored in glycomics_data_loader by introducing the b1-? --> b1-3/4 narrow linkage ambiguities (9eeaa3a, 436bf09)
  • download_model will now download model weights and representations from the HuggingFace Hub (22f6b8f)
  • df_species and df_glycan are now of type GlycoDataFrame; build_custom_df now returns a dataframe of type GlycoDataFrame (9764b3e)
  • DataFrameSerializer will now also correctly serialize cells in which (i) lists of strings or (ii) dictionaries have been converted into one string (Excel/pandas interplay of complex cells), where we use ast to try to literally evaluate them back into lists of strings (i) / dictionaries (ii) (806a47c, e302075)

stats

Fixed 🐛
  • Fixed a DeprecationWarning about implicit indexing in alr_transformation when a dict is used for custom_scale (9bf18f7)

motif

processing

Added ✨
  • GlyTouCanIDs are now another supported nomenclature in the context of Universal Input and can be used as inputs for functions etc, supported via improvements in canonicalize_iupac (eafb218)
  • Added sanitize_iupac to detect and fix chemical impossibilities (like two monosaccharides connected via the same hydroxyl group) and fix it (407cd6f, 74d35a0)
  • Added GLYCAN_MAPPINGS dictionary to map commonly used glycan names to their IUPAC-condensed sequence (36d33b8)
  • Added linearcode1d_to_iupac to support sequences of type 01Y41Y41M(31M21M21M)61M(31M21M)61M21M in the Universal Input platform (d0eee40)
  • CSDB linear code is now another supported nomenclature in the context of Universal Input and can be used as inputs for functions etc, supported via improvements in canonicalize_iupac (8dd34b7, 36d2a61, 69c00e1, 2d8fdfd, cb97593, 0e07c56)
  • Added transform_repeat_glycan to support bringing repeat structures of type 1)Fruf(b2-3)Fruf(b2- into the glycowork format of Fruf(b2-3)Fruf(b2-1)Fruf (36d2a61, 2d8fdfd)
  • Added nglycan_stub_to_iupac to support sequences of type (Hex)3 (HexNAc)1 (NeuAc)1 + (Man)3(GlcNAc)2 in the Universal Input platform (69c00e1)
  • Added iupac_to_smiles alias for IUPAC_to_SMILES (cb97593)
  • Added GAG_disaccharide_to_iupac to support disaccharide structural code (DSC) for GAGs (e.g., D2A6) in the context of Universal Input (0770bcd)
  • Added more WURCS tokens for better support in the context of Universal Input, now stored in wurcs_tokens.json (436bf09, 84c5bcc, b30553f, b94cf6d, d1fd4c7, 14bbd4d, a109176)
  • Support monosaccharides without anomeric indicator and phospho-linkages in WURCS (14bbd4d)
Changed 🔄
  • Moved .motif.query.glytoucan_to_glycan into .motif.processing (eafb218)
  • canonicalize_iupac will now use sanitize_iupac to auto-fix chemical impossibilities in input glycans (407cd6f)
  • More GlycoWorkBench sequence variants can now be handled via glycoworkbench_to_iupac/canonicalize_iupac (9eeaa3a, 436bf09, 74d35a0, 87fd540)
  • canonicalize_iupac and most glycowork functions now also support common names, like "LacNAc" or "2'-FL", in the Universal Input framework, thanks to GLYCAN_MAPPINGS (36d33b8, ab42dbb)
  • get_class can now identify repeating unit glycans and returns "repeat" in this case (74d35a0)
  • canonicalize_iupac can now handle even more IUPAC-dialects, like aMan13(aMan16)Man, where the anomeric state is declared before the monosaccharide (24c8e81, ab42dbb)
  • canonicalize_iupac will now use glycan_to_nxGraph and graph_to_string for branch canonicalization, instead of choose_correct_isoform. On average, this works much better and is more reliable (7c52a0e)
  • canonicalize_iupac is now more robust to (5-6) type linkages and to the associated sugar alcohols, like Rib5P-ol (7a260ac)
  • canonicalize_iupac will now raise a ValueError instead of a warning if a glycan string has mismatching brackets (b69fced)
  • canonicalize_iupac can now handle even more IUPAC-dialects such as Neu5Ac-α-2,6-Gal-β-1,3-GlcNAc-β-Sp (cb2c898)
  • canonicalize_iupac can now handle α,β before linkage parentheses (70b2f61)
  • get_class will now correctly annotate plant N-glycans with core a1-3 Fuc (8dd34b7)
  • Rare GLYCAM variants without "-OH" at the end can now also be handled by glycam_to_iupac (207a050)
  • Support single-monosaccharide glycans in GlycoCT within glycoct_to_iupac (87fd540)
  • Support variant sulfate notations in oxford_to_iupac (b35fc0e)
  • Improved parsing of Sialic acid linkage specification in oxford_to_iupac (06ea51f)
  • Added Oxford preferred antenna parsing in oxford_to_iupac (013456f)
  • Added Sialic acid Acetyl modification parsing in oxford_to_iupac (c402bf2)
  • enabled usage of single strings, next to lists, in iupac_to_smiles (8c5aa64)
  • glycam_to_iupac can now handle KDN tokens and more exotic modifications (8c5aa64)
  • iupac_to_smiles can now auto-use Universal Input, if used with a single-string input
Deprecated ⚠️
  • Deprecated find_isomorphs and choose_correct_isoform; this will be done (and better) by the new canonicalize_glycan_graph instead (7c52a0e)

annotate

Changed 🔄
  • Renamed clean_up_heatmap to deduplicate_motifs (407cd6f)
  • Allow sets of glycans as inputs in get_k_saccharides, in addition to lists of glycans (74d35a0)
  • Made get_k_saccharides faster by re-using graphs and using the directed graphs in an optimized way (7c52a0e)
  • get_terminal_structures will now return an actual ValueError when setting size to be higher than 2 (fa451ba)
Fixed 🐛
  • Fixed an edge case in get_k_saccharides, in which choosing a size larger than the size of the largest glycan in the input caused an error (db7847d)
  • Fixed get_k_saccharides with higher values of size, which occasionally produced invalid strings, by refactoring count_unique_subgraphs_of_size_k and switching it to use the changed graph_to_string_int, to ensure motif validity (db7847d)
  • Fixed preprocess_data, which was attempting to transform 0-containing dataframes when no transform argument was provided (878701a)
  • Fixed an issue in get_molecular_properties in which failed requests with placeholder set to False could lead to a size mismatch in preparing the output dataframe (106d0b0)
Deprecated ⚠️
  • Deprecated link_find; will be done by an optimized get_k_saccharides instead (since link_find relied on find_isomorphs) (7c52a0e)

draw

Added ✨
  • Added get_branches_from_graph to process directed glycan graphs into components for GlycoDraw (e56d015)
  • Added the reverse_highlight keyword argument to GlycoDraw, if you want to highlight everything except a certain motif (which means you can highlight discontiguous sequence stretches) (f5e3b2f)
  • GlycoDraw will now inject ALT text / metadata into all its outputs (displayed or saved as .pdf/.svg/.png) for improved accessibility and to aid curation efforts. The ALT text will be automatically generated and includes appropriate tags, the glycan sequence, and used drawing options. But it can also be overriden, if desired, via the new alt_text keyword argument in GlycoDraw (793e71f)
Changed 🔄
  • Quantitative highlighting in GlycoDraw via the per_residue keyword argument will now use individual SNFG-colors instead of a uniform highlight color (07c9c12)
  • Refactored get_coordinates_and_labels to be more efficient and generalizable; with this and the new get_branches_from_graph, GlycoDraw is now capable of drawing even more complex structures accurately (e56d015, 36fbba9)
  • Next to .svg and .pdf, it is now also possible to save .png files with GlycoDraw (36fbba9)
  • display_svg_with_matplotlib now has the optional chem keyword argument to alert our renderer that the .svg comes from RDKit (7c4fbe1)
  • Exposed libr in GlycoDraw to allow users to override the namespace for drawing too exotic monosaccharides (8c5aa64)
Deprecated ⚠️
  • Deprecated split_node, unique, get_indices, split_monosaccharide_linkage, and glycan_to_skeleton, since this will now be handled by the changed get_coordinates_and_labels (e56d015)

graph

Added ✨
  • Added canonicalize_glycan_graph to reorder graph nodes in either a length-first or linkage-first manner (7c52a0e)
  • graph_to_string and its sub-functions now have new keyword arguments: canonicalize, to determine whether transcribed graphs should be re-ordered into a canonicalized IUPAC-condensed and order_by to decide whether canonicalization happens in a length-first or linkage-first manner (7c52a0e)
  • Added glycan_graph_memoize decorator to cache results from graph_to_string_int (db7847d)
Changed 🔄
  • Switched lru_cache from glycan_to_graph to glycan_to_nxGraph_int for better performance and fewer opportunities to mess with the cache (03dfad6)
  • Made graph_to_string faster, to accommodate its more central role in the Universal Input framework (7c52a0e)
  • Added a fast-return for disaccharide graphs in graph_to_string_int, since no canonicalization/branch sorting is needed (db7847d)
  • subgraph_isomorphism is now also fine with people prodiving a separate termini_list even when providing graphs as input (though it's still recommended to just input termini_list when creating the graphs in the first place) (7a260ac)
  • GlycoDraw will now properly space text within monosaccharide symbols, if there are multiple indicators (like D and f) (03e502c)
  • get_possible_topologies will now raise a ValueError instead of a warning when you attempt to use it with an already defined glycan (b69fced)
  • get_possible_topologies will now by default return the strings of glycans, except if return_graphs=True, in which case the old behavior of returning glycan graphs as NetworkX objects is restored (b69fced)
  • If get_possible_topologies is used with dangling modifications (e.g., {OS}Gal(b1-3)GalNAc), the new default is now to also try adding the modification at non-terminal residues, even if exhaustive=False. The behavior for monosaccharide additions is unchanged. (b69fced)
  • If a glycan string is input into graph_to_string (even though you shouldn't do this) it will simply be returned as the string value (b69fced)
Fixed 🐛
  • Fixed an edge case in compare_glycans in which two identical string glycans returned (True, True) if return_matches == False (03dfad6)

tokenization

Changed 🔄
  • stemify_glycan can now deal with even more strongly modified glycans and should be faster too (03e502c)
  • map_to_basic can now deal with any linkage, even those never before seen in glycans (069faf7)

network

biosynthesis

Changed 🔄
  • plot_network now uses bokeh for interactive plotting instead of mpld3; changed the default layout algorithm from kamada_kawai to spring (972c34b)
  • find_diamonds and highlight_network will now raise actual ValueErrors instead of printed warnings, if their settings are wrong (36d2a61)
Fixed 🐛
  • Fixed a DeprecationWarning about resources.open_text in construct_network (ba40c73)
  • Fixed an edge case in find_diff when related but dissimilar glycans are used as input (069faf7)

evolution

Changed 🔄
  • distance_from_metric will now raise a ValueError if the chosen metric is not yet supported (ba40c73)
Fixed 🐛
  • Fixed a ClusterWarning about distance matrix formats in dendrogram_from_distance (ba40c73)

ml

model_training

Changed 🔄
  • training_setup will now raise a ValueError if the chosen mode is not supported (2d8fdfd)

inference

Changed 🔄
  • get_lectin_preds will now raise a ValueError if no protein:ESM-1b dictionary is provided in non-flex mode (9bf18f7)
  • get_esm1b_representations is now get_esmc_representations, with a slightly changed function signature (e.g., no alphabet needed anymore) (e302075)

models

Changed 🔄
  • init_weights will now raise a ValueError if the chosen initialization mode is not supported (9bf18f7)
  • LectinOracle will now use ESMC-300M representations, rather than ESM-1b (e302075)

traintestsplit

Fixed 🐛
  • Fixed class_list order in prepare_multilabel to ensure reproducibility (1e7999d)

- Jupyter Notebook
Published by Bribak about 1 year ago

glycowork - v1.5.0

Changelog

[1.5.0]

Added ✨

  • Added type hints to all functions (e6721a1)
  • Added CodeCov shield to track PyTest test code coverage (23d6456)
  • Added more PyTest unit tests (e.g., 0c94995, 23d6456, 5a99d6b, f76535e, 94646ad, d5f5d4e, 918d18f, d1a8c6d, 194f31c)
  • Added setuptools to required_installs to support pip installation beyond pip 25.0 (94646ad)
  • Added pyproject.toml to support pip installation beyond pip 25.0 (94646ad)
  • Added CITATION.bib to allow for even easier citation of glycowork (a64f694)
  • Reworked user interface of the glycoworkGUI (77bbfa3)

Changed 🔄

  • Bumped minimum supported Python version to 3.9 (3.8 is no longer supported, see https://devguide.python.org/versions/) (4960c5c)
  • Switched docstring style to docments (https://nbdev.fast.ai/tutorials/best_practices.html#document-parameters-with-docments) (e6721a1)
  • Removed gdown dependency; Will be handled by the standard library module urllib for better retrieval of externally stored models/files (319981e, 35ed71a)
  • Switched pathing from os to pathlib (319981e)

glycan_data

Added ✨
  • Added new named motifs to motif_list: DisialylLewisC, Sia(a2-3)Gal(b1-3)[Sia(a2-6)]GlcNAc; RM2, Sia(a2-3)[GalNAc(b1-4)]Gal(b1-3)[Sia(a2-6)]GlcNAc; DisialylLewisA, Sia(a2-3)Gal(b1-3)[Fuc(a1-4)][Sia(a2-6)]GlcNAc (a64f694)
  • Added new curated glycomics dataset, mouse_brain_GSL_PMID39375371 (b94744e)
Changed 🔄
  • Changed glycoproteomics_human_keratinocytes_PMID37956981 to glycoproteomics_human_keratinocytes_N_PMID37956981 (d5f5d4e)
  • Improved the description of blood group motifs in motif_list (including type 3 blood group antigens, ExtB, and parent motifs) (b94744e)
Fixed 🐛
  • Fixed the "Oglycancore6" motif definition in `motiflist` to no longer overlap with core 2 structures (f394bda)

loader

Added ✨
  • Added count_nested_brackets helper function to monitor level of nesting in glycans (41bb1a1, d57b836)
  • Added dictionaries with lists of strings as values as a new supported data type for DataFrameSerializer (034b6ad)
  • Added share_neighbor helper function to check whether two nodes in a glycan graph share a neighbor (f394bda)
Changed 🔄
  • Changed resources.open_text to resources.files to prevent DeprecationWarning from importlib (0c94995)
  • lectin_specificity now uses our custom DataFrameSerializer and is stored as a .json file rather than a .pkl file, to improve long-term stability across versions (034b6ad)
Fixed 🐛
  • Fixed DeprecationWarning in all data-loading functions that used importlib.resources.open_text or .content (87ea2fc)

stats

Added ✨
  • Added the "randomstate" keyword argument to `clrtransformation` to allow users to provide a reproducible RNG seed (b94744e)
  • Added the JTKTest class object (87ea2fc)
Changed 🔄
  • For replace_outliers_winsorization, in small datasets, the 5% limit is dynamically changed to include at least one datapoint (23d6456)
  • Handled the edge case of strong differences in cohen_d with zero standard deviation; now outputting positive/negative infinity (23d6456)
  • Renamed test_inter_vs_intra_group to compare_inter_vs_intra_group, to avoid testing issues (23d6456)
  • partial_corr will now return a normal Spearman's correlation if no control features can be identified (241141b)
Deprecated ⚠️
  • Deprecated hlm, fast_two_sum, two_sum, expansion_sum, and update_cf_for_m_n, which will all be done in-line instead (e1afe33)
  • Deprecated jtkdist, jtkinit, jtkstat, jtkx, which will all be done by the new JTKTest (87ea2fc)
Fixed 🐛
  • Fixed DeprecationWarning in calculate_permanova_stat for calling nonzero on 0d arrays (23d6456)
  • Prevent possible division by zero in pseudo-F calculation in calculate_permanova_stat (23d6456)
  • Fixed DeprecationWarning in jtkdist for calling np.sum(generator) (23d6456)
  • Ensured that the input to impute_and_normalize are columns with floats, to avoid TypeWarnings during imputation (23d6456)
  • Fixed DeprecationWarning in process_glm_results to prevent DataFrameGroupBy.apply from operating on the grouping columns (23d6456)
  • Fixed RuntimeWarnings for JTK-related functions in case of imperfect input data (d5f5d4e)
  • Ensured that correct_multiple_testing will return empty lists if the provided p-value list is also empty (ef3da9c)

motif

tokenization

Added ✨
  • Added get_random_glycan to retrieve random glycan sequences (optionally of specific glycan type) (d1a8c6d)
  • Supported intramolecular modifications like lactonization in glycan_to_composition (8c69c2c)
Changed 🔄
  • Changed resources.open_text to resources.files to prevent DeprecationWarning from importlib (0c94995)
  • The monosaccharide keys of the output dictionaries of glycan_to_composition are now alphabetically sorted (8c69c2c)
  • Modified calculate_adduct_mass to deal with a greater variety of adduct handling, such as "C2H4O2", "-H2O", "+Na" to add or subtract masses (8c69c2c)
  • Expanded glycan_to_mass and composition_to_mass to deal with compositional building blocks that represent losses/gains in the molecule (like "-H2O") (8c69c2c)
  • Composition and mass functions now can correctly work with azide-modified monosaccharides such as Neu5Az (ef3da9c)
  • In addition to chemical formulae, users can now also provide direct additional masses as floats with the same "adduct" keyword argument in composition_to_mass and glycan_to_mass (d57b836)
  • get_modification will no longer return the 5Ac / 5Gc of Neu5Ac / Neu5Gc as part of the modification (0387d37)
Fixed 🐛
  • Fixed an edge case in get_unique_topologies, in which the absence of a universal replacer sometimes created an empty list that was attempted to be indexed (0c94995)
  • Made sure that compositions_to_structures always returns a DataFrame, even if no matches are found (0c94995)
  • Provided correct exact methyl masses in mass_dict (e3eeb32)

processing

Added ✨
  • Added "antennaryFuc" as another inferred feature to `inferfeaturesfromcomposition` (a64f694)
  • Added "IdoA", "GalA", "Araf", "D-Fuc", "AllNAc", "Par", "Kdo", "GlcN", "Ido", "Col", "Tyv", "GalN", "QuiNAc", "Gul", and "Gal6S" to recognized WURCS2 tokens (52fc16e, f3cd8f0, 7551805, 35ed71a)
  • Added the new "orderby" keyword argument to `choosecorrect_isoform` to enforce strictly sorting branches by branch endings / linkages, if desired (918d18f)
  • Added "Col", "Ido", "Kdo", and "Gul" to supported GlycoCT monosaccharides (7551805, 35ed71a)
  • GLYCAM is now another supported nomenclature in the Universal Input framework, enabled by the added glycam_to_iupac function, which is also integrated into canonicalize_iupac (2fb5dc6)
  • GlycoWorkBench (GlycanBuilder) is now another supported nomenclature in the Universal Input framework, enabled by the added glycoworkbench_to_iupac function, which is also integrated into canonicalize_iupac (ea1fdfc)
Changed 🔄
  • check_nomenclature will now actually raise appropriate Exceptions, in case nomenclature is incompatible with glycowork, instead of print warnings (23d6456)
  • Supported triple-branch reordering in find_isomorphs and choose_correct_isoform (918d18f)
  • Improved find_isomorphs to swap neighboring branches with different levels of nesting (41bb1a1, 034b6ad)
  • choose_correct_isoform can now also be used with a single glycan sequence, in which case it internally calls find_isomorphs to generate material for choosing (918d18f)
  • choose_correct_isoform can now correctly handle more complex sequences than before (41bb1a1, 034b6ad, d1ff321)
  • canonicalize_iupac now can handle modifications such as Neu5,9Ac2 / Neu4,5Ac2 or multiple ones like in (6S)(4S)Gal, even if in the wrong order (034b6ad)
  • canonicalize_iupac now can handle even more typos (e.g., 'aa1-3' in specifying a linkage) (a64f694, 241141b)
  • canonicalize_iupac now can handle even more inconsistencies (e.g., mix of short-hand and expanded linkages)
  • Expanded get_mono to deal with some special WURCS2 tokens at the reducing end, of type u2122h_2*NCC/3=O (d57b836)
  • canonicalize_iupac will no longer convert things like "b1-3/4" into "b1-?", because narrow linkage ambiguities can now be properly handled (52fc16e)
  • get_possible_linkages and de_wildcard_glycoletter now also support narrow linkage ambiguities like "b1-3/4" (52fc16e)
  • canonicalize_iupac will now no longer mess up branch formatting of the repeating unit in glycans of type "repeat" (9a94537)
  • Ensured that canonicalize_iupac works with lactonized glycans (i.e., containing something like "1,7lactone") (8c69c2c)
  • find_matching_brackets_indices has been renamed to get_matching_indices and now takes multiple delimiter choices and returns a generator, including the level of nesting (basically what .draw.matches used to do) (e1afe33)
  • get_class will now return "lipid/free" if glycans of type Neu5Ac(a2-3)Gal(b1-4)Glc are supplied (i.e., lacking 1Cer and -ol but still lactose-core based) (b99699c)
  • expand_lib now no longer modifies the input dictionary (65bd12c)
  • get_possible_linkages now returns a set instead of a list (a98461f)
  • wurcs_to_iupac now can also properly deal with ultra-narrow linkage wildcards (e.g., a2-3/6) (f3cd8f0)
Fixed 🐛
  • Fixed component inference in parse_glycoform in case of unexpected composition formats (0c94995)
  • Fixed an issue in equal_repeats, in which identical repeats sometimes were not returning True (0c94995)

graph

Added ✨
  • Natively support narrow linkage ambiguity in categorical_node_match_wildcard; that means you can use things like "Gal(b1-3/4)GlcNAc" with subgraph_isomorphism or compare_glycans (as well as all functions using these core functions) and it will only return True for "Gal(b1-3)GlcNAc", "Gal(b1-4)GlcNAc", and "Gal(b1-?)GlcNAc" (b94744e)
  • Added build_wildcard_cache for a central handling of wildcard mapping that can also be cached (a98461f)
  • compare_glycans now also has the return_matches keyword argument that allows for a retrieval of the node mapping if the glycans are isomorphic (7c510c9)
Changed 🔄
  • Ensured that compare_glycans is 100% order-specific, never matching something like ("Gal(b1-4)GlcNAc", "GlcNAc(b1-4)Gal") (5a99d6b)
  • glycan_to_nxGraph will now return an empty graph if the input is an empty string (4f1ccfa)
  • get_possible_topologies will now also produce a warning (and return the input) if an already defined topology is provided as a pre-calculated graph (3f22f14)
  • Negation in subgraph_isomorphism can now also be added for internal monosaccharides (e.g., "Neu5Ac(a2-3)!Gal(b1-4)GlcNAc") (7558d9b)
  • Functions with the handle_negation decorator can now be accessed without the decorator via .__wrapped__ (7558d9b)
Fixed 🐛
  • Fixed an edge case in which subgraph_isomorphism could erroneously return False if any of the matchings were in the wrong order, if "count = False" (f394bda)
  • Fixed an edge case in which negated motifs in subgraph_isomorphism sometimes wrongly returned False because the negated motif was present somewhere else in the glycan (but the intended motif was still there) (7558d9b)

draw

Added ✨
  • Added the "drawing" argument to draw_hex, hex_circumference, add_bond, add_sugar, and draw_bracket to avoid having to operate on global variables (918d18f)
  • Added the option to provide your own existing glycan .pdb structures to GlycoDraw when using draw_method='chem3d' with the new keyword argument pdb_file (9d082a6)
Changed 🔄
  • matches can now also use [] as delimiters (f76535e)
  • Support easy import of GlycoDraw, via from glycowork import GlycoDraw (d5f5d4e)
  • Renamed hex to draw_hex, to avoid overwriting the built-in hex (918d18f)
  • Changed keyword argument "hex" to "hexcodes" in `addcolourstomap` (838c708)
  • get_highlight_attribute now internally uses motif.graph.subgraph_isomorphism for pattern retrieval, ensuring up-to-date functionality (4f1ccfa)
  • get_coordinates_and_labels now internally uses motif.processing.choose_correct_isoform to reorder the glycan for drawing (41bb1a1)
  • Improved console drawing quality controlled by display_svg_with_matplotlib and image quality in Excel cells using plot_glycans_excel (a64f694)
  • draw_chem2d and draw_chem3d will now detect whether the user is in a Jupyter environment and, if not, plot to the Matplotlib console (c3a7f64)
  • process_per_residue now will re-order the per_residue list in the same way as the glycan is re-ordered for drawing with GlycoDraw (7c510c9)
Deprecated ⚠️
  • Deprecated hex_circumference, the functionality is now available within draw_hex with the new keyword argument "outline_only" (4f1ccfa)
  • Deprecated multiple_branches, multiple_branch_branches, branch_order, and reorder_for_drawing accordingly (41bb1a1)
  • Deprecated matches, which will now be done by .processing.get_matching_indices that has been reworked
Fixed 🐛
  • Made sure scale_in_range never divides by zero, if value range is zero (f76535e)
  • Made sure that monosaccharides that were never observed but are still SNFG-defined (like TalNAc vs 6dTalNAc) can still be drawn with GlycoDraw (ef24af4)

analysis

Changed 🔄
  • get_glycanova will now raise a ValueError if fewer than three groups are provided in the input data (f76535e)
  • Improved console drawing quality controlled by display_svg_with_matplotlib and image quality in Excel cells using plot_glycans_excel (a64f694)
  • The "periods" argument in get_jtk is now a keyword argument and has a default value of 12, 24
  • specify_linkages can now also handle super-narrow linkage wildcards like Galb3/4 (f394bda)
  • get_SparCC will now limit the number of eligible controls for "partialcorrelations=True" to samplesize//5, capped at 5 (241141b)
Fixed 🐛
  • Fixed a FutureWarning in get_lectin_array by avoiding DataFrame.groupby with axis=1 (f76535e)
  • Fixed a RuntimeWarning in get_biodiversity by handling statistical tests of identical alpha diversity values between groups (f76535e)
  • Made sure that the TSNE perplexity fits the sample size in plot_embeddings (d5f5d4e)
  • Fixed an edge case in which user-provided embeddings as DataFrames were misformatted in plot_embeddings (d5f5d4e)
  • Supported the case where no labels are provided to plot_embeddings (d5f5d4e)
  • Fixed a potential format mismatch in get_meta_analysis if random-effects meta-analyses were performed (d5f5d4e)
  • Fixed an issue where variance-filtered rows could cause problems in get_differential_expression if "monte_carlo = True" (ef3da9c)
  • Fixed an issue in get_differential_expression if "sets = True" that caused indexing issues under certain conditions (ef3da9c)
  • Ensured that "effectsizevariance = True" in get_differential_expression always formats variances correctly (ef3da9c)
  • Ensured that the combination of "groupedBH = True", "paired = False", and CLR/ALR in `getdifferential_expression` works even when negative values are present (87ea2fc)

regex

Changed 🔄
  • Improved tracing in try_matching for complicated branching cases (f394bda)
  • Ensured that format_retrieved_matches outputs the identified motifs in the canonical IUPAC representation (7558d9b)
Deprecated ⚠️
  • Deprecated process_pattern; will be done in-line instead (f394bda)
  • Deprecated expand_pattern; will be handled by specify_linkages and improvements in subgraph_isomorphism instead (f394bda)
  • Deprecated filter_dealbreakers; will be handled by improvements in subgraph_isomorphism instead (65bd12c)
Fixed 🐛
  • Fixed an issue in get_match_batch, in which precompiled patterns caused issues in get_match (194f31c)

annotate

Added ✨
  • Added get_size_branching_features to create glycan size and branching level features for downstream analysis (d57b836)
  • Added the "sizebranch" option in the "featureset" keyword argument of annotate_dataset and quantify_motifs, to analyze glycans by size or level of branching (d57b836)
Fixed 🐛
  • Fixed an issue in clean_up_heatmap in which, occasionally, duplicate strings were introduced in the output (e3eeb32)

ml

model_training

Added ✨
  • Added classification-AUROC, multilabel-accuracy, multilabel-MCC, regression-MAE, and regression-R2 as metrics to train_model (#66)
  • Added the "returnmetrics" keyword argument to `trainmodel` that can additionally return all training and validation metrics (#66)
Changed 🔄
  • Weigh metric calculation by batch-size (correctly handling the last batch) in train_model (#66)
  • Best performances in train_model are now taken from the overall best model (lowest loss), not from best-model-per-metric (#66)
Fixed 🐛
  • Fixed an indexing issue in train_ml_model if "additionalfeaturestrain" / "additionalfeaturestest" were used (b94744e)

inference

Changed 🔄
  • Changed resources.open_text to resources.files to prevent DeprecationWarning from importlib (d1a8c6d)

models

Changed 🔄
  • In prep_model, the hidden_dim argument can now also be used to modify the protein embedding size of a newly defined LectinOracle model (d1ff321)

network

evolution

Fixed 🐛
  • Fixed DeprecationWarning in distance_from_embeddings to prevent DataFrameGroupBy.apply from operating on the grouping columns (94646ad)
  • Fixed an issue in distance_from_metric where networks were indexed incorrectly based on presented DataFrame order (d2f5d55)

biosynthesis

Changed 🔄
  • Made sure in network_alignment that only nodes that are virtual in all aligned networks stay virtual (918d18f)
  • choose_leaves_to_extend will now correctly return no leaf node glycan if the target composition cannot be reached from any of the leaf nodes in a network (918d18f)
Fixed 🐛
  • Fixed an issue in find_shared_virtuals in which no shared nodes were found because of graph comparisons (d2f5d55)

- Jupyter Notebook
Published by Bribak over 1 year ago

glycowork - v1.4.0

Change Log

For Version 1.4.0

  • Added an example workflow/tutorial for differential glycomics analysis to the Examples tab in the documentation
  • Added additional tests via pytest
  • Cleaned up repo with more stringent .gitignore, removing unnecessary files
  • Added hover-over tooltips to the glycoworkGUI, describing how the input files should be formatted
  • Exposed more keyword arguments of get_heatmap in GUI (CLR transformation + tick label control) ## glycan_data
  • Broadened the motif definition of “Mucinelongatedcore2” in motif_list
  • Refined the motif definitions of the O-glycan core motifs in motif_list to prevent overlaps
  • Larger (and cleaner) datasets for: df_glycan, df_species, df_tissue, df_disease, and glycan_binding
  • Updated lib from 2,366 to 2,565 glycoletters ### loader
  • Added the glycoproteomics_data_loader, to request stored glycoproteomics datasets
  • Added human_milk_N_PMID34087070 and human_keratinocytes_PMID37956981 as example datasets for glycoproteomics_data_loader (data are ID’ed in the “Glycosite” column in the format proteinsitecomposition)
  • Added HexOS and HexNAcOS monosaccharide lists to be used in downstream functions
  • Added modification_map to map which monosaccharides can be modified with which post-biosynthetic modification
  • Added DataFrameSerializer to have a version-independent serializer for handling df_glycan ### stats
  • Added get_glycoform_diff to aggregate glycoforms differential expression across glycopeptides or glycoproteins via Fisher’s Combined Probability Test
  • Fixed a pandas deprecation warning in replace_outliers_winsorization (for pandas >= 2.2.2)
  • Added get_glm and process_glm_results to fit and analyze generalized linear models, with interaction terms, to grouped glycoproteomics data
  • Added partial_corr to calculate regularized partial correlations
  • Added estimate_technical_variance and perform_tests_monte_carlo to account for technical variation in glycomics data
  • Added the “capside” keyword argument to `replaceoutlierswithIQRboundsandreplaceoutliers_winsorization` to allow users to cap outliers on “both”, “upper”, “lower” sides; default: “both”
  • Fixed the global NumPy RNG for clr_transformation and alr_transformation to ensure reproducibility
  • Added the “correctionmethod” keyword argument to `correctmultiple_testing`, to allow users to switch between regular Benjamini-Hochberg and two-stage Benjamini-Hochberg ## motif ### processing
  • Added support for sulfated monosaccharides to get_possible_monosaccharides
  • Added parse_glycoform, infer_features_from_composition, and process_for_glycoshift as helper functions in glycoproteomics data analysis
  • Expanded canonicalize_composition to deal with compositions of type “9 2 0 0”
  • Fine-tune canonicalize_iupac to not mess up formatting of sequences ending in “GlcOP-ol”
  • Added de_wildcard_glycoletter to retrieve a random specified monosaccharide/linkage of the general type present as a wildcard (e.g., Hex->Gal)
  • Added get_class to return the glycan class as a string, given a glycan sequence
  • If choose_correct_isoform is provided with isomers that have different amounts of ambiguities, it will now prioritize the isomers with the fewest ambiguities ### graph
  • Added support for mixing monosaccharide and modification wildcards in compare_glycans and subgraph_isomorphism (e.g., “HexNAcOS”)
  • Added the handle_negation decorator and subgraph_isomorphism_with_negation to process motif annotation with restrictions (e.g., “Gal(b1-3)[!GlcNAc(b1-6)]GalNAc” to prevent annotating core2 O-glycans as core1)
  • subgraph_isomorphism is now decorated with handle_negation, such that if the “motif” argument contains a negating operator (“!”), the function will actually execute subgraph_isomorphism_with_negation
  • Added the “alloweddisaccharides” keyword argument to `getpossible_topologies` to support filtering possible extensions by physiological glycan extensions
  • Added a filter to get_possible_topologies to maintain chemically feasible structures by checking that the same carbon does not get two linkages
  • Support handling of post-biosynthetic modifications in get_possible_topologies, e.g., allowing things like “{6S}Gal(b1-3)[GlcNAc(b1-6)]GalNAc” as input, with uncertainty about where the sulfate is attached
  • Refactored graph_to_string_int to recursively construct a depth-first search tree to construct the IUPAC-condensed string
  • Supported monosaccharide-only graphs in generate_graph_features
  • Added deduplicate_glycans to remove duplicate glycans (with different IUPAC strings) from a list of glycans ### analysis
  • Added the “glycoproteomics” and “level” keyword arguments to get_differential_expression to support the analysis of glycoproteomics data if “glycoproteomics=True”. “level” indicates whether different glycoforms should be analyzed at the level of glycopeptides or glycoproteins
  • Added get_glycoshift_per_site to analyze whether, and in which way, glycosylation changes between conditions for each glycosylation site (controlling for protein expression etc.) via generalized linear models (GLM) adapted for compositional data (i.e., CLR-transformation)
  • Added preprocess_data as a centralization of data preprocessing for easier maintenance
  • Moved preprocessing code from get_differential_expression, get_glycanova, get_biodiversity, and get_roc into preprocess_data
  • Fixed an issue in clean_up_heatmap in which sometimes the longer string instead of the longer sequence was picked for deduplication (e.g., “Internal_LewisX” vs “SialylLewisX”)
  • Moved clean_up_heatmap into motif.annotate
  • Added Omega-squared as an effect size output to get_glycanova
  • Fixed an issue in get_heatmap in which sometimes the function did not correctly rescue an input by transposing it, if the index contained special characters
  • Fixed an issue in get_pca in which the input of a dataframe for group specification resulted in an error
  • Disabled Levene’s test in get_differential_expression if either group has fewer than three samples, for numerical stability
  • Added the “partialcorrelations” keyword argument to `getSparCC`. If set to True, it will instead use regularized partial correlations to reduce multi-colinearity and enrich associations that represent direct effects (i.e., getting rid of bystander effects)
  • Added the “montecarlo” keyword argument (default False) to `preprocessdataandgetdifferentialexpression`. If True, this will simulate technical variation by sampling 128 Monte Carlo instances from a Dirichlet distribution for each sample. Only works for sequences & CLR for now. This will substantially increase runtime and be considerably more conservative in yielding significant differences between conditions. Use with caution.
  • In get_differential_expression glycans that had been filtered out by variance filtering now still have their mean abundance and log2FC recorded in the output table
  • Added the “showall” keyword argument to `getheatmap` to force all tick labels to display, even if they visually overlap ### annotate
  • Added annotate_glycan_topology_uncertainty to probe whether motifs can be annotated in the case of structural ambiguity (e.g., {Fuc(a1-3)} in N-glycans, to still annotate Lewis X)
  • Expanded annotate_dataset to let it automatically switch between annotate_glycan and annotate_glycan_topology_uncertainty, depending on whether structural ambiguity is present in a glycan (the latter is much more costly in terms of computation)
  • Added the (default: True) keyword argument “removeredundant” to `quantifymotifsthat will callcleanupheatmap` on the output to remove redundant motifs
  • Dynamically generated terminal motifs now have the prefix “Terminal_” in all outputs
  • Resolved a recent deprecation warning from pandas in get_k_saccharides
  • Added a warning to annotate_dataset that will print all features in “feature_set” that are not being recognized
  • Support the use of “terminal1” as a synonym to the original “terminal” in “feature_set” ### draw
  • Support the new “Terminal” prefix in GlycoDraw and `annotatefigure` ### tokenization
  • Added support for sulfated HexA and HexN in map_to_basic
  • Added calculate_adduct_mass to calculate the mass for generic molecular formulae (e.g., C2H4O2)
  • Added support for chemical tags or adducts in composition_to_mass, glycan_to_mass, and mz_to_composition via the new “adduct” keyword argument
  • Added “Pen” to get_core
  • The default “glycanclass” in `mzto_composition` is now “all” (but it can of course still be user-specified)
  • Added the new keyword argument “extras” to mz_to_composition, to allow users to switch off the consideration of adducts or doubly-charged input masses (the default now is to opt out of adducts but users can add that to “extras”)
  • Copy the input dictionary in composition_to_mass to prevent any in-place modification of the keys ## network ### biosynthesis
  • Made network construction faster via code optimizations
  • Added the “mode” keyword argument to choose_path, find_diamonds, trace_diamonds, and evoprune_network to allow for biosynthetic motif analysis to use information from relative abundances
  • We now support the use of longitudinal data in get_differential_biosynthesis to analyze whether biosynthetic flows change over time
  • Fixed an issue in get_differential_biosynthesis in which N-glycans with high-mannose sequences caused errors (due to the backward direction of synthesis)
  • Fixed an issue in get_differential_biosynthesis in which N-glycans, containing many unobserved intermediate sequences, had capacity bottleneck issues
  • Added the “mindefault” keyword argument to `estimateweights`, to allow class-dependent fine-tuning of the minimum capacity
  • Modified construct_network to disallow the transfer of modified monosaccharides (e.g., GlcNAc6S), only retaining the sequential assembly in accordance with known biosynthesis (e.g., GlcNAc, then 6S)
  • Added extend_glycans, edges_for_extension, and extend_network to extend the biosynthetic network based on observed reactions and permitted disaccharide extensions
  • Deprecated safe_max and find_ptm; will be done in-line instead ## ml
  • Updated trained models for new lib ### processing
  • Made dataset_to_graphs faster if there were any duplicates in the input glycans
  • Added augment_glycan and AugmentedGlycanDataset to support glycan data augmentation during training of deep learning models. Currently, the only supported data augmentation is wildcarding of monosaccharides/linkages (e.g., GalHex, b1-4?1-?) and the inverse (de-wildcarding)
  • Added the keyword arguments “augmentprob” and “generalizationprob” to split_data_to_train to control the likelihood of augmenting a glycan and the proportion of the glycan to be (de-)wildcarded if it is augmented ### inference
  • Added an unwrap call to get_lectin_preds to fix the output format ### models
  • Set “weightsonly = True” for torch.load to prevent FutureWarning ### modeltraining
  • Support already one-hot encoded multilabel labels in Poly1CrossEntropyLoss

- Jupyter Notebook
Published by Bribak over 1 year ago

glycowork - v1.3.0

Change Log

For Version 1.3.0

  • Added get_heatmap to the glycoworkGUI
  • Added an “About” tab to the glycoworkGUI, describing the glycowork version that it is running and pointers to the reference and documentation
  • Added get_lectin_array to the glycoworkGUI
  • Added a progress bar to lengthier operations in the glycoworkGUI
  • Reduced filesize of glycoworkGUI by ~20% and filesize of glycowork by >80%
  • Removed inplace operations from pandas functions, because of PDEP-8
  • PyTorch (torch) is now no longer a mandatory requirement for base glycowork. It has been shifted to the setup requirements for the optional glycowork[ml] install. Trying to do machine learning without that install will result in an appropriate ImportError
  • gdown is now a mandatory requirement for glycowork, to support hosting larger files outside the package itself ## glycan_data
  • Updated glycan_binding by averaging results from duplicate sequences with different formatting
  • Added processed example glycomics datasets that are available via loader.glycomics_data_loader
  • Added processed example lectin array datasets that are available via loader.lectin_array_data_loader
  • Added a bit of fuzziness to the motifs in motif_list to allow for broader capture (e.g., “GalOS” instead of “Gal6S” when appropriate, or “Sia” instead of "Neu5Ac”)
  • Fixed the definition of Internal_LacNAc_type1 in motif_list ### loader
  • Added glycomics_data_loader as an object for requesting glycomics data. Use dir(glycomicsdataloader) for displaying available glycomics datasets, and then request them via glycomicsdataloader.XXX (same goes for lectin array data, which is requestable via lectin_array_data_loader)
  • Added human_skin_O_PMC5871710, human_skin_O_PMC5871710_BCC, human_skin_O_PMC5871710_SCC, human_colorectal_O_PMC9254241, human_colorectal_N_PMID26085185, human_colorectal_O_PMID19152289, human_gastric_O_PMC4816881, human_gastric_O_PMID28461410, human_gastric_O_PMC5762837, human_gastric_O_PMC7226152, human_liver_O_PMC9254241, human_liver_O_PMC5383776, human_ovarian_O_PMC4468167, human_prostate_O_PMC8010466, human_prostate_N_PMC8010466, human_retina_GSL_PMC5173345, human_leukemia_O_PMID34646384, human_leukemia_N_PMID34646384, HIV_gagtransfection_N_PMID35112714, HIV_gagtransfection_O_PMID35112714, time_series_N_PMID32149347, human_brain_GSL_PMID38343116, human_brain_N_PMID38343116, human_brain_O_PMID38343116, human_platelets_O_PMID36952551, human_platelets_N_PMID36952551, human_serum_bacteremia_N_PMID33535571, time_series_HMO_PMID22649065, and time_series_O_PMID32149347 as datasets for glycomics_data_loader
  • Added A549_influenza_PMID33046650 and HEK_XBP1_PMID30305426 as datasets for lectin_array_data_loader
  • Added lectin_specificity as a resource for documented lectin specificities for lectin array analysis
  • Switch glycan_binding, df_species, and df_glycan to lazyloading for improved package import etc.
  • Added strip_suffixes to strip a column of string values of suffixes such as “.1”, “.2” that pandas may assign to duplicate columns
  • Added download_model to download hosted large files, such as model weights, when needed ### stats
  • Fixed an issue in test_inter_vs_intra_group in which mean values were not correctly broadcast if “paired = False” and “grouped_BH = True”
  • Added get_equivalence_test to test for significant equivalence of group means via two one-sided t-tests
  • Added clr_transformation for the center log ratio transformation of a glycomics dataframe with the addition of scale uncertainty via a gamma parameter (see for instance https://arxiv.org/abs/2201.03616 for the theory behind this)
  • For impute_and_normalize, the default value for “min_samples” has been changed to 0.1, which now means that at least 10% of the samples (rounded down) need to be non-zero for a glycan to be retained. Further, features for which one group only has zero values will now be imputed with 1e-5 to avoid erroneous homogenization of effects by MissForest
  • Changed the “minfeaturevariance” default from 0.01 to 0.02 in variance_based_filtering and now it also outputs the discarded rows as a second output
  • Added replace_outliers_winsorization to cap outliers via Winsorization
  • Fixed numpy random seed to 0
  • Added anosim for ANOSIM (Analysis of similarities) for the beta-diversity calculation in get_biodiversity
  • Added alpha_biodiversity_stats for performing an ANOVA on alpha diversity metrics, if groups > 2 in get_biodiversity
  • Fixed a warning if the standard deviation of a paired sample in cohen_d was exactly zero
  • Added calculate_permanova_stat and permanova_with_permutation for PERMANOVA (Permutational multivariate analysis of variance) for the beta-diversity calculation in get_biodiversity
  • Added alr_transformation, get_procrustes_scores, and get_additive_logratio_transformation to find ALR reference component to perform the ALR transformation for compositional data analysis
  • Added correct_multiple_testing to centralize multiple testing correction and also add a warning if >90% of features are significant (in which case, Bonferroni correction will be applied to make results more conservative)
  • Raised tolerance of MissForest from 1e-6 to 1e-5 (as it’s applied to the sum of differences, it’s still very conservative)
  • Added omega_squared to calculate Omega squared, as an effect size for ANOVA-type analyses ## motif ### analysis
  • Change get_differential_expression to only call TST_grouped_benjamini_hochberg if “grouped_BH = True”, otherwise default to scipy two-stage Benjamini-Hochberg
  • get_differential_expression now also outputs equivalence tests for all cases in which the uncorrected p-value is above 0.05
  • get_differential_expression, get_glycanova, get_time_series, and get_jtk now will internally CLR- or ALR-transform input glycomics data to appropriately handle compositional data. These functions also newly accept a “gamma” keyword argument to tune the scale uncertainty for lowering the potential for false-positives
  • get_heatmap will now automatically transpose the input dataframe if it has been provided in the wrong orientation
  • Added the “transform” keyword argument to get_heatmap, to optionally CLR/ALR-transform the input data by setting ‘transform = “CLR”’ or ‘transform = “ALR”’
  • The “transform” keyword argument also exists in most other analysis functions and accepts “ALR” and “CLR”, if users wish to override the automatically inferred type of transformation (“Nothing” is accepted for not transforming data at all but this is not recommended in most circumstances)
  • Changed multiple testing correction to two-stage Benjamini-Hochberg, even if no grouped Benjamini-Hochberg test is being done
  • Also change the “minsamples” default to 0.1 in `getdifferential_expression` and other functions
  • Changed all analysis functions to use Winsorization (glycan_data.stats.replace_outliers_winsorization) instead of IQR capping (glycan_data.stats.replace_outliers_with_IQR_bounds) for outlier treatment
  • Added get_SparCC to perform SparCC (Sparse Correlations for Compositional Data) to find pairwise associations between glycans sequences, or motifs, between two glycomics datasets, with the typical interface of .analysis functions (note that you can also use a glycomics dataset together with an, e.g., metagenomics dataset, even if “motifs=True” is set)
  • Removed outlier treatment in get_pvals_motifs to avoid removing actual effects of effect-sparse glycan array data
  • Added beta-diversity measures (via Euclidean distance on CLR/ALR-transformed data) to get_biodiversity. This function now operates on a shopping cart principle, similar to “feature_set” in the annotation functions. The “metrics” shopping cart currently has “alpha” and “beta” as options. Beta-diversity is tested via ANOSIM (e.g., differences in central tendencies) and PERMANOVA (e.g., variations in dispersions between groups)
  • In get_heatmap a correct color mapping (ascending or contrastive) is now automatically chosen and applied depending on whether negative values are absent or present in the input data, respectively (transform=”CLR” will introduce negative values in the data and trigger contrastive coloring)
  • Added the “customscale” keyword argument to `getdifferentialexpression,getglycanova,getbiodiversity, andgettimeseries`. Only use it if you know what you’re doing. Basically, if you know that the total amount of glycans goes up/down in your condition of interest (in the condition, not in the measurement), then provide the ratio of glycan signal as group2/group1 and that will be used for an informed scale model, as described in https://www.biorxiv.org/content/10.1101/2024.04.01.587602v1 . Alternatively, if you have more than two groups, “customscale” can be provided as a dictionary of type: group idx : mean(group)/min(mean(groups)). [In all these cases, “gamma” becomes a parameter describing experimental error in measuring this glycan signal]
  • In get_volcano the default for “xthresh” has been changed to 0 (post-hoc filtering of results by fold-change invalidates the FDR guarantee) and a new “n” keyword argument exists to provide the sample-size for applying an `getalphaN` calculated alpha threshold
  • Added get_roc to calculate ROC AUC scores for all features and, optionally, plot the ROC curve of the best feature. Also works in multi-group mode (i.e., best feature to distinguish class A from all other classes) and can use “custom_scale”
  • Added get_lectin_array to analyze lectin array data to find out what kind of glycan motifs are increasing/decreasing between conditions
  • Added an optional number of keyword arguments to get_volcano that get directly passed onto the seaborn scatterplot function (**kwargs)
  • Added the “rarityfilter” keyword argument to `getpca`, to support excluding extremely rare sequences/motifs from PCA calculation
  • The glycan_representation file as a static embedding look-up for plot_embeddings has been removed from the package and is now downloaded at runtime from a hosted file
  • Changed get_differential_expression and get_glycanova to re-append variance-based filtering discarded rows at the end, with a default p-value of 1.0 ### graph
  • Deprecated “wildcardsptm” keyword argument in `compareglycansandsubgraph_isomorphism`. Instead, this will be inferred internally and, if a monosaccharide with PTM uncertainty (e.g., “GalOS”) is present, then it will kick in and allow for matching to specified monosaccharides (e.g., “Gal6S”)
  • Fixed an issue where graph_to_string sometimes returned incorrect brackets for multiple nested branches ### processing
  • Improved canonicalize_iupac by handling “*”, “Ga(“, and improperly formatted ambiguities (e.g., “Gal-GlcNAc”) in an otherwise properly formatted string. Also improved floating bit handling
  • Fixed an issue in the rescue_glycans wrapper in which keyword arguments with empty list defaults could cause an indexing issue for wrapped functions ### draw
  • Added the “perresidue” keyword argument to GlycoDraw, which allows users to basically overlay a heatmap over the SNFG representation, where the “perresidue” values control the opacity (e.g., to visualize attention or any other kind of per-monosaccharide attribution)
  • Added process_per_residue to match per-residue values to different levels of branching
  • Added the “drawmethod” keyword argument to GlycoDraw, which allows users to draw glycans on the atomic level (chemical depiction of monosaccharides, including steric information, outlined with the respective SNFG color) in 2D (“drawmethod = chem2d”) as well as 3D (“draw_method = chem3d”). Note that this requires the glycowork[chem] optional installs
  • Fixed an issue in GlycoDraw that incorrectly parsed global losses when drawing Domon-Costello fragments
  • Fixed an issue in GlycoDraw where, if the filepath contained “svg” or “pdf”, that was sometimes read as the incorrect filepath
  • Fixed an issue in GlycoDraw where “vertical = True” occasionally resulted in empty output files ### annotate
  • Added load_lectin_lib, Lectin, create_lectin_and_motif_mappings, and lectin_motif_scoring as helper functions for analysis.get_lectin_array
  • quantify_motifs now also works with log2-transformed data ## network ### biosynthesis
  • Added multiple testing correction (via two-stage Benjamini-Hochberg), alphaN, and significance column to get_differential_biosynthesis
  • Fixed an issue in which no significant results in get_differential_biosynthesis could error out the function ## ml ### models
  • The model weights of the trained LectinOracle_flex, LectinOracle, SweetNet, and NSequonPred models have been removed from the package and are now downloaded at runtime from a hosted file

- Jupyter Notebook
Published by Bribak about 2 years ago

glycowork - v1.2.0

Change Log

For Version 1.2.0

  • Added glycoworkGUI.py to build the .exe based GUI for important glycowork endpoint functions: GlycoDraw, plot_glycans_excel, and get_differential_expression
  • Removed python-louvain as a required dependency for glycowork ## glycan_data ### loader
  • Switched from pkg_resources to importlib for loading tabular data into the package stats
  • Fixed an issue in TST_grouped_benjamini_hochberg that caused errors if nothing was significantly different in the entire dataset or in any group
  • test_inter_vs_intra_grouping is now robust to non-paired data and data with differing sample sizes per condition
  • Added replace_outliers_with_IQR_bounds to support outlier treatment in motif.analysis
  • Added sequence_richness, shannon_diversity_index, and simpson_diversity_index to calculate diversity indices of glycomics data ## motif ### processing
  • WURCS handling for universal input now encompass more monosaccharides
  • GlycoCT handling for universal input now is robust to the declaration of substituents not immediately following their monosaccharide in the GlycoCT string
  • Added equal_repeats to check whether two repeating units of a polysaccharide are the same, just shifted
  • Modified glycan nomenclature detection in canonicalize_iupac to be less prone of overidentifying Oxford when it’s just numbers etc.
  • Added “ß” to the typo detection in canonicalize_iupac and “(-)” as a variation of linkage uncertainty detection
  • Made canonicalize_iupac robust to the variation of using {} instead of () for linkages ### graph
  • Removed the required usage of lib in glycan_to_nxGraph, compare_glycans, subgraph_isomorphism, and all downstream functions (lib only remains for stemification and deep learning model training/inference)
  • The keyword argument “wildcardsptm” now also works as intended when providing pre-calculated graphs as input to `compareglycansorsubgraph_isomorphism`
  • Fixed a rare issue in which subgraph_isomorphism, when “count = False”, would sometimes erroneously output “False” because of a greedy approach to evaluating potential matches ### tokenization
  • Added get_unique_topologies to retrieve all base topologies for a given composition that have been observed for a given taxonomic subset
  • Added the “obfuscateptm” keyword argument to `mapto_basic`, to allow for mapping Gal6S to Hex6S rather than the default HexOS, if that is required/advantageous
  • Support mapping of phosphorylated glycans in map_to_basic ### draw
  • Fixed an issue where cross-ring fragments were not correctly rendered in GlycoDraw
  • plot_glycans_excel can now also be used with filepaths to .xlsx files (in addition to .csv files)
  • plot_glycans_excel now also supports compact glycan drawing with the “compact” keyword argument
  • Improved drawing resolution in plot_glycans_excel
  • GlycoDraw will now more strongly make use of nomenclature canonicalization in case of IUPAC dialects (still not 100%, if you suspect you use a dialect of IUPAC, pass your sequences through canonicalize_iupac first)
  • If no filepath is specified, GlycoDraw will now also display drawn glycan structures in a non-Jupyter environment (as the classic matplotlib pop-up). Note that this functionality requires the cairosvg dependency (head to https://bojarlab.github.io/glycowork/examples.html#glycodraw-code-snippets if you’re unsure about that) ### analysis
  • Functions able to use .csv paths as input can now also deal with .xlsx paths as input
  • The new “annotatevolcano” keyword argument now allows for the direct insertion of SNFG images within plots from `getvolcanowithout having to subsequently rundraw.annotate_figure`
  • get_pvals_motifs, get_differential_expression, get_glycanova, get_time_series, and get_jtk now use glycan_data.stats.replace_outliers_with_IQR_bounds to auto-smooth outliers
  • Moved hotellings_t2 to glycan_data.stats
  • All functions compatible with motif-level analysis now accept the “custommotifs” keyword argument to be passed to `annotatedatasetorquantifymotifs` if “custom” is included in “featureset”
  • Changed the “mode” keyword argument in get_heatmap to “motifs” as a Boolean argument, like in all other motif.analysis functions
  • Added a call to clean_up_heatmap to get_jtk to avoid redundant motifs
  • Added get_biodiversity to compare two groups of glycomics datasets with regard to the sequence diversity that is present (similar to comparable analyses for microbiome data) ### regex
  • Added filter_dealbreakers to allow for the exclusion of identified matches if they have illegal components beyond the identified match (e.g., the forbidden Fuc in "Fuc-([Gal|GalNAc])?-Gal-([!Fuc]){,1}-GlcNAc"). Before this, the sequence context except the Fuc was extracted and returned.
  • Fixed an edge case in filter_matches_by_location in which internal locations sometimes had to handle triple-nested lists which led to errors
  • get_match can now also use glycan graphs, such as derived from glycan_to_nxGraph, as input
  • Added get_match_batch to process a whole list of glycans at once, with some performance improvements via first pre-compiling the pattern
  • Fixed an edge case in get_match in which pattern components consisting of a single monosaccharide with a specified linkage (e.g., “Fuca3”) could sometimes erroneously output no matches
  • Added motif_to_regex to convert glycan motifs (e.g., in IUPAC-condensed) into a regular expression suitable for get_match. Limited to simple queries for now. ### annotate
  • get_terminal_structures now has a “size” keyword argument with which users can control the size of the extracted terminal motifs
  • get_k_saccharides now has a “terminal” keyword argument with which users can filter to only count motifs at non-reducing ends
  • annotate_dataset and functions using it now can add the “terminal2” and “terminal3” option in “feature_set” to also annotate & analyze terminal motifs of size 2 (e.g., Neu5Ac(a2-3)Gal(b1-4)) or size 3 (e.g., Neu5Ac(a2-3)Gal(b1-4)GlcNAc) ## network ### biosynthesis
  • Added the possibility of providing abundances to construct_network that are then stored as node attributes in the network
  • Added add_high_man_removal as a post-processing step in construct_network to allow for the addition of reactions removing mannoses from high-Man N-glycans occurring during maturation
  • Added estimate_weights and get_edge_weight_by_abundance to estimate reaction capacities from abundances + estimate missing abundances
  • Added get_maximum_flow, get_max_flow_path, and get_reaction_flow to calculate maximum flow paths between network root and endpoints as well as aggregate the flow by reaction type
  • Added get_differential_biosynthesis as a wrapper function to compare two groups of glycomes/networks with regard to their biosynthesis (differential flow paths or differential reaction flows)
  • Fixed an issue in construct_network in which sometimes nodes with outgoing but no incoming connections were not detected as unconnected nodes, leading to incomplete networks
  • Added the rescue_glycans decorator to construct_network, to allow for auto-fixing nomenclature variations
  • Improved performance of construct_network by reducing wasteful computation ### evolution
  • Switched get_communities from using python-louvain to the Louvain implementation in networkx

- Jupyter Notebook
Published by Bribak over 2 years ago

glycowork - v1.1.0

Change Log

glycan_data

  • Updated sugarbase database and all models ### stats
  • Newly added module to glycowork
  • Moved all the statistics functions from motif.processing into this module: cohen_d, mahalanobis_distance, mahalanobis_variance, variance_stabilization, MissForest, impute_and_normalize, and variance_based_filtering
  • Added fast_two_sum, two_sum, expansion_sum, hlm, update_cf_for_m_n, jtkdist, jtkinit, jtkstat, and jtkx helper functions for JTK test
  • Added get_BF to calculate Jeffreys' approximate Bayes factor based on sample size and p-value
  • Added get_alphaN to calculate sample size-appropriate significance cut-offs informed by Bayesian statistics
  • Added pi0_tst and TST_grouped_benjamini_hochberg to perform a Two-Stage adaptive Benjamini-Hochberg procedure based on groups (e.g., https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3175141/ or https://www.biorxiv.org/content/10.1101/2024.01.13.575531v1)
  • Added test_inter_vs_intra_group to estimate intra- versus inter-group correlation with a mixed-effects model for groupings of glycans based on domain expertise ## motif ### regex
  • Newly added module to glycowork
  • Added the get_match function and associated functions to implement a regular expression system for glycans. This allows for powerful queries to detect and extract motifs of arbitrary complexity. ### processing
  • Moved cohen_d, mahalanobis_distance, mahalanobis_variance, variance_stabilization, MissForest, impute_and_normalize, and variance_based_filtering into glycan_data.stats to re-focus processing on processing glycan sequences
  • Extended canonicalize_composition to cases like ‘542_1’, ‘5421’, and ‘(Hex)2 (HexNAc)2 (Deoxyhexose)1 (NeuAc)2 + (Man)3(GlcNAc)2’
  • GlycoCT and WURCS handling for universal input now encompass more monosaccharides and more modifications
  • Expanded oxford_to_iupac to handle more complex sequences, including sulfation, LacdiNAc, hybrid structures, extended Neu5Ac, complex fucosylation, more custom linkage specifications
  • enforce_class can now deal with free glycans regardless of whether they end in ‘-ol’ or not ### annotate
  • annotate_dataset and downstream functions now accept a new keyword in “featureset”, called “custom”. If “custom” is added to “featureset”, a list of custom motifs can and must be added via the “custommotifs” keyword argument. “custom” can be mixed and matched with all other keywords in “featureset”
  • annotate_dataset now also accepts glyco-regular expressions via the “custom” keyword in “featureset”. These expressions need to be added within the “custommotifs” keyword argument and have to start with an “r”, such as "rHex-HexNAc-([Hex|Fuc]){1,2}-HexNAc". Normal motifs and glyco-regular expressions can be freely mixed within “custom_motifs”
  • Added group_glycans_core, group_glycans_sia_fuc, and group_glycans_N_glycan_type to group glycans by core structure (for O-glycans), Sia/Fuc/FucSia/Rest, or complex/hybrid/high-man/rest (for N-glycans)
  • Fixed a bug in get_k_saccharides, in which redundant columns were not always correctly removed ### analysis
  • Added get_jtk to analyze circadian expression of glycans in temporal glycomics datasets using the Jonckheere–Terpstra–Kendall (JTK) algorithm, with the typical interface for motifs and imputation etc analogous to differential expression.
  • get_differential_expression, get_glycanova, and get_jtk now use get_alphaN to calculate a sample size-appropriate significance cut-off (see https://journals.sagepub.com/doi/10.1177/14761270231214429) and add a ‘significant’ column to the output to display whether the corrected p-values lie below this threshold
  • Added the “zscores” keyword argument to get_pvals_motifs to perform z-score transformation if used data are not yet z-score transformed, by setting “zscores” to False
  • For statistical calculations, get_pval_motifs will now weigh the motif occurrences by z-score magnitude, rather than only using a cut-off for enrichment calculations
  • Added effect size calculations to get_pval_motifs which are also in the output, as Cohen’s d
  • Changed get_pval_motifs such that now both enrichments and depletions will be tested (with depletions resulting in negative effect sizes)
  • Added select_grouping to find out which grouping of glycans has the highest intra- versus inter-group correlation, as estimated by glycan_data.stats.test_inter_vs_intra_group
  • When “motifs = False” and “groupedBH = True”, `getdifferential_expression` now tries to use the Two-Stage adaptive Benjamini-Hochberg procedure based on groups for multiple testing correction, if meaningful groups can be found in the glycans [note this makes everything at least one order of magnitude slower, though most datasets should still finish in a few seconds] ### draw
  • In GlycoDraw, the “highlight_motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression)
  • Added plot_glycans_excel to allow for the automated insertion of GlycoDraw SNFG pictures into an Excel file containing glycan sequences ### graph
  • categorical_node_match_wildcard now uses string ID for matching, instead of integer ID, which means even two graphs, generated with two different libs, can now be successfully compared via compare_glycans or subgraph_isomorphism
  • compare_glycans or subgraph_isomorphism (and all functions using these functions) now support negation, by prepending “!”. For instance, “!Fuc(a1-?)Gal(b1-4)GlcNAc” will match subsequences that have a monosaccharide that is NOT Fuc before the Gal. It is highly recommend to generate your own lib via get_lib if you use negation, as monosaccharides such as !Fuc are not within lib and will cause indexing errors.
  • Added “?1-?” as another ultimate wildcard (promoting it from a strong narrow wildcard)
  • Fixed some cases where “Monosaccharide” was not treated as an ultimate wildcard in graph operations
  • Fixed an issue in graph_to_string in which glycans of size 1 (e.g., “GalNAc”) sometimes were missing their first character ## network
  • Updated pre-calculated biosynthetic networks for milk oligosaccharides ### biosynthesis
  • Refactored find_diff to make networks compatible with the automated, dynamic wildcards (i.e., ? behave as they should and don’t necessarily cause over-branching of the network)
  • In highlight_network, the “motif” keyword argument can now use glyco-regular expressions in addition to regular motifs (just add a single ‘r’ before your glyco-regular expression to indicate that it is indeed a regular expression) ## ml ### model_training
  • In training_setup, upgraded the loss functions for all classification problems to PolyLoss with label smoothing (see https://arxiv.org/abs/2204.12511 for details).
  • In training_setup, number of classes (for multiclass or multilabel classification) can now be specified via the new “num_classes” keyword argument

- Jupyter Notebook
Published by Bribak over 2 years ago

glycowork - v1.0.1

Change Log

motif

processing

  • Slightly extended WURCS parsing in wurcs_to_iupac
  • Fixed an issue in choose_correct_isoform in which errors would be caused if the input list contained only duplicate glycans
  • Fixed an issue in choose_correct_isoform in which errors would be caused if the input list contained only glycans without branching ### draw
  • Adapted cairosvg imports so that, even without cairosvg dependencies, users can plot glycans inline and export as .svg files (only export as .pdf and export of annotate_figure is still restricted to cairosvg) ## network ### biosynthesis
  • Fixed handling of empty outputs of choose_correct_isoform in construct_network ### evolution
  • Fixed dictionary handling in get_communities

- Jupyter Notebook
Published by Bribak over 2 years ago

glycowork - v1.0.0

Change Log

  • Added a Zenodo badge, to have a release-specific doi for glycowork ## glycan_data
  • Updated sugarbase database; sugarbase is now pickled, so literal evaluations are necessary
  • Harmonized glycan column names across generated dataframes; all use ‘glycan’ now, ‘target’ has been deprecated ### loader
  • Updated motif_list to be compatible with new position encoding
  • Added InternalLewisX and InternalLewisA to motif_list (renamed LewisX and LewisA to TerminalLewisX and TerminalLewisA, correspondingly)
  • Made df_species static again to speed up package import
  • Added find_nth_reverse helper function that finds the starting index of the nth occurrence of a substring from the end of the string
  • Added remove_unmatched_brackets helper function to strip unmatched opening or closing brackets from glycan strings ## motif
  • Added more masses to mztocomposition.csv / mass_dict: Acetonitrile, Formate, Cl-, HCO3-, and NH4+ ### processing
  • Extended canonicalize_iupac to cases like "NeuGcα3Galβ3(NeuAcα6)GalNAcol" and even more modification formulations, e.g., “6S-GlcNAc”
  • Added canonicalize_composition to convert compositions formatted either in the style of HexNAc2Hex1Fuc3Neu5Ac1 or N2H1F3A1 into dictionaries used by glycowork
  • Added GalNAc4S to permitted reducing end monosaccharides for O-linked glycans in enforce_class
  • MissForest now has a maximum number of iterations and will check for convergence each iteration (immediately finishing upon converging), yielding some speed-ups in most cases
  • The output of min_process_glycans no longer contains empty strings for glycans ending in a linkage
  • Updated choose_correct_isoform to be compatible with change in min_process_glycans
  • Added get_possible_linkages to retrieve linkages matching a wildcarded linkage
  • Added get_possible_monosaccharides to retrieve monosaccharides matching a monosaccharide type (HexNAc, etc.)
  • Added decorators, rescue_glycans and rescue_compositions, to canonicalize them in case a decorated function errors out
  • Added linearcode_to_iupac to support LinearCode as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
  • Added iupac_extended_to_condensed to support IUPAC-extended as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
  • Added glycoct_to_iupac to support GlycoCT as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
  • Added wurcs_to_iupac to support WURCS as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage may not be perfect yet
  • Added oxford_to_iupac to support Oxford as input format for glycowork (this will be called within canonicalize_iupac and the decorators); note that for now coverage is limited
  • check_nomenclature (formerly in motif.tokenization) now handles outputting warning messages for trying to use non-string, non-graph nomenclatures or SMILES with glycowork functions
  • Expanded find_isomorphs to generate more isomorphic sequence variants and thereby increasing the chances that choose_correct_isoform will have access to the canonical sequence
  • Fixed a rare issue with canonicalize_iupac where sequences coming from structure_to_basic would sometimes be formatted incorrectly if they contained dHex
  • Fixed an issue in find_isomorphs in which double branches were not always correctly swapped ### analysis
  • get_heatmap now no longer tries to convert data to relative abundances if negative values are detected in the input
  • All functions using dataframes as inputs in analysis can now also be used by providing full filepaths to the .csv file instead
  • Optimized some of the code for readability and speed (everything should be at least a bit faster now) ### annotate
  • get_k_saccharides is now allowed to generate new dynamic motifs with tokens outside of lib (via expand_lib)
  • annotate_glycan and annotate_dataset now also support narrow wildcards
  • Fixed an issue in count_unique_subgraphs_of_size_k in which branched motifs were not always correctly formatted (i.e., opening/closing brackets)
  • get_k_saccharides now outputs dataframes with counts as default and can yield the old nested lists of motifs by setting the new keyword just_motifs to True
  • Fixed an edge case in which get_k_saccharides sometimes overcounted individual monosaccharides if their strings overlapped ### graph
  • subgraph_isomorphism and compare_glycans now support using wildcards and position encoding at the same time. The extra keyword argument is now deprecated and the functions auto-detect whether anything has been specified in wildcards and/or termini_list
  • subgraph_isomorphism and compare_glycans now support automatically inferred narrow wildcards to allow for (i) matching linkages like a1-? to only specified linkages within that group (e.g., a1-3 but not b1-3 etc.) and (ii) matching monosaccharide types like HexNAc to only specified monosaccharides of that type (e.g., GlcNAc but not Glc, etc.)
  • The wildcard_list keyword argument in all graph & annotation functions is now deprecated as wildcards are inferred automatically via narrow wildcards and native full wildcards (?1-? and Monosaccharide)
  • subgraph_isomorphism now behaves as expected for testing motifs ending in linkages on glycans ending in linkages
  • subgraph_isomorphism can now return the matched subgraphs in the input glycan with the new return_matches keyword argument
  • glycan_to_nxGraph is now decorated with the rescue_glycans decorator, which auto-canonicalizes IUPAC strings if they are not in the format preferred by glycowork
  • Fixed mismatch of labels and stringlabels in `categoricalnodematchwildcard`
  • Fixed an issue in subgraph_isomorphism in which, when using positional encoding, sometimes the mirror image of a motif was incorrectly captured if the termini aligned
  • termini_list within subgraph_isomorphism now only requires the specification of monosaccharide positions
  • Added expand_termini_list helper function to facilitate the expansion of monosaccharide-only termini_list into full termini_list behind the scenes
  • Added support for shorthand notation of position encoding, now either ‘terminal’ or ‘t’ will work
  • Improved handling of complex branching in graph_to_string; should be fewer unexpected translations now
  • Fixed an issue in graph_to_string in which induced subgraphs could cause errors due to unexpected or weirdly sorted node indices
  • Fixed an edge case in which the reducing end could be sometimes calculated as ‘internal’ when termini=’calc’ in glycan_to_nxGraph
  • Deprecated a duplicate character_to_label and string_to_labels
  • Deprecated categorical_termini_match; the functionality is now handled within categorical_node_match_wildcard
  • Deprecated the wildcards keyword argument from compare_glycans as this will now be detected internally, if wildcards are provided via wildcard_list ### tokenization
  • Composition functions (e.g., composition_to_mass) are now decorated with rescue_compositions, which means that they can be used with compositions like “H3N2” (basically anything that canonicalize_composition can handle)
  • Deprecated character_to_label as it’s now handled within string_to_labels
  • Moved check_nomenclature into motif.processing
  • Optimized some of the code for readability and speed (most things should be at least a bit faster now) ### draw
  • Support motif highlighting in GlycoDraw: by providing the highlight_motif keyword argument, motifs can be highlighted (everything else will be set to low opacity). Works with IUPAC-condensed motifs and named motifs from known
  • Support wildcards in motif highlighting with the highlight_wildcard_list keyword argument, for instance highlighting all Gal(?1-?)GlcNAc subunits (for Gal(b1-?)GlcNAc you don’t need highlight_wildcard_list, as narrow wildcards are handled automatically)
  • Support positional encoding in motif highlighting with the highlight_termini_list keyword argument, for instance highlighting all terminal, non-reducing end Gal(b1-?)GlcNAc subunits (yes, you can use both wildcards and positional encoding at the same time😊)
  • Support drawing of repeat structures (indicated by brackets and the number of repeats) via the new repeat keyword argument. Internal repeats can also be specified with the additional repeat_range keyword argument.
  • Optimized some of the code for readability and speed (most things should be at least a bit faster now)

network

biosynthesis

  • Optimized some of the code for readability and speed (everything should be up to 2x faster now) ### evolution
  • Optimized some of the code for readability and speed (everything should be at least a bit faster now)

ml

  • Optimized some of the code for readability and speed (most things should be at least a bit faster now)

- Jupyter Notebook
Published by Bribak over 2 years ago

glycowork - v0.8.1-zenodo

Literally no code changes at this point (0.9 is expected to come in December) but Zenodo requires a new release to mint a doi

- Jupyter Notebook
Published by Bribak over 2 years ago

glycowork - v0.8.1

Change Log

For Version 0.8.1

motif

tokenization

  • Converted chars into a dict to match libr formatting
  • Updated constrain_prot to work with the change above

ml

models

  • Changed prep_model to load trained models onto the CPU if no GPU is available

- Jupyter Notebook
Published by Bribak almost 3 years ago

glycowork - v0.8.0

Change Log

For Version 0.8.0

  • Linted the package with flake8
  • Increased code coverage
  • Added another optional extras install, [chem], including glyles, requests, and pubchempy

glycan_data

  • Changed lib to be a dict of type glycoletters:index, as it’s faster to index a dict vs. a long list; also adapted all functions using lib to reflect this change

loader - Added replace_every_second helper function - Updated linkages list - Changed linkages and Hex etc to be sets instead of lists

motif

processing - Added variance_stabilization for variance stabilization normalization, both globally and group-specific - Added in_lib helper function to check whether all glycoletters of glycan are in lib - Deprecated small_motif_find - cohen_d now also returns the variance of the effect size and supports paired samples as well (calculating Cohen’s dz in this case) - Added mahalanobis_distance to calculate Mahalanobis distance as an effect size for multivariate comparisons - Added mahalanobis_variance to estimate variance of Mahalanobis distance via bootstrapping - Added MissForest for random forest based data imputation - Cleaned up canonicalize_iupac and made it slightly faster - Added variance_based_filtering - Added impute_and_normalize and underlying helper functions - Fixed numpy random seed for reproducibility - Sped-up presence_to_matrix

tokenization - Deprecated mz_to_composition - mz_to_composition2 is now the new mz_to_composition - Adapted mz_to_structures, compositions_to_structures, and match_composition_relaxed to work with this change

annotate - Added create_correlation_network to identify clusters of highly correlated glycans/motifs - Added count_unique_subgraphs_of_size_k as a helper function within get_k_saccharides - Refactor get_k_saccharides to be faster and more complete (and be, effectively, a replacement of motif_matrix) - annotate_dataset now uses get_k_saccharides for mono- and disaccharides, instead of motif_matrix - Deprecated motif_matrix - annotate_dataset now also creates relevant ?-containing motifs if ‘terminal’ in featureset, even if they don’t explicitly occur in the glycan strings - Big speed-up for `annotatedatasetif known=True, as we now cache the precalculated motif graphs - Addedquantifymotifsas a wrapper aroundannotatedatasetto adequately distribute relative abundances across extracted motifs - Deprecatedestimatelowerbound` as speed-ups make it no longer necessary

analysis - Renamed make_heatmap to get_heatmap - Renamed make_volcano to get_volcano - Deprecated replace_zero_with_random_gaussian (this is now handled by MissForest in .processing within impute_and_normalize) - Added hotellings_t2 for multivariate comparisons - Changed multiple-testing correction method from Holm-Sidak to Benjamini-Hochberg - Added variance_stabilization in get_differential_expression - Added the option to analyze highly correlated sets of glycans/motifs (via create_correlation_network) within get_differential_expression - Implemented usage of hotellings_t2 and the Mahalanobis distance (as effect size) for usage if sets are analyzed within get_differential_expression - get_heatmap and get_differential_expression now scale abundances by the actual counts of motifs per glycan, not just absence/presence - Added get_meta_analysis to estimate combined effect sizes from the results of multiple studies (both fixed-effects and random-effects models can be estimated) - Added variance_based_filtering in get_differential_expression - Effect size variances can now also be retrieved within get_differential_expression via the effectsizevariance keyword argument - get_differential_expression now also can handle paired samples when paired=True - get_differential_expression now also tests the homogeneity of variances using Levene’s test in all settings (also multiple-testing controlled) - Added get_glycanova to use ANOVA-based analyses on glycomics datasets (uses basically all the improvements of get_differential_expression, including analysis on the motif level) - Added get_pca to plot glycomics data (also has the motif interface) - Added get_pval_distribution to plot the distribution of p-values - Added get_ma to plot a Bland-Altman plot - Added get_glycan_change_over_time to detect significant changes in time-course data via OLS fitting - Added get_time_series as a wrapper around get_glycan_change_over_time to do time series analyses, with all the motif & normalization functionality - Added get_coverage to visualize glycan expression across samples (ordered by average intensity) in a coverage plot

draw - Added import warning if draw dependencies are not installed - Removed pycairo from dependencies - Modified annotate_figure to be compatible with .svg files from older Matplotlib versions - Changed “output” to “filepath” in GlycoDraw - If there are “?” in the provided filepath for GlycoDraw, they will now be automatically replaced with “_” to avoid saving errors

graph - Sped-up glycan_to_graph/glycan_to_nxGraph (and all downstream functions, which are a lot) - Also improved the runtime of downstream functions, such as subgraph_isomorphism independent of these advances - subgraph_isomorphism now also accepts precalculated motif graph as inputs (in addition to the already supported precalculated glycan graphs)

ml

  • Rephrased import warnings to reflect optional install strategy for extra dependencies

model_training - Sped-up train_ml_model

network

biosynthesis - create_neighbors no longer uses the libr keyword

- Jupyter Notebook
Published by Bribak almost 3 years ago

glycowork - v0.7.0

Change Log

For Version 0.7.0

  • Removed support for Python 3.7; as we use the walrus operator in some of the re-worked functions, Python 3.8+ is now required to use glycowork
  • Added optional installs for specialized glycowork usage (‘all’, ‘ml’, and ‘draw’; for now), which install additional dependencies for these usages; more details in docs

glycan_data

Updated datasets, models, lib to be bigger & better; removed many sequence duplicates with differently written branch orderings

loader - Added multireplace helper function, to map a dictionary of changes to a string - Made build_custom_df faster

motif

draw - Added draw as a new submodule of .motif - Added GlycoDraw to draw glycans in SNFG style and save them as .svg/.pdf - Added annotate_figure to replace glycan text with glycan images in .svg figures (heatmaps, volcano plots, etc.) - Added text_to_glycan, which replaces glycan strings in figures with glycan images - Added scale_in_range to normalize a list of numbers within a range

tokenization - Sped up glycan_to_composition by 1000x (avoiding explicit stemification and just doing stemification of the building blocks); also speeds up all functions using glycan_to_composition - Sped up composition_to_mass (independent of the above) - glycan_to_composition (and downstream functions) now can handle more post-biosynthetic modifications: Ac, PCho, PEtN - Renamed calculate_theoretical_mass to glycan_to_mass - Sped up mz_to_composition2 by (i) filtering out duplicate compositions and (ii) selecting compositions from a chosen taxonomic kingdom - Reprioritized mz_to_composition2 by first searching for native compositions and only then looking for compositions + adducts and only then searching for doubly-charged compositions - canonicalize_iupac now also handles floating substituents and can handle many more typos / inconsistencies / IUPAC dialects (such as CFG-coded glycans), including improvements made by Kathryn Klarich - Moved canonicalize_iupac into motif.processing - Expanded get_core (and downstream functions) with HexA, HexNAc, dHex - Expanded map_to_basic to (some) post-biosynthetic modifications - mz_to_structures no longer outright fails if no m/z value can be matched - Deprecated structures_to_motifs ; annotate_dataset can do the same

processing - Fixed bug in processing glycans with floating substituents in small_motif_find - Deprecated seed_wildcard - choose_correct_isoform has been updated to keep up with the improved find_isomorphs - Added more informative error message to IUPAC_to_SMILES - get_lib is now slightly faster

graph - Sped up compare_glycans with string inputs, by avoiding graph operations when the two glycans do not have the same composition - Added support for enabling modification wildcards in compare_glycans and subgraph_isomorphism (for instance matching GalOS and Gal6S) by setting wildcardsptm = True - Speed-up `glycantonxGraphintby optimizing node label/attribute assignments - Refactorgraphtostringto be a lot more robust, streamlined, and faster. Its new integration withcanonicalizeiupacmay also result in string improvement upon back-translation (e.g., branch order canonicalization) -ensuregraphnow has **kwargs that get passed toglycantonxGraph -getpossibletopologiesnow supports internal additions as well, with the keyword argument ‘exhaustive’ -possibletopologychecknow supports wildcard matching via **kwargs passed on tocompareglycans - Made changes to makeglycoworkcompatible with NetworkX 3.0 - Movedbracketremovaltomotif.processing - Fixed a small inconsistency in handling floating substituents inglycantonxGraphintthat could have caused issues with custom libs -overridereducingendis no longer needed inglycantonxGraphto delineate linkage-ending glycans (e.g., Fuc(a1-2) ); this is auto-inferred withinglycanto_nxGraph` now

annotate - Deprecated convert_to_counts_glycoletter and glycoletter_count_matrix ; motif_matrix can do both - Refactored motif_matrix to be substantially faster and more condensed in its output (also speeds up annotate_dataset with the ‘exhaustive’ option in the featureset argument) - Expanded `motifmatrixto implicitly test for subsumption enrichment (e.g., previously we only explicitly looked for “Gal(b1-?)GlcNAc”; now we also count “Gal(b1-4)GlcNAc” as to the former) -annotateglycan` is now dual-compatible with string and networkx graph input - expanded featureset in annotate_dataset by the option ‘terminal’, which calls get_terminal_structures - This usage of get_terminal_structures in annotate_dataset now also does the same implicit test for subsumption enrichment as described for motif_matrix above - annotate_dataset now creates its own lib, based on the motif list and the provided glycans - Expanded find_isomorphs to also be able to re-shuffle (some) branched branches - Moved find_isomorphs into motif.processing - Linkages-only are no longer considered by motif_matrix / annotate_dataset

analysis - All functions with the featureset keyword argument now can also use the ‘terminal’ keyword for analyzing non-reducing end motifs exclusively - Added `getdifferentialexpressionto compare glycomics data, including data cleaning and imputation -getpvalsmotifsandmakeheatmapno longer have the lib keyword argument, asannotatedatasetwill generate a suitable lib internally - Fixed relative abundance summation in motif-mode formakeheatmap - Added thecleanupheatmaphelper function to remove redundant (i.e., identical) rows in heatmaps, with a prioritization of named motifs and longer motifs containing redundant shorter motifs - Addedmakevolcano, to generate a volcano plot from internally calculated differential expression using thegetdifferentialexpressionfunction - Movedcohendintomotif.processing`

ml

model_training - train_ml_model no longer has the lib keyword argument, as annotate_dataset will generate a suitable lib internally

network

biosynthesis - Refactored construct_network pipeline to be faster and more memory-efficient - reducing_end has been deprecated and is being handled internally - Added infer_roots to auto-infer permitted_roots (also does not need to be specified any longer in construct_network) - Implemented distance limit, to prevent combinatorial explosion when outlier glycans are present - Deprecated subgraph_to_string and make_network_from_edges - Deprecated fill_with_virtuals and make_network_directed - Minor speed-up of process_ptm, by pre-calculating stem_lib once instead of for every glycan in network

- Jupyter Notebook
Published by Bribak about 3 years ago

glycowork - v0.6.0

Change Log

For Version 0.6.0

  • Updated nbdev1 to nbdev2
  • Updated documentation notebooks
  • Expanded documentation examples for (i) networks and (ii) deep learning models

glycan_data

  • Updated v7_sugarbase and associated files + models
  • Improved Cellosaurus ID prefixes
  • Added glycan composition as a new column to sugarbase
  • Exchanged ‘z’ with ‘?’ as a linkage uncertainty indicator
  • Added protein column to glycan_binding, indicating the protein name whose sequence is in the target column

loader - Added “Ins” and “Galf” to Hex list - Added stringify_dict utils function to convert a dictionary into a string

motif

  • Changed functions to use “?” as a linkage uncertainty indicator rather than “z”

processing - Added enforceclass to check whether glycan is from desired glycan class - Added IUPACto_SMILES to convert glycans from IUPAC-condensed into SMILES via GlyLES

graph - glycantonxGraph can now use glycan strings with floating substituents, such as “{Neu5Ac(a2-3)}Gal(b1-4)GlcNAc(b1-6)[Gal(b1-3)]GalNAc” - added getpossibletopologies and possibletopologycheck to probe whether glycans (could) match a glycan with floating substituents - added ensuregraph to allow functions to be dual-compatible for string & graph inputs - generategraphfeatures, largestsubgraph, getpossibletopologies, and possibletopologycheck are now dual-compatible with string & graph inputs

tokenization - Refactor matchcompositionrelaxed to be slightly faster & a much smaller function, that uses glycantocomposition for matching - Deprecated matchcomposition accordingly - mztocomposition is now up to 100x faster, based on much better defaults / assumptions - added support for free oligosaccharides to mztocomposition - added mztocomposition2 as an alternative way of composition matching; better scaling and “more physiological” as it’s constrained by class-specific existing compositions within sugarbase - glycantocomposition can now also handle post-biosynthetic modifications such as sulfation - added compositiontomass - Improve linkage uncertainty handling in canonicalizeiupac - canonicalizeiupac now can handle sulfation and phosphorylation - updated stemifyglycan & structuretobasic to correctly handle glycans of length 1 - updated stemifyglycan to terminate the while loop if it would result in infinite loops - updated glycantocomposition to support floating substituents - getcore now also handles “Ins” correctly - calculatetheoreticalmass now can also handle methylation modifications correctly - improved reducing end calculation for modified glycans in calculatetheoreticalmass - added speed-up option to calculatetheoreticalmass & glycantocomposition for non-exotic glycans - refactored calculatetheoreticalmass to use compositiontomass

annotate - add getterminalstructures to extract monosaccharide+linkage from all non-reducing ends of glycan - improved runtime and completeness for getksaccharides - getterminalstructures & getksaccharides are now also both dual-compatible with string & graph inputs - added getmolecularproperties to obtain chemical features of glycans via SMILES - ‘chemical’ is a new option in featureset of annotatedataset, using getmolecularproperties - small style fix in motifmatrix to avoid warning - linkfind (and downstream annotation findings) now also support floating substituents

analysis - add cohend to calculate effect size between two comparison groups - ‘chemical’ is a new option in featureset of getpvalsmotifs and makeheatmap, using getmolecular_properties

ml

model_training - added the option to use GSAM instead of SAM for the optimizer by specifying alpha in training_setup

models - streamlined SweetNet architecture (credit to David Alexander) used in SweetNet and LectinOracle  faster training and clearer code

network

biosynthesis - added a dictionary of pre-calculated glycan graphs to constructnetwork and underlying functions  ~2x speed-up and better scaling - various other performance improvements to network construction functions further increase speed - improved pruning of virtual root nodes in constructnetwork - modified exportnetwork to allow for custom node attribute extraction - generalized finddiamonds to allow for extraction of diamonds, hexagons, etc with a custom parameter nbintermediates (default: 2, for diamonds) - generalized choosepath to compute path probabilities for non-diamond shape motifs

evolution - small fix in calculatedistancematrix

- Jupyter Notebook
Published by Bribak over 3 years ago

glycowork - v0.5.0

Change Log

For Version 0.5.0

  • added more in-line documentation to all functions/modules

glycan_data

  • dfspecies is now being generated internally from dfglycan and is no longer a separate file
  • added buildcustomdf to generate dfspecies, dftissue, and dfdisease from sugarbase/dfglycan
  • We are retiring ‘bond’. Instead, the default for full linkage uncertainty is now z1-z / z2-z. Replace z with ? for full compatibility with IUPAC-condensed
  • The ethanolamine modification (previously Etn) is now EtN for consistency with the style of other modifications
  • tissue associations now have either Uberon IDs (tissues etc.) or Cellosaurus IDs (cell lines)
  • disease associations now have a Disease Ontology ID
  • tissue and disease associations now also have a species designation (in tissuespecies and diseasespecies, respectively)
  • the internal lib is now a .pkl file instead of being calculated each time the package is loaded
  • shifted glycanrepresentationsspecies.pkl into .motif, where it will be loaded upon calling .motif.analysis.plot_embeddings
  • shifted df_glysum into .alignment, where it will be loaded upon calling .alignment.glysum.pairwiseAlign
  • it should be noted that we may deviate more and more from the provided GlyTouCan IDs, as we strive towards removing unnecessary uncertainty (e.g., specifying the core Fuc as alpha, regardless of whether it has been denoted as alpha in the official GlyTouCan entry)
  • updated positional information in motif_list to account for new graph generation output

loader - Deprecated load_file

motif

tokenization - added mztocomposition to match m/z values from glycomics with possible monosaccharide compositions - added mztostructures wrapper to directly go from m/z values to matching glycan sequences - changed some required arguments to optional arguments in compositionstostructures and mztostructures (the default is now human glycans with no additional relative intensities) - fixed an issue in compositionstostructures in which an error was returned if none of the provided compositions had any structure matches - update stemifyglycan to the z-nomenclature for linkage uncertainty - compositionstostructures now allows for input of custom Hex, HexNAc, and dHex lists - condensecompositionmatching is updated to the z-linkage uncertainty nomenclature - sped up composition matching by only considering glycans with correct number of monosaccharides - added canonicalizeiupac to allow for conversion of other IUPAC “flavors” into the version of IUPAC-condensed nomenclature optimized for glycowork - added structuretobasic, glycantocomposition, and calculatetheoreticalmass utility functions to convert glycan sequences into topologies, compositions, and their theoretical mass, respectively

processing - added choosecorrectisoform to distinguish glycan branch isomers - deprecated processglycans and motiffind - refactored getlib to use minprocessglycans - condensed smallmotiffind - moved checknomenclature into .motif.tokenization + integrated canonicalize_iupac into it

analysis - updated characterize_monosaccharide to work with seaborn 0.11.2+

graph - overhauled graph generation (glycantograph, glycantonxGraph, graphtostring) to be more robust, modular, and simpler / easier to maintain - combined fastcompareglycans and compareglycans into compareglycans (which internally detects whether strings or precomputed graphs were provided) - compareglycans (and its dependencies) is also 2-3x faster now - subgraphisomorphism also should be 2-3x as fast as before - updated graphtostring to the z-nomenclature for linkage uncertainty - fixed a bug in the counting mode of subgraphisomorphism, in which the graph was modified in-place if precomputed graphs were provided and the function was called multiple times - glycantonxGraph received a new optional argument to enable generating graphs of glycans ending in a linkage but note that this output will not work for all downstream functions - correspondingly subgraphisomorphism can now use motifs ending in a linkage as input - wildcard matching for compare_glycans etc now uses the string labels instead of the regular lib index labels to define the wildcards

query - dramatically sped up get_insight by first checking for string identity before doing graph isomorphisms

annotate - fix scipy import for compatibility with scipy 1.8.0 - improved getksaccharides to be (i) compatible with the new graph generation approach and (ii) be a lot more robust and exhaustive

ml

  • modified GPU utilization to allow CPU usage of all functions (in theory)

models - the trained model file for LectinOracleflex is now contained within the package instead of being loaded externally - deprecated functions for loading external LectinOracleflex model

processing - refactored datasettographs to directly import from NetworkX graphs

traintestsplit - renamed taxonomicmultilabel to preparemultilabel, as it now also works for preparing training datasets for tissue and disease associations

model_training - SAM will now only be loaded by training_setup in case of multiclass or multilabel classification (for performance reasons)

network

  • functions working with biosynthetic networks can now use dictionaries of pre-computed networks as inputs; with the default option of stored pre-computed milk glycan biosynthetic networks stored within glycowork

biosynthesis - added tracediamonds to automatically extract diamond-shaped motifs from networks and leverage evolutionary information to return likelihoods for real paths - replaced infusenetwork with highlightnetwork, which allows you to highlight motifs, species-specific glycans, abundances, and degree of conservation in a network - added prunenetwork to cut away virtual paths that are unlikely to impossible (depending on threshold) - added evoprunenetwork as a wrapper for tracediamonds, highlightnetwork, prunenetwork - fixed an issue in choosepath returning an error if a path doesn’t occur in any other species; now it returns an empty dictionary - fixed an issue in propagatevirtuals that prevented proper deorphanization for O-glycans - fixed a suffix issue in PTM detection for non-milk networks - made getvirtualnodes and constructnetwork more robust toward unusual branch ordering - improved constructnetwork to prune virtual leaf nodes with degree > 1 - functions requiring a filepath now require a species : network dictionary as function input

evolution - added checkconservation to assess the evolutionary conservation of a glycans and glycan motifs via biosynthetic networks - added getcommunities to use Louvain community detection algorithm, e.g., in biosynthetic networks - refactored distance matrix calculation as separate function, calculatedistancematrix

alignment

  • retired alignment until significant improvements can be made

- Jupyter Notebook
Published by Bribak about 4 years ago

glycowork - v0.4.0

Change Log

For Version 0.4.0

ml

models

  • added NSequonPred (for predicting whether N-linked sequons will be glycosylated) as a trained model
  • added LectinOracle_flex as a trained model (doing the same thing as LectinOracle but able to use raw protein sequences as input rather than ESM-1b representations; with comparable performance)
  • modified prepmodel to allow for NSequonPred and LectinOracleflex selection
  • added more model initialization options and adjusted their defaults in prepmodel ## modeltraining
  • changed default optimizer from AdamW to AdamW+SAM (Sharpness-Aware Minimization from https://arxiv.org/abs/2010.01412); typically increases model performance on test set by ~2%
  • implemented support for training models for multilabel classification ## traintestsplit
  • added taxonomic_multilabel to prepare taxonomic glycan data for multilabel classification ## inference
  • added getNsequonpreds to use NSequonPred for inference
  • modified getlectinpreds to allow for LectinOracle_flex usage # motif ## graph
  • modified subgraph_isomorphism to use both string and precalculated graph inputs
  • modified subgraph_isomorphism to be able to count the number of occurring subgraphs
  • glycantonxGraph now also records the actual monosaccharide/linkage strings as “string_labels” in the node labels
  • glycantonxGraph and graphtostring can now also operate on monosaccharides (glycans of length 1)
  • added largest_subgraph to identify the largest common subgraph between two glycans ## annotate
  • annotateglycan now makes use of precalculated graph in calling subgraphisomorphism  ~3x faster in motif annotation (also applies to many heatmap applications etc etc.)
  • annotateglycan & annotatedataset now also return the number of known/named motifs per glycan
  • replaced gettrisaccharides with getk_saccharides that allows for motif recognition of user-defined size
  • bug fixes ## tokenization
  • added constrainprot and prottocoded to process protein sequences for LectinOracleflex
  • added maskrareglycoletters to mask rare monosaccharides and linkages in glycan sequences ## processing
  • checknomenclature now returns True if no red flag is raised # glycandata
  • replaced influenzabinding with the superset glycanbinding (564,647 protein-glycan interactions from 1,392 lectins) ## loader
  • added a reindex utility function
  • updated linkages list ## data_entry
  • check_presence now ensures correct glycan nomenclature # network ## biosynthesis
  • added functions to consider post-translational glycan modifications when constructing biosynthetic networks (either via the processptm wrapper or as an option in constructnetwork)
  • added functionality to convert biosynthesis networks into directed graphs (either via the makenetworkdirected wrapper or as an option in construct_network)
  • added update_network to add new information to an already constructed biosynthetic network
  • improved construct_network to enable finding paths for all nodes that can be connected to the biosynthetic root nodes
  • added infuse_network to allow for visualizing glycomics abundance data together with biosynthetic networks
  • added choose_path to leverage biosynthetic networks from other species to determine which path is taken in diamond shapes (A->B, A->C, B->D, C->D) where both paths are virtual/not observed
  • various improvements to ensure that the code functionality also works for classes other than milk glycans, such as O-linked glycans
  • better network layouts with pydot2
  • added edge types (monosaccharide, monosaccharide+linkage, biosynthetic enzyme), which can be infused with differential gene expression information
  • bug fixes & smaller improvements (e.g., pruning of virtual leaves, exporting of networks, user choice of edge type, etc.) ## evolution
  • added functions to calculate a distance matrix from glycan embeddings and use this to calculate dendrograms / evolutionary networks
  • add distancefrommetric to calculate distance of networks, e.g., via Jaccard distance

- Jupyter Notebook
Published by Bribak over 4 years ago

glycowork - v0.3.0

Change Log

For Version 0.3.0

ml models - added LectinOracle as option for prepmodel & modified prepmodel to allow for loading trained models modeltraining_ - trainmlmodel now allows for additional (optional) input features - changed default optimizer from Adam to AdamW - changed default learning rate scheduler from cosine-decay to ReduceLROnPlateau processing - splitdatatotrain now allows for additional (optional) input features - labeltype is now also an optional argument for splitdatatotrain and all lower-level functions _modeltraining_ - modified trainmodel to allow for LectinOracle training _representation/inference - renamed “representation” module into “inference” - added getlectinpreds to use LectinOracle for inferring binding specificity of lectins - added getesm1brepresentation to retrieve ESM1b representations for new lectins, to use them as input for LectinOracle

motif query - added tissue expression and disease association to getinsight - glytoucantoglycan now more robust in dealing with missing GlyTouCan IDs _tokenization - added condensecompositionmatching to find the minimum number of glycans to characterize matching compositions - added compositionstostructures wrapper function that will take a list of compositions, find possible matches, condense them into the minimum number of structures, and match them with values, such as provided relative intensities - added structurestomotifs function to convert datasets of relative intensities of glycan structures to relative intensities of the corresponding glycan motifs - changed default mode of matchcompositionrelaxed to “exact” - modified matchcompositionrelaxed to allow for filtering possible matches based on reducing end monosaccharide (e.g., O-linked glycans) - fixed issue in matchcompositionrelaxed that prevented the addition of additional monosaccharide types to the composition - moved motif_matrix and dependencies over to motif.annotate

glycan_data - replaced glycotargetsspeciesseqallV4 (~23,000 species-specific glycans) and v4sugarbase (~47,000 unique glycans) with glycotargetsspeciesseqallV5 (~31,500 species-specific glycans) and v5sugarbase (~50,500 unique glycans) - added directed disease associations (currently 533 associations) and tissue expression (currently 2,815 associations) for glycans in v5sugarbase - changed nomenclature of glycolipids (mostly receive an “1Cer” at their reducing end, for instance “Glc1Cer”) and free oligosaccharides (receive an “-ol” at their reducing end, for instance “Glc-ol”) - made Lewis motifs in motiflist more general - correspondingly updated glycan ML models, representations, and substitution matrix

- Jupyter Notebook
Published by Bribak over 4 years ago

glycowork - v0.2.0

motif tokenization - added functions for stemifying glycans (by removing rare modifications) - added matchcomposition & matchcomposition_relaxed for finding glycan structures in stored or provided databases that match a provided composition. Can be narrowed down to, e.g., a species of interest.

graph - added function to translate glycan graph back to IUPAC-condensed string - added trystringconversion function to check whether glycan graph describes valid glycan - modified generategraphfeatures to also work with networks

analysis - update plotembeddings to use representation dataframes as inputs in addition to dictionaries - swap subplots in characterizemonosaccharide and modify labelling to enhance clarity - getpvalsmotifs now allows for a custom motiflist via the optional motifs argument - plotembeddings now allows for a custom color palette

query - added glytoucantoglycan function to interconvert GlyTouCan IDs and glycans - get_insight now also yields the GlyTouCan ID of a glycan (if available) + the predicted taxonomy if no taxonomy is recorded in our database

annotate - added gettrisaccharides to retrieve a subset of the trisaccharides occurring in a glycan - added estimatelowerbound to give makeheatmap + getpvalsmotifs a speedup option with estimatespeedup = True (warning: estimatelower_bound is an estimate and might in theory lead to missed motifs in the motif annotation); typically results in a 3x speed-up

network - beta version of completely new module that is still in active development

biosynthesis - added functions to find neighbors in biosynthesis space (one reaction removed) - added functions to plot biosynthetic network for a set of glycans - added functions to combine/align biosynthetic networks

glycan_data - replaced glycotargetsspeciesseqallV3 (~13,000 species-specific glycans) and v3sugarbase (~20,000 unique glycans) with glycotargetsspeciesseqallV4 (~23,000 species-specific glycans) and v4sugarbase (~47,000 unique glycans) - correspondingly updated glycan ML models, representations, and substitution matrix - next to all the new glycans, many pre-existing glycans are now better specified (e.g., Gal3S instead of GalOS, wherever location of modification is known) - GlyTouCan IDs were added whenever possible - motif_list was expanded by two new motifs (difucosylated N-glycan core & extended core fucose)

ml traintestsplit - modified hierarchy_filter to ignore glycans with ‘undetermined’ taxonomy label

- Jupyter Notebook
Published by Bribak almost 5 years ago

glycowork - v0.1.0

- Jupyter Notebook
Published by Bribak about 5 years ago