Recent Releases of https://github.com/akikuno/dajin2
https://github.com/akikuno/dajin2 - 0.7.1
v0.7.1 (2025-07-18)
🌟 New Features
- Added Web-based Graphical User Interface (GUI): DAJIN2 now provides a user-friendly web interface that can be launched with
DAJIN2 gui. The GUI supports both single sample analysis and batch processing with real-time progress monitoring, file upload capabilities, and cross-platform file management. See PR: #106
🐛 Bug Fixes
- Fixed KeyError for allele names containing underscores: Resolved critical bug where allele names containing underscores (e.g.,
deletion_allele) caused KeyError during HTML export. Updatedsequence_exporter.pyto properly parse allele names from headers with complex naming patterns. See PR: #107
📝 Documentation
- Updated README with GUI Instructions: Added comprehensive documentation for the new GUI mode in both English (README.md) and Japanese (README_JP.md) versions, including step-by-step instructions for single sample analysis and batch processing via the web interface.
- Python
Published by akikuno 7 months ago
https://github.com/akikuno/dajin2 - 0.7.0
🌟 New Features
Added
--no-filteroption to detect rare mutations. See Issue: #83Added
-b/--bedoption to specify a BED file when using genomes other than UCSC reference genomes. See Issue: #26
🔧 Maintenance
- Updated Python support from 3.9 to 3.12 due to dependencies in
pysamandmappy. See Issue: #101
- Python
Published by akikuno 8 months ago
https://github.com/akikuno/dajin2 - 0.6.2
💥 Breaking
- Improved overly lenient strand bias detection. Issue: #89:
📝 Documentation
🐛 Bug Fixes
Fixed CSV header validation error when processing files containing a BOM (Byte Order Mark). Issue: #88 [Commit]
Corrected the argument order in
fastx_handler.convert_bam_to_fastq. Issue: #94 [Commit]
🔧 Maintenance
- Shortened DAJIN2 log filenames by replacing UUIDs with microsecond-based timestamps for improved readability. Issue: #95 [Commit]
- Python
Published by akikuno 9 months ago
https://github.com/akikuno/dajin2 - 0.6.1
v0.6.1 (2025-03-18)
🚀 Performance
- Use
BisectingKMeansinstead ofAgglomerativeClusteringbecauseBisectingKMeanscan take aspmatrixas input, significantly reducing memory consumption. [Commit Detail]
📝 Documentation
- Specify the Range of Bases to Be Recorded in the FASTA File. Issue #78 [Commit Detail]
🔧 Maintenance
Explicitly unify the line endings of text files in DAJIN_Reports to
LF. [Commit Detail]Upgrade to
pandas = ">=2.0.0"because the argument specification for line terminator was changed tolineterminatorin pandas >=1.5. [Commit Detail]Sort MUTATION_INFO by Allele ID. Issue #79 [Commit Detail]
Add
sv_annotatorto reflect SV alleles in consensus midsv tags. [Commit Detail]Refactoring
annotate_insertions_within_deletion: Previously, a similar function existed incssplit_handler, but since this function is only called once during consensus, it has been moved to a dedicated module,consensus.sv_annotator. At the same time, the function has been simplified. [Commit Detail]
🐛 Bug Fixes
Reflect Inversion Alleles When Flanked by Deletions at HTML. Issue #82 [Commit Detail]
Fix the issue where the SV length was reflected one base longer in deletion/inversion SV alleles. [Commit Detail]
Fix a bug where the silhouette score could not be calculated and resulted in an error when the sample and control were completely separated at a 1:1 ratio. [Commit Detail]
Correct the mislabeling of Deletion Allele as Insertion Allele in HTML report. [Commit Detail]
Return the region containing the insertion sequence as a deletion sequence if the region flanked by deletions is determined to be an insertion sequence. Issue #86 [Commit Detail]
Inversions are underlined since they can coexist with other mutations, while others are highlighted. Issue #84 [Commit Detail]
Reflect the mutations (indel, substitution) within the inversion in HTML and MUTATION_LOCI. [Commit Detail]
- Python
Published by akikuno 11 months ago
https://github.com/akikuno/dajin2 - 0.6.0
💥 Breaking
Add
preprocess.sv_detectorto detect SV (Insertion/Deletion/Inversion) alleles. Issue #33 [Commit Detail]Add
html_builderto display SV alleles. Issue #31 [Commit Detail]
📝 Documentation
- Upgrade Python version from 3.10 to 3.12 in README.md. Issue #74 [Commit Detail]
🚀 Performance
Simplify feature extraction using
extract_n_featuresto reduce computational costs. [Commit Detail]To avoid overlooking minor alleles, the number of reads is increased from 10,000 to 100,000 during downsampling. [Commit Detail]
🔧 Maintenance
Increase the SV allele number to at least two digits (e.g.,
deletion01). [Commit Detail]Display the currently processing NAME in batch mode. [Commit Detail]
By appending a UUID to the log file, potential filename duplication can be prevented. [Commit Detail]
- Python
Published by akikuno about 1 year ago
https://github.com/akikuno/dajin2 - 0.5.6
💥 Breaking
Support for PacBio HiFi reads. [Commit Detail]
Add
preprocess.sequence_error_handlerto exclude Nanopore sequence errors from the analysis. Issue: #60- Initial commit [Commit Detail]
- Since most Nanopore sequencing errors occur due to read interruptions,
parse_midsv_from_csvclassifies entries as either Unknown or Other (M). [Commit Detail] - Instead of strategies like Cosine similarity or HDBSCAN, the Jaro-Winkler distance is explicitly used as a string similarity metric. Jaro-Winkler was chosen because Levenshtein would be too time-consuming. [Commit Detail]
Add
srpresets to all execusions inpreprocess.mapping. Issue: #55 [Commit Detail]Increase the sensitivity by lowering the mutation detection threshold from 0.5% to 0.1% to detect mutations around 0.75%. [Commit Detail]
Use
AgglomerativeClusteringinstead of Constrained KMeans because AgglomerativeClustering provides a more global clustering approach, and Constrained KMeans was not very useful due to the unreliability of itsmin_cluster_size. [Commit Detail]Output seqence error reads as
BAM/{name}/sequence_errors.bam. Issue: #61 [Commit Detail]
🚀 Performance
- Downsampling the sample reads to a maximum of 10,000. Issue: #58 [Commit Detail]
🐛 Bug Fixes
- Fix a bug where a element of dict with empty values was left behind after minor insertions were removed. [Commit Detail]
🔧 Maintenance
With the end of security support for Python 3.8 in October 2024, we have updated DAJIN2 to support Python 3.9 or later. [Commit Detail]
Replace typing.Generator to collections.abc.Iterator Since typing.Generator is deprecated. Issue: #53 [Commit Detail]
Automatically retrieve version information using
importlib.metadata.versionIssue: #59 [Commit Detail]Move the FASTX IO processing to
utils.io. Issue: #66 [Commit Detail]Add E2E tests in Github Actions. [Commit Detail]
- Python
Published by akikuno about 1 year ago
https://github.com/akikuno/dajin2 - 0.5.5.1
This is a patch for version v0.5.5.
An unfinished inversion detection program had mistakenly been included in the production code.
Since the inversion detection program is scheduled for implementation in version v0.5.6 or later, the code in question has been removed.
- Python
Published by akikuno over 1 year ago
https://github.com/akikuno/dajin2 - 0.5.5
📝 Documentation
- Add
FAQ.mdandFAQ_JP.mdto address the question: "Why is the read count of the Control sample lower in the output BAM file?". [Commit Detail]
🔧 Maintenance
Integrating insertion and inversion detection: Issue #31
- Add sv_handler [Commit Detail]
- Modify arguments of
is_insertiontois_sv[Commit Detail] - Remame
insertions_to_fasta.generate_insertions_fastatoinsertion_detector.detect_insertionsbecause the function is not only for generating fasta files but also for generating csv tag. [Commit Detail]
Remove unused dependencies
networkx: Issue #49 [Commit Detail]
- Python
Published by akikuno over 1 year ago
https://github.com/akikuno/dajin2 - 0.5.4
💥 Breaking
Use simulated annealing to optimize cluster assignments in
clustering.constrained_kmenas[Commit Detail]- Since
ortoolsis not installable on osx-arm64 in Bioconda, I implemented an alternative method, simulated annealing, to solve mincostflow.
- Since
Change the criteria for terminating clustering. [Commit Detail]
- The following termination criteria have been added:
- Minimum cluster size is less than or equal to 0.5% of the sample's read number.
- Decrease in the proportion of samples with a silhouette score of 0.25 or higher.
- The following termination criterion has been removed:
- Adjusted Rand Index >= 0.95, as it led to early termination when minor clusters were generated.
The threshold for
clustering.strand biasdetermination has been loosened. [Commit Detail]- This adjustment addresses cases like
+:13, -:2(0.87) observed inexample_flox/flox-1nt-deletion. - Since the minor allele is particularly susceptible, further adjustments may be necessary in the future.
- This adjustment addresses cases like
🌟 New Features
- Support for Apple Silicon (osx-arm64) in Bioconda. Issue: #46
- Python
Published by akikuno over 1 year ago
https://github.com/akikuno/dajin2 - 0.5.3
💥 Breaking
Update
clustering.clustering: Use Constrained Kmeans clustering to address the issue of cluster imbalance where extremely minor clusters were preferentially separated. Setmin_cluster_sizeto 0.5% of the sample read count. [Commit Detail]- As a result,
clustering.label_merger.pyis no longer needed and has been removed.
- As a result,
Update
consensus.call_consensus: For mutations determined to be sequence errors, we previously replaced them with unknown (N), but thisNhad low interpretability. Therefore, mutations that DAJIN2 determines to be sequence errors will now be assigned the same base as the reference genome. [Commit Detail]
🐛 Bug Fixes
Due to a bias in
classifiler.calc_matchwhere alleles with shorter sequences were prioritized, the operation of dividing by sequence length has been removed. [Commit Detail]Fix
preporcess.mapping.generate_samto perform alignments withmap-ontandsplicein addition tosrfor sequence lengths of 500 bp or less, and select the optimal prefix from these alignments. Issue: #45 [Commit Detail]
- Python
Published by akikuno over 1 year ago
https://github.com/akikuno/dajin2 - 0.5.2
📝 Documentation
- Add
FAQ.mdandFAQ_JP.mdto provide answers to questions. [Commit Detail]
🌟 New Features
- Update
mutation_extractor[Commit Detail]- Simplified the logic of the
is_dissimilar_lociif statement. Additionally, changed the threshold for determining a mutation in Consensus from 75% to 50% (to accommodate the insertion allele in Cas3 Tyr Barcode10). - Updated
detect_anomaliesto use MLPClassifier to detect mutations more flexibly and accurately compared to the previous threshold setting with MiniBatchKMeans.
- Simplified the logic of the
🔧 Maintenance
Make DAJIN2 compatible with Python 3.11 and 3.12. Issue: #43 [Commit Detail]
- pysam and mappy builds with Python 3.11 and 3.12 are now available on Bioconda.
Update GitHub Actions to test with Python 3.11 and 3.12. Issue: #43 [Commit Detail]
Resolve the B023 Function definition does not bind loop variable
alignment_lengthsissue. [Commit Detail]Add
question.ymlin GitHub Issue template. [Commit Detail]
🐛 Bug Fixes
- Update
cssplits_handler._get_index_of_large_deletions: Modified to split large deletions when a match of 10 or more bases is found within the identified large deletion. Issue: #42 [Commit Detail]
- Python
Published by akikuno over 1 year ago
https://github.com/akikuno/dajin2 - 0.5.1
🚀 New Features
- Enable to accept additional file formats as an input. Issue: #37
- FASTA [Commit Detail]
- BAM [Commit Detail]
📝 Documentation
- Add a description of the procedure for accepting files generated by Dorado basecaller as input. Issue: #37 [Commit Detail]
🔧 Maintenance
Specify the Python version to be between 3.8 and 3.10. [Commit Detail]
Change
mutation_exporter.report_mutationsto return list[list[str]]. Update the tests accordingly. [Commit Detail]Apply formatting with Ruff [Commit Detail]
🐛 Bug Fixes
- Add
reallocate_insertion_within_deletionintoreport.mutation_exporterand reflected it in the mutation info. [Commit Detail]
- Python
Published by akikuno over 1 year ago
https://github.com/akikuno/dajin2 - 0.5.0
📝 Documentation
- Update the issue template from md to yml and modify it to make it easier for users to fill out each item. [Commit Detail]
💥 Breaking
Extremely low-frequency alleles (less than 0.05%) are considered Nanopore sequence errors and are not clustered #36.
- Configure
clustering.extract_labelsso that alleles with a low number of reads (0.05% or fewer or 5 reads or fewer) are not clustered. [Commit Detail] - Change
clustering.clusteringto stop if the minimum value of the elements in the cluster is 0.5% or less. [Commit Detail] - Add
consensus.remove_minor_allelesto remove minor alleles with fewer than 5 reads or less than 0.5% [Commit Detail]
- Configure
Save subsetted fastq of a control sample if the read number is too large (> 10,000 reads). The control will have a maximum of 10,000 reads to avoid excessive computational load. [Commit Detail]
If the read length is 500 bases or less, change the mappy preset to
sr. [Commit Detail]Update
extract_best_presetto prioritizemap-ontand removesplicepreset if inversion is observed. [Commit Detail]Update the algorithms of
cssplits_hander.reallocate_insertion_within_deletionto automate change point detection by incorporating temporal changes. [Commit Detail]
🔧 Maintenance
Update
deploy_pypi.ymlto use the latest version of Actions. Refer to the latest official YAML for guidance. [Commit Detail]Integrate
requirements.txtandMANIFEST.inintopyproject.tomlby replacingsetup.py[Commit Detail]Modify to record the execution command of DAJIN2 in the log file [Commit Detail]
Add a test to check if the version in
test_version.shmatches the version inpyproject.tomlandutils.config[Commit Detail]Rename
consensus.subset_clusttoconsensus.downsample_by_labelto clarify the function's purpose. [Commit Detail]Update
extract_unique_insertionsto merge highly similar extracted insertion sequences. [Commit Detail]- Fix
extract_unique_insertions: There is a bug where removing the key twice in fastainsertionsunique caused the index and key to become misaligned in enumerate(distances) if i != key. Therefore, the removal of keys from fastainsertionsunique is now done all at once at the end. [Commit Detail]
- Fix
Add control characters for
fastx_handler.sanitize_filenameas forbidden chars. [Commit Detail]Chang the naming convention for the temporary directory:
<sample_name>/<process_content>/<allele_name>/(<label_name>)/file_name. Example:flox/consensus/control/1/mutation_loci.pickle. [Commit Detail]Move
sanitze_namefunction fromutils.fastx_handlertoutils.io[Commit Detail]
🐛 Bug Fixes
Remove
sam_handler.remove_overlapped_readsto prevent unnecessary trimming of reads. [Commit Detail]Fix
preprocess.insertions_to_fasta.remove_minor_groupsto delete the keys (insertion loci) when insertions are removed and result in an empty dict. This prevents errors when accessing non-existent keys insubset_insertions. [Commit Detail]Fix the bug in
cssplits_handler.convert_cssplits_to_cstagwhere the insertion cs tag is not merged with the next cs tag if they have the same operator (e.g.,+A|+A|=T, =T: before:+aa=T=T, after:+aa=TT). [Commit Detail]Modify the system to separate intermediate files using a directory structure instead of underscores (
_), ensuring that no errors occur even if users use allele names containing underscores [Commit Detail]- Thank you @geedrn for reporting the issue #39!
- Python
Published by akikuno over 1 year ago
https://github.com/akikuno/dajin2 - 0.4.6
💥 Breaking
Update the log file Commit Detail
- Add the version of DAJIN2 to the log file to track the version of the analysis.
- Rename the log file to
DAJIN2_log_<current time>.txtfrom<current time>_DAJIN2.logto enabling open the file in any text editor.
Update
mutation_extractor.is_dissimilar_lociCommit Detail- Rename to
is_dissimilar_locifromidentify_dissimilar_locito explicitly indicate that a boolean is returned. - Changed to use cosine distance instead of cosine similarity to make "difference from control" more intuitive.
- Added a condition to ensure that the cosine distance is not dependent on the specific index: Calculate the cosine distance for 10 bases starting from the neighbor of the corresponding indel, and add the condition that the cosine distances of these adjacent 10 bases should be similar.
- Rename to
Update
preprocess.insertions_to_fasta.pywhich detects unintended insertion alleles. Commit Detailclustering_insertions: To accelerate MeanShift clustering, setbin_seeding=True. Additionally, because clustering decoys without variation becomes extremely slow, we have switched to using decoys that include slight variations.extract_unique_insertions: Withinunintended insertion alleles, alleles similar to theintended alleleprovided by the user are now excluded.- The similarity is defined as there being differences of more than 10 bases
Update
preprocess.insertions_to_fasta.clustering_insertionsto consider the length of each insertion sequence during clustering. This allows two alleles, such asN,(30-base Insertion)and(30-base Insertion),N, to be weighted with different scores as [(1, 30), (30, 1)], enabling correct clustering. Commit DetailUpdate
preprocess.homopolymer_handler: Scaling data to [0, 1] for cosine similarity, normalizing to match scales due to significant differences in mutation rates between samples and controls. Commit Detail
📝 Documentation
Add the descriptions about required Python version supporting from 3.8 to 3.10 due to a Bioconda issue to the README.md. Commit Detail
Enhance the descriptions in GitHub Issue templates to clarify their purpose. Commit Detail
🔧 Maintenance
Move
DAJIN2_VERSIONtoutils.config.pyfrommain.pyto make it easier to recognize its location. Commit DetailUpdate
io.read_csvto return alist[dict[str, str]], notlist[str]to align the output format withread_xlsx. Commit DetailUpdate
utils.input_validatorandpreprocess.genome_fetcherto temporarily disable SSL certificate verification, allowing access to UCSC servers. Commit DetailAdd an example of flox knockin design to the
examplesCommit DetailUpdate
preprocess.insertions_to_fasta.py: The label names for the insertions were not starting from 1, so they have been revised to begin at 1. Commit DetailChange installer from pip to conda to install mappy in macos-latest (macos-14-arm64) in Github Action Commit Detail
🚀 Performance
- Update
consensus.similarity_searcherto cache onehot encoded controls to avoid redundant computations and increase processing speed. Commit Detail
🐛 Bug Fixes
Debug
clustering.strand_bias_handlerCommit Detail- For
positive_strand_counts_by_labels: dict, there was a bug that caused an error and halted execution when accessing a non-existent key. It has been fixed to output 0 instead. - Created a wrapper function
annotate_strand_bias_by_labelsfor outputting strand bias. Fixed a bug where the second and subsequent arguments were not being correctly passed when reallocating clusters with strand bias.
- For
Fix
preprocess.knockin_handlerto correctly identify the flox knock-in sites as deletions not present in the control. Commit DetailBug fix of
reallocate_insertion_within_deletionCommit Detail- In the script that considers the region between two deletions as an insertion sequence, the size of the other deletion was not taken into account. Even if there was a single base deletion, the entire sequence between the deletions was considered as an insertion sequence. Therefore, the region between two deletions is now defined only if the size of both deletions is equal to or greater than the specified threshold (default = 3).
- Python
Published by akikuno almost 2 years ago
https://github.com/akikuno/dajin2 - 0.4.5
🐛 Bug Fixes
- In version 0.4.4 of strandbiashandler.removebiasedclusters, there was an error in the continuation condition for removing biased clusters, which has now been corrected. The correct condition should be 'there are alleles with and without strand bias and the iteration count is less than or equal to 1000'. Instead, it was incorrectly set to 'there are alleles with and without strand bias or the iteration count is less than or equal to 1000'.
- Python
Published by akikuno almost 2 years ago
https://github.com/akikuno/dajin2 - 0.4.4
💥 Breaking
Update the threshold from 5 to 0.5 at
identify_dissimilar_locito capture 1% minor alleles. Commit DetailReturn smaller allele clustering labels (
labels_previous) when the adjusted Rand index is sufficiently high to reduce predicted allele numbers. Commit Detail
🔧 Maintenance
Add the detailed discription at
identify_dissimilar_locito clarify the purpose of the function. Commit DetailUpdate a function name of
utils.io.check_excel_or_csvtoutils.io.determine_file_typefor clarity. Commit DetailUpdate examples: In tyrc230gt01, the point mutation of Tyr was previously 0.7%, but has been increased to 1.0% by adding point mutation reads from tyrc230gt50. Commit Detail
Rename
validate_columns_of_batch_filein test_main.py. Commit DetailAdd tests of
strand_bias_handlerCommit DetailAdd type hints and comments in
return_labelsCommit Detail
- Python
Published by akikuno almost 2 years ago
https://github.com/akikuno/dajin2 - 0.4.3
📝 Documentation
- Update example dataset and a description of README.md/README_JP.md Commit Detail
🐛 Bug Fixes
Update
preprocess.genome_fetcher_fetch_seq_coordinatesto accurately verify that the entire length of the input sequence is present within the reference sequence. Previously, partial 100% matches were inadvertently accepted; this revision aims to ensure the full alignment of the input sequence with the reference. Commit DetailUpdate
report.bam_exporterto be case-sensitive and consistent with directory names. This is to avoid errors caused by the difference between report/bam and report/BAM on Ubuntu, which is case-sensitive to directory names. Commit Detail- Thank you @takeiga for reporting the issue #24 !
🔧 Maintenance
Change
threshold_readnumberatlabem_merger.merge_labelsfrom 10 to 5 to capture 1% alleles from 500 total reads. Commit DetailUpdate the
requirements.txtto install a newer version of the library. Commit DetailUpdate
report.report_bamand rename toreport.bam_exporter: Commit Detail- Use UUID instead of random number for the temporary file name.
- Rename
realigntorecalculate_sam_coodinates_to_referencefor the readability of the function name. - Add
convert_pos_to_one_indexedto convert the 0-based position to 1-based position and suppress samtools warning. - Warning:
[W::sam_parse1] mapped query cannot have zero coordinate; treated as unmapped - Add tests for the
write_sam_to_bamfunction
Move
read_samfunction from sam_handler to io module. Commit DetailRename
report.report_mutation,report.report_filestoreport.mutation_exporterandreport.sequence_exporterto be more explicit. Commit Detail
- Python
Published by akikuno almost 2 years ago
https://github.com/akikuno/dajin2 - 0.4.2
🔧 Maintenance
Remove multi-mapping reads, as multi-mapping reads are mostly reads that are locally mapped to low-complexity regions. Commit Detail
Create
preprocess.input_formatter.pyto summarize formatting functions to a module. Commit DetailRefactor
directory_manager.pyCommit DetailRefactor
preprocess.__init__.pyCommit DetailTo increase cohesion by functions of the same category into a single module, we have migrated
preprocess.fastx_parsertoutils.fastx_handler. Commit DetailRemove the packages that are no longer in use from
requirements.txt. Commit DetailAdd
read_samin sam_handler module. Commit DetailRevise the docstring of
export_fasta_files. Commit DetailStandardize to use
dataclassinstead ofNamedTuple. Commit Detail
- Python
Published by akikuno almost 2 years ago
https://github.com/akikuno/dajin2 - 0.4.1
📝 Documentation
- Added documentation for a new feature in
README.md: DAJIN2 can now detect complex mutations characteristic of genome editing, such as insertions occurring in regions where deletions have occurred.
🚀 New Features
Introduced
cssplits_handler.detect_insertion_within_deletionto extract insertion sequences within deletions. This addresses cases where minimap2 may align bases that partially match the reference through local alignment, potentially failing to detect them as insertions. This enhancement ensures the proper detection of insertion sequences. Commit DetailAdded
report.insertion_refractor.pyto include original insertion information in the consensus for mappings made by insertion. This addition enables the listing of both insertions and deletions within the insertion allele on a single HTML file. Commit Detail
🔧 Maintenance
Updated
insertions_to_fasta.py. Commit Detail- Modified the approach to reduce randomness by replacing set or frozenset with list or tuple, and using
random.sample()for subsetting reads. - Refactored
call_consensus_insertion_sequence. - Fixed a bug in
extract_score_and_sequenceto ensure correct appending of scores for the insertionsmergedsubset.
- Modified the approach to reduce randomness by replacing set or frozenset with list or tuple, and using
Changed the function name of
reportto be more explicit. Commit DetailUpdated
utils.report_report_generatorCommit Detail- Capitalized "Allele" (e.g., control) and "Allele type" (e.g., intact).
- Changed the output format of readall and readsummary from CSV to XLSX.
- Corrected the order of the Legend to follow a logical sequence from control to sample, and then to specific insertions.
Updated
utils.io.read_xlsxto switch from using pandas to openpyxl due to the DeprecationWarning in Pandas being cumbersome. Commit Detail
🐛 Bug Fixes
Added
=to the prefix for valid cstag recognition when there is annin inversion. Commit DetailModified the io.loadfromcsv function to trim spaces before and after each field, addressing an error caused by spaces in batch.csv. Commit Detail
⛔️ Deprecated
- Removed
reads_all.csv. This CSV file, which showed the allele for each read, is no longer reported due to its limited usefulness and because the same information can be obtained from the BAM file. Commit Detail
- Python
Published by akikuno about 2 years ago
https://github.com/akikuno/dajin2 - 0.4.0
💥 Breaking
- Changed the input from a path to a FASTQ file to a path to a directory: The output of Guppy is now stored in multiple FASTQ files under the
barcodeXX/directory. Previously, it was necessary to combine the FASTQ files in thebarcodeXX/directory into one and specify it as an argument. With this revision, it is now possible to directly specify thebarcodeXXdirectory, allowing users to seamlessly proceed to DAJIN2 analysis after Guppy processing. Commit Detail
📝 Documentation
- Changed
conda config --set channel_priority stricttoconda config --set channel_priority flexiblefor installation process in TROUBLESHOOTING.md. Commit Detail
🚀 New Features
Apple Silicon (ARM64) supoorts. Commit Detail
Changed the definition of the minor allele from a read number of less than or equal to 10 to less than or equal to 5. This is based on the assumption that one sample contains 1000 reads, where 0.5% corresponds to 5 reads. Commit Detail
🔧 Update
Update
preprocess.insertion_to_fastato facilitate the discrimination of Insertion alleles, the Reference for Insertion alleles has been saved in FASTA/HTML directory. Commit DetailUpdate
insertions_to_fasta.extract_enriched_insertions: Previously, it calculated the presence ratio of insertion alleles separately for samples and controls, filtering at 0.5%. However, due to a threshold issue, some control insertions were narrowly missing the threshold, resulting in them being incorrectly identified as sample-specific insertions. To rectify this, the algorithm now clusters samples and controls together, excluding clusters where both types are mixed. This modification allows for the extraction of sample-specific insertion alleles. Commit DetailUpdated
preprocess.insertions_to_fasta.count_insertionsof the counting method to treat similar insertions as identical. Previously, the same insertion was erroneously counted as different ones due to sequence errors. Commit DetailUpdated
preprocess.insertions_to_fasta.merge_similar_insertions: Previously, clustering was done using MiniBatchKMeans, but this method had an issue where it excessively clustered when only highly similar insertion sequences existed. Therefore, a strategy similar toextract_enriched_insertionswas adopted, changing the algorithm to one that mixes with a uniform distribution of random scores before clustering. Commit DetailAdded
preprocess.insertions_to_fasta.clustering_insertions: Combined the clustering methods used inextract_enriched_insertionsandmerge_similar_insertionsinto a common function. Commit DetailMoved the
call_sequencefunction to thecssplits_handlermodule. Commit Detail
🐛 Bug Fixes
Debug
clustering.merge_labelsto be able to correctly revert minor labels back to parent labels. Commit DetailUpdated
utils.input_validator.validate_genome_and_fetch_urlsto obtainavailable_servermore explicitly. Previously, it relied on HTTP response codes, but there were instances where the UCSC Genome Browser showed a normal (200) response while internally being in error. Therefore, with this change, a more explicit method is employed by searching for specific keywords present in the normal HTML, to determine if the server is functioning correctly. Commit DetailAdded
config.reset_loggingto reset the logging configuration. Previously, when batch processing multiple experiment IDs (names), a bug existed where the log settings from previous experiments remained, and the log file name was not updated. However, with this change, log files are now created for each experiment ID. Commit DetailDebugged
core.py: Modified the specification ofpaths_predefined_fastato accept input from user-entered ALLELE data. Previously, it accepted fasta files stored in the fasta directory. However, this approach had a bug where fasta files left over from a previously aborted run (which included newly created insertions) were treated as predefined. This resulted in new insertions being incorrectly categorized as predefined. Commit Detail
- Python
Published by akikuno about 2 years ago
https://github.com/akikuno/dajin2 - 0.3.6
📝 Documentation
- Added a quick guide for installation to TROUBLESHOOTING.md. Commit Detail
🚀 Update
Preprocess
Updated
input_validator.py: The UCSC Blat server sometimes returns a 200 HTTP status code even when an error occurs. In such cases, "Very Early Error" is indicated in the title. Therefore, we have made it so that it returns False in those situations. Commit DetailSimplified
homopolymer_handler.pyfor error detection using cosine similarity. Commit DetailUpdated
mutation_extractor.pyto use cosine similarity to filter dissimilar loci. Commit DetailUpdated the
mutation_extractor.identify_dissimilar_lociso that it unconditionally returns True if the 'sample' shows more than 5% variation compared to the 'control'. Commit DetailAdded
preprocess.midsv_caller.convert_consecutive_indels_to_match: Due to alignment errors, instances where a true match is mistakenly replaced with "insertion following a deletion" are corrected. For example, "=C,=T" mistakenly replaced by "-C,+C|=T" is reverted back to "=C,=T". Commit Detail
Classification
- Added
allele_merger.merge_minor_allelesto reclassify alleles with fewer than 10 reads to suppress excessive subdivision of alleles. Commit Detail
Clustering
Added the function
merge_minor_clusterto revert labels clustered with fewer than 10 reads back to the previous labels to suppress excessive subdivision of alleles. Commit DetailUpdated
generate_mutation_kmersto consider indices not registered in mutation_loci as mutations by replacing them with "@". For example, "=G,=C,-C" and "=G,=G,=C" become "@,@,@" in both cases, making them the same and ensuring they do not affect clustering. Commit Detail
Consensus
- Implemented
LocalOutlierFactorto filter abnormal control reads. Commit Detail
- Python
Published by akikuno about 2 years ago
https://github.com/akikuno/dajin2 - 0.3.5
Last update: 2023-12-23
📝 Documentation
- [x] Added
ROADMAP.mdto track the progress of the project Commit Detail - [x] Added Prerequisites section to README.md Commit Detail
🚀 Features
Preprocessing
- [x] Updated
homopolymer_handler.get_counts_homopolymerto change to count mutations in homopolymer regions considering only the control Commit Detail
Clustering
- [x] Changed clustering algorithm from KMeans to BisectingKMeans to handle larger dataset Commit Detail
Consensus
[x] Added
convert_consecutive_indels_to_matchto offset the effect when the same base insertion/deletion occurs consecutively Commit Detail[x] Added
similarity_searcher.pyto extract control reads resembling the consensus sequence, thereby enhancing the accuracy of detecting sample-specific mutations. Commit Detail[x] Changed the method in
clust_formatter.get_thresholds` to dynamically define the thresholds for ignoring mutations, instead of using fixed values.Commit Detail[x] Removed code that was previously commented out Commit Detail
🐛 Bug Fixes
- None
🔧 Maintenance
[x] Modified batch processing to run on a single CPU thread per process Commit Detail
[x] Simplifed import path Commit Detail
preprocess.midsv_caller.executetopreprocess.generate_midsvpreprocess.mapping.generate_samtopreprocess.generate_sam
[x] Added tests to
consensus.convert_consecutive_indels_to_matchCommit Detail
⛔️ Deprecated
- None
- Python
Published by akikuno about 2 years ago
https://github.com/akikuno/dajin2 - 0.3.4
📖 Documentation
- Added docs/TROUBLESHOOTING.md
- Added docs/CODEOFCONDUCT.md
- Added docs/CONTRIBUTING.md
✨ New Features
- None
🔧 Maintenance
Update preprocess.mutation_extractor.py
count_indels:- Change: Method of counting indels modified to use only matches as the denominator, instead of matches + indels.
- Reason: To specifically focus on the occurrence rate of particular mutations.
find_dissimilar_indices:- Change: Mutation detection modified. If the p-value remains < 0.05 after removing the target base sequence, the area is not detected as a mutation, assuming the significance is due to other parts.
- Implication: Increases mutation detection accuracy by excluding irrelevant base sequences.
merge_index_of_consecutive_indel:- Change: Merged
merge_surrounding_indexandmerge_index_of_consecutive_insertionsinto a single function. - Benefit: Streamlines the process and enhances efficiency in handling consecutive indels.
- Change: Merged
Update consensus.consensus.py:
- Addressed a precision issue in floating-point calculations where N equals 100%, leading to
100 != 100.000002. Changed the condition to "having only one key and that key beingN". Commit details
Update mutation_extractor.py:
- Switched to the Wilcoxon signed-rank test due to false negatives in the t-test for data with peak-like shapes. Commit details
Others
- Modified batch processing to run on a single CPU thread per process.
- Added
clust_formatter.cache_mutation_loci. - Changed
mutation_extractor.merge_locito use union instead of intersection. - Added a filter for minor insertion alleles in
insertions_to_fasta.py. - Moved
insertion_to_fasta.save_fastatoutils.io.save_fasta.
- Python
Published by akikuno about 2 years ago
https://github.com/akikuno/dajin2 - 0.3.3
📖 Documentation
- Added troubleshooting.md
✨ New Features
- Excluded the letter 'N' except when all bases are 'N' (which indicates reads with missing ends).
- Upon successful completion, the log file is now moved to the report directory (DAJIN_Results/{name}).
🔧 Modification
- Changed from OneClassSVM to k-means for anomaly detection (https://github.com/akikuno/DAJIN2/commit/d97d32a37a9e4c241fed414f9ec4c21b50f8f1b6)
🧰 Maintenance
- Set up weekly tests to run on GitHub Actions.
- Python
Published by akikuno over 2 years ago
https://github.com/akikuno/dajin2 - 0.3.2
📖 Documentation
- [x] Revisions to README.md and READMD_jp.md
- [x] Added a note to the README to install gcc and zlib when encountering installation errors for mappy via pip
✨ New Features
None
🛠️ Maintenance
[x] Refactoring of
main.py- config.setsinglethreaded_blas
- config.set_logging
- utils.multiprocess
[x] Verified operation with the latest
cstag(v1.0.5)[x] Limited the generation of log files with every execution
- It's troublesome to have an empty log every time you check for help or version
- Ensured log files are only generated at appropriate times (like during logging.info) or in case of unexpected errors
[x] Added
convert_cssplits_to_cstagtoutils.cssplits_handler- Converted cssplits to cstag, ensuring to_html operates without issues
- However, the existing CS tag doesn't represent inversion, so further consideration is needed on how to handle this
[x] Added tests for
convert_cssplits_to_cstag
- Python
Published by akikuno over 2 years ago
https://github.com/akikuno/dajin2 - v0.3.1
- release for zenodo
- upload to bioconda
- Python
Published by akikuno over 2 years ago