https://github.com/bin-cao/tcgpr

[NPJ Com Mat 2023 | Small 2024] Machine Learning Algorithm : outlier identifying, feature selection

https://github.com/bin-cao/tcgpr

Science Score: 49.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: wiley.com, nature.com
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary

Keywords

abnormal-detection featureselection ggmf outlier-identifying tcgpr
Last synced: 6 months ago · JSON representation

Repository

[NPJ Com Mat 2023 | Small 2024] Machine Learning Algorithm : outlier identifying, feature selection

Basic Info
  • Host: GitHub
  • Owner: Bin-Cao
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 60.4 MB
Statistics
  • Stars: 14
  • Watchers: 4
  • Forks: 5
  • Open Issues: 0
  • Releases: 2
Topics
abnormal-detection featureselection ggmf outlier-identifying tcgpr
Created almost 4 years ago · Last pushed 7 months ago
Metadata Files
Readme License

README.md

TCGPR

A Python library for divide-and-conquer (TCGPR) - an efficient strategy tailored for small datasets in materials science and beyond.


📖 Citation

If you use this code in your research, please cite the following papers:

  • Li T., Cao B., Su T., ... Feng L., Zhang T. Machine Learning-Engineered Nanozyme System for Synergistic Anti-Tumor Ferroptosis/Apoptosis Therapy, SMALL Link to paper

  • Wei Q., Cao B., Yuan H., ... Dong Z., Zhang T. Divide and conquer: Machine learning accelerated design of lead-free solder alloys with high strength and high ductility, npj Computational Materials Link to paper


📜 Project History

  • 2022: TCGPR was first proposed and implemented, in collaboration with Mr. Hao Yuan (experiments) and Mr. Qinghua Wei (experiments). It was successfully applied to the optimization of lead-free solder alloys. → Published in npj Computational Materials News link

  • 2024: After two years of development, TCGPR was enhanced with sequential feature selection and outlier detection. In collaboration with Mr. Tianliang Li (experiments) and Mr. Tianhao Su (computations), it was applied to anti-tumor ferroptosis studies. → Published in SMALL News link


🧠 Algorithm Overview

For an in-depth explanation of the algorithm, see the TCGPR Introduction PDF.


🔧 Installation

Install TCGPR via PyPI:

bash pip install PyTcgpr

To verify the installation:

bash pip show PyTcgpr

To upgrade to the latest version:

bash pip install --upgrade PyTcgpr


🚀 Getting Started

1. Data Screening | Partition Mode

```python from PyTcgpr import TCGPR

TCGPR.fit( filePath = "data.csv", initialsetcap = 3, samplingcap = 2, upsearch = 500, CV = 'LOOCV', Task = 'Partition' ) ```

2. Data Screening | Identification Mode

```python from PyTcgpr import TCGPR

TCGPR.fit( filePath = "data.csv", samplingcap = 2, upsearch = 500, CV = 'LOOCV', Task = 'Identification' ) ```

3. Feature Selection Mode

```python from PyTcgpr import TCGPR

TCGPR.fit( filePath = "data.csv", Mission = 'FEATURE', samplingcap = 2, upsearch = 500, CV = 'LOOCV' ) ```


⚙️ Parameters

```python :param Mission: str, default='DATA' - 'DATA': Perform data screening - 'FEATURE': Perform feature selection

:param filePath: str Path to input dataset in CSV format

:param initialsetcap: int or list Initial subset size or index list for Partition mode

:param sampling_cap: int, default=1 Number of items selected per iteration

:param measure: str, default='Pearson' Correlation type: 'Pearson' or 'Determination'

:param ratio: float Tolerance threshold for correlation-based filtering

:param target: int, default=1 Number of targets in regression (for feature selection)

:param weight: float, default=0.2 Weight coefficient in GGMF score calculation

:param up_search: int, default=500 Upper limit for search iterations

:param exploit_coef: float, default=2 Variance constraint for EI acquisition function

:param exploit_model: bool, default=False If True, disables GGMF and uses only R values

:param CV: int or str, default=10 Cross-validation: integer (e.g., 5, 10) or 'LOOCV' ```


📤 Output

After running, TCGPR outputs a CSV file with the remaining samples:

bash Dataset_remained_by_TCGPR.csv


📦 Source Code

PyPI - TCGPR

Compatible with Windows, Linux, and macOS.


🧾 Patent

Patent Image


👨‍🔧 Maintainer

Maintained by Bin Cao 📧 Email: bcao686@connect.hkust-gz.edu.cn Feel free to open an issue or contact me for any questions, bugs, or collaboration opportunities.


🤝 Contributing

Contributions and suggestions are welcome!

  • Report bugs or request features via GitHub Issues
  • Submit a pull request with improvements or fixes
  • Interested in research collaboration? Please get in touch!

Owner

  • Name: 曹斌 | Bin CAO
  • Login: Bin-Cao
  • Kind: user
  • Location: Shanghai
  • Company: Shanghai University

Machine learning | Materials Informatics|Mechanics

GitHub Events

Total
  • Watch event: 5
  • Push event: 13
  • Fork event: 1
Last Year
  • Watch event: 5
  • Push event: 13
  • Fork event: 1

Issues and Pull Requests

Last synced: 7 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 2
  • Total downloads:
    • pypi 166 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 18
  • Total maintainers: 1
pypi.org: pytcgpr

Tree-Classifier for gaussian process model (TCGPR) is a data preprocessing algorithm based on the Gaussian correlation among data.

  • Versions: 8
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 27 Last month
Rankings
Dependent packages count: 7.2%
Forks count: 17.2%
Average: 21.2%
Stargazers count: 23.3%
Dependent repos count: 37.1%
Maintainers (1)
Last synced: 6 months ago
pypi.org: tcgpr

Tree-Classifier for gaussian process model (TCGPR) is a data preprocessing algorithm based on the Gaussian correlation among data.

  • Versions: 10
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 139 Last month
Rankings
Dependent packages count: 4.8%
Dependent repos count: 6.3%
Average: 24.3%
Forks count: 26.6%
Stargazers count: 33.0%
Downloads: 50.9%
Maintainers (1)
Last synced: about 1 year ago