piedomains

Classify the kind of content hosted by the domain using the domain name, and text and screenshot of the homepage.

https://github.com/themains/piedomains

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary

Last synced: 11 months ago · JSON representation ·

Repository

Classify the kind of content hosted by the domain using the domain name, and text and screenshot of the homepage.

Basic Info

Host: GitHub
Owner: themains
License: mit
Language: Jupyter Notebook
Default Branch: main
Homepage:
Size: 114 MB

Statistics

Stars: 14
Watchers: 4
Forks: 2
Open Issues: 0
Releases: 0

Created over 4 years ago · Last pushed 11 months ago

Metadata Files

Readme License Citation

README.rst

===========================================================================================
piedomains: AI-powered domain content classification
===========================================================================================

.. image:: https://github.com/themains/piedomains/actions/workflows/python-publish.yml/badge.svg
    :target: https://github.com/themains/piedomains/actions/workflows/python-publish.yml
.. image:: https://img.shields.io/pypi/v/piedomains.svg
    :target: https://pypi.python.org/pypi/piedomains
.. image:: https://readthedocs.org/projects/piedomains/badge/?version=latest
    :target: http://piedomains.readthedocs.io/en/latest/?badge=latest

**piedomains** predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.

🚀 **Quickstart**
-------------------

Install and classify domains in 3 lines:

.. code-block:: python

    pip install piedomains
    
    from piedomains import DomainClassifier
    classifier = DomainClassifier()
    
    # Classify current content
    result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
    print(result[['domain', 'pred_label', 'pred_prob']])
    
    # Expected output:
    #        domain    pred_label  pred_prob
    # 0     cnn.com          news   0.876543
    # 1  amazon.com      shopping   0.923456
    # 2 wikipedia.org   education   0.891234

📊 **Key Features**
--------------------

- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy
- **Historical Analysis**: Classify websites from any point in time using archive.org
- **Fast & Scalable**: Batch processing with caching for 1000s of domains
- **Easy Integration**: Modern Python API with pandas output
- **41 Categories**: From news/finance to adult/gambling content

⚡ **Usage Examples**
---------------------

**Basic Classification**

.. code-block:: python

    from piedomains import DomainClassifier
    
    classifier = DomainClassifier()
    
    # Combined analysis (most accurate)
    result = classifier.classify(["github.com", "reddit.com"])
    
    # Text-only (faster)
    result = classifier.classify_by_text(["news.google.com"])
    
    # Images-only (good for visual content)  
    result = classifier.classify_by_images(["instagram.com"])

**Historical Analysis**

.. code-block:: python

    # Analyze how Facebook looked in 2010 vs today
    old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
    new_facebook = classifier.classify(["facebook.com"])
    
    print(f"2010: {old_facebook.iloc[0]['pred_label']}")
    print(f"2024: {new_facebook.iloc[0]['pred_label']}")

**Batch Processing**

.. code-block:: python

    # Process large lists efficiently
    domains = ["site1.com", "site2.com", ...] # 1000s of domains
    results = classifier.classify_batch(
        domains, 
        method="text",           # text|images|combined
        batch_size=50,           # Process 50 at a time
        show_progress=True       # Progress bar
    )

🏷️ **Supported Categories**
------------------------------

News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.

📈 **Performance**
-------------------

- **Speed**: ~10-50 domains/minute (depends on method and network)
- **Accuracy**: 85-95% depending on content type and method
- **Memory**: <500MB for batch processing
- **Caching**: Automatic content caching for faster re-runs

🔧 **Installation**
--------------------

**Requirements**: Python 3.9+

.. code-block:: bash

    # Basic installation
    pip install piedomains
    
    # For development
    git clone https://github.com/themains/piedomains
    cd piedomains
    pip install -e .

🔄 **Migration from v0.2.x**
-----------------------------

**Old API** (still supported):

.. code-block:: python

    from piedomains import domain
    result = domain.pred_shalla_cat_with_text(["example.com"])

**New API** (recommended):

.. code-block:: python

    from piedomains import DomainClassifier
    classifier = DomainClassifier()
    result = classifier.classify_by_text(["example.com"])

📖 **Documentation**
---------------------

- **API Reference**: https://piedomains.readthedocs.io
- **Examples**: `/examples` directory
- **Notebooks**: `/piedomains/notebooks` (training & analysis)

🤝 **Contributing**
--------------------

.. code-block:: bash

    # Setup development environment
    git clone https://github.com/themains/piedomains
    cd piedomains
    pip install -e ".[dev]"
    
    # Run tests
    pytest piedomains/tests/ -v
    
    # Run linting
    flake8 piedomains/

📄 **License**
---------------

MIT License - see LICENSE file.

📚 **Citation**
----------------

If you use piedomains in research, please cite:

.. code-block:: bibtex

    @software{piedomains,
      title={piedomains: AI-powered domain content classification},
      author={Chintalapati, Rajashekar and Sood, Gaurav},
      year={2024},
      url={https://github.com/themains/piedomains}
    }

---

**Legacy Documentation**
========================

For legacy API documentation, see LEGACY_API.rst

Owner

Name: the mains
Login: themains
Kind: organization

Website: https://themains.github.io
Repositories: 8
Profile: https://github.com/themains

making it easier to understand web traffic

Citation (Citation.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Chintalapati"
  given-names: "Rajashekar"
- family-names: "Sood"
  given-names: "Gaurav"
title: "piedomains: Predict the kind of content hosted by a domain based on domain name and content"
version: 0.0.2
date-released: 2022-05-04
url: "https://github.com/themains/piedomains"

GitHub Events

Total

Watch event: 1
Delete event: 7
Issue comment event: 3
Push event: 38
Pull request event: 10
Create event: 10

Last Year

Watch event: 1
Delete event: 7
Issue comment event: 3
Push event: 38
Pull request event: 10
Create event: 10

Issues and Pull Requests

Last synced: 11 months ago

All Time

Total issues: 2
Total pull requests: 27
Average time to close issues: about 2 years
Average time to close pull requests: 12 days
Total issue authors: 1
Total pull request authors: 4
Average comments per issue: 0.5
Average comments per pull request: 0.04
Merged pull requests: 21
Bot issues: 0
Bot pull requests: 15

Past Year

Issues: 0
Pull requests: 8
Average time to close issues: N/A
Average time to close pull requests: 30 minutes
Issue authors: 0
Pull request authors: 2
Average comments per issue: 0
Average comments per pull request: 0.13
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 1

View more stats

Top Authors

Issue Authors

soodoku (2)

Pull Request Authors

dependabot[bot] (16)
soodoku (9)
snyk-bot (2)
TrellixVulnTeam (1)

Top Labels

Issue Labels

enhancement (1)

Pull Request Labels

dependencies (16) codex (3)

Packages

Total packages: 1
Total downloads:
- pypi 1,314 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 33
Total maintainers: 2

pypi.org: piedomains

Predict categories based on domain names and their content

Homepage: https://github.com/themains/piedomains
Documentation: https://piedomains.readthedocs.io/
License: MIT License
Latest release: 0.3.10
published 11 months ago

Versions: 33
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 1,314 Last month

Rankings

Dependent packages count: 10.0%

Stargazers count: 16.5%

Average: 17.8%

Forks count: 19.1%

Dependent repos count: 21.7%

Downloads: 21.7%

Maintainers (2)

soodoku rajashekar

Last synced: 11 months ago

Dependencies

.github/workflows/codeql.yml actions

actions/checkout v3 composite
github/codeql-action/analyze v2 composite
github/codeql-action/autobuild v2 composite
github/codeql-action/init v2 composite

.github/workflows/docker-publish.yml actions

actions/checkout v3 composite
docker/build-push-action ac9327eae2b366085ac7f6a2d02df8aa8ead720a composite
docker/login-action 28218f9b04b4f3f62068d7b6ce6ca5b26e35336c composite
docker/metadata-action 98669ae865ea3cffbcbaa878cf57c20bbf1c6c38 composite
docker/setup-buildx-action 79abd3f86f79a9d68a23c75a09a9a85889262adf composite
sigstore/cosign-installer f3c664df7af409cb4873aa5068053ba9d61a57b6 composite

.github/workflows/pylint.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/python-package-conda.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/python-package.yml actions

actions/checkout v3 composite
actions/setup-python v3 composite

.github/workflows/tests-macos.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/tests-ubuntu.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

.github/workflows/tests-windows.yml actions

actions/checkout v3 composite
actions/setup-python v4 composite

Dockerfile docker

ubuntu 22.04 build

pyproject.toml pypi

python ^3.9

requirements.txt pypi

bs4 *
nltk *
tensorflow *

requirements_rtd.txt pypi

bs4 *
nltk *
tensorflow *

setup.py pypi

tqdm *

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science