piedomains

Classify the kind of content hosted by the domain using the domain name, and text and screenshot of the homepage.

https://github.com/themains/piedomains

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.9%) to scientific vocabulary
Last synced: 9 months ago · JSON representation ·

Repository

Classify the kind of content hosted by the domain using the domain name, and text and screenshot of the homepage.

Basic Info
  • Host: GitHub
  • Owner: themains
  • License: mit
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 114 MB
Statistics
  • Stars: 14
  • Watchers: 4
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 4 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.rst

===========================================================================================
piedomains: AI-powered domain content classification
===========================================================================================

.. image:: https://github.com/themains/piedomains/actions/workflows/python-publish.yml/badge.svg
    :target: https://github.com/themains/piedomains/actions/workflows/python-publish.yml
.. image:: https://img.shields.io/pypi/v/piedomains.svg
    :target: https://pypi.python.org/pypi/piedomains
.. image:: https://readthedocs.org/projects/piedomains/badge/?version=latest
    :target: http://piedomains.readthedocs.io/en/latest/?badge=latest

**piedomains** predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.

🚀 **Quickstart**
-------------------

Install and classify domains in 3 lines:

.. code-block:: python

    pip install piedomains
    
    from piedomains import DomainClassifier
    classifier = DomainClassifier()
    
    # Classify current content
    result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
    print(result[['domain', 'pred_label', 'pred_prob']])
    
    # Expected output:
    #        domain    pred_label  pred_prob
    # 0     cnn.com          news   0.876543
    # 1  amazon.com      shopping   0.923456
    # 2 wikipedia.org   education   0.891234

📊 **Key Features**
--------------------

- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy
- **Historical Analysis**: Classify websites from any point in time using archive.org
- **Fast & Scalable**: Batch processing with caching for 1000s of domains
- **Easy Integration**: Modern Python API with pandas output
- **41 Categories**: From news/finance to adult/gambling content

⚡ **Usage Examples**
---------------------

**Basic Classification**

.. code-block:: python

    from piedomains import DomainClassifier
    
    classifier = DomainClassifier()
    
    # Combined analysis (most accurate)
    result = classifier.classify(["github.com", "reddit.com"])
    
    # Text-only (faster)
    result = classifier.classify_by_text(["news.google.com"])
    
    # Images-only (good for visual content)  
    result = classifier.classify_by_images(["instagram.com"])

**Historical Analysis**

.. code-block:: python

    # Analyze how Facebook looked in 2010 vs today
    old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
    new_facebook = classifier.classify(["facebook.com"])
    
    print(f"2010: {old_facebook.iloc[0]['pred_label']}")
    print(f"2024: {new_facebook.iloc[0]['pred_label']}")

**Batch Processing**

.. code-block:: python

    # Process large lists efficiently
    domains = ["site1.com", "site2.com", ...] # 1000s of domains
    results = classifier.classify_batch(
        domains, 
        method="text",           # text|images|combined
        batch_size=50,           # Process 50 at a time
        show_progress=True       # Progress bar
    )

🏷️ **Supported Categories**
------------------------------

News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.

📈 **Performance**
-------------------

- **Speed**: ~10-50 domains/minute (depends on method and network)
- **Accuracy**: 85-95% depending on content type and method
- **Memory**: <500MB for batch processing
- **Caching**: Automatic content caching for faster re-runs

🔧 **Installation**
--------------------

**Requirements**: Python 3.9+

.. code-block:: bash

    # Basic installation
    pip install piedomains
    
    # For development
    git clone https://github.com/themains/piedomains
    cd piedomains
    pip install -e .

🔄 **Migration from v0.2.x**
-----------------------------

**Old API** (still supported):

.. code-block:: python

    from piedomains import domain
    result = domain.pred_shalla_cat_with_text(["example.com"])

**New API** (recommended):

.. code-block:: python

    from piedomains import DomainClassifier
    classifier = DomainClassifier()
    result = classifier.classify_by_text(["example.com"])

📖 **Documentation**
---------------------

- **API Reference**: https://piedomains.readthedocs.io
- **Examples**: `/examples` directory
- **Notebooks**: `/piedomains/notebooks` (training & analysis)

🤝 **Contributing**
--------------------

.. code-block:: bash

    # Setup development environment
    git clone https://github.com/themains/piedomains
    cd piedomains
    pip install -e ".[dev]"
    
    # Run tests
    pytest piedomains/tests/ -v
    
    # Run linting
    flake8 piedomains/

📄 **License**
---------------

MIT License - see LICENSE file.

📚 **Citation**
----------------

If you use piedomains in research, please cite:

.. code-block:: bibtex

    @software{piedomains,
      title={piedomains: AI-powered domain content classification},
      author={Chintalapati, Rajashekar and Sood, Gaurav},
      year={2024},
      url={https://github.com/themains/piedomains}
    }

---

**Legacy Documentation**
========================

For legacy API documentation, see LEGACY_API.rst

Owner

  • Name: the mains
  • Login: themains
  • Kind: organization

making it easier to understand web traffic

Citation (Citation.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Chintalapati"
  given-names: "Rajashekar"
- family-names: "Sood"
  given-names: "Gaurav"
title: "piedomains: Predict the kind of content hosted by a domain based on domain name and content"
version: 0.0.2
date-released: 2022-05-04
url: "https://github.com/themains/piedomains"

GitHub Events

Total
  • Watch event: 1
  • Delete event: 7
  • Issue comment event: 3
  • Push event: 38
  • Pull request event: 10
  • Create event: 10
Last Year
  • Watch event: 1
  • Delete event: 7
  • Issue comment event: 3
  • Push event: 38
  • Pull request event: 10
  • Create event: 10

Issues and Pull Requests

Last synced: 9 months ago

All Time
  • Total issues: 2
  • Total pull requests: 27
  • Average time to close issues: about 2 years
  • Average time to close pull requests: 12 days
  • Total issue authors: 1
  • Total pull request authors: 4
  • Average comments per issue: 0.5
  • Average comments per pull request: 0.04
  • Merged pull requests: 21
  • Bot issues: 0
  • Bot pull requests: 15
Past Year
  • Issues: 0
  • Pull requests: 8
  • Average time to close issues: N/A
  • Average time to close pull requests: 30 minutes
  • Issue authors: 0
  • Pull request authors: 2
  • Average comments per issue: 0
  • Average comments per pull request: 0.13
  • Merged pull requests: 4
  • Bot issues: 0
  • Bot pull requests: 1
Top Authors
Issue Authors
  • soodoku (2)
Pull Request Authors
  • dependabot[bot] (16)
  • soodoku (9)
  • snyk-bot (2)
  • TrellixVulnTeam (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels
dependencies (16) codex (3)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 1,314 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 33
  • Total maintainers: 2
pypi.org: piedomains

Predict categories based on domain names and their content

  • Versions: 33
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 1,314 Last month
Rankings
Dependent packages count: 10.0%
Stargazers count: 16.5%
Average: 17.8%
Forks count: 19.1%
Dependent repos count: 21.7%
Downloads: 21.7%
Maintainers (2)
Last synced: 9 months ago

Dependencies

.github/workflows/codeql.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/docker-publish.yml actions
  • actions/checkout v3 composite
  • docker/build-push-action ac9327eae2b366085ac7f6a2d02df8aa8ead720a composite
  • docker/login-action 28218f9b04b4f3f62068d7b6ce6ca5b26e35336c composite
  • docker/metadata-action 98669ae865ea3cffbcbaa878cf57c20bbf1c6c38 composite
  • docker/setup-buildx-action 79abd3f86f79a9d68a23c75a09a9a85889262adf composite
  • sigstore/cosign-installer f3c664df7af409cb4873aa5068053ba9d61a57b6 composite
.github/workflows/pylint.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
.github/workflows/python-package-conda.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
.github/workflows/python-package.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v3 composite
.github/workflows/tests-macos.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/tests-ubuntu.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
.github/workflows/tests-windows.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
Dockerfile docker
  • ubuntu 22.04 build
pyproject.toml pypi
  • python ^3.9
requirements.txt pypi
  • bs4 *
  • nltk *
  • tensorflow *
requirements_rtd.txt pypi
  • bs4 *
  • nltk *
  • tensorflow *
setup.py pypi
  • tqdm *