piedomains
Classify the kind of content hosted by the domain using the domain name, and text and screenshot of the homepage.
Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (8.9%) to scientific vocabulary
Last synced: 9 months ago
·
JSON representation
·
Repository
Classify the kind of content hosted by the domain using the domain name, and text and screenshot of the homepage.
Basic Info
Statistics
- Stars: 14
- Watchers: 4
- Forks: 2
- Open Issues: 0
- Releases: 0
Created over 4 years ago
· Last pushed 9 months ago
Metadata Files
Readme
License
Citation
README.rst
===========================================================================================
piedomains: AI-powered domain content classification
===========================================================================================
.. image:: https://github.com/themains/piedomains/actions/workflows/python-publish.yml/badge.svg
:target: https://github.com/themains/piedomains/actions/workflows/python-publish.yml
.. image:: https://img.shields.io/pypi/v/piedomains.svg
:target: https://pypi.python.org/pypi/piedomains
.. image:: https://readthedocs.org/projects/piedomains/badge/?version=latest
:target: http://piedomains.readthedocs.io/en/latest/?badge=latest
**piedomains** predicts website content categories using AI analysis of domain names, text content, and homepage screenshots. Classify domains as news, shopping, adult content, education, etc. with high accuracy.
🚀 **Quickstart**
-------------------
Install and classify domains in 3 lines:
.. code-block:: python
pip install piedomains
from piedomains import DomainClassifier
classifier = DomainClassifier()
# Classify current content
result = classifier.classify(["cnn.com", "amazon.com", "wikipedia.org"])
print(result[['domain', 'pred_label', 'pred_prob']])
# Expected output:
# domain pred_label pred_prob
# 0 cnn.com news 0.876543
# 1 amazon.com shopping 0.923456
# 2 wikipedia.org education 0.891234
📊 **Key Features**
--------------------
- **High Accuracy**: Combines text analysis + visual screenshots for 90%+ accuracy
- **Historical Analysis**: Classify websites from any point in time using archive.org
- **Fast & Scalable**: Batch processing with caching for 1000s of domains
- **Easy Integration**: Modern Python API with pandas output
- **41 Categories**: From news/finance to adult/gambling content
⚡ **Usage Examples**
---------------------
**Basic Classification**
.. code-block:: python
from piedomains import DomainClassifier
classifier = DomainClassifier()
# Combined analysis (most accurate)
result = classifier.classify(["github.com", "reddit.com"])
# Text-only (faster)
result = classifier.classify_by_text(["news.google.com"])
# Images-only (good for visual content)
result = classifier.classify_by_images(["instagram.com"])
**Historical Analysis**
.. code-block:: python
# Analyze how Facebook looked in 2010 vs today
old_facebook = classifier.classify(["facebook.com"], archive_date="20100101")
new_facebook = classifier.classify(["facebook.com"])
print(f"2010: {old_facebook.iloc[0]['pred_label']}")
print(f"2024: {new_facebook.iloc[0]['pred_label']}")
**Batch Processing**
.. code-block:: python
# Process large lists efficiently
domains = ["site1.com", "site2.com", ...] # 1000s of domains
results = classifier.classify_batch(
domains,
method="text", # text|images|combined
batch_size=50, # Process 50 at a time
show_progress=True # Progress bar
)
🏷️ **Supported Categories**
------------------------------
News, Finance, Shopping, Education, Government, Adult Content, Gambling, Social Networks, Search Engines, and 32 more categories based on the Shallalist taxonomy.
📈 **Performance**
-------------------
- **Speed**: ~10-50 domains/minute (depends on method and network)
- **Accuracy**: 85-95% depending on content type and method
- **Memory**: <500MB for batch processing
- **Caching**: Automatic content caching for faster re-runs
🔧 **Installation**
--------------------
**Requirements**: Python 3.9+
.. code-block:: bash
# Basic installation
pip install piedomains
# For development
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e .
🔄 **Migration from v0.2.x**
-----------------------------
**Old API** (still supported):
.. code-block:: python
from piedomains import domain
result = domain.pred_shalla_cat_with_text(["example.com"])
**New API** (recommended):
.. code-block:: python
from piedomains import DomainClassifier
classifier = DomainClassifier()
result = classifier.classify_by_text(["example.com"])
📖 **Documentation**
---------------------
- **API Reference**: https://piedomains.readthedocs.io
- **Examples**: `/examples` directory
- **Notebooks**: `/piedomains/notebooks` (training & analysis)
🤝 **Contributing**
--------------------
.. code-block:: bash
# Setup development environment
git clone https://github.com/themains/piedomains
cd piedomains
pip install -e ".[dev]"
# Run tests
pytest piedomains/tests/ -v
# Run linting
flake8 piedomains/
📄 **License**
---------------
MIT License - see LICENSE file.
📚 **Citation**
----------------
If you use piedomains in research, please cite:
.. code-block:: bibtex
@software{piedomains,
title={piedomains: AI-powered domain content classification},
author={Chintalapati, Rajashekar and Sood, Gaurav},
year={2024},
url={https://github.com/themains/piedomains}
}
---
**Legacy Documentation**
========================
For legacy API documentation, see LEGACY_API.rst
Owner
- Name: the mains
- Login: themains
- Kind: organization
- Website: https://themains.github.io
- Repositories: 8
- Profile: https://github.com/themains
making it easier to understand web traffic
Citation (Citation.cff)
cff-version: 1.2.0 message: "If you use this software, please cite it as below." authors: - family-names: "Chintalapati" given-names: "Rajashekar" - family-names: "Sood" given-names: "Gaurav" title: "piedomains: Predict the kind of content hosted by a domain based on domain name and content" version: 0.0.2 date-released: 2022-05-04 url: "https://github.com/themains/piedomains"
GitHub Events
Total
- Watch event: 1
- Delete event: 7
- Issue comment event: 3
- Push event: 38
- Pull request event: 10
- Create event: 10
Last Year
- Watch event: 1
- Delete event: 7
- Issue comment event: 3
- Push event: 38
- Pull request event: 10
- Create event: 10
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 2
- Total pull requests: 27
- Average time to close issues: about 2 years
- Average time to close pull requests: 12 days
- Total issue authors: 1
- Total pull request authors: 4
- Average comments per issue: 0.5
- Average comments per pull request: 0.04
- Merged pull requests: 21
- Bot issues: 0
- Bot pull requests: 15
Past Year
- Issues: 0
- Pull requests: 8
- Average time to close issues: N/A
- Average time to close pull requests: 30 minutes
- Issue authors: 0
- Pull request authors: 2
- Average comments per issue: 0
- Average comments per pull request: 0.13
- Merged pull requests: 4
- Bot issues: 0
- Bot pull requests: 1
Top Authors
Issue Authors
- soodoku (2)
Pull Request Authors
- dependabot[bot] (16)
- soodoku (9)
- snyk-bot (2)
- TrellixVulnTeam (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels
dependencies (16)
codex (3)
Packages
- Total packages: 1
-
Total downloads:
- pypi 1,314 last-month
- Total dependent packages: 0
- Total dependent repositories: 1
- Total versions: 33
- Total maintainers: 2
pypi.org: piedomains
Predict categories based on domain names and their content
- Homepage: https://github.com/themains/piedomains
- Documentation: https://piedomains.readthedocs.io/
- License: MIT License
-
Latest release: 0.3.10
published 9 months ago
Rankings
Dependent packages count: 10.0%
Stargazers count: 16.5%
Average: 17.8%
Forks count: 19.1%
Dependent repos count: 21.7%
Downloads: 21.7%
Maintainers (2)
Last synced:
9 months ago
Dependencies
.github/workflows/codeql.yml
actions
- actions/checkout v3 composite
- github/codeql-action/analyze v2 composite
- github/codeql-action/autobuild v2 composite
- github/codeql-action/init v2 composite
.github/workflows/docker-publish.yml
actions
- actions/checkout v3 composite
- docker/build-push-action ac9327eae2b366085ac7f6a2d02df8aa8ead720a composite
- docker/login-action 28218f9b04b4f3f62068d7b6ce6ca5b26e35336c composite
- docker/metadata-action 98669ae865ea3cffbcbaa878cf57c20bbf1c6c38 composite
- docker/setup-buildx-action 79abd3f86f79a9d68a23c75a09a9a85889262adf composite
- sigstore/cosign-installer f3c664df7af409cb4873aa5068053ba9d61a57b6 composite
.github/workflows/pylint.yml
actions
- actions/checkout v3 composite
- actions/setup-python v3 composite
.github/workflows/python-package-conda.yml
actions
- actions/checkout v3 composite
- actions/setup-python v3 composite
.github/workflows/python-package.yml
actions
- actions/checkout v3 composite
- actions/setup-python v3 composite
.github/workflows/tests-macos.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite
.github/workflows/tests-ubuntu.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite
.github/workflows/tests-windows.yml
actions
- actions/checkout v3 composite
- actions/setup-python v4 composite
Dockerfile
docker
- ubuntu 22.04 build
pyproject.toml
pypi
- python ^3.9
requirements.txt
pypi
- bs4 *
- nltk *
- tensorflow *
requirements_rtd.txt
pypi
- bs4 *
- nltk *
- tensorflow *
setup.py
pypi
- tqdm *