https://github.com/capjamesg/getsitemap

A Python library that retrieves all URLs in the sitemaps on a website.

https://github.com/capjamesg/getsitemap

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary

Keywords

crawling python sitemap

Keywords from Contributors

labels
Last synced: 6 months ago · JSON representation

Repository

A Python library that retrieves all URLs in the sitemaps on a website.

Basic Info
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 1
  • Open Issues: 1
  • Releases: 2
Topics
crawling python sitemap
Created over 3 years ago · Last pushed over 1 year ago
Metadata Files
Readme Changelog Contributing License

README.md

getsitemap

Documentation Status image image image image

getsitemap is a Python library that retrieves all of the URLs that are found in all of the sitemaps on a website.

This project may be useful if you are building a search crawler or sitemap URL status code validators.

You can read the documentation for this project on Read the Docs.

Installation 💻

To get started, pip install getsitemap:

pip install getsitemap

Quickstart ⚡

Get all URLs recursively in all sitemaps

``` python import getsitemap

urls = getsitemap.getindividualsitemap("https://jamesg.blog/sitemap.xml")

print(urls) ```

Get all URLs in a single sitemap

``` python import getsitemap

allurls = getsitemap.retrievesitemap_urls("https://sitemap")

print(all_urls) ```

Code Quality

This library uses tox, pytest, and flake8 to assure code quality.

To run code quality checks, run the following command:

bash tox

License 👩

This project is licensed under an MIT License.

Contributing 🛠️

We would love to have your help in improving [getsitemap]{.title-ref}. Have an idea for a new feature or a bug to fix? Leave information in a GitHub Issue to start a discussion!

If you have

Contributors 💻

  • capjamesg

Owner

  • Name: James
  • Login: capjamesg
  • Kind: user
  • Location: Scotland
  • Company: @Roboflow

from words, wonder.

GitHub Events

Total
Last Year

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 15
  • Total Committers: 3
  • Avg Commits per committer: 5.0
  • Development Distribution Score (DDS): 0.133
Top Committers
Name Email Commits
capjamesg j****g@j****g 13
dependabot[bot] 4****]@u****m 1
jamesg 3****g@u****m 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 6 months ago

All Time
  • Total issues: 2
  • Total pull requests: 8
  • Average time to close issues: 1 day
  • Average time to close pull requests: about 15 hours
  • Total issue authors: 2
  • Total pull request authors: 1
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.0
  • Merged pull requests: 8
  • Bot issues: 0
  • Bot pull requests: 8
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • capjamesg (1)
  • minism (1)
Pull Request Authors
  • dependabot[bot] (11)
Top Labels
Issue Labels
testing (1)
Pull Request Labels
dependencies (11)

Packages

  • Total packages: 3
  • Total downloads:
    • pypi 54 last-month
  • Total dependent packages: 0
    (may contain duplicates)
  • Total dependent repositories: 0
    (may contain duplicates)
  • Total versions: 8
  • Total maintainers: 1
pypi.org: getsitemap

Retrieve all URLs from a sitemap.

  • Versions: 6
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 27 Last month
Rankings
Dependent packages count: 6.6%
Forks count: 23.2%
Average: 26.1%
Dependent repos count: 30.6%
Downloads: 31.0%
Stargazers count: 39.1%
Maintainers (1)
Last synced: 6 months ago
pypi.org: disinfo-domains

Analyze the reliability of a source using Wikipedia.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 16 Last month
Rankings
Dependent packages count: 9.6%
Average: 36.3%
Dependent repos count: 63.0%
Maintainers (1)
Last synced: 6 months ago
pypi.org: sourcetrust

Analyze the reliability of a source using Wikipedia.

  • Versions: 1
  • Dependent Packages: 0
  • Dependent Repositories: 0
  • Downloads: 11 Last month
Rankings
Dependent packages count: 9.6%
Average: 36.3%
Dependent repos count: 63.1%
Maintainers (1)
Last synced: 6 months ago

Dependencies

requirements.txt pypi
  • beautifulsoup4 ==4.11.1
  • bs4 ==0.0.1
  • certifi ==2022.9.24
  • charset-normalizer ==2.1.1
  • idna ==3.4
  • requests ==2.28.1
  • soupsieve ==2.3.2.post1
  • urllib3 ==1.26.12