https://github.com/capjamesg/web-feed-recovery

Try to identify new versions of feeds that now return a 404.

https://github.com/capjamesg/web-feed-recovery

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.9%) to scientific vocabulary

Keywords

atom feed-reader-testing feed-reading rss
Last synced: 5 months ago · JSON representation

Repository

Try to identify new versions of feeds that now return a 404.

Basic Info
  • Host: GitHub
  • Owner: capjamesg
  • License: mit
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 27.3 KB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
atom feed-reader-testing feed-reading rss
Created about 1 year ago · Last pushed about 1 year ago
Metadata Files
Readme License

README.md

Web Feed Recovery

This repository contains a script that aims to find a new version of a web feed for a feed that currently returns a 404.

This repository takes a list of feed URLs that are known to be 404s and attempts to find new feeds.

On a test of 160 broken feeds from the real world, this project recovered 67%.

Installation

First, clone this project:

git clone https://github.com/capjamesg/web-feed-recovery

Then, create a file called feeds.txt and add feeds that are known to be broken. Add one feed URL per line.

Then, run:

app.py

Results will be saved to a file called results.json with the structure:

json [ { "original_feed": "https://blog.autumnrain.cc", "found_feeds": { "https://blog.autumnrain.cc/rss/": "application/rss+xml" } } ]

The key-value pairs are the found feed URL mapped to the found MIME type.

MIME types are only added if a feed was found through HTTP header discovery. If the feed was not found through HTTP header discovery, the MIME type will be null.

Algorithm

  1. Go to the homepage of the site associated with the feed.
  2. Check the HTTP headers and HTML <link> tags for signals of a feed (using the indieweb-utils feed discovery implementation).
  3. Check for instances of several link anchors indicative of a feed (i.e. "RSS", "RSS Feed"). Save those as potential new feeds.
  4. Check for instances of link anchors for several blog-related terms, like "Blog" and "Writing". Go to those pages, perform HTTP header and HTML <link> tag analysis, and save any feeds.
  5. Present all discovered feeds.

Limitations

For a multi-user site on the same domain, the algorithm will not work. This is because a feed on the URL cannot be confidently, generally reconciled with a single writer with the algorithm above. More additions would be needed to support such behaviour.

UX

The feeds returned are "potential" feeds, since any feed that the user did not add to a feed reader themselves (or that a feed reader did not infer from a URL provided by a user) cannot be known to be the right replacement without confirmation from a user. Thus, use of this script in any project should be accompanied by a stage where a user is asked to confirm that the new feed matches their expectations before replacing the broken feed with the newly-found one.

License

This project is licensed under an MIT license.

Owner

  • Name: James
  • Login: capjamesg
  • Kind: user
  • Location: Scotland
  • Company: @Roboflow

from words, wonder.

GitHub Events

Total
  • Push event: 12
  • Create event: 2
Last Year
  • Push event: 12
  • Create event: 2

Issues and Pull Requests

Last synced: 12 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Dependencies

requirements.txt pypi
  • bs4 *
  • indieweb-utils *
  • requests *
  • tqdm *