https://github.com/capjamesg/web-feed-recovery
Try to identify new versions of feeds that now return a 404.
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (12.9%) to scientific vocabulary
Keywords
Repository
Try to identify new versions of feeds that now return a 404.
Basic Info
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Topics
Metadata Files
README.md
Web Feed Recovery
This repository contains a script that aims to find a new version of a web feed for a feed that currently returns a 404.
This repository takes a list of feed URLs that are known to be 404s and attempts to find new feeds.
On a test of 160 broken feeds from the real world, this project recovered 67%.
Installation
First, clone this project:
git clone https://github.com/capjamesg/web-feed-recovery
Then, create a file called feeds.txt and add feeds that are known to be broken. Add one feed URL per line.
Then, run:
app.py
Results will be saved to a file called results.json with the structure:
json
[
{
"original_feed": "https://blog.autumnrain.cc",
"found_feeds": {
"https://blog.autumnrain.cc/rss/": "application/rss+xml"
}
}
]
The key-value pairs are the found feed URL mapped to the found MIME type.
MIME types are only added if a feed was found through HTTP header discovery. If the feed was not found through HTTP header discovery, the MIME type will be null.
Algorithm
- Go to the homepage of the site associated with the feed.
- Check the HTTP headers and HTML
<link>tags for signals of a feed (using the indieweb-utils feed discovery implementation). - Check for instances of several link anchors indicative of a feed (i.e. "RSS", "RSS Feed"). Save those as potential new feeds.
- Check for instances of link anchors for several blog-related terms, like "Blog" and "Writing". Go to those pages, perform HTTP header and HTML
<link>tag analysis, and save any feeds. - Present all discovered feeds.
Limitations
For a multi-user site on the same domain, the algorithm will not work. This is because a feed on the URL cannot be confidently, generally reconciled with a single writer with the algorithm above. More additions would be needed to support such behaviour.
UX
The feeds returned are "potential" feeds, since any feed that the user did not add to a feed reader themselves (or that a feed reader did not infer from a URL provided by a user) cannot be known to be the right replacement without confirmation from a user. Thus, use of this script in any project should be accompanied by a stage where a user is asked to confirm that the new feed matches their expectations before replacing the broken feed with the newly-found one.
License
This project is licensed under an MIT license.
Owner
- Name: James
- Login: capjamesg
- Kind: user
- Location: Scotland
- Company: @Roboflow
- Website: jamesg.blog
- Repositories: 320
- Profile: https://github.com/capjamesg
from words, wonder.
GitHub Events
Total
- Push event: 12
- Create event: 2
Last Year
- Push event: 12
- Create event: 2
Issues and Pull Requests
Last synced: 12 months ago
All Time
- Total issues: 0
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 0
- Total pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 0
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 0
- Pull request authors: 0
- Average comments per issue: 0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels
Dependencies
- bs4 *
- indieweb-utils *
- requests *
- tqdm *