https://github.com/alan-turing-institute/common-crawl-readability
Scripts for processing common crawl web content through Mozilla readability.js
https://github.com/alan-turing-institute/common-crawl-readability
Science Score: 13.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
○.zenodo.json file
-
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (4.5%) to scientific vocabulary
Repository
Scripts for processing common crawl web content through Mozilla readability.js
Basic Info
- Host: GitHub
- Owner: alan-turing-institute
- Language: HTML
- Default Branch: master
- Size: 3.87 MB
Statistics
- Stars: 0
- Watchers: 3
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Readable Common Crawl
Scripts for processing common crawl web content through Mozilla readability.js
Usage
python warc_to_readable_pages.py (-f <path-to-warc.gz> | -u <url-to-warc.gz>) -o <output-directory>
Constraints:
- URL or local path refers to a gzipped WARC file
- The response contents are extracted from each WARC record in the file and
written to disk as one file per page with response headers stripped
- Output files are named <WARC-Record-ID>.txt. The .txt extension allows
easy inspection using the default text editor. A .html extension was
deliberately avoided due to the likelihood that some pages contain malware and
therefore opening pages in a browser would be a bad idea.
Owner
- Name: The Alan Turing Institute
- Login: alan-turing-institute
- Kind: organization
- Email: info@turing.ac.uk
- Website: https://turing.ac.uk
- Repositories: 477
- Profile: https://github.com/alan-turing-institute
The UK's national institute for data science and artificial intelligence.
GitHub Events
Total
Last Year
Issues and Pull Requests
Last synced: 9 months ago
All Time
- Total issues: 1
- Total pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Total issue authors: 1
- Total pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Past Year
- Issues: 1
- Pull requests: 0
- Average time to close issues: N/A
- Average time to close pull requests: N/A
- Issue authors: 1
- Pull request authors: 0
- Average comments per issue: 0.0
- Average comments per pull request: 0
- Merged pull requests: 0
- Bot issues: 0
- Bot pull requests: 0
Top Authors
Issue Authors
- mhauru (1)