https://github.com/apachecn-archive/epub-crawler

Science Score: 13.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (4.5%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Basic Info

Host: GitHub
Owner: apachecn-archive
License: other
Language: Python
Default Branch: master
Size: 56.6 KB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created about 3 years ago · Last pushed about 3 years ago

Metadata Files

Readme Changelog License

epub-crawler

用于抓取网页内容并制作 EPUB 的小工具。

安装

通过 pip（推荐）：

pip install EpubCrawler

从源码安装：

pip install git+https://github.com/apachecn/epub-crawler

使用指南

``` crawl-epub [CONFIG]

CONFIG: JSON 格式的配置文件，默认为当前工作目录中的 config.json ```

配置文件包含以下属性：

name: String

元信息中的书籍名称，也是在当前工作目录中保存文件的名称
url: String（和list二选一）

目录页面的 URL
link: String（若url非空则必填）

链接<a>的选择器
list: [String]（和url二选一）

待抓取页面的列表，如果这个列表不为空，则抓取这个列表

⚠该配置项会覆盖url、link和external⚠
title: String（可空）

文章页面的标题选择器（默认为title）
content: String（可空）

文章页面的内容选择器，为空则智能分析
remove: String（可空）

文章页面需要移除的元素的选择器
credit: Boolean（可空）

是否显示原文链接
headers: {String: String}（可空）

HTTP 请求的协议头，默认为{"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"}
retry: Integer（可空）

HTTP 请求的重试次数，默认为 10
wait: Float（可空）

两次请求之间的间隔（秒），默认为 0
timeout: Integer（可空）

同时设置 HTTP 请求的连接和读取超时（秒）

⚠会覆盖connTimeout和readTimeout
connTimeout: Integer（可空）

HTTP 请求的连接超时（秒），默认为 1
readTimeout: Integer（可空）

HTTP 请求的读取超时（秒），默认为 60
encoding: String（可空）

网页编码，默认为 UTF-8
optiMode: String（可空）

图片处理的模型，'none'表示不处理，其它值请见 imgyaso 支持的模式，默认为'quant'
colors: Integer（可空）

imgyaso 接收的colors参数，默认为 8
imgSrc: [String]（可空）

图片源的属性，默认为["data-src", "data-original-src", "src"]
proxy: String（可空）

要使用的代理，格式为<protocal>://<host>:<port>
checkStatus: Bool（可空）

是否检查状态码。如果为true并且状态码非 2XX，当作失败。默认为False。
textThreads: Integer（可空）

爬取文本的线程数，默认为 5
imgThreads: Integer（可空）

爬取图片的线程数，默认为 5
external: String（可空）

外部脚本的路径。脚本中可定义get_toc和get_article函数来自定义获取目录和正文的逻辑。

get_toc(html: string, url: string): [string]

接受页面 HTML 和 URL，返回目录列表

get_article(html: string, url: string): {'title': string, 'content': string}

接受页面 HTML 和 URL，返回字典，title键是标题，content键是正文

⚠该配置项会覆盖link，title和content，但不会覆盖list⚠
sizeLimit：String（可空）

EPUB 大小限制，格式为【数字+字母单位】，默认为100m。

用于抓取我们的 PyTorch 1.4 文档的示例：

json { "name": "PyTorch 1.4 中文文档 & 教程", "url": "https://gitee.com/apachecn/pytorch-doc-zh/blob/master/docs/1.4/SUMMARY.md", "link": ".markdown-body li a", "remove": "a.anchor", "headers": {"Referer": "https://gitee.com/"} }

协议

本项目基于 SATA 协议发布。

您有义务为此开源项目点赞，并考虑额外给予作者适当的奖励。

赞助我们

另见

Owner

Name: ApacheCN 归档
Login: apachecn-archive
Kind: organization
Email: wizard.z@qq.com

Repositories: 180
Profile: https://github.com/apachecn-archive

防止重要项目丢失而设立的归档

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

GenEpub *
imgyaso *
pyquery *
pyyaml *
readability-lxml *
requests *
selenium *

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science