citation

Extract citation ISBNs from Wikipedia dump

https://github.com/calil/citation

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
○
.zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (3.8%) to scientific vocabulary

Keywords

code4lib-jp wikipedia-dump

Last synced: 10 months ago · JSON representation ·

Repository

Extract citation ISBNs from Wikipedia dump

Basic Info

Host: GitHub
Owner: CALIL
License: mit
Language: Python
Default Branch: master
Homepage:
Size: 54.7 KB

Statistics

Stars: 2
Watchers: 4
Forks: 0
Open Issues: 0
Releases: 0

Topics

code4lib-jp wikipedia-dump

Created almost 10 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

citation

Wikipediaのダンプファイルから出典ISBNを抽出するツール

概要

日本語版Wikipediaのダンプから出典ISBNを抽出
抽出したデータはLine-delimited JSON形式で保存
ある程度の表記ゆれを吸収

依存パッケージのインストール

bash poetry install

コマンドライン

```bash Usage: citation.py [OPTIONS] INPUTFILENAME EXPORTFILENAME

Options: --show-exclusion / --no-show-exclusion 除外した項目を表示する --help Show this message and exit. ```

bash wget https://dumps.wikimedia.org/jawiki/20190420/jawiki-20190420-pages-articles-multistream.xml.bz poetry run python citation.py jawiki-20190420-pages-articles-multistream.xml.bz2 citation-jawiki-20190420.jsonl

抽出されるデータ

json { "isbn": "4772212272", "raw": "4-7722-1227-2", "title": "地理学", "score": 2.9, "h1": "参考文献", "h2": null, "is_ref": true }

| 項目 | 型 | 概要 | |--------|-------------|--------------------------------------------------| | isbn | String | 正規化されたISBN（ISBN-10） | | raw | String | 解析される元のISBN表記 | | title | String | Wikipediaのページ名 | | score | Number | 独自指標により算出されたISBNの正確さ
（スコアが低い場合は、誤って検出した場合がある） | | h1 | String/null | 見出し1 | | h2 | String/null | 見出し2 | | is_ref | Boolean | 出典であることが明記されているか（作品リストなどではfalse） |

処理済みデータのダウンロード

注意事項

チェックデジットの一致により、ISBN以外を誤判定する場合があります。ただし、ISBNから参照記事を検索する目的では問題とならないため許容しています
チェックデジット間違いのISBNは抽出されません

Owner

Name: CALIL Inc.
Login: CALIL
Kind: organization
Email: contact@calil.jp
Location: Gifu, JP

Website: https://calil.jp/
Repositories: 16
Profile: https://github.com/CALIL

The more fun the library.

Citation (citation.py)

__title__ = 'Wikipedia Citation Extractor'
__copyright__ = "Copyright (C) 2023 CALIL Inc."
__author__ = "Ryuuji Yoshimoto <ryuuji@calil.jp>"

import re
import bz2
import json
import isbnlib
import click
from halo import Halo


@click.command()
@click.argument('input_filename', type=click.Path(exists=True))
@click.argument('export_filename', type=click.Path(exists=False))
@click.option('--show-exclusion/--no-show-exclusion', default=False, help='除外した項目を表示する')
def extract_citation(input_filename, export_filename, show_exclusion):
    """
    Wikipediaのダンプファイルから出典ISBNを抽出する
    :param input_filename: ダンプファイル
    :param export_filename: 出力するJSON
    :param show_exclusion: 除外した項目を表示する
    :return:
    """
    click.echo('| extract_citation')
    click.echo('| 処理するファイル:' + click.format_filename(input_filename))
    click.echo('| 出力するファイル:' + click.format_filename(export_filename))

    title = None
    topic1 = None
    topic2 = None
    isbn_regex = re.compile(r"((?:ISBN10 |ISBN13 |ISBN　|isbn=|ISBN  |isbn = |ISBN-10 |ISBN-13 |ISBN：|ISBN-|ISBN |ISBN)?)([0-9][0-9\- ]{8,20}[0-9Xx])")
    topic_regex = re.compile("([=]{2,3})([^=]+)(.*)")
    pages = 0
    count_isbn = 0
    count_error = 0
    with Halo(text='Loading', spinner='dots') as spinner:
        with open(export_filename, 'w', encoding='utf-8') as f:
            for line in bz2.open(input_filename, 'rt', encoding='utf-8'):
                if line == "\n":  # Optimize
                    continue
                if line == "  <page>\n":
                    title = None
                    topic1 = None
                    topic2 = None
                    pages += 1
                    if pages % 300 == 0:
                        spinner.text = str(pages)
                elif not title:
                    for title_ in re.findall(u"<title>([^<]*)</title>", line):
                        title = title_
                else:
                    # 見出しの検索
                    if line.find("==") != -1:
                        ret = topic_regex.findall(line)
                        if len(ret) == 1:
                            if len(ret[0][0]) == 2:
                                topic1 = ret[0][1].strip()
                                topic2 = None
                            if len(ret[0][0]) == 3:
                                topic2 = ret[0][1].strip()

                    if line.find("ISBN") != -1 and line.find("isbn") != -1 and line.find("Isbn") != -1:
                        continue

                    for ret in isbn_regex.findall(line):
                        score = 0.0
                        _isbn = ret[1].replace('-', '')
                        _isbn = _isbn.replace(' ', '')
                        _isbn = _isbn.replace('x', 'X')

                        if ret[0] == 'ISBN-' and len(_isbn) == 12 and _isbn[0:2] == '10' and _isbn[2] == '4':
                            _isbn = _isbn[2:12]
                        _isbn_pattern = "?"
                        _isbn_length = len(_isbn)

                        if len(ret[0]) > 0:  # ISBNの記述があった場合は信頼
                            score += 0.9

                        if _isbn_length == 16:
                            if _isbn.find('978978') == 0 and isbnlib.is_isbn13(_isbn[6:16]):
                                _isbn_pattern = "I13(978978Cut)"
                                score += 0.5
                                _isbn = _isbn[6:16]
                        if _isbn_length == 10:
                            if isbnlib.is_isbn10(_isbn):
                                _isbn_pattern = "I10"
                                score += 0.5
                                if re.search("^4", _isbn):
                                    score += 1.0
                                if re.search("[X]$", _isbn):
                                    score += 0.5
                            elif _isbn.find("X") == -1 and isbnlib.is_isbn13("978" + _isbn):
                                _isbn_pattern = "I13(978+)"
                                _isbn = "978" + _isbn
                                score += 1.0
                        elif _isbn_length == 13:
                            if _isbn.find('491') == 0:
                                _isbn_pattern = u"雑誌コード"
                                score = -1
                            elif (_isbn.find('978') == 0 or _isbn.find('977') == 0) and _isbn.find(
                                    "X") == -1 and isbnlib.is_isbn13(_isbn):
                                _isbn_pattern = "I13"
                                if _isbn.find('978') == 0:
                                    _isbn = _isbn
                                score += 1.0
                            elif isbnlib.is_isbn10(_isbn[3:]):  # 10桁を無理矢理13桁化
                                _isbn = _isbn[3:]
                                _isbn_pattern = "I10(978-)"
                                score += 0.5
                        elif _isbn_length == 11 and _isbn[0] == '8' and isbnlib.is_isbn13("97" + _isbn):  # 10桁を13桁化
                            _isbn = "97" + _isbn
                            _isbn_pattern = "I13(97+)"
                            score += 0.5
                        elif _isbn_length > 13 and _isbn.find("978") == 0 and _isbn.find("X") == -1 and isbnlib.is_isbn13(
                                _isbn[0:13]):
                            _isbn_pattern = "I13(Cut13)"
                            _isbn = _isbn[0:13]
                            score += 0.5
                        elif _isbn_length > 13 and _isbn.find("978") == 0 and isbnlib.is_isbn10(_isbn[3:13]):
                            _isbn = _isbn[3:13]
                            _isbn_pattern = "I10(Cut13_978-)"
                            score += 0.5
                        elif _isbn_length > 10 and isbnlib.is_isbn10(_isbn[0:10]):
                            _isbn = _isbn[0:10]
                            _isbn_pattern = "I10(Cut10)"
                            score += 0.5
                        elif _isbn_length > 10 and isbnlib.is_isbn13("978" + _isbn[0:10]):
                            _isbn = "978" + _isbn[0:10]
                            _isbn_pattern = "I13(Cut10_978+)"
                            score += 0.5
                        elif _isbn_length == 9 and isbnlib.is_isbn10("4" + _isbn):
                            _isbn = '4' + _isbn
                            _isbn_pattern = "I10(4+)"
                            score += 0.5
                        if score >= 1.0:
                            count_isbn += 1
                            if line.find("&lt;ref") != -1 or line.find("{Cite book") != -1:
                                is_ref = True
                                score += 0.5
                            else:
                                is_ref = False
                            if topic1:
                                if topic1 in ["作品リスト", "作品"]:
                                    is_ref = False
                                    score -= 0.5
                                if (topic1 in ["典拠・資料", "脚注", "脚注および参考文献", "参考図書", "主な文献", "参照資料",
                                               "関連図書", "参考書籍", "参考文献", "参考資料",
                                               "関連書籍", "文献", "出典", "参照文献"]) or topic1.find("関連文献") == 0:
                                    is_ref = True
                                    score += 0.5
                            item = {'isbn': isbnlib.to_isbn10(_isbn),
                                    'raw': ret[1].strip(),
                                    'title': title,
                                    'score': score,
                                    'h1': topic1,
                                    'h2': topic2,
                                    'is_ref': is_ref}
                            f.write(json.dumps(item, ensure_ascii=False) + '\n')
                        else:
                            count_error += 1
                            if show_exclusion and len(ret[0]) > 0:
                                click.echo("\n" + " ".join([_isbn_pattern, ret[0], _isbn, title, str(score)]))

    click.echo("count_pages:" + str(pages))
    click.echo("count_isbn:" + str(count_isbn))
    click.echo("count_error:" + str(count_error))
    click.secho('処理が完了しました', fg='green')


if __name__ == '__main__':
    extract_citation()

GitHub Events

Total

Watch event: 1
Push event: 2

Last Year

Watch event: 1
Push event: 2

Dependencies

.github/workflows/codeql-analysis.yml actions

actions/checkout v2 composite
github/codeql-action/analyze v1 composite
github/codeql-action/autobuild v1 composite
github/codeql-action/init v1 composite

poetry.lock pypi

click 8.1.3
colorama 0.4.6
halo 0.0.31
isbnlib 3.10.12
log-symbols 0.0.14
six 1.16.0
spinners 0.0.24
termcolor 2.1.1

pyproject.toml pypi

click *
colorama *
halo *
isbnlib *
python ^3.10

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

citation

Science Score: 31.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

citation

概要

依存パッケージのインストール

コマンドライン

抽出されるデータ

処理済みデータのダウンロード

注意事項

Owner

Citation (citation.py)

GitHub Events

Total

Last Year

Dependencies