citation

Extract citation ISBNs from Wikipedia dump

https://github.com/calil/citation

Science Score: 31.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (3.8%) to scientific vocabulary

Keywords

code4lib-jp wikipedia-dump
Last synced: 10 months ago · JSON representation ·

Repository

Extract citation ISBNs from Wikipedia dump

Basic Info
  • Host: GitHub
  • Owner: CALIL
  • License: mit
  • Language: Python
  • Default Branch: master
  • Homepage:
  • Size: 54.7 KB
Statistics
  • Stars: 2
  • Watchers: 4
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Topics
code4lib-jp wikipedia-dump
Created almost 10 years ago · Last pushed about 1 year ago
Metadata Files
Readme License Citation

README.md

citation Maintainability

Wikipediaのダンプファイルから出典ISBNを抽出するツール

概要

  • 日本語版Wikipediaのダンプから出典ISBNを抽出
  • 抽出したデータはLine-delimited JSON形式で保存
  • ある程度の表記ゆれを吸収

依存パッケージのインストール

bash poetry install

コマンドライン

```bash Usage: citation.py [OPTIONS] INPUTFILENAME EXPORTFILENAME

Options: --show-exclusion / --no-show-exclusion 除外した項目を表示する --help Show this message and exit. ```

bash wget https://dumps.wikimedia.org/jawiki/20190420/jawiki-20190420-pages-articles-multistream.xml.bz poetry run python citation.py jawiki-20190420-pages-articles-multistream.xml.bz2 citation-jawiki-20190420.jsonl

抽出されるデータ

json { "isbn": "4772212272", "raw": "4-7722-1227-2", "title": "地理学", "score": 2.9, "h1": "参考文献", "h2": null, "is_ref": true }

| 項目 | 型 | 概要 | |--------|-------------|--------------------------------------------------| | isbn | String | 正規化されたISBN(ISBN-10) | | raw | String | 解析される元のISBN表記 | | title | String | Wikipediaのページ名 | | score | Number | 独自指標により算出されたISBNの正確さ
(スコアが低い場合は、誤って検出した場合がある) | | h1 | String/null | 見出し1 | | h2 | String/null | 見出し2 | | is_ref | Boolean | 出典であることが明記されているか(作品リストなどではfalse) |

処理済みデータのダウンロード

| ダンプ | 処理データ | 件数 | |-----------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------|----------:| | jawiki-20190420-pages-articles-multistream.xml.bz2 | citation-jawiki-20190420.jsonl | 672,155 | | jawiki-20190601-pages-articles-multistream.xml.bz2 | citation-jawiki-20190601.jsonl | 679,440 | | jawiki-20190801-pages-articles-multistream.xml.bz2 | citation-jawiki-20190801.jsonl | 688,393 | | jawiki-20191220-pages-articles-multistream.xml.bz2 | citation-jawiki-20191220.jsonl | 714,273 | | jawiki-20200301-pages-articles-multistream.xml.bz2 | citation-jawiki-20200301.jsonl | 728,278 | | jawiki-20200801-pages-articles-multistream.xml.bz2 | citation-jawiki-20200801.jsonl | 763,007 | | jawiki-20201201-pages-articles-multistream.xml.bz2 | citation-jawiki-20201201.jsonl | 788,068 | | jawiki-20210620-pages-articles-multistream.xml.bz2 | citation-jawiki-20210620.jsonl | 839,059 | | jawiki-20210920-pages-articles-multistream.xml.bz2 | citation-jawiki-20210920.jsonl | 864,341 | | jawiki-20211120-pages-articles-multistream.xml.bz2 | citation-jawiki-20211120.jsonl | 880,591 | | enwiki-20211120-pages-articles-multistream.xml.bz2 | citation-enwiki-20211120.jsonl | 5,116,149 | | jawiki-20221220-pages-articles-multistream.xml.bz2 | citation-jawiki-20221220.jsonl | 970,869 | | enwiki-20221220-pages-articles-multistream.xml.bz2 | citation-enwiki-20221220.jsonl | 6,064,901 | | jawiki-20240401-pages-articles-multistream.xml.bz2 | citation-jawiki-20240401.jsonl | 1,073,563 | | enwiki-20240401-pages-articles-multistream.xml.bz2 | citation-enwiki-20240401.jsonl | 7,023,140 | | jawiki-20241201-pages-articles-multistream.xml.bz2 | citation-jawiki-20241201.jsonl | 1,130,854 | | enwiki-20241201-pages-articles-multistream.xml.bz2 | citation-enwiki-20241201.jsonl | 8,669,996 | | jawiki-20250601-pages-articles-multistream.xml.bz2 | citation-jawiki-20250601.jsonl | 1,175,404 | | enwiki-20250601-pages-articles-multistream.xml.bz2 | citation-enwiki-20250601.jsonl | 9,212,634 |

注意事項

  • チェックデジットの一致により、ISBN以外を誤判定する場合があります。ただし、ISBNから参照記事を検索する目的では問題とならないため許容しています
  • チェックデジット間違いのISBNは抽出されません

Owner

  • Name: CALIL Inc.
  • Login: CALIL
  • Kind: organization
  • Email: contact@calil.jp
  • Location: Gifu, JP

The more fun the library.

Citation (citation.py)

__title__ = 'Wikipedia Citation Extractor'
__copyright__ = "Copyright (C) 2023 CALIL Inc."
__author__ = "Ryuuji Yoshimoto <ryuuji@calil.jp>"

import re
import bz2
import json
import isbnlib
import click
from halo import Halo


@click.command()
@click.argument('input_filename', type=click.Path(exists=True))
@click.argument('export_filename', type=click.Path(exists=False))
@click.option('--show-exclusion/--no-show-exclusion', default=False, help='除外した項目を表示する')
def extract_citation(input_filename, export_filename, show_exclusion):
    """
    Wikipediaのダンプファイルから出典ISBNを抽出する
    :param input_filename: ダンプファイル
    :param export_filename: 出力するJSON
    :param show_exclusion: 除外した項目を表示する
    :return:
    """
    click.echo('| extract_citation')
    click.echo('| 処理するファイル:' + click.format_filename(input_filename))
    click.echo('| 出力するファイル:' + click.format_filename(export_filename))

    title = None
    topic1 = None
    topic2 = None
    isbn_regex = re.compile(r"((?:ISBN10 |ISBN13 |ISBN |isbn=|ISBN  |isbn = |ISBN-10 |ISBN-13 |ISBN:|ISBN-|ISBN |ISBN)?)([0-9][0-9\- ]{8,20}[0-9Xx])")
    topic_regex = re.compile("([=]{2,3})([^=]+)(.*)")
    pages = 0
    count_isbn = 0
    count_error = 0
    with Halo(text='Loading', spinner='dots') as spinner:
        with open(export_filename, 'w', encoding='utf-8') as f:
            for line in bz2.open(input_filename, 'rt', encoding='utf-8'):
                if line == "\n":  # Optimize
                    continue
                if line == "  <page>\n":
                    title = None
                    topic1 = None
                    topic2 = None
                    pages += 1
                    if pages % 300 == 0:
                        spinner.text = str(pages)
                elif not title:
                    for title_ in re.findall(u"<title>([^<]*)</title>", line):
                        title = title_
                else:
                    # 見出しの検索
                    if line.find("==") != -1:
                        ret = topic_regex.findall(line)
                        if len(ret) == 1:
                            if len(ret[0][0]) == 2:
                                topic1 = ret[0][1].strip()
                                topic2 = None
                            if len(ret[0][0]) == 3:
                                topic2 = ret[0][1].strip()

                    if line.find("ISBN") != -1 and line.find("isbn") != -1 and line.find("Isbn") != -1:
                        continue

                    for ret in isbn_regex.findall(line):
                        score = 0.0
                        _isbn = ret[1].replace('-', '')
                        _isbn = _isbn.replace(' ', '')
                        _isbn = _isbn.replace('x', 'X')

                        if ret[0] == 'ISBN-' and len(_isbn) == 12 and _isbn[0:2] == '10' and _isbn[2] == '4':
                            _isbn = _isbn[2:12]
                        _isbn_pattern = "?"
                        _isbn_length = len(_isbn)

                        if len(ret[0]) > 0:  # ISBNの記述があった場合は信頼
                            score += 0.9

                        if _isbn_length == 16:
                            if _isbn.find('978978') == 0 and isbnlib.is_isbn13(_isbn[6:16]):
                                _isbn_pattern = "I13(978978Cut)"
                                score += 0.5
                                _isbn = _isbn[6:16]
                        if _isbn_length == 10:
                            if isbnlib.is_isbn10(_isbn):
                                _isbn_pattern = "I10"
                                score += 0.5
                                if re.search("^4", _isbn):
                                    score += 1.0
                                if re.search("[X]$", _isbn):
                                    score += 0.5
                            elif _isbn.find("X") == -1 and isbnlib.is_isbn13("978" + _isbn):
                                _isbn_pattern = "I13(978+)"
                                _isbn = "978" + _isbn
                                score += 1.0
                        elif _isbn_length == 13:
                            if _isbn.find('491') == 0:
                                _isbn_pattern = u"雑誌コード"
                                score = -1
                            elif (_isbn.find('978') == 0 or _isbn.find('977') == 0) and _isbn.find(
                                    "X") == -1 and isbnlib.is_isbn13(_isbn):
                                _isbn_pattern = "I13"
                                if _isbn.find('978') == 0:
                                    _isbn = _isbn
                                score += 1.0
                            elif isbnlib.is_isbn10(_isbn[3:]):  # 10桁を無理矢理13桁化
                                _isbn = _isbn[3:]
                                _isbn_pattern = "I10(978-)"
                                score += 0.5
                        elif _isbn_length == 11 and _isbn[0] == '8' and isbnlib.is_isbn13("97" + _isbn):  # 10桁を13桁化
                            _isbn = "97" + _isbn
                            _isbn_pattern = "I13(97+)"
                            score += 0.5
                        elif _isbn_length > 13 and _isbn.find("978") == 0 and _isbn.find("X") == -1 and isbnlib.is_isbn13(
                                _isbn[0:13]):
                            _isbn_pattern = "I13(Cut13)"
                            _isbn = _isbn[0:13]
                            score += 0.5
                        elif _isbn_length > 13 and _isbn.find("978") == 0 and isbnlib.is_isbn10(_isbn[3:13]):
                            _isbn = _isbn[3:13]
                            _isbn_pattern = "I10(Cut13_978-)"
                            score += 0.5
                        elif _isbn_length > 10 and isbnlib.is_isbn10(_isbn[0:10]):
                            _isbn = _isbn[0:10]
                            _isbn_pattern = "I10(Cut10)"
                            score += 0.5
                        elif _isbn_length > 10 and isbnlib.is_isbn13("978" + _isbn[0:10]):
                            _isbn = "978" + _isbn[0:10]
                            _isbn_pattern = "I13(Cut10_978+)"
                            score += 0.5
                        elif _isbn_length == 9 and isbnlib.is_isbn10("4" + _isbn):
                            _isbn = '4' + _isbn
                            _isbn_pattern = "I10(4+)"
                            score += 0.5
                        if score >= 1.0:
                            count_isbn += 1
                            if line.find("&lt;ref") != -1 or line.find("{Cite book") != -1:
                                is_ref = True
                                score += 0.5
                            else:
                                is_ref = False
                            if topic1:
                                if topic1 in ["作品リスト", "作品"]:
                                    is_ref = False
                                    score -= 0.5
                                if (topic1 in ["典拠・資料", "脚注", "脚注および参考文献", "参考図書", "主な文献", "参照資料",
                                               "関連図書", "参考書籍", "参考文献", "参考資料",
                                               "関連書籍", "文献", "出典", "参照文献"]) or topic1.find("関連文献") == 0:
                                    is_ref = True
                                    score += 0.5
                            item = {'isbn': isbnlib.to_isbn10(_isbn),
                                    'raw': ret[1].strip(),
                                    'title': title,
                                    'score': score,
                                    'h1': topic1,
                                    'h2': topic2,
                                    'is_ref': is_ref}
                            f.write(json.dumps(item, ensure_ascii=False) + '\n')
                        else:
                            count_error += 1
                            if show_exclusion and len(ret[0]) > 0:
                                click.echo("\n" + " ".join([_isbn_pattern, ret[0], _isbn, title, str(score)]))

    click.echo("count_pages:" + str(pages))
    click.echo("count_isbn:" + str(count_isbn))
    click.echo("count_error:" + str(count_error))
    click.secho('処理が完了しました', fg='green')


if __name__ == '__main__':
    extract_citation()

GitHub Events

Total
  • Watch event: 1
  • Push event: 2
Last Year
  • Watch event: 1
  • Push event: 2

Dependencies

.github/workflows/codeql-analysis.yml actions
  • actions/checkout v2 composite
  • github/codeql-action/analyze v1 composite
  • github/codeql-action/autobuild v1 composite
  • github/codeql-action/init v1 composite
poetry.lock pypi
  • click 8.1.3
  • colorama 0.4.6
  • halo 0.0.31
  • isbnlib 3.10.12
  • log-symbols 0.0.14
  • six 1.16.0
  • spinners 0.0.24
  • termcolor 2.1.1
pyproject.toml pypi
  • click *
  • colorama *
  • halo *
  • isbnlib *
  • python ^3.10