swh-indexer

GitHub mirror of Metadata indexer

https://github.com/softwareheritage/swh-indexer

Science Score: 26.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (4.4%) to scientific vocabulary

Keywords

software-heritage swh
Last synced: 9 months ago · JSON representation

Repository

GitHub mirror of Metadata indexer

Basic Info
Statistics
  • Stars: 15
  • Watchers: 5
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
software-heritage swh
Created over 9 years ago · Last pushed 9 months ago
Metadata Files
Readme License Code of conduct Authors Codemeta

README.rst

Software Heritage - Indexer
===========================

Tools to compute multiple indexes on SWH's raw contents:

- content:

  - mimetype
  - fossology-license
  - metadata

- origin:

  - metadata (intrinsic, using the content indexer; and extrinsic)

An indexer is in charge of:

- looking up objects
- extracting information from those objects
- store those information in the swh-indexer db

There are multiple indexers working on different object types:

  - content indexer: works with content sha1 hashes
  - revision indexer: works with revision sha1 hashes
  - origin indexer: works with origin identifiers

Indexation procedure:

- receive batch of ids
- retrieve the associated data depending on object type
- compute for that object some index
- store the result to swh's storage

Current content indexers:

- mimetype (queue swh_indexer_content_mimetype): detect the encoding
  and mimetype

- fossology-license (queue swh_indexer_fossology_license): compute the
  license

- metadata: translate file from an ecosystem-specific formats to JSON-LD
  (using schema.org/CodeMeta vocabulary)

Current origin indexers:

- metadata: translate file from an ecosystem-specific formats to JSON-LD
  (using schema.org/CodeMeta and ForgeFed vocabularies)

Owner

  • Name: Software Heritage
  • Login: SoftwareHeritage
  • Kind: organization
  • Email: info@softwareheritage.org

The Great Library of Source Code

CodeMeta (codemeta.json)

{
  "@context": "https://raw.githubusercontent.com/codemeta/codemeta/2.0/codemeta.jsonld",
  "@type": "SoftwareSourceCode",
  "identifier": "5682a72dc61f86ae69f2841c2184d6159c0b6d5d",
  "description": "Software Heritage Indexer for revisions and contents",
  "name": "swh-indexer",
  "isPartOf": {
    "@type": "SoftwareSourceCode",
    "name": "swh-environment",
    "identifier": "83e766feafde91242883be1bf369ed3e6865824f"
  },
  "codeRepository": "https://forge.softwareheritage.org/diffusion/78/",
  "issueTracker": "https://forge.softwareheritage.org/maniphest/",
  "license": "https://spdx.org/licenses/GPL-3.0.html",
  "version": "0.0.35",
  "author": [
    {
      "@type": "Organization",
      "name": "Software Heritage",
      "url": "https://www.softwareheritage.org",
      "email": "swh-devel@inria.fr"
    }
  ],
  "developmentStatus": "active",
  "keywords": [
    "indexer",
    "software",
    "mimetype",
    "ctags",
    "language",
    "fossology-license",
    "metadata",
    "metadata-detector",
    "metadata-translator"
  ],
  "dateCreated": "2017-06-12",
  "datePublished": "2017-06-12",
  "programmingLanguage": "Python"
}

GitHub Events

Total
  • Delete event: 2
  • Push event: 43
  • Create event: 18
Last Year
  • Delete event: 2
  • Push event: 43
  • Create event: 18

Dependencies

requirements-swh.txt pypi
  • swh.core >=2.14.1
  • swh.journal >=0.1.0
  • swh.model >=0.0.15
  • swh.objstorage >=0.2.2
  • swh.scheduler >=0.5.2
  • swh.storage >=0.22.0
requirements-test.txt pypi
  • confluent-kafka * test
  • hypothesis >=3.11.0 test
  • pytest * test
  • pytest-mock * test
  • swh.scheduler >=0.5.0 test
  • swh.storage >=0.10.0 test
  • types-click * test
  • types-pyyaml * test
requirements.txt pypi
  • click *
  • frozendict *
  • iso8601 *
  • pyld *
  • python-magic >=0.4.13
  • rdflib *
  • sentry-sdk *
  • typing-extensions *
  • xmltodict *
pyproject.toml pypi