https://github.com/chaoss/grimoirelab-elk

https://github.com/chaoss/grimoirelab-elk

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    2 of 50 committers (4.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (13.0%) to scientific vocabulary

Keywords

data-enrichment elasticsearch grimoirelab hacktoberfest software-analytics

Keywords from Contributors

orchestration community chaoss project-governance mentorship handbook projection diversity-measurement archival sequences
Last synced: 5 months ago · JSON representation

Repository

Basic Info
  • Host: GitHub
  • Owner: chaoss
  • License: gpl-3.0
  • Language: Python
  • Default Branch: main
  • Homepage:
  • Size: 6.49 MB
Statistics
  • Stars: 60
  • Watchers: 15
  • Forks: 122
  • Open Issues: 53
  • Releases: 0
Topics
data-enrichment elasticsearch grimoirelab hacktoberfest software-analytics
Created over 10 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog License Authors

README.md

Welcome to GrimoireELK Build Status Coverage Status PyPI version

GrimoireELK is the component of GrimoireLab that interacts with the ElasticSearch database. Its goal is two-fold, first it aims at offering a convenient way to store the data coming from Perceval, second it processes and enriches the data in a format that can be consumed by Kibiter.

The Perceval data is stored in ElasticSearch indexes as raw documents (one per item extracted by Perceval). Those raw documents, which will be referred to as "raw data" in this documentation, include all information coming from the original data source which grants the platform to perform multiple analysis without the need of downloading the same data over and over again. Once raw data is retrieved, a new phase starts where data is enriched according to the data source from where it was collected and stored in ElasticSearch indexes. The enrichment removes information not needed by Kibiter and includes additional information which is not directly available within the raw data. For instance, pair programming information for Git data, time to solve (i.e., close or merge) issues and pull requests for GitHub data, and identities and organization information coming from SortingHat . The enriched data is stored as JSON documents, which embed information linked to the corresponding raw documents to ease debugging and guarantee traceability.

Raw data

Each raw document stored in an ElasticSearch index contains a set of common first level fields, regardless of the data source: - backend (string): Name of the Perceval backend used to retrieve the information. - backend_version (string): Version of the abovementioned backend. - perceval_version (string): Perceval version. - timestamp (long): When the item was retrieved by Perceval (in epoch format). - origin (string): Where the item was retrieved from. - uuid (string): Item unique identifier. - updated_on (long): When the item was updated in the original source (in epoch format). - classifiedfieldsfiltered (list): List of data field names (strings) which contained classified information and that were removed from the original item. Depends on activating ‘--filter-classified’ flag in Perceval. - category (string): Type of the items to fetch (commit, pull request, etc.) depending on the data source. - tag (string): Custom label that can be set in Perceval for each retrieval. - data (object): This field contains a copy in JSON format of the original data as it is retrieved from the data source. Next sections will describe where GrimoireLab get this information from.

Enriched data

Each enriched index includes one or more types of documents, which are summarized below.

  • Askbot: each document can be either a question, an answer or answer's comments.
  • Bugzilla: each document corresponds to a single issue (fetched using CGI calls).
  • Bugzillarest: each document corresponds to a single issue (fetched using Bugzilla REST API).
  • Cocom: each document corresponds to single file in a commit, with code complexity information.
  • Colic: each document corresponds to single file in a commit, with license information.
  • Confluence: each document can be either a new page, a page edit, a comment or an attachment.
  • Crates: each document corresponds to an event.
  • Discourse: each document can be either a question or an answer.
  • Dockerhub: each document corresponds to an image.
  • Finosmeetings: each document corresponds to details about a meeting.
  • Functest: each document corresponds to details about a test.
  • Gerrit: each document can be either a changeset, a comment, a patchset or a patchset approval.
  • Git: each document corresponds to a single commit.
  • Git Areas of Code: each document corresponds to one single file.
  • GitHub issues: each document corresponds to an issue.
  • GitHub pull requests: each document corresponds to a pull request.
  • GitHub repo statistics: each document includes repo statistics (e.g., forks, watchers).
  • GitLab issues: each document corresponds to an issue.
  • GitLab merge requests: each document corresponds to a merge request.
  • Gitter: each document corresponds to a message.
  • Googlehits: each document contains hits information derived from Google.
  • Groupsio: each document corresponds to a message.
  • Hyperkitty: each document corresponds to a message.
  • Jenkins: each document corresponds to a single built.
  • Jira: each document corresponds to an issue or a comment. To simplify counting user activities, issues are duplicated and they can include assignee, reporter and creator data respectively.
  • Kitsune: each document can be either a question or an answer.
  • Launchpad: each document corresponds to a bug.
  • Mattermost: each document corresponds to a message.
  • Mbox: each document corresponds to a message.
  • Mediawiki: each document corresponds to a review.
  • Meetup: each document can be either an event, a rsvp or a comment.
  • Mozillaclub: each document includes event information.
  • Nttp: each document corresponds to a message.
  • Onion Study/Community Structure: each document corresponds to an author in a specific quarter, split by organization and project. That means we have an entry for the author’s overall contributions in a given quarter, one entry for the author in each one of the projects he contributed to in that quarter and the same for the author in each of the organizations he is affiliated to in that quarter. This way we store results of onion analysis computed overall, by project and by organization
  • Pagure: each document corresponds to an issue.
  • Phabricator: each document corresponds to a task.
  • Pipermail: each document corresponds to a message.
  • Puppetforge: each document corresponds to a module.
  • Rocketchat: each document corresponds to a message.
  • Redmine: each document corresponds to an issue.
  • Remo activities: each document corresponds to an activity.
  • Remo events: each document corresponds to an event.
  • Remo users: each document corresponds to a user.
  • Rss: each document corresponds to an entry.
  • Slack: each document corresponds to a message.
  • Stackexchange: each document can be either a question or answer.
  • Supybot: each document corresponds to a message.
  • Telegram: each document corresponds to a message.
  • Twitter: each document corresponds to a tweet.

Fields

Each enriched document contains a set of fields, they can be (i) common to all data sources (e.g., metadata fields, time field), (ii) specific to the data source, (iii) related to contributor’s profile information (i.e., identity fields) or (iv) to the project listed in the Mordred projects.json (i.e., project fields).

Metadata fields

  • metadata__timestamp (date): Date when the item was retrieved from the original data source and stored in the index with raw documents.
  • metadata_updatedon (date): Date when the item was updated in its original data source.
  • metadata_enrichedon (date): Date when the item was enriched and stored in the index with enriched documents.
  • metadata_gelkbackend_name (string): Name of the backend used to enrich information.
  • metadata_gelkversion (string): Version of the backend used to enrich information.
  • origin (string): Original URL where the repository was retrieved from.

Identity fields

  • author_uuid (string): Author profile unique identifier. Used for counting authors and cross-referencing data among data sources in ElasticSearch and between ElasticSearch, SortingHat and Hatstall.
  • authororgname (string): Organization name to which the author is affiliated to. Same author could have different affiliations based on non-overlapping time periods. Used for aggregating contributors and contributions by organization.
  • author_name (string): Similar to author_uuid, but less useful for unique counts as different profiles could share the same name. Nevertheless is more appropriate to show this field when aggregating data by author as it is usually nicer to see a name than a hash value.
  • author_bot (boolean): True if the given author is identified as a bot.
  • author_domain (string): Domain associated to the author in SortingHat profile.
  • author_id (string): Author identifier. This id comes from SortingHat and identifies each different identity provided by SortingHat. These identifiers are grouped in a single author_uuid, so this fields is not commonly used unless data needs to be debugged.

Project fields

  • project (string): Project name as defined in the JSON file where repositories are grouped by project.
  • project_1 (string): Project (if more than one level is allowed in project hierarchy).

Time field:

  • grimoirecreationdate (date): Date when the item was created upstream. Used by default to represent data in time series on the dashboards.

Demography fields:

  • authormaxdate (date): Date of most recent commit made by this author.
  • authormindate (date): Date of the first commit made by this author.

Extra fields:

  • extra_ (anything): Extra fields added using the enrich_extra_data study.

Data source specific fields

Details of the fields of each data source is available in the Schema folder.

Installation

There are several ways to install GrimoireELK on your system: packages or source code using Poetry or pip.

PyPI

GrimoireELK can be installed using pip, a tool for installing Python packages. To do it, run the next command: $ pip install grimoire-elk

Source code

To install from the source code you will need to clone the repository first: $ git clone https://github.com/chaoss/grimoirelab-elk $ cd grimoirelab-elk

Then use pip or Poetry to install the package along with its dependencies.

Pip

To install the package from local directory run the following command: $ pip install . In case you are a developer, you should install GrimoireELK in editable mode: $ pip install -e .

Poetry

We use poetry for dependency management and packaging. You can install it following its documentation. Once you have installed it, you can install GrimoireELK and the dependencies in a project isolated environment using: $ poetry install To spaw a new shell within the virtual environment use: $ poetry shell

Running tests

Tests are located in the folder tests. In order to run them, you need to have in your machine instances (or Docker containers) of ElasticSearch and MySQL

Then you need to: - update the file tests.conf file: - in case your ElasticSearch instance isn't available at http://localhost:9200. For example, if you are using the secure edition of elasticsearch, it will be located at https://admin:admin@localhost:9200 - in case you are using non-default credentials for your SortingHat database, you will need to include the [Database] section of the file with both user and password parameters - create the databases test_sh and test_projects in your MySQL instance (e.g., mysql -u root -e "create database test_sh", if you are running mysql in a container use docker exec -i <container id> mysql -u root -e "create database test_sh") - populate the database test_projects with the SQL file test_projects.sql (e.g., mysql -u root test_projects < tests/test_projects.sql)

The full battery of tests can be executed with run_tests.py. However, it is also possible to execute a sub-set of tests by running the single test files (test_* files in the tests folder)

The tests can be run in combination with the Python package coverage. The steps below show how to do it: buildoutcfg $ pip3 install coveralls $ cd <path-to-ELK>/tests $ python3 -m coverage run run_tests.py --source=grimoire_elk

pycharm-config-run_tests

Coverage will generate a file .coverage in the tests folder, which can be inspected with the following command: buildoutcfg cd <path-to-ELK>/tests python3 -m coverage report -m

pycharm-config_report

The output will be similar to the following one: ```buildoutcfg

Name Stmts Miss Cover Missing

.../ELK/grimoireelk/init.py 4 0 100% .../ELK/grimoireelk/_version.py 1 0 100% ```

Owner

  • Name: CHAOSS
  • Login: chaoss
  • Kind: organization

GitHub Events

Total
  • Create event: 29
  • Release event: 16
  • Issues event: 3
  • Watch event: 2
  • Delete event: 6
  • Issue comment event: 11
  • Push event: 32
  • Pull request review event: 7
  • Pull request event: 32
  • Fork event: 4
Last Year
  • Create event: 29
  • Release event: 16
  • Issues event: 3
  • Watch event: 2
  • Delete event: 6
  • Issue comment event: 11
  • Push event: 32
  • Pull request review event: 7
  • Pull request event: 32
  • Fork event: 4

Committers

Last synced: 8 months ago

All Time
  • Total Commits: 3,086
  • Total Committers: 50
  • Avg Commits per committer: 61.72
  • Development Distribution Score (DDS): 0.553
Past Year
  • Commits: 92
  • Committers: 5
  • Avg Commits per committer: 18.4
  • Development Distribution Score (DDS): 0.424
Top Committers
Name Email Commits
Alvaro del Castillo a****s@b****m 1,379
Valerio Cosentino v****s@b****m 883
Santiago Dueñas s****s@b****m 269
Quan Zhou q****n@b****m 144
Jose Javier Merchante j****e@b****m 120
alpgarcia a****a@b****m 54
Luis Cañas Díaz l****s@b****m 51
Jesus M. Gonzalez-Barahona j****b@g****s 38
Venu Vardhan Reddy Tekula v****u@b****m 16
Miguel Ángel Fernández m****n@b****m 11
dpose d****e@b****m 11
inishchith i****h@g****m 9
Nitish Gupta i****g@g****m 9
dpose d****e@s****t 8
Daniel Izquierdo d****o@b****m 7
dependabot[bot] 4****] 7
Florent Kaisser c****t@l****e 5
Animesh Kumar a****1@g****m 5
Lukasz Gryglicki l****i@o****l 5
sevagenv s****v@g****m 4
Georg J.P. Link l****g@g****m 4
Alberto Martín a****n@b****m 3
Prodromos Polychroniadis p****6@h****m 3
aswanipranjal a****l@g****m 3
Rafael Dulfer r****r@g****m 3
chenqi 5****8@q****m 2
Rashmi K A k****4@g****m 2
snack0verflow f****2@h****n 2
alpgarcia a****a@g****m 2
Willem Jiang w****g@g****m 2
and 20 more...

Issues and Pull Requests

Last synced: 5 months ago

All Time
  • Total issues: 56
  • Total pull requests: 197
  • Average time to close issues: about 2 years
  • Average time to close pull requests: about 1 month
  • Total issue authors: 34
  • Total pull request authors: 26
  • Average comments per issue: 2.34
  • Average comments per pull request: 1.57
  • Merged pull requests: 129
  • Bot issues: 0
  • Bot pull requests: 43
Past Year
  • Issues: 4
  • Pull requests: 45
  • Average time to close issues: 4 days
  • Average time to close pull requests: 13 days
  • Issue authors: 3
  • Pull request authors: 5
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.6
  • Merged pull requests: 24
  • Bot issues: 0
  • Bot pull requests: 15
Top Authors
Issue Authors
  • canasdiaz (11)
  • fioddor (4)
  • k----n (3)
  • jsmanrique (2)
  • valeriocos (2)
  • xiao623 (2)
  • vchrombie (2)
  • zhquan (2)
  • GeorgLink (2)
  • chengshiwen (2)
  • lukaszgryglicki (1)
  • jjmerchante (1)
  • StingRayZA (1)
  • jayvdb (1)
  • aklapper (1)
Pull Request Authors
  • jjmerchante (79)
  • dependabot[bot] (43)
  • zhquan (30)
  • sduenas (9)
  • kaxada (4)
  • Rafaeltheraven (4)
  • shanchenqi (4)
  • VSevagen (3)
  • GeorgLink (2)
  • valeriocos (2)
  • vchrombie (2)
  • mabelbot (1)
  • mafesan (1)
  • heming6666 (1)
  • eyehwan (1)
Top Labels
Issue Labels
bug (12) enhancement (10) documentation (4) question (1) good first issue (1) help wanted (1) enrich (1)
Pull Request Labels
dependencies (43) python (9)

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 2,496 last-month
  • Total docker downloads: 69
  • Total dependent packages: 6
  • Total dependent repositories: 22
  • Total versions: 165
  • Total maintainers: 2
pypi.org: grimoire-elk

GrimoireELK processes and stores software development data to ElasticSearch

  • Versions: 165
  • Dependent Packages: 6
  • Dependent Repositories: 22
  • Downloads: 2,496 Last month
  • Docker Downloads: 69
Rankings
Dependent packages count: 1.3%
Dependent repos count: 3.1%
Docker downloads count: 3.8%
Forks count: 4.3%
Average: 5.6%
Stargazers count: 8.8%
Downloads: 12.0%
Maintainers (2)
Last synced: 6 months ago

Dependencies

poetry.lock pypi
  • coverage 5.5 develop
  • httpretty 0.9.7 develop
  • astroid 2.11.5
  • bandit 1.7.4
  • beautifulsoup4 4.11.1
  • cereslib 0.3.1
  • certifi 2022.5.18.1
  • cffi 1.15.0
  • charset-normalizer 2.0.12
  • cloc 0.2.5
  • colorama 0.4.4
  • colored 1.4.3
  • cryptography 3.4.8
  • dill 0.3.5.1
  • dulwich 0.20.42
  • elasticsearch 6.3.1
  • elasticsearch-dsl 6.3.1
  • execnet 1.9.0
  • feedparser 6.0.10
  • flake8 3.9.2
  • geographiclib 1.52
  • geopy 2.2.0
  • gitdb 4.0.9
  • gitpython 3.1.27
  • graal 0.3.1
  • grimoirelab-toolkit 0.3.0
  • idna 3.3
  • importlib-metadata 4.11.4
  • isort 5.10.1
  • jinja2 3.0.3
  • lazy-object-proxy 1.7.1
  • lizard 1.16.6
  • markupsafe 2.1.1
  • mccabe 0.6.1
  • networkx 2.6.3
  • numpy 1.18.3
  • packaging 21.3
  • pandas 0.25.3
  • patsy 0.5.2
  • pbr 5.9.0
  • perceval 0.19.0
  • perceval-mozilla 0.3.1
  • perceval-opnfv 0.2.1
  • perceval-puppet 0.2.1
  • perceval-weblate 0.2.1
  • platformdirs 2.5.2
  • pycodestyle 2.7.0
  • pycparser 2.21
  • pydot 1.4.2
  • pyflakes 2.3.1
  • pyjwt 2.4.0
  • pylint 2.13.9
  • pymysql 0.9.3
  • pyparsing 3.0.9
  • python-dateutil 2.8.2
  • pytz 2022.1
  • pyyaml 6.0
  • requests 2.26.0
  • scipy 1.6.1
  • sgmllib3k 1.0.0
  • six 1.16.0
  • smmap 5.0.0
  • sortinghat 0.7.20
  • soupsieve 2.3.2.post1
  • sqlalchemy 1.3.24
  • statsmodels 0.13.2
  • stevedore 3.5.0
  • tomli 2.0.1
  • typed-ast 1.5.4
  • typing-extensions 4.2.0
  • urllib3 1.26.5
  • wrapt 1.14.1
  • zipp 3.8.0
pyproject.toml pypi
  • coverage ^5.5 develop
  • flake8 ^3.9.2 develop
  • httpretty ^0.9.6 develop
  • PyMySQL 0.9.3
  • cereslib >=0.3
  • elasticsearch 6.3.1
  • elasticsearch-dsl 6.3.1
  • geopy ^2.0.0
  • graal >=0.3
  • grimoirelab-toolkit >=0.3
  • pandas >=0.22.0,<=0.25.3
  • perceval >=0.19
  • perceval-mozilla >=0.3
  • perceval-opnfv >=0.2
  • perceval-puppet >=0.2
  • perceval-weblate >=0.2
  • python ^3.7
  • requests 2.26.0
  • sortinghat ^0.7.20
  • statsmodels >=0.9.0
  • urllib3 1.26.5
requirements.txt pypi
  • PyMySQL ==0.9.3
  • elasticsearch ==6.3.1
  • elasticsearch-dsl ==6.3.1
  • geopy >=2.0.0
  • numpy <=1.18.3
  • pandas >=0.22.0,<=0.25.3
  • requests ==2.26.0
  • statsmodels >=0.9.0
  • urllib3 ==1.26.5
requirements_tests.txt pypi
  • httpretty >=0.9.6
setup.py pypi
  • PyMySQL ==0.9.3
  • cereslib >=0.1.0
  • elasticsearch ==6.3.1
  • elasticsearch-dsl ==6.3.1
  • geopy >=2.0.0
  • graal >=0.2.2
  • grimoirelab-toolkit >=0.1.4
  • numpy <=1.18.3
  • pandas >=0.22.0,<=0.25.3
  • perceval >=0.9.6
  • perceval-mozilla >=0.1.4
  • perceval-opnfv >=0.1.2
  • perceval-puppet >=0.1.4
  • perceval-weblate >=0.1.0
  • requests ==2.26.0
  • sortinghat >=0.6.2
  • statsmodels *
  • urllib3 ==1.26.5