https://github.com/bluebrain/dir-content-diff

Simple tool to compare directory contents and get differences using smart comparators.

https://github.com/bluebrain/dir-content-diff

Science Score: 36.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    3 of 6 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.0%) to scientific vocabulary

Keywords

compare-directories compare-files compare-folders diff dir-compare directory directory-comparator directory-comparison-tool directory-diff file-comparison file-diff file-differences folder-compare folder-comparisation folder-comparison folder-diff python
Last synced: 5 months ago · JSON representation

Repository

Simple tool to compare directory contents and get differences using smart comparators.

Basic Info
Statistics
  • Stars: 7
  • Watchers: 4
  • Forks: 1
  • Open Issues: 4
  • Releases: 17
Topics
compare-directories compare-files compare-folders diff dir-compare directory directory-comparator directory-comparison-tool directory-diff file-comparison file-diff file-differences folder-compare folder-comparisation folder-comparison folder-diff python
Created about 4 years ago · Last pushed 6 months ago
Metadata Files
Readme Changelog Contributing License Authors

README.md

Version Build status Coverage License Documentation status

Directory Content Difference

This project provides simple tools to compare the content of a directory against a reference directory.

This is useful to check the results of a process that generate several files, like a luigi workflow for example.

Installation

This package should be installed using pip:

bash pip install dir-content-diff

Usage

The dir-content-diff package introduces a framework to compare two directories. A comparator is associated to each file extension and then each file in the reference directory is compared to the file with the same relative path in the compared directory. By default, a few comparators are provided for usual files but others can be associated to new file extensions or can even replace the default ones. The comparators should be able to report the differences between two files accurately, reporting which elements are different among the data. When an extension has no comparator associated, a default comparator is used which just compares the whole binary data of the files, so it is not able to report which values are different.

Compare two directories

Let's compare two directories with the following structures:

bash └── reference_dir ├── sub_dir_1 | ├── sub_file_1.a | └── sub_file_2.b └── file_1.c

bash └── compared_dir ├── sub_dir_1 | ├── sub_file_1.a | └── sub_file_2.b | └── sub_file_3.b └── file_1.c

The reference directory contains all the files that should be checked in the compared directory, which means that extraneous files in the compared directory are just ignored.

These two directories can be compared with the following code:

```python import dircontentdiff

dircontentdiff.comparetrees("referencedir", "compared_dir") ```

[!WARNING] The order of the parameters is important: the first path is considered as the reference directory while the second one is the compared directory. Inverting the parameters may return a different result (in this example it would return that the file sub_file_3.b is missing).

If all the files are identical, this code will return an empty dictionary because no difference was detected. As mentioned previously, this is because dir-content-diff is only looking for files in the compared directory that are also present in the reference directory, so the file sub_file_3.b is just ignored in this case.

If reference_dir/file_1.c is the following JSON-like file:

json { "a": 1, "b": [1, 2] }

And compared_dir/file_1.c is the following JSON-like file:

json { "a": 2, "b": [10, 2, 0] }

The following code registers the JsonComparator for the file extension .c and compares the two directories:

```python import dircontentdiff

dircontentdiff.registercomparator(".c", dircontentdiff.JsonComparator()) dircontentdiff.comparetrees("referencedir", "compareddir") ```

The previous code will output the following dictionary:

python { 'file_1.c': ( 'The files \'reference_dir/file_1.c\' and \'compared_dir/file_1.c\' are different:\n' 'Added the value(s) \'{"2": 0}\' in the \'[b]\' key.\n' 'Changed the value of \'[a]\' from 1 to 2.\n' 'Changed the value of \'[b][0]\' from 1 to 10.' ) }

It is also possible to check whether the two directories are equal or not with the following code:

```python import dircontentdiff

dircontentdiff.registercomparator(".c", dircontentdiff.JsonComparator()) dircontentdiff.assertequaltrees("referencedir", "compared_dir") ```

Which will output the following AssertionError:

bash AssertionError: The files 'reference_dir/file_1.c' and 'compared_dir/file_1.c' are different: Added the value(s) '{"2": 0}' in the '[b]' key. Changed the value of '[a]' from 1 to 2. Changed the value of '[b][0]' from 1 to 10.

Finally, the comparators have parameters that can be passed either to be used for all files of a given extension or only for a specific file:

```python import dircontentdiff

Get the default comparators

comparators = dircontentdiff.get_comparators()

Replace the comparators for JSON files to perform the comparison with a given tolerance

comparators[".json"] = dircontentdiff.JsonComparator(defaultdiffkwargs={"tolerance": 0.1})

Use a specific tolerance for the file sub_dir_1/sub_file_1.a

In this case, the kwargs are used to compute the difference by default, except the following

specific kwargs: return_raw_diffs, load_kwargs, format_data_kwargs, filter_kwargs,

format_diff_kwargs, sort_kwargs, concat_kwargs and report_kwargs.

specificargs = {"subdir1/subfile_1.a": {"tolerance": 0.5}}

dircontentdiff.assertequaltrees( "referencedir", "compareddir", comparators=comparators, specificargs=specificargs, ) ```

Each comparator has different arguments that are detailed in the documentation.

It's also possible to specify a arbitrary comparator for a specific file:

python specific_args = { "sub_dir_1/sub_file_1.a": { "comparator": dir_content_diff.JsonComparator(), "tolerance": 0.5, } }

And last but not least, it's possible to use regular expressions to associate specific arguments to a set of files:

python specific_args = { "all files with *.a of *.b extensions": { "patterns": [r".*\.[a,b]$"], "comparator": dir_content_diff.BaseComparator(), } }

Export formatted data

Some comparators have to format the data before comparing them. For example, if one wants to compare data with file paths inside, it's likely that only a relative part of these paths are relevant, not the entire absolute paths. To do this, a specific comparator can be defined with a custom format_data() method which is automatically called after the data are loaded but before the data are compared. It is then possible to export the data just after they have been formatted for check purpose for example. To do this, the export_formatted_files argument of the dir_content_diff.compare_trees and dir_content_diff.assert_equal_trees functions can be set to True. Thus all the files processed by a comparator with a save() method will be exported to a new directory. This new directory is the same as the compared directory to which a suffix is added. By default, the suffix is _FORMATTED, but it can be overridden by passing a non-empty string to the export_formatted_files argument.

Pytest plugin

This package can be used as a pytest plugin. When pytest is run and dir-content-diff is installed, it is automatically detected and registered as a plugin. It is then possible to trigger the export of formatted data with the following pytest option: --dcd-export-formatted-data. It is also possible to define a custom suffix for the new directory with the following option: --dcd-export-suffix.

Funding & Acknowledgment

The development of this software was supported by funding to the Blue Brain Project, a research center of the École polytechnique fédérale de Lausanne (EPFL), from the Swiss government’s ETH Board of the Swiss Federal Institutes of Technology.

For license and authors, see LICENSE.txt and AUTHORS.md respectively.

Copyright © 2021-2023 Blue Brain Project/EPFL

Owner

  • Name: The Blue Brain Project
  • Login: BlueBrain
  • Kind: organization
  • Email: bbp.opensource@epfl.ch
  • Location: Geneva, Switzerland

Open Source Software produced and used by the Blue Brain Project

GitHub Events

Total
  • Create event: 14
  • Issues event: 3
  • Release event: 2
  • Watch event: 2
  • Delete event: 11
  • Issue comment event: 9
  • Push event: 25
  • Pull request review comment event: 2
  • Pull request review event: 4
  • Pull request event: 22
  • Fork event: 1
Last Year
  • Create event: 14
  • Issues event: 3
  • Release event: 2
  • Watch event: 2
  • Delete event: 11
  • Issue comment event: 9
  • Push event: 25
  • Pull request review comment event: 2
  • Pull request review event: 4
  • Pull request event: 22
  • Fork event: 1

Committers

Last synced: almost 3 years ago

All Time
  • Total Commits: 53
  • Total Committers: 6
  • Avg Commits per committer: 8.833
  • Development Distribution Score (DDS): 0.094
Top Committers
Name Email Commits
Adrien Berchet a****t@e****h 48
bbpgithubaudit 8****t@u****m 1
Alexis Arnaudon a****n@g****m 1
alex4200 a****z@e****h 1
Adrien Berchet a****t@g****m 1
Alan Garner a****r@e****h 1
Committer Domains (Top 20 + Academic)
epfl.ch: 3

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 3
  • Total pull requests: 71
  • Average time to close issues: 12 months
  • Average time to close pull requests: about 10 hours
  • Total issue authors: 2
  • Total pull request authors: 5
  • Average comments per issue: 2.0
  • Average comments per pull request: 0.21
  • Merged pull requests: 71
  • Bot issues: 0
  • Bot pull requests: 10
Past Year
  • Issues: 1
  • Pull requests: 31
  • Average time to close issues: 6 days
  • Average time to close pull requests: about 12 hours
  • Issue authors: 1
  • Pull request authors: 2
  • Average comments per issue: 0.0
  • Average comments per pull request: 0.26
  • Merged pull requests: 31
  • Bot issues: 0
  • Bot pull requests: 7
Top Authors
Issue Authors
  • adrien-berchet (3)
  • arnaudon (1)
  • froh (1)
Pull Request Authors
  • adrien-berchet (77)
  • dependabot[bot] (14)
  • alex4200 (1)
  • bbpgithubaudit (1)
  • arnaudon (1)
Top Labels
Issue Labels
enhancement (1)
Pull Request Labels
dependencies (14) github_actions (1)

Dependencies

.github/workflows/codeql.yml actions
  • actions/checkout v3 composite
  • github/codeql-action/analyze v2 composite
  • github/codeql-action/autobuild v2 composite
  • github/codeql-action/init v2 composite
.github/workflows/commitlint.yml actions
  • actions/checkout v3 composite
  • actions/setup-node v3 composite
.github/workflows/publish-sdist.yml actions
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • pypa/gh-action-pypi-publish release/v1 composite
.github/workflows/run-tox.yml actions
  • actions/cache v3 composite
  • actions/checkout v3 composite
  • actions/setup-python v4 composite
  • actions/upload-artifact v3 composite
  • awalsh128/cache-apt-pkgs-action latest composite
  • codecov/codecov-action v3 composite
  • mikepenz/action-junit-report v3 composite