master-thesis-rcm

Code for Master Degree thesis on the feasibility of automatic sensitive content detection in colonial photographic archives through Image Classification algorithms.

https://github.com/orsolamborrini/master-thesis-rcm

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (10.1%) to scientific vocabulary

Keywords

archival-sciences colonial-archives computer-vision image-classification machine-learning machine-learning-projects sensitive-content-detection
Last synced: 6 months ago · JSON representation ·

Repository

Code for Master Degree thesis on the feasibility of automatic sensitive content detection in colonial photographic archives through Image Classification algorithms.

Basic Info
  • Host: GitHub
  • Owner: OrsolaMBorrini
  • License: gpl-3.0
  • Language: Jupyter Notebook
  • Default Branch: main
  • Homepage:
  • Size: 48 MB
Statistics
  • Stars: 1
  • Watchers: 2
  • Forks: 1
  • Open Issues: 0
  • Releases: 0
Topics
archival-sciences colonial-archives computer-vision image-classification machine-learning machine-learning-projects sensitive-content-detection
Created over 2 years ago · Last pushed 9 months ago
Metadata Files
Readme License Citation

README.md

"Revealing contested memory" Master thesis - Github repository

This repository contains the code for the project "Revealing contested memory: Automatic sensitive content detection in colonial photographic archives", aiming at experimenting with the fine-tuning of different Machine Learning models (specifically, Computer Vision models) to assess the feasibility of automatic sensitive content detection in colonial photographic archives.

The project was developed as final thesis for the Master "Digital Humanities and Digital Knowledge", at Alma Mater Studiorum - University of Bologna.

🧐 Abstract

Although many European archival institutions hold plenty of visual materials on colonial domination, these assets are frequently difficult to access and use due to not only privacy, copyright, commercial and technical issues, but also to ethical concerns: when handling such unsettling and sensitive content, a discussion on the ethics of care and looking should be addressed, especially in relation to the digitisation of colonial archives with a focus on confronting power dynamics and amplifying underrepresented voices. The large scale of the digital archival collections originated from the digitisation efforts conducted by GLAM institutions since the 1990s marks the imperative of machine reasoning for record selection, appraisal, and management of the records. In this context, the use of Machine Learning (ML) techniques can also assist in the detection of potentially sensitive contents.

🎈 Usage

The four Jupyter Notebook files showcase the different phases of the project: from the annotated data (which was annotated through Label Studio) to the error analysis of the fine-tuned ML models.

  • 1_data_cleaning.ipynb
    • Data cleaning and processing performed on the annotated data (in the form of CSV files, one for each archival collection) to prepare it for the creation of the dataset: specifically, corrupted images were deleted, unnecessary information was removed to improve readability, new columns with information on the collection of provenance of each image were added, and the files were all moved to a common folder pictures with all the information stored in a new index.csv file
  • 2_dataset_creation.ipynb
    • The prepared data was split into three sets (namely: train, validation, and test sets), updating the information in the index.csv file. Given the data imbalance, this operation was performed through a stratified split: the class proportions of the dataset population are thus preserved and the risk of not having any instance of the least populated class in the train set is avoided. Finally, images are split in different folders based on their set and class: the dataset, therefore, has a folder-based structure.
  • 3_training.ipynb
    • Experimenting with the ResNet architecture with different hyperparameter configurations. The best performing model is then validated and evaluated on the test set via precision, recall, f1 (both micro and macro average) and accuracy and confusion matrices are produced. All of the experiments are accessible for further examination on a public interactive Weights & Biases dashboard.
  • 4_error_analysis.ipynb
    • For each class of the test set, the model's predictions (and the predicted score for each class for each instance) are analysed in order to understand the possible errors in detection

⛏️ Requirements

  • Python 3.11.5

Run the command: pip install -r requirements.txt

Owner

  • Name: Orsola Maria Borrini
  • Login: OrsolaMBorrini
  • Kind: user
  • Location: Bologna

Student at the DHDK (Digital Humanities and Digital Knowledge) Master Degree @ Alma Mater Studiorum, University of Bologna

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Borrini"
  given-names: "Orsola Maria"
title: "'Revealing contested memory' Master Degree thesis - Github repository"
version: 1.0.0
doi: 10.5281/zenodo.1234
date-released: 2024-01-28
url: "https://github.com/OrsolaMBorrini/master-thesis-rcm"

GitHub Events

Total
  • Push event: 3
Last Year
  • Push event: 3

Dependencies

requirements.txt pypi
  • Babel ==2.13.1
  • Django ==3.2.20
  • GitPython ==3.1.41
  • Jinja2 ==3.1.2
  • MarkupSafe ==2.1.3
  • Pillow ==9.3.0
  • PyYAML ==6.0.1
  • Pygments ==2.16.1
  • QtPy ==2.4.1
  • Send2Trash ==1.8.2
  • accelerate ==0.24.1
  • aiohttp ==3.8.6
  • aiosignal ==1.3.1
  • anyio ==4.0.0
  • appdirs ==1.4.4
  • argon2-cffi ==23.1.0
  • argon2-cffi-bindings ==21.2.0
  • arrow ==1.3.0
  • asgiref ==3.7.2
  • asttokens ==2.4.1
  • async-lru ==2.0.4
  • async-timeout ==4.0.3
  • attr ==0.3.1
  • attrs ==23.1.0
  • azure-core ==1.29.5
  • azure-storage-blob ==12.19.0
  • beautifulsoup4 ==4.12.2
  • bleach ==5.0.1
  • boto ==2.49.0
  • boto3 ==1.16.63
  • botocore ==1.19.63
  • boxing ==0.1.4
  • cachetools ==5.3.2
  • certifi ==2023.7.22
  • cffi ==1.16.0
  • charset-normalizer ==3.3.2
  • click ==8.1.7
  • colorama ==0.4.6
  • comm ==0.2.0
  • contourpy ==1.2.0
  • coreapi ==2.3.3
  • coreschema ==0.0.4
  • cryptography ==41.0.5
  • cycler ==0.12.1
  • datasets ==2.16.1
  • debugpy ==1.8.0
  • decorator ==5.1.1
  • defusedxml ==0.7.1
  • dill ==0.3.7
  • django-annoying ==0.10.6
  • django-cors-headers ==3.6.0
  • django-debug-toolbar ==3.2.1
  • django-environ ==0.10.0
  • django-extensions ==3.1.0
  • django-filter ==2.4.0
  • django-model-utils ==4.1.1
  • django-ranged-fileresponse ==0.1.2
  • django-rest-swagger ==2.2.0
  • django-rq ==2.5.1
  • django-storages ==1.12.3
  • django-user-agents ==0.4.0
  • djangorestframework ==3.13.1
  • docker-pycreds ==0.4.0
  • docopt ==0.6.2
  • drf-dynamic-fields ==0.3.0
  • drf-flex-fields ==0.9.5
  • drf-generators ==0.3.0
  • drf-yasg ==1.20.0
  • et-xmlfile ==1.1.0
  • evaluate ==0.4.1
  • executing ==2.0.1
  • expiringdict ==1.2.2
  • fastjsonschema ==2.19.0
  • filelock ==3.13.1
  • fonttools ==4.46.0
  • fqdn ==1.5.1
  • frozenlist ==1.4.0
  • fsspec ==2023.10.0
  • gitdb ==4.0.11
  • google-api-core ==2.11.0
  • google-auth ==2.14.1
  • google-cloud-appengine-logging ==1.1.0
  • google-cloud-audit-log ==0.2.0
  • google-cloud-core ==2.3.2
  • google-cloud-logging ==2.7.1
  • google-cloud-storage ==2.5.0
  • google-crc32c ==1.5.0
  • google-resumable-media ==2.3.3
  • googleapis-common-protos ==1.56.4
  • grpc-google-iam-v1 ==0.12.4
  • grpcio ==1.59.2
  • grpcio-status ==1.59.2
  • htmlmin ==0.1.12
  • huggingface-hub ==0.20.3
  • idna ==3.4
  • ijson ==3.2.3
  • import-ipynb ==0.1.4
  • inflection ==0.5.1
  • ipykernel ==6.26.0
  • ipynb ==0.5.1
  • ipython ==8.17.2
  • ipywidgets ==8.1.1
  • isodate ==0.6.1
  • isoduration ==20.11.0
  • itypes ==1.2.0
  • jedi ==0.19.1
  • jmespath ==0.10.0
  • joblib ==1.3.2
  • json5 ==0.9.14
  • jsonpointer ==2.4
  • jsonschema ==3.2.0
  • jsonschema-specifications ==2023.11.1
  • jupyter ==1.0.0
  • jupyter-console ==6.6.3
  • jupyter-events ==0.9.0
  • jupyter-lsp ==2.2.0
  • jupyter_client ==8.6.0
  • jupyter_core ==5.5.0
  • jupyter_server ==2.10.0
  • jupyter_server_terminals ==0.4.4
  • jupyterlab ==4.0.8
  • jupyterlab-pygments ==0.2.2
  • jupyterlab-widgets ==3.0.9
  • jupyterlab_server ==2.25.1
  • kiwisolver ==1.4.5
  • label-studio ==1.8.2.post1
  • label-studio-converter ==0.0.54rc0
  • label-studio-tools ==0.0.3
  • launchdarkly-server-sdk ==7.5.0
  • lockfile ==0.12.2
  • lxml ==4.9.3
  • matplotlib ==3.8.2
  • matplotlib-inline ==0.1.6
  • mistune ==3.0.2
  • mpmath ==1.3.0
  • multidict ==6.0.4
  • multiprocess ==0.70.15
  • nbclient ==0.9.0
  • nbconvert ==7.11.0
  • nbformat ==5.9.2
  • nest-asyncio ==1.5.8
  • networkx ==3.2.1
  • nltk ==3.6.7
  • notebook ==7.0.6
  • notebook_shim ==0.2.3
  • numpy ==1.24.3
  • openapi-codec ==1.3.2
  • openpyxl ==3.1.2
  • ordered-set ==4.0.2
  • overrides ==7.4.0
  • packaging ==23.2
  • pandas ==2.1.3
  • pandocfilters ==1.5.0
  • parso ==0.8.3
  • pipreqs ==0.4.13
  • platformdirs ==4.0.0
  • prometheus-client ==0.18.0
  • prompt-toolkit ==3.0.41
  • proto-plus ==1.22.3
  • protobuf ==4.25.0
  • psutil ==5.9.6
  • psycopg2-binary ==2.9.6
  • pure-eval ==0.2.2
  • pyRFC3339 ==1.1
  • pyarrow ==14.0.1
  • pyarrow-hotfix ==0.5
  • pyasn1 ==0.5.0
  • pyasn1-modules ==0.3.0
  • pycparser ==2.21
  • pydantic ==1.10.13
  • pyparsing ==3.1.1
  • pyrsistent ==0.20.0
  • python-dateutil ==2.8.2
  • python-json-logger ==2.0.4
  • pytz ==2022.7.1
  • pywin32 ==306
  • pywinpty ==2.0.12
  • pyzmq ==25.1.1
  • qtconsole ==5.5.0
  • redis ==3.5.3
  • referencing ==0.31.0
  • regex ==2023.10.3
  • requests ==2.31.0
  • responses ==0.18.0
  • rfc3339-validator ==0.1.4
  • rfc3986-validator ==0.1.1
  • rpds-py ==0.12.0
  • rq ==1.10.1
  • rsa ==4.9
  • ruamel.yaml ==0.18.5
  • ruamel.yaml.clib ==0.2.8
  • rules ==2.2
  • s3transfer ==0.3.7
  • safetensors ==0.4.0
  • scikit-learn ==1.3.2
  • scipy ==1.11.3
  • seaborn ==0.13.1
  • semver ==2.13.0
  • sentry-sdk ==1.35.0
  • setproctitle ==1.3.3
  • simplejson ==3.19.2
  • six ==1.16.0
  • smmap ==5.0.1
  • sniffio ==1.3.0
  • soupsieve ==2.5
  • sqlparse ==0.4.4
  • stack-data ==0.6.3
  • sympy ==1.12
  • tensorboardX ==2.6.2.2
  • terminado ==0.18.0
  • threadpoolctl ==3.2.0
  • tinycss2 ==1.2.1
  • tokenizers ==0.14.1
  • torch ==2.1.0
  • torchaudio ==2.1.0
  • torchvision ==0.16.0
  • tornado ==6.3.3
  • tqdm ==4.66.1
  • traitlets ==5.13.0
  • transformers ==4.35.1
  • types-python-dateutil ==2.8.19.14
  • typing_extensions ==4.8.0
  • tzdata ==2023.3
  • ua-parser ==0.18.0
  • ujson ==5.8.0
  • uri-template ==1.3.0
  • uritemplate ==4.1.1
  • urllib3 ==1.26.16
  • user-agents ==2.2.0
  • wandb ==0.16.2
  • wcwidth ==0.2.10
  • webcolors ==1.13
  • webencodings ==0.5.1
  • websocket-client ==1.6.4
  • widgetsnbextension ==4.0.9
  • xmljson ==0.2.0
  • xxhash ==3.4.1
  • yarg ==0.1.9
  • yarl ==1.9.2