bachelor-thesis-information-science
Code and datasets for my bachelor's thesis
https://github.com/darwinkel/bachelor-thesis-information-science
Science Score: 39.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
○CITATION.cff file
-
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
✓DOI references
Found 2 DOI reference(s) in README -
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary
Repository
Code and datasets for my bachelor's thesis
Basic Info
- Host: GitHub
- Owner: Darwinkel
- License: gpl-3.0
- Language: Python
- Default Branch: main
- Size: 217 MB
Statistics
- Stars: 1
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Fingerprinting web servers through Transformer-encoded HTTP response headers (Darwinkel, 2023)
Code and datasets for my bachelor's thesis.
License
The code of this project is licensed under the GNU GPLv3. The data that I have collected and processed (e.g. in
the data_* folders) is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.
Note that the domains folder contains the list of domains as used in the experiment. Courtesy of the Tranco list.
Cite
@misc{darwinkel2024fingerprinting,
title={Fingerprinting web servers through Transformer-encoded HTTP response headers},
author={Patrick Darwinkel},
year={2024},
month = mar,
eprint={2404.00056},
archivePrefix={arXiv},
primaryClass={cs.CR},
url={https://doi.org/10.48550/arXiv.2404.00056}
}
Abstract
We investigated the feasibility of using state-of-the-art deep learning, big data, and natural language processing techniques to improve the ability to detect vulnerable web server versions. As having knowledge of specific version information is crucial for vulnerability analysis, we attempted to improve the accuracy and specificity of web server fingerprinting as compared to existing rule-based systems.
To investigate this, we made various ambiguous and non-standard HTTP requests to 4.77 million domains, and collected HTTP response status lines for each request. We then trained a byte-level BPE tokenizer and RoBERTa encoder to represent these individual status lines through unsupervised masked language modelling. To represent the web server of a single domain, we dimensionality-reduced each encoded response line and concatenated them. A Random Forest and multilayer perceptron classified these encoded, concatenated samples. The classifiers achieved 0.94 and 0.96 macro f1-score respectively on detecting five of the most popular publicly available origin web servers, which make up roughly half of the World Wide Web. On the task of classifying 347 major type and minor version pairs, the multilayer perceptron achieved a weighted f1-score of 0.55. Analysis of Gini impurity suggests that our test cases are meaningful discriminants of web server types, and our high f1-scores are unprecedented and demonstrate that our proposed method is a viable alternative to traditional rule-based systems. In addition, this innovative method opens up avenues for future work, many of which will likely result in even greater classification performance.
Instructions for replicating the experiment
The code was originally written in Python 3.10, guided by isort,black, mypy, pylint, vulture, and eradicate. It's a bit
of a mess, and the documentation is subpar. I haven't bothered to clean it properly as nobody will probably try to
replicate my research. If you are interested in using the code and data, don't hesitate to contact me so I can help you
get started. If desired, I can send my own processed files for reference.
You should probably read the thesis itself first.
Creating the raw data
- Run
collector.pythroughharbinger.shto create raw output files asbatch_.tsv. - Concatenate the generated batch files (including headers) with Unix utilities into a single file.
Cleaning and filtering the data
- Clean the data and select viable samples with
preprocess_filter_by_bad_samples.pyfromconcatenated_data.tsvintopreprocessed_*.tsv.python3 code/create_dataset/preprocess_filter_by_bad_samples.py data_processed/concatenated_data_withheaders.tsv
- Filter the data by target classes with
preprocess_filter_by_target_labels.pyfrompreprocessed.tsvintopreprocessed_filtered.tsv.python3 code/create_dataset/preprocess_filter_by_target_labels.py preprocessed_.tsv
Preparing support data
- Generate a list of unique values for unsupervised training with
prepare_embeddings_list.pyfromconcatenated_data.tsvintoembeddings_list_.tsv.python3 code/create_support_data/prepare_embeddings_list.py data_processed/concatenated_data_withheaders.tsv
- Remove some redundant HTML-only lines from
embeddings_list.tsvwith Unix utilities. - Generate a list of unique classes for HuggingFace's datasets library.
python3 code/create_support_data/prepare_classes_lists.py preprocessed_filtered_.tsv
Training the Transformer encoder
- Train a new tokenizer with
train_tokenizer.pyfrom anembeddings_list.tsv.python3 code/create_embedding_model/train_tokenizer.py embeddings_list.tsv
- Train/fine-tune a (new) model with
train_embeddings_scratch.pyortrain_embeddings_finetune.pyfrom anembeddings_list.tsv.python3 code/create_embedding_model/train_embeddings_.py embeddings_list.tsv
- Run
prepare_encoded_inputs.pyonpreprocessed_filtered.tsvandembeddings_list.tsvto extract features from raw text columns, downsize them to 64 dimensions, and save them as a HuggingFace dataset.python3 code/create_embedding_model/prepare_encoded_inputs.py preprocessed_filtered.tsv embeddings_list.tsv
Training and evaluating the classifiers
Run any classifier from code/evaluate_embeddings after datasets/http-header-split-embedded-data-v1 has been created.
Data analysis and plotting
See assorted files in code/analyze_dataset.
Playing with the Transformer encoder
See assorted files in code/play_with_model.
Owner
- Name: Patrick
- Login: Darwinkel
- Kind: user
- Location: Groningen, Netherlands
- Company: @code050 | @rijksuniversiteit-groningen
- Repositories: 1
- Profile: https://github.com/Darwinkel
BSc in Information Science; software engineer
GitHub Events
Total
- Watch event: 2
Last Year
- Watch event: 2