bachelor-thesis-information-science

Code and datasets for my bachelor's thesis

https://github.com/darwinkel/bachelor-thesis-information-science

Science Score: 39.0%

This score indicates how likely this project is to be science-related based on various indicators:

○
CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
✓
DOI references
Found 2 DOI reference(s) in README
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (13.1%) to scientific vocabulary

Last synced: 10 months ago · JSON representation

Repository

Code and datasets for my bachelor's thesis

Basic Info

Host: GitHub
Owner: Darwinkel
License: gpl-3.0
Language: Python
Default Branch: main
Size: 217 MB

Statistics

Stars: 1
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created over 3 years ago · Last pushed about 2 years ago

Metadata Files

Readme License Citation

Fingerprinting web servers through Transformer-encoded HTTP response headers (Darwinkel, 2023)

Code and datasets for my bachelor's thesis.

License

The code of this project is licensed under the GNU GPLv3. The data that I have collected and processed (e.g. in the data_* folders) is licensed under the Attribution-ShareAlike 4.0 International (CC BY-SA 4.0) license.

Note that the domains folder contains the list of domains as used in the experiment. Courtesy of the Tranco list.

Cite

@misc{darwinkel2024fingerprinting, title={Fingerprinting web servers through Transformer-encoded HTTP response headers}, author={Patrick Darwinkel}, year={2024}, month = mar, eprint={2404.00056}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://doi.org/10.48550/arXiv.2404.00056} }

Abstract

We investigated the feasibility of using state-of-the-art deep learning, big data, and natural language processing techniques to improve the ability to detect vulnerable web server versions. As having knowledge of specific version information is crucial for vulnerability analysis, we attempted to improve the accuracy and specificity of web server fingerprinting as compared to existing rule-based systems.

To investigate this, we made various ambiguous and non-standard HTTP requests to 4.77 million domains, and collected HTTP response status lines for each request. We then trained a byte-level BPE tokenizer and RoBERTa encoder to represent these individual status lines through unsupervised masked language modelling. To represent the web server of a single domain, we dimensionality-reduced each encoded response line and concatenated them. A Random Forest and multilayer perceptron classified these encoded, concatenated samples. The classifiers achieved 0.94 and 0.96 macro f1-score respectively on detecting five of the most popular publicly available origin web servers, which make up roughly half of the World Wide Web. On the task of classifying 347 major type and minor version pairs, the multilayer perceptron achieved a weighted f1-score of 0.55. Analysis of Gini impurity suggests that our test cases are meaningful discriminants of web server types, and our high f1-scores are unprecedented and demonstrate that our proposed method is a viable alternative to traditional rule-based systems. In addition, this innovative method opens up avenues for future work, many of which will likely result in even greater classification performance.

Instructions for replicating the experiment

The code was originally written in Python 3.10, guided by isort,black, mypy, pylint, vulture, and eradicate. It's a bit of a mess, and the documentation is subpar. I haven't bothered to clean it properly as nobody will probably try to replicate my research. If you are interested in using the code and data, don't hesitate to contact me so I can help you get started. If desired, I can send my own processed files for reference. You should probably read the thesis itself first.

Creating the raw data

Run collector.py through harbinger.sh to create raw output files as batch_.tsv.
Concatenate the generated batch files (including headers) with Unix utilities into a single file.

Cleaning and filtering the data

Clean the data and select viable samples with preprocess_filter_by_bad_samples.py from concatenated_data.tsv into preprocessed_*.tsv.
- python3 code/create_dataset/preprocess_filter_by_bad_samples.py data_processed/concatenated_data_withheaders.tsv
Filter the data by target classes with preprocess_filter_by_target_labels.py from preprocessed.tsv into preprocessed_filtered.tsv.
- python3 code/create_dataset/preprocess_filter_by_target_labels.py preprocessed_.tsv

Preparing support data

Generate a list of unique values for unsupervised training with prepare_embeddings_list.py from concatenated_data.tsv into embeddings_list_.tsv.
- python3 code/create_support_data/prepare_embeddings_list.py data_processed/concatenated_data_withheaders.tsv
Remove some redundant HTML-only lines from embeddings_list.tsv with Unix utilities.
Generate a list of unique classes for HuggingFace's datasets library.
- python3 code/create_support_data/prepare_classes_lists.py preprocessed_filtered_.tsv

Training the Transformer encoder

Train a new tokenizer with train_tokenizer.py from an embeddings_list.tsv.
- python3 code/create_embedding_model/train_tokenizer.py embeddings_list.tsv
Train/fine-tune a (new) model with train_embeddings_scratch.py or train_embeddings_finetune.py from an embeddings_list.tsv.
- python3 code/create_embedding_model/train_embeddings_.py embeddings_list.tsv
Run prepare_encoded_inputs.py on preprocessed_filtered.tsv and embeddings_list.tsv to extract features from raw text columns, downsize them to 64 dimensions, and save them as a HuggingFace dataset.
- python3 code/create_embedding_model/prepare_encoded_inputs.py preprocessed_filtered.tsv embeddings_list.tsv

Training and evaluating the classifiers

Run any classifier from code/evaluate_embeddings after datasets/http-header-split-embedded-data-v1 has been created.

Data analysis and plotting

See assorted files in code/analyze_dataset.

Playing with the Transformer encoder

See assorted files in code/play_with_model.

Owner

Name: Patrick
Login: Darwinkel
Kind: user
Location: Groningen, Netherlands
Company: @code050 | @rijksuniversiteit-groningen

Repositories: 1
Profile: https://github.com/Darwinkel

BSc in Information Science; software engineer

GitHub Events

Total

Watch event: 2

Last Year

Watch event: 2

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science