adqm_analysis

modified autodqm_ml for beta-binomial meta analysis

https://github.com/chosila/adqm_analysis

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (14.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

modified autodqm_ml for beta-binomial meta analysis

Basic Info
  • Host: GitHub
  • Owner: chosila
  • Language: Python
  • Default Branch: adqm_analysis
  • Size: 2.3 MB
Statistics
  • Stars: 1
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed over 1 year ago
Metadata Files
Readme Citation

README.md

Description

This is a modified version of autodqmml that takes advantage of the fetching and training scripts to run beta-binomial meta-analysis of 2022 HLT physics data. If you already have a conda environment of AutoDQMML installed, and you don't want to install another conda env that is very similar, you can just clone this repo and skip to pip install -e. inside the adqm_analysis directory. It will overide the old version of autodqm-ml package you installed earlier. Just do pip instal command again in the original autodqm-ml directory to undo this change. If you do not want the hassle of remembering which adqm-ml was installed, you can create a new conda environment following steps below:

Installation

1. Clone repository git clone https://github.com/chosila/adqm_analysis.git cd adqm_analysis 2. Install dependencies

Dependencies are listed in environment.yml and installed using conda. If you do not already have conda set up on your system, you can install (for linux) with: curl -O -L https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh bash Miniconda3-latest-Linux-x86_64.sh -b You can then set conda to be available upon login with ~/miniconda3/bin/conda init # adds conda setup to your ~/.bashrc, so relogin after executing this line

Once conda is installed and set up, install dependencies with (warning: this step may take a while) conda env create -f environment.yml <path to install conda env>

Some packages cannot be installed via conda or take too long and need to be installed with pip (after activating your conda env above): pip install yahist

Note: if you are running on lxplus, you may run into permissions errors, which may be fixed with: chmod 755 -R /afs/cern.ch/user/s/<your_user_name>/.conda and then rerunning the command to create the conda env. The resulting conda env can also be several GB in size, so it may also be advisable to specify the installation location in your work area if running on lxplus, i.e. running the conda env create command with -p /afs/cern.ch/work/....

3. Install autodqm-ml

Install with: pip install -e .

Once your setup is installed, you can activate your python environment with conda activate adqm_analysis

Note: CMSSW environments can interfere with conda environments. Recommended to unset your CMSSW environment (if any) by running eval `scram unsetenv -sh` before attempting installation and each time before activating the conda environment.

Instructions to use this package for meta-analysis

A lot of these instructions are very similar to the instructions on the AutoDQM_ML tutorial.

  1. Make JSON of histograms to use for this analysis. General shape of the JSON are name of the subsystem as the key, followed by list of path to histograms within the root file. Example of what it should look like: https://github.com/chosila/adqm_analysis/blob/adqm_analysis/metadata/histogram_lists/l1tshift_1.json

How to choose goodruns/badruns 1. go to run registry https://cmsrunregistry.web.cern.ch/online/global?. Offline or Online depends on the set of histograms you're working with. 2. GOOD RUNS: "class = Collisions22", "lsduration > 77", " = GOOD" 3. BAD RUNS: "class = Collisions22", "lsduration > 13", " = BAD"

NOTE: Since we are fetching a very large number of root files and all the histograms need to be readout, it is very memory intensive and tend to fail if too many histograms are fetched at once. I have found that around ~15 histograms at a time is within the safe limit of not encountering any memory issues. To deal with this, I split my histogram_list json files into multiple JSON files, and each one is fetched and run individually.

  1. fetch the data.

Within the metadata directory, there is also the datasetlists directory which contains HLTphysics2022.json. We will be using this data set list for the analysis. The fetching script will read the root files of runs specified in the datasetlists json file, and read out the histograms specified by your histogramlists json files. The output will be saved into a .parquet file. The command to do this is: ``` python scripts/fetchdata.py --outputdir "" --contents <path to histogramlists json> --datasets "metadata/datasetlists/HLTphysics2022.json" ``` Output parquet file will bein the the outputdir that you specified.

  1. run statistical test

python scripts/train.py --input_file "<output dir name>/<output file>.parquet" --output_dir "<new output dir name>" --algorithm "statistical_tester" --tag "beta_binom" --histograms "<path to histogram_list json file"

The new output dir should be a different one from before, as the new parquet file created in this step could override your fetched dataset if you use the same output dir. This step creates another parquet file in the new output directory. This new parquet file contains the beta-binomial chi2 and pull-value score of the comparison you just did.

Beta-binomial comparison can be run using multiple reference runs. In the AutoDQM paper, we compared the performance of 1,4, and 8 references. Currently to change how many references are used in the comparison, you need to modify this value https://github.com/chosila/adqm_analysis/blob/49529a9080ad923a70e3075029ec2328b0a7dee1/autodqm_ml/algorithms/statistical_tester.py#L25. I apologize for the jank. This was backengineered into the autodqm-ml code and I coudln't find an elegant solution to how to control nRef from the command line.

The output file from this step always has the same name, so if you want to run the same histogram_lists JSON but with different number of reference runs, you will need to rename the parquet file you just created in this step before rerunning the train command, or give it a differnt output directory.

  1. convert parquet to csv.

After producing all the histogram lists, you will need to convert this parquet file into a csv, as the final plotting script uses csv. I have a script HLT_l1tShift_addstat/combinetocsv.py that converts HLTPhysics1.parquet, HLTPhysics2.parquet....HLTPhysics4.parquet. into a csv called L1THLTPhysics.csv. If you also named your parquet files as <subsystem><number>.parquet, you can modify this file by changing the range to match the number of files you need <https://github.com/chosila/adqmanalysis/blob/49529a9080ad923a70e3075029ec2328b0a7dee1/HLTl1tShiftaddstat/combinetocsv.py#L8>, as well as the name of the input file https://github.com/chosila/adqm_analysis/blob/49529a9080ad923a70e3075029ec2328b0a7dee1/HLT_l1tShift_addstat/combinetocsv.py#L10 and the name of the output file https://github.com/chosila/adqm_analysis/blob/49529a9080ad923a70e3075029ec2328b0a7dee1/HLT_l1tShift_addstat/combinetocsv.py#L21. Move the output of this merge into a csv directory for the next step.

NOTE: parquet files corresponding to different number of references should not be merged together.

  1. Run the ROC making script

We will use the output of the merged csv to run the scripts to plot "ROC" curves that are used in the AutoDQM paper.

python make_roc.py <algo> <subsystem> <N>

This will create pdfs in the plots/ directory. This script was written specifically for my study which had csv files corresponding to a study using 1,4, and 8 reference runs. If you onlyt have 1 csv file, you can modify this line https://github.com/chosila/adqm_analysis/blob/49529a9080ad923a70e3075029ec2328b0a7dee1/make_roc.py#L30 to zip(['1_REF'], ['-rD'], ['purple']) and this line https://github.com/chosila/adqm_analysis/blob/49529a9080ad923a70e3075029ec2328b0a7dee1/make_roc.py#L33 to the name of your csv file.

Owner

  • Login: chosila
  • Kind: user

Chosila (Si) Sutantawibul

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "AutoDQM Team"
title: "AutoDQM Machine Learning Toolkit"
version: 0.0.0-alpha
doi: 10.5281/zenodo.6547594
date-released: 2022-05-13
url: "https://github.com/AutoDQM/AutoDQM_ML"

GitHub Events

Total
  • Create event: 1
Last Year
  • Create event: 1

Dependencies

setup.py pypi
environment.yml pypi