twitter-job-postings

Replication materials for the paper "280 Characters to Employment: Using Twitter to Quantify Job Vacancies"

https://github.com/socially-embedded-lab/twitter-job-postings

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization socially-embedded-lab has institutional domain (sel.sise.bgu.ac.il)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (11.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Replication materials for the paper "280 Characters to Employment: Using Twitter to Quantify Job Vacancies"

Basic Info
  • Host: GitHub
  • Owner: Socially-Embedded-Lab
  • License: other
  • Language: HTML
  • Default Branch: main
  • Size: 29.5 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 2 years ago · Last pushed about 2 years ago
Metadata Files
Readme License Citation

README.md

280 Characters to Employment

Using Twitter to Quantify Job Vacancies

Twitter Geographical Representativeness

Table of Contents

If you use code or data from this repository, please cite our ICWSM paper.


Repository Contents

  • ar_models: Directory containing scripts to perform AR on Twitter Index. More information under AR Models section.
  • data: Directory containing the datasets used in the study, with aggregate state and occupation data for replication purposes.
    • analyze_tweets post inference CSV's and figures.
    • csv: Employment data, job titles and location information.
    • job_offer: Official data to compare Twitter index to.
    • scripts: Scripts used to build the files for inference, as well as later analysis of the index.
  • job_title_classifier: Source code directory with scripts for training Job Title classifier with different variations.
  • ner_model: Source code directory with scripts for training NER model.
  • pipeline: Source code directory with scripts for our index inference. ___ ## Inference Pipeline The inference process involves a series of scripts executed in the following order:
  1. Preparation: The build_job_offer.py script (located under data/scripts) takes the initial input file and prepares the data for the inference process.
  2. Deduplication: The dedup.py script (also under data/scripts) deduplicates the data to ensure unique job postings are processed.
  3. Inference: The pipeline_single.py script (located under pipeline) performs the inference using the prepared data.
  4. Post-Inference Processing: The post_predict.py script (also under pipeline) processes the output of the inference to generate the final results.

This can be easily triggered using run_pipeline.sh script.

Input Files Format

For inference Process

The inference process works on input files with the following columns: - user_id: The id of the user. - tweet_timestamp: The timestamp of the tweet. - tweet_text: The text content of the tweet. - tweet_urls: URLs included in the tweet.

For NER Model Training

To train the Named Entity Recognition (NER) model, an input file with the following columns is required: - Text - The text of the tweet. - Token - The word that was being tagged. - ORG - 1 or 0 to mark if the token is an organization. - LOC - Same only for location - JOB_TITLE - Same to mark a job title. - MISC - Everything that doesn't fall under the previous categories. You can train the model using the main.py (located under ner_model).

For Job Title classifier Training

To train you own Job Title Classifier you can use the provided job_titles.csv. You can suse the provided train_bert.py script (located under job_title_classifier).

AR models

Input file

The input file for the AR model is created during the post inference process called merged_regression_ready.csv (located under pipeline/csv). It contains the following columns: - date - Date in a monthly granularity (01/12/2016, 01/01/2017, 01/02/2017, ...). - state_name - The name of the state. - occupation_name - The name of the occupation (Assemblers, Cleaners and helpers, ...). - job_offer - # of job offers found in our index. - employment - Employment rates from the Current Population Survey (CPS). - job_offer_rate - A normalized number of job-opening tweets by the total number of tweets posted in the state in a given time period. - employment_rate - A normalized rate of employment by the total number of employment in the state in a given time period. - job_offer_rate_overall - A normalized number of job-opening tweets by the total number of tweets in a given time period. - employment_rate_overall - A normalized rate of employment by the total number of employment in a given time period.

Time series prediction

To perform the time series prediction of employment rated you can simply use pooled_effects.py (located under ar_models) to train the Auto Regressive (AR) models per state.

Heat Map Plot

Finally, we can run plot_rmse.py to re-create the prediction gain heatmap Fig. 6 plot from the paper.


Representativeness of Twitter Job Opening Index

Input files

The input files for the Representativeness of Twitter Job Opening Index are as follows: - pipeline/csv/merged/ - This will be created during the inference process. it will contain monthly CSV's which will hold the index data. - data/job_offer/ - In the folder you will find files with URL's leading to the data. You can download and parse the data using them. The data itself is not shared as it is not ours to share. Please feel free to reach us if you are facing issues.

Twitter index analysis

First you will need to run aggragate_tweets.ipynb (located under data/scripts). This will create analyze_tweets folder under data and create the required files for the analysis. Next you can run analyze_tweets.ipynb to re-create Fig. 4 and Fig. 5 from the paper.


Environment Setup

Before running the scripts, you need to set up your Python environment:

  1. Ensure you have Python 3.7 installed on your system.
  2. Install the required dependencies by running the following command in your terminal: pip install -r requirements.txt The requirements.txt file contains all the necessary Python packages to run the code in this repository.
  3. Before running the inference using the run_pipeline.sh script, please modify and put your environment name in the batch files under batch_files. ___ ## Code Availability Statement The code used in this study and provided here includes the methodology for accurately extracting occupation and location information from job postings on social media and scripts for evaluating the coverage and representativeness of the Twitter-based job vacancy index. As Twitter information cannot be shared it is not included here. ___ ## How to Cite

Please cite our work as follows:

Sobol Portnov, B., Tonneau, M., Lee, D., Fraiberger, S., & Grinberg, N. (2024). 280 Characters to Employment: Using Twitter to Quantify Job Vacancies [Conference paper]. Proceedings of the Eighteenth International AAAI Conference on Web and Social Media (ICWSM)

or in BibTex, using: @inproceedings{Sobol_Portnov_280_Characters_to_2024, author = {Sobol Portnov, Boris and Tonneau, Manuel and Lee, Do and Fraiberger, Samuel and Grinberg, Nir}, booktitle = {Proceedings of the Eighteenth International AAAI Conference on Web and Social Media (ICWSM)}, month = jun, series = {International Conference on Web and Social Media (ICWSM) 2024}, title = {{280 Characters to Employment: Using Twitter to Quantify Job Vacancies}}, year = {2024} }


License

The assets in this repository are subject to their respective licenses. Usage complies with the requirements set forth by the data providers.


Contact

For inquiries regarding the paper or the repository, please contact the authors at the following email addresses: - Boris Sobol Portnov: boris.sobol@gmail.com - Manuel Tonneau: manuel.tonneau@oii.ox.ac.uk - Do Lee: dq1204@nyu.edu - Samuel Fraiberger: sfraiberger@worldbank.org - Nir Grinberg: nirgrn@bgu.ac.il

Owner

  • Name: Socially Embedded Lab
  • Login: Socially-Embedded-Lab
  • Kind: organization

Code and data from research projects conducted at BGU's Socially Embedded Lab.

Citation (CITATION.cff)

cff-version: 1.2.0
title: >-
  280 Characters to Employment: Using Twitter to Quantify Job Vacancies
message: >-
  If you use code in this repository or the associated
  data, please cite our ICWSM paper.
type: software
authors:
  - given-names: Boris
    family-names: Sobol Portnov
    email: boris.sobol@gmail.com
  - given-names: Manuel 
    family-names: Tonneau
    email: manuel.tonneau@oii.ox.ac.uk
  - given-names: Do
    family-names: Lee
    email: dq1204@nyu.edu
  - given-names: Samuel 
    family-names: Fraiberger
    email: sfraiberger@worldbank.org
  - given-names: Nir
    family-names: Grinberg
    email: nirgrn@bgu.ac.il
    affiliation: Ben-Gurion University
    orcid: 'https://orcid.org/0000-0002-1277-894X'
repository-artifact: 'https://doi.org/10.7910/DVN/RQBWAC'
license: CC-BY-NC-SA-4.0
preferred-citation:
  type: conference-paper
  authors:
    - given-names: Boris
      family-names: Sobol Portnov
      email: boris.sobol@gmail.com
    - given-names: Manuel 
      family-names: Tonneau
      email: manuel.tonneau@oii.ox.ac.uk
    - given-names: Do
      family-names: Lee
      email: dq1204@nyu.edu
    - given-names: Samuel 
      family-names: Fraiberger
      email: sfraiberger@worldbank.org
    - given-names: Nir
      family-names: Grinberg
      email: nirgrn@bgu.ac.il
      affiliation: Ben-Gurion University
      orcid: 'https://orcid.org/0000-0002-1277-894X'
  title: "280 Characters to Employment: Using Twitter to Quantify Job Vacancies"
  month: 6
  collection-title: "Proceedings of the Eighteenth International AAAI Conference on Web and Social Media (ICWSM)"
  collection-type: proceedings
  conference:
    name: "International Conference on Web and Social Media (ICWSM) 2024"
  year: 2024

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • Fiona ==1.9.3
  • Jinja2 ==3.1.2
  • Keras-Preprocessing ==1.1.2
  • Levenshtein ==0.21.0
  • Markdown ==3.4.1
  • MarkupSafe ==2.1.1
  • Pillow ==9.4.0
  • PySocks ==1.7.1
  • PyYAML ==6.0
  • Pygments ==2.15.1
  • QtPy ==2.3.1
  • Send2Trash ==1.8.2
  • Werkzeug ==2.2.2
  • absl-py ==1.3.0
  • accelerate ==0.18.0
  • adjustText ==0.8
  • aiohttp ==3.8.4
  • aiosignal ==1.3.1
  • anyio ==3.6.2
  • argon2-cffi ==21.3.0
  • argon2-cffi-bindings ==21.2.0
  • astunparse ==1.6.3
  • async-generator ==1.10
  • async-timeout ==4.0.2
  • asynctest ==0.13.0
  • attrs ==22.2.0
  • backcall ==0.2.0
  • beautifulsoup4 ==4.12.2
  • bleach ==6.0.0
  • cachetools ==5.2.0
  • certifi ==2022.12.7
  • cffi ==1.15.1
  • charset-normalizer ==2.1.1
  • chart-studio ==1.1.0
  • click ==8.1.3
  • click-plugins ==1.1.1
  • cligj ==0.7.2
  • colorama ==0.4.6
  • colorlover ==0.3.0
  • cufflinks ==0.17.3
  • cycler ==0.11.0
  • datasets ==2.12.0
  • debugpy ==1.6.7
  • decorator ==5.1.1
  • defusedxml ==0.7.1
  • dill ==0.3.6
  • entrypoints ==0.4
  • et-xmlfile ==1.1.0
  • evaluate ==0.4.0
  • exceptiongroup ==1.1.0
  • fastjsonschema ==2.16.3
  • filelock ==3.9.0
  • flatbuffers ==22.12.6
  • fonttools ==4.38.0
  • frozenlist ==1.3.3
  • fsspec ==2023.1.0
  • fuzzywuzzy ==0.18.0
  • gast ==0.4.0
  • geographiclib ==2.0
  • geopandas ==0.10.2
  • geopy ==2.3.0
  • google-auth ==2.15.0
  • google-auth-oauthlib ==0.4.6
  • google-pasta ==0.2.0
  • grpcio ==1.51.1
  • h11 ==0.14.0
  • h5py ==3.7.0
  • huggingface-hub ==0.14.1
  • idna ==3.4
  • importlib-metadata ==6.0.0
  • importlib-resources ==5.12.0
  • ipykernel ==6.16.2
  • ipython ==7.34.0
  • ipython-genutils ==0.2.0
  • ipywidgets ==8.0.6
  • jedi ==0.18.2
  • joblib ==1.2.0
  • jsonschema ==4.17.3
  • jupyter ==1.0.0
  • jupyter-client ==7.4.9
  • jupyter-console ==6.6.3
  • jupyter-core ==4.12.0
  • jupyter-server ==1.24.0
  • jupyterlab-pygments ==0.2.2
  • jupyterlab-widgets ==3.0.7
  • kaleido ==0.2.1
  • keras ==2.8.0
  • kiwisolver ==1.4.4
  • libclang ==14.0.6
  • matplotlib ==3.5.3
  • matplotlib-inline ==0.1.6
  • mistune ==2.0.5
  • multidict ==6.0.4
  • multiprocess ==0.70.14
  • munch ==2.5.0
  • nbclassic ==0.5.6
  • nbclient ==0.7.4
  • nbconvert ==7.3.1
  • nbformat ==5.8.0
  • nest-asyncio ==1.5.6
  • nltk ==3.8.1
  • notebook ==6.5.4
  • notebook-shim ==0.2.3
  • numpy ==1.21.6
  • oauthlib ==3.2.2
  • openpyxl ==3.1.2
  • opt-einsum ==3.3.0
  • outcome ==1.2.0
  • packaging ==22.0
  • pandas ==1.3.5
  • pandocfilters ==1.5.0
  • parso ==0.8.3
  • patsy ==0.5.3
  • pickleshare ==0.7.5
  • pkgutil-resolve-name ==1.3.10
  • plotly ==5.11.0
  • prometheus-client ==0.16.0
  • prompt-toolkit ==3.0.38
  • protobuf ==3.19.6
  • psutil ==5.9.5
  • pyDeprecate ==0.3.2
  • pyarrow ==10.0.1
  • pyasn1 ==0.4.8
  • pyasn1-modules ==0.2.8
  • pycparser ==2.21
  • pyparsing ==3.0.9
  • pyproj ==3.2.1
  • pyrsistent ==0.19.3
  • python-Levenshtein ==0.21.0
  • python-dateutil ==2.8.2
  • pytz ==2023.3
  • pywin32 ==306
  • pywinpty ==2.0.10
  • pyzmq ==25.0.2
  • qtconsole ==5.4.2
  • rapidfuzz ==3.0.0
  • regex ==2022.10.31
  • requests ==2.28.1
  • requests-oauthlib ==1.3.1
  • responses ==0.18.0
  • retrying ==1.3.4
  • rsa ==4.9
  • sacremoses ==0.0.53
  • scikit-learn ==1.0.2
  • scipy ==1.7.3
  • seaborn ==0.12.2
  • selenium ==4.7.2
  • sentence-transformers ==2.2.2
  • sentencepiece ==0.1.97
  • seqeval ==1.2.2
  • shapely ==2.0.1
  • six ==1.16.0
  • sniffio ==1.3.0
  • sortedcontainers ==2.4.0
  • soupsieve ==2.4.1
  • statsmodels ==0.13.2
  • tenacity ==8.1.0
  • tensorboard ==2.8.0
  • tensorboard-data-server ==0.6.1
  • tensorboard-plugin-wit ==1.8.1
  • tensorflow ==2.8.0
  • tensorflow-estimator ==2.11.0
  • tensorflow-io-gcs-filesystem ==0.29.0
  • termcolor ==2.2.0
  • terminado ==0.17.1
  • tf-estimator-nightly ==2.8.0.dev2021122109
  • threadpoolctl ==3.1.0
  • tinycss2 ==1.2.1
  • tokenizers ==0.13.2
  • torch ==1.7.1
  • torchaudio ==0.7.2
  • torchmetrics ==0.7.2
  • torchvision ==0.8.2
  • tornado ==6.2
  • tqdm ==4.64.1
  • traitlets ==5.9.0
  • transformers ==4.16.2
  • trio ==0.22.0
  • trio-websocket ==0.9.2
  • typing-extensions ==4.5.0
  • urllib3 ==1.26.13
  • wcwidth ==0.2.6
  • webencodings ==0.5.1
  • websocket-client ==1.5.1
  • widgetsnbextension ==4.0.7
  • wrapt ==1.14.1
  • wsproto ==1.2.0
  • xxhash ==3.2.0
  • yarl ==1.9.2
  • zipp ==3.11.0