twitter-job-postings
Replication materials for the paper "280 Characters to Employment: Using Twitter to Quantify Job Vacancies"
https://github.com/socially-embedded-lab/twitter-job-postings
Science Score: 52.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
✓Institutional organization owner
Organization socially-embedded-lab has institutional domain (sel.sise.bgu.ac.il) -
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (11.1%) to scientific vocabulary
Repository
Replication materials for the paper "280 Characters to Employment: Using Twitter to Quantify Job Vacancies"
Basic Info
- Host: GitHub
- Owner: Socially-Embedded-Lab
- License: other
- Language: HTML
- Default Branch: main
- Size: 29.5 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
280 Characters to Employment
Using Twitter to Quantify Job Vacancies

Table of Contents
- Paper Summary
- Repository Contents
- Inference Pipeline
- Input Files Format
- AR Models
- Representativeness of Twitter Job Opening Index
- Environment Setup
- Code Availability Statement
- How to Cite
- License
- Contact ___ ## Paper Summary This paper investigates the potential of social media, specifically Twitter, to provide a complementary signal for estimating labor market demand. We introduce a statistical approach for extracting information about the location and occupation advertised in job vacancies posted on Twitter. We construct an aggregate index of labor market demand by occupational class in every major U.S. city from 2015 to 2022 and evaluate it against official statistics and an index from a large aggregator of online job postings. The findings suggest that the Twitter-based index is strongly correlated with official statistics and can improve the prediction of official statistics across occupations and states.
If you use code or data from this repository, please cite our ICWSM paper.
Repository Contents
ar_models: Directory containing scripts to perform AR on Twitter Index. More information underAR Modelssection.data: Directory containing the datasets used in the study, with aggregate state and occupation data for replication purposes.-
analyze_tweetspost inference CSV's and figures. -
csv: Employment data, job titles and location information. -
job_offer: Official data to compare Twitter index to. -
scripts: Scripts used to build the files for inference, as well as later analysis of the index.
-
job_title_classifier: Source code directory with scripts for training Job Title classifier with different variations.ner_model: Source code directory with scripts for training NER model.pipeline: Source code directory with scripts for our index inference. ___ ## Inference Pipeline The inference process involves a series of scripts executed in the following order:
- Preparation: The
build_job_offer.pyscript (located underdata/scripts) takes the initial input file and prepares the data for the inference process. - Deduplication: The
dedup.pyscript (also underdata/scripts) deduplicates the data to ensure unique job postings are processed. - Inference: The
pipeline_single.pyscript (located underpipeline) performs the inference using the prepared data. - Post-Inference Processing: The
post_predict.pyscript (also underpipeline) processes the output of the inference to generate the final results.
This can be easily triggered using run_pipeline.sh script.
Input Files Format
For inference Process
The inference process works on input files with the following columns:
- user_id: The id of the user.
- tweet_timestamp: The timestamp of the tweet.
- tweet_text: The text content of the tweet.
- tweet_urls: URLs included in the tweet.
For NER Model Training
To train the Named Entity Recognition (NER) model, an input file with the following columns is required:
- Text - The text of the tweet.
- Token - The word that was being tagged.
- ORG - 1 or 0 to mark if the token is an organization.
- LOC - Same only for location
- JOB_TITLE - Same to mark a job title.
- MISC - Everything that doesn't fall under the previous categories.
You can train the model using the main.py (located under ner_model).
For Job Title classifier Training
To train you own Job Title Classifier you can use the provided job_titles.csv.
You can suse the provided train_bert.py script (located under job_title_classifier).
AR models
Input file
The input file for the AR model is created during the post inference process called merged_regression_ready.csv (located under pipeline/csv).
It contains the following columns:
- date - Date in a monthly granularity (01/12/2016, 01/01/2017, 01/02/2017, ...).
- state_name - The name of the state.
- occupation_name - The name of the occupation (Assemblers, Cleaners and helpers, ...).
- job_offer - # of job offers found in our index.
- employment - Employment rates from the Current Population Survey (CPS).
- job_offer_rate - A normalized number of job-opening tweets by the total number of tweets posted in the state in a given time period.
- employment_rate - A normalized rate of employment by the total number of employment in the state in a given time period.
- job_offer_rate_overall - A normalized number of job-opening tweets by the total number of tweets in a given time period.
- employment_rate_overall - A normalized rate of employment by the total number of employment in a given time period.
Time series prediction
To perform the time series prediction of employment rated you can simply use pooled_effects.py (located under ar_models) to train the Auto Regressive (AR) models per state.
Heat Map Plot
Finally, we can run plot_rmse.py to re-create the prediction gain heatmap Fig. 6 plot from the paper.
Representativeness of Twitter Job Opening Index
Input files
The input files for the Representativeness of Twitter Job Opening Index are as follows:
- pipeline/csv/merged/ - This will be created during the inference process. it will contain monthly CSV's which will hold the index data.
- data/job_offer/ - In the folder you will find files with URL's leading to the data. You can download and parse the data using them. The data itself is not shared as it is not ours to share. Please feel free to reach us if you are facing issues.
Twitter index analysis
First you will need to run aggragate_tweets.ipynb (located under data/scripts).
This will create analyze_tweets folder under data and create the required files for the analysis.
Next you can run analyze_tweets.ipynb to re-create Fig. 4 and Fig. 5 from the paper.
Environment Setup
Before running the scripts, you need to set up your Python environment:
- Ensure you have
Python 3.7installed on your system. - Install the required dependencies by running the following command in your terminal:
pip install -r requirements.txtTherequirements.txtfile contains all the necessary Python packages to run the code in this repository. - Before running the inference using the
run_pipeline.shscript, please modify and put your environment name in the batch files underbatch_files. ___ ## Code Availability Statement The code used in this study and provided here includes the methodology for accurately extracting occupation and location information from job postings on social media and scripts for evaluating the coverage and representativeness of the Twitter-based job vacancy index. As Twitter information cannot be shared it is not included here. ___ ## How to Cite
Please cite our work as follows:
Sobol Portnov, B., Tonneau, M., Lee, D., Fraiberger, S., & Grinberg, N. (2024). 280 Characters to Employment: Using Twitter to Quantify Job Vacancies [Conference paper]. Proceedings of the Eighteenth International AAAI Conference on Web and Social Media (ICWSM)
or in BibTex, using:
@inproceedings{Sobol_Portnov_280_Characters_to_2024,
author = {Sobol Portnov, Boris and Tonneau, Manuel and Lee, Do and Fraiberger, Samuel and Grinberg, Nir},
booktitle = {Proceedings of the Eighteenth International AAAI Conference on Web and Social Media (ICWSM)},
month = jun,
series = {International Conference on Web and Social Media (ICWSM) 2024},
title = {{280 Characters to Employment: Using Twitter to Quantify Job Vacancies}},
year = {2024}
}
License
The assets in this repository are subject to their respective licenses. Usage complies with the requirements set forth by the data providers.
Contact
For inquiries regarding the paper or the repository, please contact the authors at the following email addresses: - Boris Sobol Portnov: boris.sobol@gmail.com - Manuel Tonneau: manuel.tonneau@oii.ox.ac.uk - Do Lee: dq1204@nyu.edu - Samuel Fraiberger: sfraiberger@worldbank.org - Nir Grinberg: nirgrn@bgu.ac.il
Owner
- Name: Socially Embedded Lab
- Login: Socially-Embedded-Lab
- Kind: organization
- Website: https://sel.sise.bgu.ac.il/
- Repositories: 2
- Profile: https://github.com/Socially-Embedded-Lab
Code and data from research projects conducted at BGU's Socially Embedded Lab.
Citation (CITATION.cff)
cff-version: 1.2.0
title: >-
280 Characters to Employment: Using Twitter to Quantify Job Vacancies
message: >-
If you use code in this repository or the associated
data, please cite our ICWSM paper.
type: software
authors:
- given-names: Boris
family-names: Sobol Portnov
email: boris.sobol@gmail.com
- given-names: Manuel
family-names: Tonneau
email: manuel.tonneau@oii.ox.ac.uk
- given-names: Do
family-names: Lee
email: dq1204@nyu.edu
- given-names: Samuel
family-names: Fraiberger
email: sfraiberger@worldbank.org
- given-names: Nir
family-names: Grinberg
email: nirgrn@bgu.ac.il
affiliation: Ben-Gurion University
orcid: 'https://orcid.org/0000-0002-1277-894X'
repository-artifact: 'https://doi.org/10.7910/DVN/RQBWAC'
license: CC-BY-NC-SA-4.0
preferred-citation:
type: conference-paper
authors:
- given-names: Boris
family-names: Sobol Portnov
email: boris.sobol@gmail.com
- given-names: Manuel
family-names: Tonneau
email: manuel.tonneau@oii.ox.ac.uk
- given-names: Do
family-names: Lee
email: dq1204@nyu.edu
- given-names: Samuel
family-names: Fraiberger
email: sfraiberger@worldbank.org
- given-names: Nir
family-names: Grinberg
email: nirgrn@bgu.ac.il
affiliation: Ben-Gurion University
orcid: 'https://orcid.org/0000-0002-1277-894X'
title: "280 Characters to Employment: Using Twitter to Quantify Job Vacancies"
month: 6
collection-title: "Proceedings of the Eighteenth International AAAI Conference on Web and Social Media (ICWSM)"
collection-type: proceedings
conference:
name: "International Conference on Web and Social Media (ICWSM) 2024"
year: 2024
GitHub Events
Total
Last Year
Dependencies
- Fiona ==1.9.3
- Jinja2 ==3.1.2
- Keras-Preprocessing ==1.1.2
- Levenshtein ==0.21.0
- Markdown ==3.4.1
- MarkupSafe ==2.1.1
- Pillow ==9.4.0
- PySocks ==1.7.1
- PyYAML ==6.0
- Pygments ==2.15.1
- QtPy ==2.3.1
- Send2Trash ==1.8.2
- Werkzeug ==2.2.2
- absl-py ==1.3.0
- accelerate ==0.18.0
- adjustText ==0.8
- aiohttp ==3.8.4
- aiosignal ==1.3.1
- anyio ==3.6.2
- argon2-cffi ==21.3.0
- argon2-cffi-bindings ==21.2.0
- astunparse ==1.6.3
- async-generator ==1.10
- async-timeout ==4.0.2
- asynctest ==0.13.0
- attrs ==22.2.0
- backcall ==0.2.0
- beautifulsoup4 ==4.12.2
- bleach ==6.0.0
- cachetools ==5.2.0
- certifi ==2022.12.7
- cffi ==1.15.1
- charset-normalizer ==2.1.1
- chart-studio ==1.1.0
- click ==8.1.3
- click-plugins ==1.1.1
- cligj ==0.7.2
- colorama ==0.4.6
- colorlover ==0.3.0
- cufflinks ==0.17.3
- cycler ==0.11.0
- datasets ==2.12.0
- debugpy ==1.6.7
- decorator ==5.1.1
- defusedxml ==0.7.1
- dill ==0.3.6
- entrypoints ==0.4
- et-xmlfile ==1.1.0
- evaluate ==0.4.0
- exceptiongroup ==1.1.0
- fastjsonschema ==2.16.3
- filelock ==3.9.0
- flatbuffers ==22.12.6
- fonttools ==4.38.0
- frozenlist ==1.3.3
- fsspec ==2023.1.0
- fuzzywuzzy ==0.18.0
- gast ==0.4.0
- geographiclib ==2.0
- geopandas ==0.10.2
- geopy ==2.3.0
- google-auth ==2.15.0
- google-auth-oauthlib ==0.4.6
- google-pasta ==0.2.0
- grpcio ==1.51.1
- h11 ==0.14.0
- h5py ==3.7.0
- huggingface-hub ==0.14.1
- idna ==3.4
- importlib-metadata ==6.0.0
- importlib-resources ==5.12.0
- ipykernel ==6.16.2
- ipython ==7.34.0
- ipython-genutils ==0.2.0
- ipywidgets ==8.0.6
- jedi ==0.18.2
- joblib ==1.2.0
- jsonschema ==4.17.3
- jupyter ==1.0.0
- jupyter-client ==7.4.9
- jupyter-console ==6.6.3
- jupyter-core ==4.12.0
- jupyter-server ==1.24.0
- jupyterlab-pygments ==0.2.2
- jupyterlab-widgets ==3.0.7
- kaleido ==0.2.1
- keras ==2.8.0
- kiwisolver ==1.4.4
- libclang ==14.0.6
- matplotlib ==3.5.3
- matplotlib-inline ==0.1.6
- mistune ==2.0.5
- multidict ==6.0.4
- multiprocess ==0.70.14
- munch ==2.5.0
- nbclassic ==0.5.6
- nbclient ==0.7.4
- nbconvert ==7.3.1
- nbformat ==5.8.0
- nest-asyncio ==1.5.6
- nltk ==3.8.1
- notebook ==6.5.4
- notebook-shim ==0.2.3
- numpy ==1.21.6
- oauthlib ==3.2.2
- openpyxl ==3.1.2
- opt-einsum ==3.3.0
- outcome ==1.2.0
- packaging ==22.0
- pandas ==1.3.5
- pandocfilters ==1.5.0
- parso ==0.8.3
- patsy ==0.5.3
- pickleshare ==0.7.5
- pkgutil-resolve-name ==1.3.10
- plotly ==5.11.0
- prometheus-client ==0.16.0
- prompt-toolkit ==3.0.38
- protobuf ==3.19.6
- psutil ==5.9.5
- pyDeprecate ==0.3.2
- pyarrow ==10.0.1
- pyasn1 ==0.4.8
- pyasn1-modules ==0.2.8
- pycparser ==2.21
- pyparsing ==3.0.9
- pyproj ==3.2.1
- pyrsistent ==0.19.3
- python-Levenshtein ==0.21.0
- python-dateutil ==2.8.2
- pytz ==2023.3
- pywin32 ==306
- pywinpty ==2.0.10
- pyzmq ==25.0.2
- qtconsole ==5.4.2
- rapidfuzz ==3.0.0
- regex ==2022.10.31
- requests ==2.28.1
- requests-oauthlib ==1.3.1
- responses ==0.18.0
- retrying ==1.3.4
- rsa ==4.9
- sacremoses ==0.0.53
- scikit-learn ==1.0.2
- scipy ==1.7.3
- seaborn ==0.12.2
- selenium ==4.7.2
- sentence-transformers ==2.2.2
- sentencepiece ==0.1.97
- seqeval ==1.2.2
- shapely ==2.0.1
- six ==1.16.0
- sniffio ==1.3.0
- sortedcontainers ==2.4.0
- soupsieve ==2.4.1
- statsmodels ==0.13.2
- tenacity ==8.1.0
- tensorboard ==2.8.0
- tensorboard-data-server ==0.6.1
- tensorboard-plugin-wit ==1.8.1
- tensorflow ==2.8.0
- tensorflow-estimator ==2.11.0
- tensorflow-io-gcs-filesystem ==0.29.0
- termcolor ==2.2.0
- terminado ==0.17.1
- tf-estimator-nightly ==2.8.0.dev2021122109
- threadpoolctl ==3.1.0
- tinycss2 ==1.2.1
- tokenizers ==0.13.2
- torch ==1.7.1
- torchaudio ==0.7.2
- torchmetrics ==0.7.2
- torchvision ==0.8.2
- tornado ==6.2
- tqdm ==4.64.1
- traitlets ==5.9.0
- transformers ==4.16.2
- trio ==0.22.0
- trio-websocket ==0.9.2
- typing-extensions ==4.5.0
- urllib3 ==1.26.13
- wcwidth ==0.2.6
- webencodings ==0.5.1
- websocket-client ==1.5.1
- widgetsnbextension ==4.0.7
- wrapt ==1.14.1
- wsproto ==1.2.0
- xxhash ==3.2.0
- yarl ==1.9.2
- zipp ==3.11.0