Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (16.4%) to scientific vocabulary
Last synced: 10 months ago · JSON representation ·

Repository

Basic Info
  • Host: GitHub
  • Owner: viktorveterinarov
  • License: other
  • Language: Python
  • Default Branch: package
  • Size: 1.58 MB
Statistics
  • Stars: 0
  • Watchers: 1
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 3 years ago · Last pushed almost 3 years ago
Metadata Files
Readme Changelog Contributing License Citation Authors

README.md

Cleaning and Retrieving Occupational Codes

This is a branch addressing some bug issues when importing and using the 'occupationalcode' package. It creates a program that exports Job Codes on CSV format after applying a comprehensive text cleaning function over variables in the data.

The program included here is based on the algorithm originally written by Jyldyz Djumalieva, Arthur Turrell <http://aeturrell.github.io/home>, David Copple, James Thurgood, and Bradley Speigner; and upon the efficiency changes made by Martin Wood <MartinWoodONS.github.io>.

Aim

Create a dataset exporting a UK 3-digit standard occupational classification (SOC), given a job title, job description, and job sector.

The algorithm uses the SOC 2010 standard, more details of which can be found on the ONS' website <https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2010>.

Changes

-Addresses debugging and installation issues when running from terminal.

-Fix debugging issues when importing datasets in format different from .csv (like .dta)

-Creation of requirement file to import necessary packages to run the algorithm smoothly

-Creation of program that exports SOCcode into a .csv file after aplying an extensive text cleaning function that addresses issues related to HTML text encoding.

-User-friendly syntaxis; just need to provide the necessary arguments for input and output datasets.

-Flexibility on variable names for jobdescription, jobtitle and job_sector to apply the function.

Installation via terminal

  1. Clone this repository on your desired root folder. Open terminal
  2. Set path to repo

cmd cd <path to repo>

  1. Creating and activating a virtual environment is recommended in the terminal

Mac:

cmd pip install virtualenv virtualenv venv source venv/bin/activate

Windows:

cmd pip install virtualenv python -m venv venv venv/Scripts/activate

  1. Then execute set up of the package in terminal:

cmd python setup.py sdist cd <path to sdist> pip install occupationcoder-<version>.tar.gz

The first line creates the .tar.gz file, the second navigates to the directory with the packaged code in, and the third line installs the package. The version number to use will be evident from the name of the .tar.gz file.

  1. Install extra dependencies (package requirements) in root folder after re-establishing path, as follows: cmd cd <path repo> pip install -r requirements.txt

How to use program:

After the installation in the terminal; user just needs to execute cmd python new_main.py <input_file_path> <output_file_path> <title_column> <sector_column> <description_column>

Where titlecolumn, sectorcolumn and descriptioncolumn are optional. If these arguments are not included, the default values will be 'jobtitle', 'jobsector' and 'jobdescription' respectively.

The output file will be accesible in the specified path and it will be a new dataframe with SOC code entries appended in a new column.

Necessary to provide the path for input dataset with text and the desired path for output dataset

File and folder description ~~~~~~~~~~~~~~~~~~~~~~~~~~~

  • cleaning-retrieving-occ-code/occupationcoder/coder.py applies SOC codes to job descriptions
  • cleaning-retrieving-occ-code/occupationcoder/cleaner.py contains helper function which mostly manipulate strings
  • cleaning-retrieving-occ-code/occupationcoder/createdictionariesturns the ONS' index of SOC code into dictionaries used byoccupationcoder/coder.py``
  • cleaning-retrieving-occ-code/occupationcoder/dictionaries contains the dictionaries used by cleaning-retrieving-occ-code/occupationcoder/coder.py
  • cleaning-retrieving-occ-code/occupationcoder/outputs is the default output directory
  • cleaning-retrieving-occ-code/occupationcoder/tests/test_vacancies.csv contains 'test' vacancies to run the code on, used by unittests, accessible by you!
  • cleaning-retrieving-occ-code/occupationcoder/new_main_.py is the main script to run the program

This code originally written by Jyldyz Djumalieva, Arthur Turrell <http://aeturrell.github.io/home>__, David Copple, James Thurgood, and Bradley Speigner. If you use this code please cite:

Turrell, A., Speigner, B., Djumalieva, J., Copple, D., & Thurgood, J. (2019). Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings <https://www.nber.org/papers/w25837>__ (No. w25837). National Bureau of Economic Research.

::

@techreport{turrell2019transforming,
  title={Transforming naturally occurring text data into economic statistics: The case of online job vacancy postings},
  author={Turrell, Arthur and Speigner, Bradley and Djumalieva, Jyldyz and Copple, David and Thurgood, James},
  year={2019},
  institution={National Bureau of Economic Research}
}
  • Documentation: https://occupationcoder.readthedocs.io.

Owner

  • Name: Viktor Veterinarov
  • Login: viktorveterinarov
  • Kind: user
  • Location: Paris
  • Company: Sciences Po

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Turrell"
  given-names: "Arthur"
  orcid: "https://orcid.org/0000-0002-2525-0773"
- family-names: "Speigner"
  given-names: "Bradley"
- family-names: "Djumalieva"
  given-names: "Jyldyz"
- family-names: "Copple"
  given-names: "David"
- family-names: "Thurgood"
  given-names: "James"
title: "occupationcoder"
version: 1.0.0
doi: 10.3386/w25837
date-released: 2019-05-01
url: "https://github.com/aeturrell/occupationcoder"
preferred-citation:
  type: techreport
  authors:
  - family-names: "Turrell"
    given-names: "Arthur"
    orcid: "https://orcid.org/0000-0002-2525-0773"
  - family-names: "Speigner"
    given-names: "Bradley"
  - family-names: "Djumalieva"
    given-names: "Jyldyz"
  - family-names: "Copple"
    given-names: "David"
  - family-names: "Thurgood"
    given-names: "James"
  doi: "10.3386/w25837"
  journal: "National Bureau of Economic Research Working Papers"
  title: "Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings"
  year: 2019
  number: "No. w25837"

GitHub Events

Total
Last Year

Dependencies

requirements.txt pypi
  • STOP_WORDS *
  • bs4 *
  • emoji *
  • fuzzywuzzy *
  • gensim *
  • nltk *
  • numpy *
  • pillow *
  • sklearn *
  • spacy *
  • unidecode *
  • wordcloud *
requirements_dev.txt pypi
  • Sphinx ==3.5.4 development
  • bump2version ==1.0.1 development
  • coverage ==5.5 development
  • flake8 ==3.9.0 development
  • pip ==21.1 development
  • tox ==3.23.0 development
  • twine ==3.4.1 development
  • watchdog ==2.0.2 development
  • wheel ==0.36.2 development
setup.py pypi