nlp

https://github.com/viktorveterinarov/nlp

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Academic email domains
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (16.4%) to scientific vocabulary

Last synced: 10 months ago · JSON representation ·

Repository

Basic Info

Host: GitHub
Owner: viktorveterinarov
License: other
Language: Python
Default Branch: package
Size: 1.58 MB

Statistics

Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Releases: 0

Created almost 3 years ago · Last pushed almost 3 years ago

Metadata Files

Readme Changelog Contributing License Citation Authors

Cleaning and Retrieving Occupational Codes

This is a branch addressing some bug issues when importing and using the 'occupationalcode' package. It creates a program that exports Job Codes on CSV format after applying a comprehensive text cleaning function over variables in the data.

The program included here is based on the algorithm originally written by Jyldyz Djumalieva, Arthur Turrell <http://aeturrell.github.io/home>, David Copple, James Thurgood, and Bradley Speigner; and upon the efficiency changes made by Martin Wood <MartinWoodONS.github.io>.

Aim

Create a dataset exporting a UK 3-digit standard occupational classification (SOC), given a job title, job description, and job sector.

The algorithm uses the SOC 2010 standard, more details of which can be found on the ONS' website <https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2010>.

Changes

-Addresses debugging and installation issues when running from terminal.

-Fix debugging issues when importing datasets in format different from .csv (like .dta)

-Creation of requirement file to import necessary packages to run the algorithm smoothly

-Creation of program that exports SOCcode into a .csv file after aplying an extensive text cleaning function that addresses issues related to HTML text encoding.

-User-friendly syntaxis; just need to provide the necessary arguments for input and output datasets.

-Flexibility on variable names for jobdescription, jobtitle and job_sector to apply the function.

Installation via terminal

Clone this repository on your desired root folder. Open terminal
Set path to repo

cmd cd <path to repo>

Creating and activating a virtual environment is recommended in the terminal

Mac:

cmd pip install virtualenv virtualenv venv source venv/bin/activate

Windows:

cmd pip install virtualenv python -m venv venv venv/Scripts/activate

Then execute set up of the package in terminal:

cmd python setup.py sdist cd <path to sdist> pip install occupationcoder-<version>.tar.gz

The first line creates the .tar.gz file, the second navigates to the directory with the packaged code in, and the third line installs the package. The version number to use will be evident from the name of the .tar.gz file.

Install extra dependencies (package requirements) in root folder after re-establishing path, as follows: cmd cd <path repo> pip install -r requirements.txt

How to use program:

After the installation in the terminal; user just needs to execute cmd python new_main.py <input_file_path> <output_file_path> <title_column> <sector_column> <description_column>

Where titlecolumn, sectorcolumn and descriptioncolumn are optional. If these arguments are not included, the default values will be 'jobtitle', 'jobsector' and 'jobdescription' respectively.

The output file will be accesible in the specified path and it will be a new dataframe with SOC code entries appended in a new column.

Necessary to provide the path for input dataset with text and the desired path for output dataset

File and folder description ~~~~~~~~~~~~~~~~~~~~~~~~~~~

cleaning-retrieving-occ-code/occupationcoder/coder.py applies SOC codes to job descriptions
cleaning-retrieving-occ-code/occupationcoder/cleaner.py contains helper function which mostly manipulate strings
cleaning-retrieving-occ-code/occupationcoder/createdictionariesturns the ONS' index of SOC code into dictionaries used byoccupationcoder/coder.py``
cleaning-retrieving-occ-code/occupationcoder/dictionaries contains the dictionaries used by cleaning-retrieving-occ-code/occupationcoder/coder.py
cleaning-retrieving-occ-code/occupationcoder/outputs is the default output directory
cleaning-retrieving-occ-code/occupationcoder/tests/test_vacancies.csv contains 'test' vacancies to run the code on, used by unittests, accessible by you!
cleaning-retrieving-occ-code/occupationcoder/new_main_.py is the main script to run the program

This code originally written by Jyldyz Djumalieva, Arthur Turrell <http://aeturrell.github.io/home>__, David Copple, James Thurgood, and Bradley Speigner. If you use this code please cite:

Turrell, A., Speigner, B., Djumalieva, J., Copple, D., & Thurgood, J. (2019). Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings <https://www.nber.org/papers/w25837>__ (No. w25837). National Bureau of Economic Research.

@techreport{turrell2019transforming,
  title={Transforming naturally occurring text data into economic statistics: The case of online job vacancy postings},
  author={Turrell, Arthur and Speigner, Bradley and Djumalieva, Jyldyz and Copple, David and Thurgood, James},
  year={2019},
  institution={National Bureau of Economic Research}
}

Documentation: https://occupationcoder.readthedocs.io.

Owner

Name: Viktor Veterinarov
Login: viktorveterinarov
Kind: user
Location: Paris
Company: Sciences Po

Repositories: 1
Profile: https://github.com/viktorveterinarov

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Turrell"
  given-names: "Arthur"
  orcid: "https://orcid.org/0000-0002-2525-0773"
- family-names: "Speigner"
  given-names: "Bradley"
- family-names: "Djumalieva"
  given-names: "Jyldyz"
- family-names: "Copple"
  given-names: "David"
- family-names: "Thurgood"
  given-names: "James"
title: "occupationcoder"
version: 1.0.0
doi: 10.3386/w25837
date-released: 2019-05-01
url: "https://github.com/aeturrell/occupationcoder"
preferred-citation:
  type: techreport
  authors:
  - family-names: "Turrell"
    given-names: "Arthur"
    orcid: "https://orcid.org/0000-0002-2525-0773"
  - family-names: "Speigner"
    given-names: "Bradley"
  - family-names: "Djumalieva"
    given-names: "Jyldyz"
  - family-names: "Copple"
    given-names: "David"
  - family-names: "Thurgood"
    given-names: "James"
  doi: "10.3386/w25837"
  journal: "National Bureau of Economic Research Working Papers"
  title: "Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings"
  year: 2019
  number: "No. w25837"

GitHub Events

Total

Last Year

Dependencies

requirements.txt pypi

STOP_WORDS *
bs4 *
emoji *
fuzzywuzzy *
gensim *
nltk *
numpy *
pillow *
sklearn *
spacy *
unidecode *
wordcloud *

requirements_dev.txt pypi

Sphinx ==3.5.4 development
bump2version ==1.0.1 development
coverage ==5.5 development
flake8 ==3.9.0 development
pip ==21.1 development
tox ==3.23.0 development
twine ==3.4.1 development
watchdog ==2.0.2 development
wheel ==0.36.2 development

setup.py pypi

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Open Source Science

nlp