Science Score: 44.0%
This score indicates how likely this project is to be science-related based on various indicators:
-
✓CITATION.cff file
Found CITATION.cff file -
✓codemeta.json file
Found codemeta.json file -
✓.zenodo.json file
Found .zenodo.json file -
○DOI references
-
○Academic publication links
-
○Academic email domains
-
○Institutional organization owner
-
○JOSS paper metadata
-
○Scientific vocabulary similarity
Low similarity (16.4%) to scientific vocabulary
Repository
Basic Info
- Host: GitHub
- Owner: viktorveterinarov
- License: other
- Language: Python
- Default Branch: package
- Size: 1.58 MB
Statistics
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
- Releases: 0
Metadata Files
README.md
Cleaning and Retrieving Occupational Codes
This is a branch addressing some bug issues when importing and using the 'occupationalcode' package. It creates a program that exports Job Codes on CSV format after applying a comprehensive text cleaning function over variables in the data.
The program included here is based on the algorithm originally written by Jyldyz Djumalieva, Arthur
Turrell <http://aeturrell.github.io/home>, David Copple, James
Thurgood, and Bradley Speigner; and upon the efficiency changes made by Martin Wood <MartinWoodONS.github.io>.
Aim
Create a dataset exporting a UK 3-digit standard occupational classification (SOC), given a job title, job description, and job sector.
The algorithm uses the SOC 2010 standard, more details of which can
be found on the ONS'
website <https://www.ons.gov.uk/methodology/classificationsandstandards/standardoccupationalclassificationsoc/soc2010>.
Changes
-Addresses debugging and installation issues when running from terminal.
-Fix debugging issues when importing datasets in format different from .csv (like .dta)
-Creation of requirement file to import necessary packages to run the algorithm smoothly
-Creation of program that exports SOCcode into a .csv file after aplying an extensive text cleaning function that addresses issues related to HTML text encoding.
-User-friendly syntaxis; just need to provide the necessary arguments for input and output datasets.
-Flexibility on variable names for jobdescription, jobtitle and job_sector to apply the function.
Installation via terminal
- Clone this repository on your desired root folder. Open terminal
- Set path to repo
cmd
cd <path to repo>
- Creating and activating a virtual environment is recommended in the terminal
Mac:
cmd
pip install virtualenv
virtualenv venv
source venv/bin/activate
Windows:
cmd
pip install virtualenv
python -m venv venv
venv/Scripts/activate
- Then execute set up of the package in terminal:
cmd
python setup.py sdist
cd <path to sdist>
pip install occupationcoder-<version>.tar.gz
The first line creates the .tar.gz file, the second navigates to the directory with the packaged code in, and the third line installs the package. The version number to use will be evident from the name of the .tar.gz file.
- Install extra dependencies (package requirements) in root folder after re-establishing path, as follows:
cmd cd <path repo> pip install -r requirements.txt
How to use program:
After the installation in the terminal; user just needs to execute
cmd
python new_main.py <input_file_path> <output_file_path> <title_column> <sector_column> <description_column>
Where titlecolumn, sectorcolumn and descriptioncolumn are optional. If these arguments are not included, the default values will be 'jobtitle', 'jobsector' and 'jobdescription' respectively.
The output file will be accesible in the specified path and it will be a new dataframe with SOC code entries appended in a new column.
Necessary to provide the path for input dataset with text and the desired path for output dataset
File and folder description ~~~~~~~~~~~~~~~~~~~~~~~~~~~
-
cleaning-retrieving-occ-code/occupationcoder/coder.pyapplies SOC codes to job descriptions -
cleaning-retrieving-occ-code/occupationcoder/cleaner.pycontains helper function which mostly manipulate strings -
cleaning-retrieving-occ-code/occupationcoder/createdictionariesturns the ONS' index of SOC code into dictionaries used byoccupationcoder/coder.py`` -
cleaning-retrieving-occ-code/occupationcoder/dictionariescontains the dictionaries used bycleaning-retrieving-occ-code/occupationcoder/coder.py -
cleaning-retrieving-occ-code/occupationcoder/outputsis the default output directory -
cleaning-retrieving-occ-code/occupationcoder/tests/test_vacancies.csvcontains 'test' vacancies to run the code on, used by unittests, accessible by you! -
cleaning-retrieving-occ-code/occupationcoder/new_main_.pyis the main script to run the program
This code originally written by Jyldyz Djumalieva, Arthur
Turrell <http://aeturrell.github.io/home>__, David Copple, James
Thurgood, and Bradley Speigner. If you use this code please cite:
Turrell, A., Speigner, B., Djumalieva, J., Copple, D., & Thurgood, J.
(2019). Transforming Naturally Occurring Text Data Into Economic
Statistics: The Case of Online Job Vacancy
Postings <https://www.nber.org/papers/w25837>__ (No. w25837). National
Bureau of Economic Research.
::
@techreport{turrell2019transforming,
title={Transforming naturally occurring text data into economic statistics: The case of online job vacancy postings},
author={Turrell, Arthur and Speigner, Bradley and Djumalieva, Jyldyz and Copple, David and Thurgood, James},
year={2019},
institution={National Bureau of Economic Research}
}
- Documentation: https://occupationcoder.readthedocs.io.
Owner
- Name: Viktor Veterinarov
- Login: viktorveterinarov
- Kind: user
- Location: Paris
- Company: Sciences Po
- Repositories: 1
- Profile: https://github.com/viktorveterinarov
Citation (CITATION.cff)
cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Turrell"
given-names: "Arthur"
orcid: "https://orcid.org/0000-0002-2525-0773"
- family-names: "Speigner"
given-names: "Bradley"
- family-names: "Djumalieva"
given-names: "Jyldyz"
- family-names: "Copple"
given-names: "David"
- family-names: "Thurgood"
given-names: "James"
title: "occupationcoder"
version: 1.0.0
doi: 10.3386/w25837
date-released: 2019-05-01
url: "https://github.com/aeturrell/occupationcoder"
preferred-citation:
type: techreport
authors:
- family-names: "Turrell"
given-names: "Arthur"
orcid: "https://orcid.org/0000-0002-2525-0773"
- family-names: "Speigner"
given-names: "Bradley"
- family-names: "Djumalieva"
given-names: "Jyldyz"
- family-names: "Copple"
given-names: "David"
- family-names: "Thurgood"
given-names: "James"
doi: "10.3386/w25837"
journal: "National Bureau of Economic Research Working Papers"
title: "Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings"
year: 2019
number: "No. w25837"
GitHub Events
Total
Last Year
Dependencies
- STOP_WORDS *
- bs4 *
- emoji *
- fuzzywuzzy *
- gensim *
- nltk *
- numpy *
- pillow *
- sklearn *
- spacy *
- unidecode *
- wordcloud *
- Sphinx ==3.5.4 development
- bump2version ==1.0.1 development
- coverage ==5.5 development
- flake8 ==3.9.0 development
- pip ==21.1 development
- tox ==3.23.0 development
- twine ==3.4.1 development
- watchdog ==2.0.2 development
- wheel ==0.36.2 development