occupationcoder

Given a job title and job description, the algorithm assigns a standard occupational classification (SOC) code to the job.

https://github.com/aeturrell/occupationcoder

Science Score: 44.0%

This score indicates how likely this project is to be science-related based on various indicators:

✓
CITATION.cff file
Found CITATION.cff file
✓
codemeta.json file
Found codemeta.json file
✓
.zenodo.json file
Found .zenodo.json file
○
DOI references
○
Academic publication links
○
Committers with academic emails
○
Institutional organization owner
○
JOSS paper metadata
○
Scientific vocabulary similarity
Low similarity (14.5%) to scientific vocabulary

Keywords

economics jobs python soc text-analysis tf-idf vacancies

Keywords from Contributors

cryptocurrencies dynamics pypi simulator metaheuristics mesh sequences interactive hacking

Last synced: 6 months ago · JSON representation ·

Repository

Given a job title and job description, the algorithm assigns a standard occupational classification (SOC) code to the job.

Basic Info

Host: GitHub
Owner: aeturrell
License: other
Language: Python
Default Branch: package
Homepage: https://occupationcoder.readthedocs.io/
Size: 1.61 MB

Statistics

Stars: 74
Watchers: 7
Forks: 27
Open Issues: 10
Releases: 0

Topics

economics jobs python soc text-analysis tf-idf vacancies

Created over 8 years ago · Last pushed over 1 year ago

Metadata Files

Readme Changelog Contributing License Citation Authors

README.rst

===============
occupationcoder
===============



.. image:: https://img.shields.io/pypi/v/occupationcoder.svg
        :target: https://pypi.python.org/pypi/occupationcoder

.. image:: https://img.shields.io/travis/aeturrell/occupationcoder.svg
        :target: https://travis-ci.com/aeturrell/occupationcoder

.. image:: https://readthedocs.org/projects/occupationcoder/badge/?version=latest
        :target: https://occupationcoder.readthedocs.io/en/latest/?version=latest
        :alt: Documentation Status


A tool to assign standard occupational classification codes to job vacancy descriptions
---------------------------------------------------------------------------------------

Given a job title, job description, and job sector the algorithm assigns
a UK 3-digit standard occupational classification (SOC) code to the job.
The algorithm uses the **SOC 2010** standard, more details of which can
be found on `the ONS'
website `__.

This code originally written by Jyldyz Djumalieva, `Arthur
Turrell `__, David Copple, James
Thurgood, and Bradley Speigner. Martin Wood has provided more recent code updates and improvements.

If you use this code please cite:

Turrell, A., Speigner, B., Djumalieva, J., Copple, D., & Thurgood, J.
(2019). `Transforming Naturally Occurring Text Data Into Economic
Statistics: The Case of Online Job Vacancy
Postings `__ (No. w25837). National
Bureau of Economic Research.

::

    @techreport{turrell2019transforming,
      title={Transforming naturally occurring text data into economic statistics: The case of online job vacancy postings},
      author={Turrell, Arthur and Speigner, Bradley and Djumalieva, Jyldyz and Copple, David and Thurgood, James},
      year={2019},
      institution={National Bureau of Economic Research}
    }

* Documentation: https://occupationcoder.readthedocs.io.

Pre-requisites
~~~~~~~~~~~~~~

See `setup.py` for a full list of Python packages.

occupationcoder is built on top of `NLTK `__ and
uses 'Wordnet' (a corpora, number 82 on their list) and the Punkt
Tokenizer Models (number 106 on their list). When the coder is run, it
will expect to find these in their usual directories. If you have nltk
installed, you can get them corpora using ``nltk.download()`` which will
install them in the right directories or you can go to
`http://www.nltk.org/nltk_data/ `__ to
download them manually (and follow the install instructions).

A couple of the other packages, such as
`rapidfuzz `__ do not come
with the Anaconda distribution of Python. You can install these via pip
(if you have access to the internet) or download the relevant binaries
and install them manually.

File and folder description
~~~~~~~~~~~~~~~~~~~~~~~~~~~

-  ``occupationcoder/coder.py`` applies SOC codes to job descriptions
-  ``occupationcoder/cleaner.py`` contains helper function which mostly
   manipulate strings
-  ``occupationcoder/createdictionaries`` turns the ONS' index of SOC
   code into dictionaries used by ``occupationcoder/coder.py``
-  ``occupationcoder/dictionaries`` contains the dictionaries used by
   ``occupationcoder/coder.py``
-  ``occupationcoder/outputs`` is the default output directory
-  ``occupationcoder/tests/test_vacancies.csv`` contains 'test' vacancies 
   to run the code on, used by unittests, accessible by you!

Installation via terminal using pip
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Download the package and navigate to the download directory. Then use

.. code-block:: shell

    python setup.py sdist
    cd dist
    pip install occupationcoder-.tar.gz

The first line creates the .tar.gz file, the second navigates to the
directory with the packaged code in, and the third line installs the
package. The version number to use will be evident from the name of the
.tar.gz file.

Running the code as a python script
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Importing, and creating an instance, of the coder

.. code-block:: python

    import pandas as pd
    from occupationcoder.coder import SOCCoder
    myCoder = SOCCoder()

To run the code with a single query, use the following syntax with the
``code_record(job_title,job_description,job_sector)`` method:

.. code-block:: python

    if __name__ == '__main__':
        myCoder.code_record('Physicist', 'Calculations of the universe', 'Professional scientific')

Note that you can leave some of the fields blank and the algorithm will still
return a SOC code.

To run the code on a file (eg csv name 'job\_file.csv') with structure

+--------------+-------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+
| job\_title   | job\_description                                                                                                  | job\_sector                                       |
+==============+===================================================================================================================+===================================================+
| Physicist    | Make calculations about the universe, do research, perform experiments and understand the physical environment.   | Professional, scientific & technical activities   |
+--------------+-------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+

use

.. code-block:: python

    df = pd.read_csv('path/to/foo.csv')
    df = myCoder.code_data_frame(df, title_column='job_title', sector_column='job_sector', description_column='job_description')

The column name arguments are optional, shown above are default values.  
This will return a new dataframe with SOC code entries appended in a new
column:

+--------------+-------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-------------+
| job\_title   | job\_description                                                                                                  | job\_sector                                       | SOC\_code   |
+==============+===================================================================================================================+===================================================+=============+
| Physicist    | Make calculations about the universe, do research, perform experiments and understand the physical environment.   | Professional, scientific & technical activities   | 211         |
+--------------+-------------------------------------------------------------------------------------------------------------------+---------------------------------------------------+-------------+

Running the code from the command line
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If you have all the relevant packages in requirements.txt, download the
code and navigate to the occupationcoder folder (which contains the
README). Then run

.. code-block:: shell

    python -m occupationcoder.coder path/to/foo.csv

This will create a 'processed\_jobs.csv' file in the outputs/ folder
which has the original text and an extra 'SOC\_code' column with the
assigned SOC codes.

Testing
~~~~~~~

To run the tests in your virtual environment, use

.. code-block:: shell

    python -m unittest

in the top level occupationcoder directory. Look in ``test_occupationcoder.py`` for what is run and for examples of use. The output appears in the 'processed\_jobs.csv' file in the outputs/
folder.

Acknowledgements
~~~~~~~~~~~~~~~~

We are very grateful to Emmet Cassidy for testing this algorithm.

Disclaimer
~~~~~~~~~~

This code is provided 'as is'. We would love it if you made it better or
extended it to work for other countries. All views expressed are our
personal views, not those of any employer.


Credits
-------

The development of this package was supported by the Bank of England.

This package was created with Cookiecutter_ and the `audreyr/cookiecutter-pypackage`_ project template.

.. _Cookiecutter: https://github.com/audreyr/cookiecutter
.. _`audreyr/cookiecutter-pypackage`: https://github.com/audreyr/cookiecutter-pypackage

Owner

Login: aeturrell
Kind: user

Website: www.aeturrell.com
Repositories: 9
Profile: https://github.com/aeturrell

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
- family-names: "Turrell"
  given-names: "Arthur"
  orcid: "https://orcid.org/0000-0002-2525-0773"
- family-names: "Speigner"
  given-names: "Bradley"
- family-names: "Djumalieva"
  given-names: "Jyldyz"
- family-names: "Copple"
  given-names: "David"
- family-names: "Thurgood"
  given-names: "James"
title: "occupationcoder"
version: 1.0.0
doi: 10.3386/w25837
date-released: 2019-05-01
url: "https://github.com/aeturrell/occupationcoder"
preferred-citation:
  type: techreport
  authors:
  - family-names: "Turrell"
    given-names: "Arthur"
    orcid: "https://orcid.org/0000-0002-2525-0773"
  - family-names: "Speigner"
    given-names: "Bradley"
  - family-names: "Djumalieva"
    given-names: "Jyldyz"
  - family-names: "Copple"
    given-names: "David"
  - family-names: "Thurgood"
    given-names: "James"
  doi: "10.3386/w25837"
  journal: "National Bureau of Economic Research Working Papers"
  title: "Transforming Naturally Occurring Text Data Into Economic Statistics: The Case of Online Job Vacancy Postings"
  year: 2019
  number: "No. w25837"

GitHub Events

Total

Watch event: 2
Fork event: 2

Last Year

Watch event: 2
Fork event: 2

Committers

Last synced: about 2 years ago

All Time

Total Commits: 89
Total Committers: 9
Avg Commits per committer: 9.889
Development Distribution Score (DDS): 0.652

Past Year

Commits: 1
Committers: 1
Avg Commits per committer: 1.0
Development Distribution Score (DDS): 0.0

Top Committers

Name	Email	Commits
aeturrell	a**l@g**m	31
djyldyz	d****z	21
aeturrell	a****l	18
Ozzy Cavendish	5****S	14
Wood	M**d@o**k	1
dependabot[bot]	4****]	1
Ozzy Cavendish	m**d@g**m	1
pyup-bot	g**t@p**o	1
James Thurgood	j**d@b**k	1

Committer Domains (Top 20 + Academic)

bankofengland.co.uk: 1 pyup.io: 1 ons.gov.uk: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 12
Total pull requests: 7
Average time to close issues: about 2 months
Average time to close pull requests: 2 months
Total issue authors: 4
Total pull request authors: 5
Average comments per issue: 0.58
Average comments per pull request: 0.71
Merged pull requests: 4
Bot issues: 0
Bot pull requests: 3

Past Year

Issues: 0
Pull requests: 1
Average time to close issues: N/A
Average time to close pull requests: 1 minute
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 2.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

aeturrell (8)
baranberkay96 (2)
pyup-bot (1)
ccomunello (1)

Pull Request Authors

dependabot[bot] (3)
jeroenminderman (2)
MartinWoodONS (1)
jamesthurgood34 (1)
pyup-bot (1)

Top Labels

Issue Labels

bug (2) help wanted (1) enhancement (1)

Pull Request Labels

dependencies (3)

Packages

Total packages: 1
Total downloads:
- pypi 24 last-month

Total dependent packages: 0
Total dependent repositories: 1
Total versions: 1
Total maintainers: 1

pypi.org: occupationcoder

A tool to use job text, such as job description, to assign standard occupational classification codes.

Homepage: https://github.com/aeturrell/occupationcoder
Documentation: https://occupationcoder.readthedocs.io/
License: Custom
Latest release: 0.2.0
published almost 5 years ago

Versions: 1
Dependent Packages: 0
Dependent Repositories: 1
Downloads: 24 Last month

Rankings

Forks count: 8.0%

Stargazers count: 8.3%

Dependent packages count: 10.1%

Average: 21.2%

Dependent repos count: 21.6%

Downloads: 58.0%

Maintainers (1)

aeturrell

Last synced: 6 months ago

Dependencies

requirements_dev.txt pypi

Sphinx ==3.5.4 development
bump2version ==1.0.1 development
coverage ==5.5 development
flake8 ==3.9.0 development
pip ==21.1 development
tox ==3.23.0 development
twine ==3.4.1 development
watchdog ==2.0.2 development
wheel ==0.36.2 development

setup.py pypi