keywords-analysis

tool to analyse words in a collection of corpus and identify whether certain words are over or under-represented in a particular corpus compared to their representation in other corpus

https://github.com/australian-text-analytics-platform/keywords-analysis

Science Score: 52.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
    Found CITATION.cff file
  • codemeta.json file
    Found codemeta.json file
  • .zenodo.json file
    Found .zenodo.json file
  • DOI references
  • Academic publication links
  • Academic email domains
  • Institutional organization owner
    Organization australian-text-analytics-platform has institutional domain (atap.edu.au)
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (12.7%) to scientific vocabulary
Last synced: 6 months ago · JSON representation ·

Repository

tool to analyse words in a collection of corpus and identify whether certain words are over or under-represented in a particular corpus compared to their representation in other corpus

Basic Info
  • Host: GitHub
  • Owner: Australian-Text-Analytics-Platform
  • License: apache-2.0
  • Language: Python
  • Default Branch: main
  • Size: 11.6 MB
Statistics
  • Stars: 2
  • Watchers: 3
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created over 3 years ago · Last pushed 10 months ago
Metadata Files
Readme License Citation

README.md

Keywords Analysis

Abstract: in this notebook, you will use the KeywordsAnalysis tool to analyse words in a collection of corpus and identify whether certain words are over or under-represented in a particular corpus (the study corpus) compared to their representation in other corpus (the reference corpus).

Setup

This tool has been designed for use with minimal setup from users. You are able to run it in the cloud and any dependencies with other packages will be installed for you automatically. In order to launch and use the tool, you just need to click the below icon.

Binder

Note: CILogon authentication is required. You can use your institutional, Google or Microsoft account to login.

If you have trouble authenticating, please refer to CILogon troubleshooting guide.

If you do not have access to any of the above accounts, you can use the below link to access the tool (this is a free Binder version, limited to 2GB memory only).

Binder

It may take a few minutes for Binder to launch the notebook and install the dependencies for the tool. Please be patient.

User Guide

For instructions on how to use the Keyword Analysis tool, please refer to the Keyword Analysis User Guide.

Load the data

This tool will allow you upload text data in a text file (or a number of text files). Alternatively, you can also upload text inside a text column inside your excel spreadsheet

Note: If you have a large number of text files (more than 10MB in total), we suggest you compress (zip) them and upload the zip file instead. If you need assistance on how to compress your file, please check the user guide.

Calculate Word Statistics

Once your texts have been uploaded, you can begin to calculate the statistics for the words in the corpus. You can then visualise the statistics on the charts (as shown below).

You also have the option to save your analysis onto an excel spreadsheet and download it to your local computer.

Welch t-test and Fisher permutation test

In this notebook, you can also use statistical tests (Welch t-test or Fisher permutation test) to investigate if the use of a certain word in a corpus is statistically different to the use of that same word in a different corpus.

You can also see the distribution of that word on a histogram to see how often it is used in the corpus.

Reference

The statistical calculations used in this tool are python implementation of the statistical calculation on this website.

Citation

If you find the Keywords Analysis useful in your research, please cite the following:

Jufri, Sony & Sun, Chao (2022). Keywords Analysis. v1.0. Australian Text Analytics Platform. Software. https://github.com/Australian-Text-Analytics-Platform/keywords-analysis

Owner

  • Name: Australian-Text-Analytics-Platform
  • Login: Australian-Text-Analytics-Platform
  • Kind: organization

Citation (CITATION.cff)

# This CITATION.cff file was generated with cffinit.
# Visit https://bit.ly/cffinit to generate yours today!

cff-version: 1.2.0
title: Keywords Analysis
message: >-
  If you use this software, please cite it using the
  metadata from this file.
type: software
authors:
  - given-names: Sony
    family-names: Jufri
    email: sony.jufri@sydney.edu.au
    affiliation: >-
      Sydney Informatics Hub, a core research facility of
      the University of Sydney
  - given-names: Chao
    family-names: Sun
    email: chao.sun@sydney.edu.au
    affiliation: >-
      Sydney Informatics Hub, a core research facility of
      the University of Sydney
repository-code: >-
  https://github.com/Australian-Text-Analytics-Platform/keywords-analysis
abstract: >-
  The Keywords Analysis tool is a text analytic tool that
  you can use to analyse keywords in a collection of corpus
  and identify whether certain words are over or
  under-represented in a particular corpus (the study
  corpus) compared to their representation in other corpus
  (the reference corpus). This tool allows you to calculate
  different word statistics from each corpus and display the
  analysis on the charts for your analysis.
keywords:
  - keywords analysis
  - keywords
  - word statistic
  - most used word
  - least used word
license: Apache-2.0
version: '1.0'
date-released: '2023-02-09'

GitHub Events

Total
  • Push event: 1
  • Pull request event: 1
  • Create event: 1
Last Year
  • Push event: 1
  • Pull request event: 1
  • Create event: 1

Dependencies

environment.yml conda
  • bokeh 2.4.3.*
  • ipywidgets 8.0.2.*
  • matplotlib 3.5.2.*
  • nltk 3.7.*
  • numpy 1.23.1.*
  • openpyxl 3.0.10.*
  • pandas 1.4.4.*
  • pip
  • python 3.9.13.*
  • scikit-learn 1.1.1.*
  • scipy 1.9.3.*
  • seaborn 0.11.2.*
  • tqdm 4.64.1.*