https://github.com/chasmani/zipfanalysis

Tools for analysing Zipf's law from text samples

https://github.com/chasmani/zipfanalysis

Science Score: 10.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
  • Academic publication links
  • Committers with academic emails
    1 of 2 committers (50.0%) from academic institutions
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (8.1%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

Tools for analysing Zipf's law from text samples

Basic Info
  • Host: GitHub
  • Owner: chasmani
  • License: mit
  • Language: Python
  • Default Branch: master
  • Size: 2.3 MB
Statistics
  • Stars: 0
  • Watchers: 2
  • Forks: 0
  • Open Issues: 0
  • Releases: 0
Created almost 6 years ago · Last pushed almost 4 years ago
Metadata Files
Readme License

README.rst

============
zipfanalysis
============

Tools in python for analysing Zipf's law from text samples. 

This can be installed as a package from the python3 package library using the terminal command:
::

	>>> pip install zipfanalysis
	
WARNING: This tool is still in development and should not be relied upon wholly for academic research. 

-----
Usage
-----

The package can be used from within python scripts to estimate Zipf exponents, assuming a simple power law model for 
word frequencies and ranks. To use the pacakge import it using
::

	import zipfanalysis

-------------
Simple Method
-------------

The easiest way to carry out an analysis on a book or text file, using different estimators, is:
::

	alpha_clauset = zipfanalysis.clauset("path_to_book.txt")

	alpha_pdf = zipfanalysis.ols_pdf("path_to_book.txt", min_frequency=3)

	alpha_cdf = zipfanalysis.ols_cdf("path_to_book.txt", min_frequency=3)

	alpha_abc = zipfanalysis.abc("path_to_book.txt")

---------------
In Depth Method
---------------

Convert a book or text file to the frequency of words, ranked from highest to lowest: 
::

	word_counts = zipfanalysis.preprocessing.preprocessing.get_rank_frequency_from_text("path_to_book.txt")
	

Carry out different types of analysis to fit a power law to the data:
::

	# Clauset et al estimator
	alpha_clauset = zipfanalysis.estimators.clauset.clauset_estimator(word_counts)

	# Ordinary Least Squares regression on log(rank) ~ log(frequency) 
	# Optional low frequency cut-off
	alpha_pdf = zipfanalysis.estimators.ols_regression_pdf.ols_regression_pdf_estimator(word_counts, min_frequency=2)

	# Ordinary least squares regression on the complemantary cumulative distribution function of ranks
	# OLS on log(P(R>rank)) ~ log(rank) 
	# Optional low frequency cut-off 
	alpha_cdf = zipfanalysis.estimators.ols_regression_cdf.ols_regression_cdf_estimator(word_counts)

	# Approximate Bayesian computation (regression method)
	# Assumes model of p(rank) = C prob_rank^(-alpha)
	# prob_rank is a word's rank in an underlying probability distribution
	alpha_abc = zipfanalysis.estimators.approximate_bayesian_computation.abc_estimator(word_counts)

------------------
Development Notes
------------------
General workflow to use should be:

1. Import data to n vector. E.g. 
n = zipfanalysis.import_book("filename.txt")
n = zipfanlysis.import_list([list of words])
n = zipfanlysis.import_counter(counter_of_words)

2. Carry out analsyis on data e.g.
zipfanalysis.n_pdf_regression(n)

3. Also convert to different representations
zipfanalysis.convert_to_f(n)



Owner

  • Name: Chasmani
  • Login: chasmani
  • Kind: user

GitHub Events

Total
Last Year

Committers

Last synced: over 3 years ago

All Time
  • Total Commits: 37
  • Total Committers: 2
  • Avg Commits per committer: 18.5
  • Development Distribution Score (DDS): 0.027
Top Committers
Name Email Commits
chasmani p****2@g****m 36
Charlie Pilgrim m****m@t****k 1
Committer Domains (Top 20 + Academic)

Issues and Pull Requests

Last synced: 10 months ago

All Time
  • Total issues: 0
  • Total pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Total issue authors: 0
  • Total pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
Pull Request Authors
Top Labels
Issue Labels
Pull Request Labels

Packages

  • Total packages: 1
  • Total downloads:
    • pypi 17 last-month
  • Total dependent packages: 0
  • Total dependent repositories: 1
  • Total versions: 3
  • Total maintainers: 1
pypi.org: zipfanalysis

Tools for analysing Zipf's law from text samples

  • Versions: 3
  • Dependent Packages: 0
  • Dependent Repositories: 1
  • Downloads: 17 Last month
Rankings
Dependent packages count: 10.0%
Dependent repos count: 21.7%
Forks count: 29.8%
Average: 32.1%
Stargazers count: 38.8%
Downloads: 60.2%
Maintainers (1)
Last synced: 10 months ago