https://github.com/chainsawriot/rstyle

The evolution of R programming styles.

https://github.com/chainsawriot/rstyle

Science Score: 23.0%

This score indicates how likely this project is to be science-related based on various indicators:

  • CITATION.cff file
  • codemeta.json file
  • .zenodo.json file
  • DOI references
    Found 1 DOI reference(s) in README
  • Academic publication links
    Links to: zenodo.org
  • Academic email domains
  • Institutional organization owner
  • JOSS paper metadata
  • Scientific vocabulary similarity
    Low similarity (9.9%) to scientific vocabulary
Last synced: 10 months ago · JSON representation

Repository

The evolution of R programming styles.

Basic Info
  • Host: GitHub
  • Owner: chainsawriot
  • Language: R
  • Default Branch: master
  • Homepage:
  • Size: 306 MB
Statistics
  • Stars: 42
  • Watchers: 4
  • Forks: 2
  • Open Issues: 0
  • Releases: 0
Created over 7 years ago · Last pushed about 4 years ago
Metadata Files
Readme

README.md

rstyle

DOI

poster

Citation

Please cite this as: Yen, C.Y., Chang, M.H.W., Chan, C.H. (2019) A Computational Analysis of the Dynamics of R Style Based on 94 Million Lines of Code from All CRAN Packages in the Past 20 Years. Paper presented at the useR! 2019 conference, Toulouse, France. doi:10.31235/osf.io/ts2wq

Preprint of this paper is available here.

Assumptions

  1. Clone the entire CRAN into ./cran subdirectory. [^1]

sh rsync -rtlzv --delete cran.r-project.org::CRAN ./cran

It takes 220G of disk space.

  1. Create the code.db using the Makefile (Don't do that if you already have code.db)

Files and dependencies

Key RDS files:

In data directory

  1. target_meta.RDS - packages, one (randomly-selected) submission per year.

  2. pkgsfunctionswithsyntaxfeature.RDS - package information with syntatic features.

R files:

0prep - collecting data and sampling

  1. 0prep01extractmetadata.R (requires: Cloned CRAN mirror): extract meta data from tarballs. Generate target_meta.RDS and final_meta.RDS in data directory.

  2. cat code.sql | sqlite3 code.db : generate the schema of the SQLITE database - code.db. Generate code.db.

  3. 0prep02_dump.R (requires: Cloned CRAN mirror, target_meta.RDS): dump the source code, NAMESPACEs and DESCRIPTIONs into code.db. Generate code.db with data. It is very large (> 20G).

  4. 0prep03extractdesc.R (requires: Cloned CRAN mirror, targetmeta.RDS): add the text description also into targetmeta.RDS as a column desc. Generate target_meta.RDS (overwrite) in data directory.

1functionnames - Analysis of function names

  1. 1functionnames01extractfunction_name.R (requires: code.db): extract names of all exported function from each package. Generate multiple fxdatayr...RDS files in data directory.

  2. 1functionnames02functionname_analysis.R (requires: fxdatayr...RDS files): analyse the style in function names by year. Generate fxstyleby_year.RDS in data directory.

  3. 1functionnames03functionname_vis.R (requires: fxstyleby_year.RDS): visualize the time trends of styles in function names. Generate images(END)

2syntax - Analysis of style elements

  1. 2syntax01extractfeatures.R (requires: targetmeta.RDS, code.db): extract syntactic features. This procedure is both CPU and I/O intensive. On a normal i5 computer, it would take a month to run. Generate *syntaxfeature_yr...RDS* files in data directory.

  2. 2syntax02genpkgsfunctionswithsyntaxfeature.R (requires: syntaxfeatureyr...RDS files): combine all .RDS files into one. Generate pkgsfunctionswithsyntaxfeature.RDS.

  3. 2syntax03_vis.R (requires: pkgsfunctionswithsyntaxfeature.RDS): Visualize the time trends of syntactic features. Generate images. (END)

3linelength - Analysis of line length

  1. 3linelength01_extraction.R (requires: code.db): generate comment_dist.RDS in data directory.

  2. 3linelength02_animation.R (requires: comment_dist.RDS): generate shiny app.

4communities - Community-based analysis

  1. 4communities01extractcran_dependency.R (requires: code.db): extract dependencies of packages from CRAN. Generate cran_dependency.RDS (END)

  2. 4communities02buildcran_graph.R (requires: crandependency.RDS): build CRAN dependency graph based on two fields, say "Import" and "Suggests." Generate **crangraph.RDS**(END)

  3. 4communities03detectcrancommunityby_walktrap.R (requires: crangraph.RDS): detect CRAN communities by using walktrap algorithm. Generate **commwalktrap.RDS** and comm_size.RDS. In addition, it examines the robustness of identified communities with respect to the choice of random seeds(END)

  4. 4communities04communitybasedfeaturescorrection.R (requires: pkgsfunctionswithsyntaxfeature.RDS, commwalktrap.RDS, commsize.RDS, commname.csv): assign community labels to each package, such that package-level summary of syntax features and naming features can be usage for analyzing the style variations among communites. Only apply to the largest 20 communites. Generate **commlargest_feature.RDS** (END)

  5. 4communities05viscommunityposterimages.R (requires: commlargestfeature.RDS, crangraph.RDS, commwalktrap.RDS, commsize.RDS, namingconvention.csv): visualize community-related analysis (END)

5conversion - Convert key RDS files to csv for preservation

  1. 5conversion01makecsv.R (requires: targetmeta.RDS, pkgsfunctionswithsyntax_feature.RDS): Convert RDS files to csv (END)

Related projects

  • baaugwo - this project depends on this experimental package to extract meta data and dump code from R packages.

How to use the Docker to build and launch the docker instance?

  • Build the docker image using the provided Dockerfile

    • it is way faster if one builds the docker image inside the directory docker because less data are copied. sh cd docker/ ; docker build -t rstudio/rstyle -f Dockerfile . ; cd ../ ;
  • By default, docker launches RStudio Server and mounts folders using root user. It makes user rstudio (the default user of RStudio server) with no write access.

  • One of the solutions of this problem is to make docker launching RStudio Server by using current the UID

sh docker run -v $(pwd):/home/$USER/rstyle -e USER=$USER -e PASSWORD=xxxx -e USERID=$UID -p 8787:8787 rstudio/rstyle - or you can launch a development dashboard by executing the following command: sh bash dev-tmux.sh

If one is developing under Window Subsystem for Linux (WSL), you may encounter a problem that docker cannot see the folder you mounted in the container. In that case, please try to soft link /mnt/c/ to the root directory as illustrated in this blog post.

And then clone this repository anywhere inside /c/Users/{YOURUSERNAME}. And then specify the `PATHRSTYLE` environment variable as shown below, such that you can launch the dashboard successfully.

sh PATH_RSTYLE=/c/Users/{YOUR_USERNAME}/{PATH_TO_RSTYLE}/rstyle bash dev-tmux.sh

Label the names of identified communities by walktrap algorithm

We manually assigned a name to the largest identified communities by their 3 most important package members. We priorized importance of packages within a community by the algorithm PageRank.

| commid|commname |top | |-------:|:----------------------|:-----------------------------------| | 6|base |methods, stats, MASS | | 4|Rstudio |testthat, knitr, rmarkdown | | 28|Rcpp |Rcpp, tinytest, pinp | | 3|Statistical Analysis |survival, Formula, sandwich | | 9|Machine Learning |nnet, rpart, randomForest | | 16|Geography 1 |sp, rgdal, maptools | | 15|GNU |gsl, expint, mnormt | | 25|Bioconductor: Graph |graph, Rgraphviz, bnlearn | | 49|Text Analysis |tm, SnowballC, NLP | | 42|GUI |tcltk, tkrplot, tcltk2 | | 13|Infrastructure 1 |rsp, listenv, globals | | 17|Numerical Optimization |polynom, magic, numbers | | 40|Bioconductor: Genomics |Biostrings, IRanges, S4Vectors | | 77|RUnit |RUnit, ADGofTest, fAsianOptions | | 24|Survival Analysis |kinship2, CompQuadForm, coxme | | 2|Sparse Matrix |slam, ROI, registry | | 44|Infrastructure 2 |RGtk2, gWidgetstcltk, gWidgetsRGtk2 | | 75|Bioinformatics |limma, affy, marray | | 37|IO |RJSONIO, Rook, base64 | | 45|rJava |rJava, xlsxjars, openNLP |


[^1]: CRAN mirror HOWTO/FAQ

Owner

  • Login: chainsawriot
  • Kind: user
  • Location: Germany
  • Company: @gesistsa

GitHub Events

Total
Last Year

Issues and Pull Requests

Last synced: over 1 year ago

All Time
  • Total issues: 16
  • Total pull requests: 16
  • Average time to close issues: 7 months
  • Average time to close pull requests: 4 months
  • Total issue authors: 3
  • Total pull request authors: 2
  • Average comments per issue: 1.44
  • Average comments per pull request: 0.13
  • Merged pull requests: 12
  • Bot issues: 0
  • Bot pull requests: 0
Past Year
  • Issues: 0
  • Pull requests: 0
  • Average time to close issues: N/A
  • Average time to close pull requests: N/A
  • Issue authors: 0
  • Pull request authors: 0
  • Average comments per issue: 0
  • Average comments per pull request: 0
  • Merged pull requests: 0
  • Bot issues: 0
  • Bot pull requests: 0
Top Authors
Issue Authors
  • yenchiayi (9)
  • chainsawriot (6)
  • pymia (1)
Pull Request Authors
  • yenchiayi (11)
  • pymia (5)
Top Labels
Issue Labels
paper (2) analysis (2) enhancement (1)
Pull Request Labels