GlobalSearchRegression

Julia's HPC command for automatic feature/model selection using all-subset-regression approaches

https://github.com/parallelgsreg/globalsearchregression.jl

Keywords

regression-analysis

Last synced: 6 months ago · JSON representation ·

Repository

Julia's HPC command for automatic feature/model selection using all-subset-regression approaches

Basic Info

Host: GitHub
Owner: ParallelGSReg
License: other
Language: Julia
Default Branch: master
Homepage:
Size: 19.1 MB

Statistics

Stars: 18
Watchers: 3
Forks: 4
Open Issues: 7
Releases: 10

Topics

regression-analysis

Created almost 8 years ago · Last pushed about 1 year ago

Metadata Files

Readme License Citation

GlobalSearchRegression

Abstract

GlobalSearchRegression is both the world-fastest all-subset-regression command (a widespread tool for automatic model/feature selection) and a first-step to develop a coherent framework to merge Machine Learning and Econometric algorithms.

Written in Julia, it is a High Performance Computing version of the Stata-gsreg command (get the original code here). In a multicore personal computer (we use a Threadripper 1950x build for benchmarks), it runs up-to 3165 times faster than the original Stata-code and up-to 197 times faster than well-known R-alternatives (pdredge).

Notwithstanding, GlobalSearchRegression main focus is not only on execution-times but also on progressively combining Machine Learning algorithms with Econometric diagnosis tools into a friendly Graphical User Interface (GUI) to simplify embarrassingly parallel quantitative-research.

In a Machine Learning environment (e.g. problems focusing on predictive analysis / forecasting accuracy) there is an increasing universe of “training/test” algorithms (many of them showing very interesting performance in Julia) to compare alternative results and find-out a suitable model.

However, problems focusing on causal inference require five important econometric features: 1) Parsimony (to avoid very large atheoretical models); 2) Interpretability (for causal inference, rejecting “intuition-loss” transformation and/or complex combinations); 3) Across-models sensitivity analysis (uncertainty is the only certainty; parameter distributions are preferred against “best-model” unique results); 4) Robustness to time series and panel data information (preventing the use of raw bootstrapping or random subsample selection for training and test sets); and 5) advanced residual properties (e.g. going beyond the i.i.d assumption and looking for additional panel structure properties -for each model being evaluated-, which force a departure from many traditional machine learning algorithms).

For all these reasons, researchers increasingly prefer advanced all-subset-regression approaches, choosing among alternative models by means of in-sample and/or out-of-sample criteria, model averaging results, bayesian priors for theoretical bounds on covariates coefficients and different residual constraints. While still unfeasible for large problems (choosing among hundreds of covariates), hardware and software innovations allow researchers to implement this approach in many different scientific projects, choosing among one billion models in a few hours using standard personal computers.

Installation

GlobalSearchRegression requires Julia 1.6.7 (or newer releases) to be previously installed in your computer. Then, start Julia and type "]" (without double quotes) to open the package manager.

julia julia> ] pkg> After that, just install GlobalSearchRegression by typing "add GlobalSearchRegression"

julia pkg> add GlobalSearchRegression Optionally, some users could also find interesting to install CSV and DataFrames packages to allow for additional I/O functionalities.

julia pkg> add CSV DataFrames

Basic Usage

To run the simplest analysis just type:

julia julia> using GlobalSearchRegression, DelimitedFiles julia> dataname = readdlm("path_to_your_data/your_data.csv", ',', header=true) and

julia julia> gsreg("your_dependent_variable your_explanatory_variable_1 your_explanatory_variable_2 your_explanatory_variable_3 your_explanatory_variable_4", dataname) or julia julia> gsreg("your_dependent_variable *", dataname) It performs an Ordinary Least Squares - all subset regression (OLS-ASR) approach to choose the best model among 2ⁿ-1 alternatives (in terms of in-sample accuracy, using the adjusted R²), where: * DelimitedFiles is the Julia buit-in package we use to read data from csv files (throught its readdlm function); * "pathtoyourdata/yourdata.csv" is a string that indentifies your comma-separated database, allowing for missing observations. It's assumed that your database first row is used to identify variable names; * gsreg is the GlobalSearchRegression function that estimates all-subset-regressions (e.g. all-possible covariate combinations). In its simplest form, it has two arguments separated by a comma; * The first gsreg argument is the general unrestricted model (GUM). It must be typed between double quotes. Its first string is the dependent variable name (csv-file names must be respected, remember that Julia is case sensitive). After that, you can include as many explanatory variables as you want. Alternative, you can replace covariates by wildcars as in the example above (e.g. * for all other variables in the csv-files, or qwert* for all other variables in the csv-file with names starting by "qwert"); and * The second gsreg argument is name of the object containing your database. Following the example above, it must match the name you use in dataname = readdlm("pathtoyourdata/yourdata.csv", ',', header=true)

Advanced usage

Alternative data input

Databases can also be handled with CSV/DataFrames packages. To do so, remember to install them by using the add command in the Julia's package manager. Once it is done, just type: julia ] pkg> add CSV, DataFrames then return with backspace to main REPL interface julia julia> using GlobalSearchRegression, CSV, DataFrames julia> data = CSV.read("path_to_your_data/your_data.csv", DataFrame) julia> gsreg("y *", data)

Alternative GUM syntax

The general unrestricted model (GUM; the gsreg function first argument) can be written in many different ways, looking for a smooth transition for R and Stata users. ```julia

Stata like

julia> gsreg("y x1 x2 x3", data)

R like

julia> gsreg("y ~ x1 + x2 + x3", data) julia> gsreg("y ~ x1 + x2 + x3", data=data)

Strings separated with comma

julia> gsreg("y,x1,x2,x3", data)

Array of strings

julia> gsreg(["y", "x1", "x2", "x3"], data)

Using wildcards

julia> gsreg("y ", data) julia> gsreg("y x", data) julia> gsreg("y x1 z", data) julia> gsreg("y ~ x", data) julia> gsreg("y ~ .", data) ```

Additional options

GlobalSearchRegression advanced properties include almost all Stata-GSREG options but also additional features. Overall, our Julia's version has the following options: * intercept::Union{Nothing, Bool}: by default the GUM includes an intercept as a fixed covariate (e.g. it's included in every model). Alternatively, users can erase it by selecting the intercept=false boolean option. * estimator::Union{Nothing, String}: can be either "ols" or "olsfe". The latter performs the OLS estimator on the modified panel dataset obtained from applying the "within transformation" to the original data. panelid and time options must be identified to use estimator="olsfe". * fixedvars::Union{Nothing, Symbol, Vector{Symbol}}: if you have some interest variables that should remain ubiquitous, use this gsreg option to identify variables that will be used in all regressions (i.e. fixedvars = [:x1, x2]). fixedvars cannot be also included in the equation. * outsample::Union{Nothing, Int}: it identify how many observations will be left to forecasting purposes (e.g. outsample = 10 indicates that the last 10 observations will not be used in the OLS estimation, remaining available for out-of-sample accuracy calculations). In a panel data context, outsample observations will be identified on a panelid basis (i.e. the last 10 observations of each panel group). * criteria::Union{Nothing, Symbol, Vector{Symbol}}: there are 7 different criteria (which must be included as symbols) to evaluate alternative models. For in-sample adjustment, user can choose one or many among the following: Adjusted R² (:r2adj, the default), Bayesian information criteria (:bic), Akaike and Corrected Akaike information criteria (:aic and :aicc), Mallows's Cp statistic (:cp), Sum of squared errors (also known as Residual sum of squares, :sse) and the Root mean square error (:rmse). For out-of-sample accuracy, there is available the out-of-sample root mean square error (:rmsout). Users are free to combine in-sample and out-of-sample information criteria, as well as many different in-sample criteria. For each alternative model, GlobalSearchRegrssion will calculate a composite ordering variable defined as the equally-weighted average of normalized (to guarantee equal weights) and harmonized (to ensure that higher values always identify better models) user's specified criteria. * ttest::Union{Nothing, Bool}: by default there is no t-test (to resamble similar R packages), but users can active it by using the boolean option ttest=true. * vce::Union{Nothing, String}: if ttest=true, you can also use available covariance corrections to improve ttest calculations. For cross-section and time series data, you can set vce="robust" to use the White correction. For panel data it is also available the option vce="cluster" to adjust for correlated errors within clusters (panelid units). * method::Union{Nothing, String}:this option has 9 valid entries ("qr64", "cho64", "svd64", "qr32", "cho32", "svd32", "qr16", "cho16", "svd16") that can be used to alternatively run estimations with three different matrix factorization alternatives (QR, Cholesky and Single Value Decomposition) and three different datatypes alternatives (Float16, Float32 of Float64). The default is method="qr32". It must be notice that Float16 only improves performance in those architectures where FPU allows Float16 arithmetic operations without conversions to Float32 (like Aarch64). * modelavg::Union{Nothing, Bool}: by default, GlobalSearchRegression identifies the best model in terms of user' specified criteria. Complementarily, by setting the boolean modelavg option to true (modelavg=true), users will be able to obtain across-models' average coefficients, t-tests and additional statistics (using exponential weights based on the -potentially composite- ordering variable defined in the criteria option). Each alternative model has a weight given by w1/sum(w1), where w1 is defined as exp(-delta/2) and delta is equal to max(ordering variable)-(ordering variable). * time::Union{Nothing, Symbol, String}: this option determines which variable will be used to date (and pre-sort) observations. Time variable must be included as a Symbol or String (e.g. time=:x1 or time="x1"). Neither, gaps nor missing observations are allowed in this variable (missing observations are allowed in any other variable). By using this option, additional residuals tests are enabled. * panelid::Union{Nothing, Symbol, String}: this option identifies the name of the variable containing panel groups identifiers (receives the argument both as a String or a Symbol). The content of this variable must be of an Integer type (with the same -and unique- identifier for each element of the same group). * residualtests::Union{Nothing, Bool}: normality, heteroskedasticity and serial correlation tests will be performed when this boolean option is set to true (default is residualtests = nothing). Jarque-Bera statistic will be used to test for gaussian residuals. If panelid is not defined, data will be treated as cross-section or time-series (depending on time option values) and the White test will be used to check for heteroskedasticity. Otherwise, the modified Wald statistic will be computed to test for heteroskedasticity in a panel data context (as in the xttest3 Stata package). If time variable is defined (in the time option), serial correlation test will also be estimated: the Breusch-Godfrey for time series and the Wooldridge test for panel data (as in the xtserial Stata package). For each model, residual tests p-values will be saved into the model results that can be exported to a csv file using the resultscsv option. * paneltests::Union{Nothing, Bool}: if panelid is identified, fixed-effects estimations can include additional tests. When paneltests=true, the (ANOVA) F-test (to check if all ui = 0) and the Breusch-Pagan Lagrange Multiplier test (to check for cross-sectional independence, as in the xttest2 Stata package) will be computed and stored in model results. * resultscsv::Union{Nothing, String}: the string used in this option will define the name of the CSV file to be created into the working directory with output results. By default, no CSV file is created (only main results are displayed in the REPL). * orderresults::Union{Nothing, Bool}: a boolean option to determine whether models should be sorted (by the user' specified information criteria) or not. By default there is no sorting performed (orderresults=false). It must be noticed that setting orderresults=true and method="svd64" will significantly increase execution times. * parallel::Union{Nothing, Int}: the most important option. It defines how many workers will be asigned to GlobalSearchRegresssion in order to parallelize calcultions. Using physical cores, speed-up is impressive. It is even superlinear with small databases (exploiting LLC multiplication). Notwidhstanding, speed-up efficiency decreases with logical cores (e.g. enabling hyperthreading). In order to use this option, julia must be initialized with the -p auto option or additional processors must be enables (with the addprocs(#) option, see the example below). Otherwise, Julia will only use one core and the parallel option of GlobalSearchRegression will not be available.

Full-syntax example

This is a full-syntax example, assuming Julia 1.0.1 (or newer version), GlobalSearchRegression and DataFrames are already installed in a quad-core personal computer. To enable parallelism, the Distributed package (including its addprocs command) must be activated before GlobalSearchRegression (in three different lines, one for Distributed, one for addprocs() and the other for GlobalSearchRegreesion, see the example below).

For simulated time series data

```julia

The first four lines are used to simulate data with random variables

julia> using DataFrames

julia> data = DataFrame([Vector{Union{Float64, Missing}}(rand(100)) for _ in 1:16], :auto) julia> headers = [ :y ; [ Symbol("x$i") for i = 1:size(data, 2) - 1 ] ] julia> rename!(data, headers)

The following two lines enable multicore calculations

julia> using Distributed julia> addprocs(4)

Next line defines the working directory (where output results will be saved), for example:

julia> cd("c:\") # in Windows, or julia> cd("/home/") # in Linux

Final two lines are used to perform all-subset-regression

julia> using GlobalSearchRegression julia> gsreg( "y x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15", data, intercept=true, outsample=10, criteria=[:r2adj, :bic, :aic, :aicc, :cp, :rmse, :rmseout, :sse], ttest=true, method="svd_64", modelavg=true, residualtests=true, time=:x1, resultscsv="output.csv", parallel=4, orderresults=false ) ```

For panel data

```julia julia> using DataFrames, Distributions, LinearAlgebra

Artificial panel database creation

julia> N, T, K= 20, 500, 4 julia> correlationmatrix = [ 1.0 0.8 0.3 0.2 0.8 1.0 0.4 0.3 0.3 0.4 1.0 0.2 0.2 0.3 0.2 1.0 ] julia> L = cholesky(correlationmatrix).L julia> X = [L * rand(Normal(), K) for _ in 1:(N * T)] julia> X = reduce(hcat, X)' julia> paneldata = DataFrame(X, [:x1, :x2, :x3, :x4]) julia> paneldata[!, :id] = repeat(1:N, inner = T) julia> paneldata[!, :time] = repeat(1:T, outer = N) julia> alpha = rand(Normal(), N) .* 10
julia> beta = [1.0, 0.6, 0.3, 0.15]
julia> constant = 2 julia> paneldata[!, :y] = [constant + alpha[paneldata1[i, :id]] + dot(beta, paneldata1[i, [:x1, :x2, :x3, :x4]]) + rand(Normal()) for i in 1:(N * T)]

Parallel execution settings

julia> using Distributed julia> addprocs() julia> n = nworkers()

GlobalSearchRegression implementation

julia> using GlobalSearchRegression, julia> model=gsreg( "y x1 x3", paneldata; fixedvars = [:x2, :x4], time = :time, panelid = :id, estimator = "olsfe", method="cho32", ttest=true, residualtests = true, paneltests = true, criteria=[:r2adj, :bic, :aic], modelavg=true, resultscsv="panel_data4.csv", parallel=2 ) julia> println(model) ```

Limitations

GlobalSearchRegression.jl is not able to handle databases with perfectly-collinear covariates. An error message will be retreived and users will have to select a new database with just one of these perfectly-colllinear variables. Similarly, it is not possible yet to include categorical variables as potential covariates. They should be transformed into dummmy variables before using GlobalSearchRegression.jl. Finally, string variables are not allowed.

Credits

The GSReg module, which perform regression analysis, was written primarily by Demian Panigo, Pablo Gluzmann, Valentín Mari, Adán Mauri Ungaro and Nicolas Monzon with the collaboration of Esteban Mocskos. The GlobalSearchRegression.jl module was inpired by GSReg for Stata, written by Pablo Gluzmann and Demian Panigo.

Owner

Name: ParallelGSReg
Login: ParallelGSReg
Kind: organization

Repositories: 3
Profile: https://github.com/ParallelGSReg

Citation (CITATION.cff)

cff-version: 1.2.0
message: "If you use this software, please cite it as below."
authors:
  - family-names: Panigo
    given-names: Demian Tupac
    orcid: https://orcid.org/0000-0003-2468-1632
  - family-names: Mauri Ungaro
    given-names: Adan
    orcid: https://orcid.org/0000-0003-3106-3912
  - family-names: Mari
    given-names: Valentin
    orcid: https://orcid.org/0000-0003-1489-6525
  - family-names: Glüzmann
    given-names: Pablo
    orcid: https://orcid.org/0000-0001-6570-383X
  - family-names: Monzón
    given-names: Nicolás
    orcid: https://orcid.org/0000-0002-4846-7916
title: "GlobalSearchRegression.jl"
version: 1.0.7
identifiers:
  - type: doi
    value: 10.5281/zenodo.7932823
date-released: May 13, 2023

GitHub Events

Total

Push event: 2
Pull request event: 3
Fork event: 1

Last Year

Push event: 2
Pull request event: 3
Fork event: 1

Committers

Last synced: almost 3 years ago

All Time

Total Commits: 287
Total Committers: 9
Avg Commits per committer: 31.889
Development Distribution Score (DDS): 0.652

Top Committers

Name	Email	Commits
Valentin Mari	v**i@u**m	100
adanmauri	a**i@g**m	61
Demian Panigo	3**o@u**m	58
nicomzn	5**n@u**m	55
nicomzn	n**4@g**m	5
Valentín Mari	v**i@n**m	4
Demian Panigo	p**o@g**m	2
José Bayoán Santiago Calderón	n**r@g**m	1
Valentín Mari	v**i@h**m	1

Committer Domains (Top 20 + Academic)

nan-labs.com: 1 ubykuo.com: 1

Issues and Pull Requests

Last synced: 6 months ago

All Time

Total issues: 6
Total pull requests: 10
Average time to close issues: 12 days
Average time to close pull requests: 4 days
Total issue authors: 6
Total pull request authors: 4
Average comments per issue: 4.5
Average comments per pull request: 0.1
Merged pull requests: 3
Bot issues: 0
Bot pull requests: 0

Past Year

Issues: 0
Pull requests: 6
Average time to close issues: N/A
Average time to close pull requests: N/A
Issue authors: 0
Pull request authors: 1
Average comments per issue: 0
Average comments per pull request: 0.0
Merged pull requests: 0
Bot issues: 0
Bot pull requests: 0

View more stats

Top Authors

Issue Authors

mthelm85 (1)
dpanigo (1)
clibassi (1)
Nosferican (1)
JuliaTagBot (1)
EvoArt (1)

Pull Request Authors

gluzmanngmail (6)
vmari (2)
dpanigo (1)
Nosferican (1)

Top Labels

Issue Labels

Pull Request Labels

Packages

Total packages: 1
Total downloads:
- julia 1 total

Total dependent packages: 1
Total dependent repositories: 4
Total versions: 10

juliahub.com: GlobalSearchRegression

Julia's HPC command for automatic feature/model selection using all-subset-regression approaches

Documentation: https://docs.juliahub.com/General/GlobalSearchRegression/stable/
License: MIT
Latest release: 1.0.9
published over 1 year ago

Versions: 10
Dependent Packages: 1
Dependent Repositories: 4
Downloads: 1 Total

Rankings

Dependent repos count: 5.5%

Average: 20.8%

Dependent packages count: 24.0%

Stargazers count: 26.6%

Forks count: 27.3%

Last synced: 6 months ago

GlobalSearchRegression

Science Score: 54.0%

Keywords

Repository

Basic Info

Statistics

Topics

Metadata Files

README.md

GlobalSearchRegression

Abstract

Installation

Basic Usage

Advanced usage

Alternative data input

Alternative GUM syntax

Stata like

R like

Strings separated with comma

Array of strings

Using wildcards

Additional options

Full-syntax example

The first four lines are used to simulate data with random variables

The following two lines enable multicore calculations

Next line defines the working directory (where output results will be saved), for example:

Final two lines are used to perform all-subset-regression

Artificial panel database creation

Parallel execution settings

GlobalSearchRegression implementation

Limitations

Credits

Owner

Citation (CITATION.cff)

GitHub Events

Total

Last Year

Committers

All Time

Top Committers

Committer Domains (Top 20 + Academic)

Issues and Pull Requests

All Time

Past Year

Top Authors

Issue Authors

Pull Request Authors

Top Labels

Issue Labels

Pull Request Labels

Packages

juliahub.com: GlobalSearchRegression

Rankings

Dependencies