.. note:: :class: sphx-glr-download-link-note Click :ref:`here ` to download the full example code .. rst-class:: sphx-glr-example-title .. _sphx_glr_auto_examples_plot_group_lasso.py: =========================== Group Lasso Regularization =========================== This is an example demonstrating Pyglmnet with group lasso regularization, typical in regression problems where it is reasonable to impose penalties to model parameters in a group-wise fashion based on domain knowledge. .. code-block:: default # Author: Matthew Antalek # License: MIT .. code-block:: default from pyglmnet import GLMCV from pyglmnet.datasets import fetch_group_lasso_datasets import matplotlib.pyplot as plt Group Lasso Example applied to the same dataset found in: ftp://ftp.stat.math.ethz.ch/Manuscripts/buhlmann/lukas-sara-peter.pdf The task here is to determine which base pairs and positions within a 7-mer sequence are predictive of whether the sequence contains a splice site or not. Read and preprocess data .. code-block:: default df, group_idxs = fetch_group_lasso_datasets() print(df.head()) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none ...0%, 0 MB ...8%, 0 MB ...16%, 0 MB ...24%, 0 MB ...32%, 0 MB ...40%, 0 MB ...48%, 0 MB ...56%, 0 MB ...64%, 0 MB ...73%, 0 MB ...81%, 0 MB ...89%, 0 MB ...97%, 0 MB ...100%, 0 MB ...0%, 0 MB ...0%, 0 MB ...0%, 0 MB ...1%, 0 MB ...1%, 0 MB ...1%, 0 MB ...2%, 0 MB ...2%, 0 MB ...3%, 0 MB ...3%, 0 MB ...3%, 0 MB ...4%, 0 MB ...4%, 0 MB ...4%, 0 MB ...5%, 0 MB ...5%, 0 MB ...6%, 0 MB ...6%, 0 MB ...6%, 0 MB ...7%, 0 MB ...7%, 0 MB ...7%, 0 MB ...8%, 0 MB ...8%, 0 MB ...9%, 0 MB ...9%, 0 MB ...9%, 0 MB ...10%, 0 MB ...10%, 0 MB ...11%, 0 MB ...11%, 0 MB ...11%, 0 MB ...12%, 0 MB ...12%, 0 MB ...12%, 0 MB ...13%, 0 MB ...13%, 0 MB ...14%, 0 MB ...14%, 0 MB ...14%, 0 MB ...15%, 0 MB ...15%, 0 MB ...15%, 0 MB ...16%, 0 MB ...16%, 0 MB ...17%, 0 MB ...17%, 0 MB ...17%, 0 MB ...18%, 0 MB ...18%, 0 MB ...19%, 0 MB ...19%, 0 MB ...19%, 0 MB ...20%, 0 MB ...20%, 0 MB ...20%, 0 MB ...21%, 0 MB ...21%, 0 MB ...22%, 0 MB ...22%, 0 MB ...22%, 0 MB ...23%, 0 MB ...23%, 0 MB ...23%, 0 MB ...24%, 0 MB ...24%, 0 MB ...25%, 0 MB ...25%, 0 MB ...25%, 0 MB ...26%, 0 MB ...26%, 0 MB ...27%, 0 MB ...27%, 0 MB ...27%, 0 MB ...28%, 0 MB ...28%, 0 MB ...28%, 0 MB ...29%, 0 MB ...29%, 0 MB ...30%, 0 MB ...30%, 0 MB ...30%, 0 MB ...31%, 0 MB ...31%, 0 MB ...31%, 0 MB ...32%, 0 MB ...32%, 0 MB ...33%, 0 MB ...33%, 0 MB ...33%, 0 MB ...34%, 0 MB ...34%, 0 MB ...35%, 0 MB ...35%, 0 MB ...35%, 0 MB ...36%, 0 MB ...36%, 0 MB ...36%, 0 MB ...37%, 0 MB ...37%, 0 MB ...38%, 0 MB ...38%, 0 MB ...38%, 0 MB ...39%, 0 MB ...39%, 0 MB ...39%, 0 MB ...40%, 0 MB ...40%, 0 MB ...41%, 0 MB ...41%, 0 MB ...41%, 0 MB ...42%, 0 MB ...42%, 0 MB ...42%, 0 MB ...43%, 0 MB ...43%, 0 MB ...44%, 0 MB ...44%, 0 MB ...44%, 0 MB ...45%, 0 MB ...45%, 0 MB ...46%, 0 MB ...46%, 0 MB ...46%, 0 MB ...47%, 0 MB ...47%, 0 MB ...47%, 0 MB ...48%, 0 MB ...48%, 1 MB ...49%, 1 MB ...49%, 1 MB ...49%, 1 MB ...50%, 1 MB ...50%, 1 MB ...50%, 1 MB ...51%, 1 MB ...51%, 1 MB ...52%, 1 MB ...52%, 1 MB ...52%, 1 MB ...53%, 1 MB ...53%, 1 MB ...54%, 1 MB ...54%, 1 MB ...54%, 1 MB ...55%, 1 MB ...55%, 1 MB ...55%, 1 MB ...56%, 1 MB ...56%, 1 MB ...57%, 1 MB ...57%, 1 MB ...57%, 1 MB ...58%, 1 MB ...58%, 1 MB ...58%, 1 MB ...59%, 1 MB ...59%, 1 MB ...60%, 1 MB ...60%, 1 MB ...60%, 1 MB ...61%, 1 MB ...61%, 1 MB ...62%, 1 MB ...62%, 1 MB ...62%, 1 MB ...63%, 1 MB ...63%, 1 MB ...63%, 1 MB ...64%, 1 MB ...64%, 1 MB ...65%, 1 MB ...65%, 1 MB ...65%, 1 MB ...66%, 1 MB ...66%, 1 MB ...66%, 1 MB ...67%, 1 MB ...67%, 1 MB ...68%, 1 MB ...68%, 1 MB ...68%, 1 MB ...69%, 1 MB ...69%, 1 MB ...70%, 1 MB ...70%, 1 MB ...70%, 1 MB ...71%, 1 MB ...71%, 1 MB ...71%, 1 MB ...72%, 1 MB ...72%, 1 MB ...73%, 1 MB ...73%, 1 MB ...73%, 1 MB ...74%, 1 MB ...74%, 1 MB ...74%, 1 MB ...75%, 1 MB ...75%, 1 MB ...76%, 1 MB ...76%, 1 MB ...76%, 1 MB ...77%, 1 MB ...77%, 1 MB ...77%, 1 MB ...78%, 1 MB ...78%, 1 MB ...79%, 1 MB ...79%, 1 MB ...79%, 1 MB ...80%, 1 MB ...80%, 1 MB ...81%, 1 MB ...81%, 1 MB ...81%, 1 MB ...82%, 1 MB ...82%, 1 MB ...82%, 1 MB ...83%, 1 MB ...83%, 1 MB ...84%, 1 MB ...84%, 1 MB ...84%, 1 MB ...85%, 1 MB ...85%, 1 MB ...85%, 1 MB ...86%, 1 MB ...86%, 1 MB ...87%, 1 MB ...87%, 1 MB ...87%, 1 MB ...88%, 1 MB ...88%, 1 MB ...89%, 1 MB ...89%, 1 MB ...89%, 1 MB ...90%, 1 MB ...90%, 1 MB ...90%, 1 MB ...91%, 1 MB ...91%, 1 MB ...92%, 1 MB ...92%, 1 MB ...92%, 1 MB ...93%, 1 MB ...93%, 1 MB ...93%, 1 MB ...94%, 1 MB ...94%, 1 MB ...95%, 1 MB ...95%, 1 MB ...95%, 1 MB ...96%, 1 MB ...96%, 1 MB ...97%, 1 MB ...97%, 2 MB ...97%, 2 MB ...98%, 2 MB ...98%, 2 MB ...98%, 2 MB ...99%, 2 MB ...99%, 2 MB ...100%, 2 MB 0 1 2 3 4 5 6 7 8 9 ... 930 931 932 933 934 935 936 937 938 Label 0 1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1 1.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 2 1.0 1.0 0.0 1.0 0.0 1.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 3 1.0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 4 1.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 [5 rows x 940 columns] Set up the training and testing sets .. code-block:: default from sklearn.model_selection import train_test_split # noqa X = df[df.columns.difference(["Label"])].values y = df.loc[:, "Label"].values Xtrain, Xtest, ytrain, ytest = \ train_test_split(X, y, test_size=0.2, random_state=42) Setup the models .. code-block:: default # set up the group lasso GLM model gl_glm = GLMCV(distr="binomial", tol=1e-3, group=group_idxs, score_metric="pseudo_R2", alpha=1.0, learning_rate=3, max_iter=100, cv=3, verbose=True) # set up the lasso model glm = GLMCV(distr="binomial", tol=1e-3, score_metric="pseudo_R2", alpha=1.0, learning_rate=3, max_iter=100, cv=3, verbose=True) print("gl_glm: ", gl_glm) print("glm: ", glm) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none gl_glm: < Distribution | binomial alpha | 1.00 max_iter | 100.00 lambda: 0.50 to 0.01 > glm: < Distribution | binomial alpha | 1.00 max_iter | 100.00 lambda: 0.50 to 0.01 > Fit models .. code-block:: default gl_glm.fit(Xtrain, ytrain) glm.fit(Xtrain, ytrain) .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /Users/mainak/Documents/github_repos/pyglmnet/pyglmnet/pyglmnet.py:864: UserWarning: Reached max number of iterations without convergence. "Reached max number of iterations without convergence.") Visualize model scores on test set .. code-block:: default plt.figure() plt.semilogx(gl_glm.reg_lambda, gl_glm.scores_, 'go-') plt.semilogx(glm.reg_lambda, glm.scores_, 'ro--') plt.legend(['Group Lasso', 'Lasso'], frameon=False, loc='best') plt.xlabel('$\lambda$') plt.ylabel('pseudo-$R^2$') plt.tick_params(axis='y', right='off') plt.tick_params(axis='x', top='off') ax = plt.gca() ax.spines['top'].set_visible(False) ax.spines['right'].set_visible(False) plt.show() .. image:: /auto_examples/images/sphx_glr_plot_group_lasso_001.png :class: sphx-glr-single-img .. rst-class:: sphx-glr-script-out Out: .. code-block:: none /Users/mainak/Documents/github_repos/pyglmnet/examples/plot_group_lasso.py:89: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure. plt.show() .. rst-class:: sphx-glr-timing **Total running time of the script:** ( 3 minutes 21.727 seconds) .. _sphx_glr_download_auto_examples_plot_group_lasso.py: .. only :: html .. container:: sphx-glr-footer :class: sphx-glr-footer-example .. container:: sphx-glr-download :download:`Download Python source code: plot_group_lasso.py ` .. container:: sphx-glr-download :download:`Download Jupyter notebook: plot_group_lasso.ipynb ` .. only:: html .. rst-class:: sphx-glr-signature `Gallery generated by Sphinx-Gallery `_