Community and CrimeΒΆ

This is a real dataset of per capita violent crime, with demographic data comprising 128 attributes from 1994 counties in the US.

The original dataset can be found here:

The target variables (per capita violent crime) are normalized to lie in a [0, 1] range. We preprocessed this dataset to exclude attributes with missing values.

# Author: Vinicius Marques <>
# License: MIT


import numpy as np
import matplotlib.pyplot as plt
from sklearn.cross_validation import train_test_split
from pyglmnet import GLM, datasets

Download and preprocess data files

X, y = datasets.fetch_community_crime_data('/tmp/glm-tools')
n_samples, n_features = X.shape

Split the data into training and test sets

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.33, random_state=0)

Fit a gaussian distributed GLM with elastic net regularization

# use the default value for reg_lambda
glm = GLM(distr='gaussian', alpha=0.05, score_metric='pseudo_R2')

# fit model, y_train)

# score the test set prediction
y_test_hat = glm[-1].predict(X_test)
print ("test set pseudo $R^2$ = %f" % glm[-1].score(X_test, y_test))


test set pseudo $R^2$ = 0.625327

Plot the true and predicted test set target values

plt.plot(y_test[:50], 'ko-')
plt.plot(y_test_hat[:50], 'ro-')
plt.legend(['true', 'pred'], frameon=False)
plt.ylabel('Per capita violent crime')

plt.tick_params(axis='y', right='off')
plt.tick_params(axis='x', top='off')
ax = plt.gca()

Total running time of the script: (0 minutes 24.451 seconds)

Download Python source code:
Download IPython notebook: plot_community_crime.ipynb