Skip to content

Instantly share code, notes, and snippets.

@brentp
Last active September 17, 2024 14:48
Show Gist options
  • Select an option

  • Save brentp/089c7d6d69d78d26437f to your computer and use it in GitHub Desktop.

Select an option

Save brentp/089c7d6d69d78d26437f to your computer and use it in GitHub Desktop.
beta regression in statsmodels
import numpy as np
from scipy.special import gammaln as lgamma
from statsmodels.base.model import GenericLikelihoodModel
from statsmodels.api import GLM
from statsmodels.genmod.families import Binomial
#see http://cran.r-project.org/web/packages/betareg/vignettes/betareg-ext.pdf
def ilogit(a):
return 1 / (1. + np.exp(-a))
def logit(a):
return np.log(p / (1. - p))
class BetaReg(GenericLikelihoodModel):
def __init__(self, endog, exog, Z=None, **kwds):
super(BetaReg, self).__init__(endog, exog, **kwds)
# how to set default Z?
if Z is None:
self.Z = np.ones((self.endog.shape[0], 1), dtype='f')
else:
self.Z = np.asarray(Z)
assert len(self.Z) == len(self.endog)
def nloglikeobs(self, params):
return -self._ll_br(self.endog, self.exog, self.Z, params)
def fit(self, start_params=None, maxiter=1000000, maxfun=50000, **kwds):
if start_params is None:
start_params = GLM(self.endog, self.exog, family=Binomial()).fit().params
start_params = np.append(start_params, [0.5] * self.Z.shape[1])
#start_params = np.append(np.zeros(self.exog.shape[1]), 0.5)
#self.exog[0] = np.mean(self.endog)
return super(BetaReg, self).fit(start_params=start_params,
maxiter=maxiter,
maxfun=maxfun,
**kwds)
def _ll_br(self, y, X, Z, params):
nz = self.Z.shape[1]
Xparams = params[:-nz]
Zparams = params[-nz:]
mu = ilogit(np.dot(X, Xparams))
phi = np.exp(np.dot(Z, Zparams))
ll = lgamma(phi) - lgamma(mu * phi) - lgamma((1 - mu) * phi) \
+ (mu * phi - 1) * np.log(y) + (((1 - mu) * phi) - 1) * np.log(1 - y)
print ll.sum()
return ll
if __name__ == "__main__":
import pandas as pd
dat = pd.read_table('gasoline.txt')
m = BetaReg.from_formula('iyield ~ C(batch) + temp', dat)
print m.fit().summary()
#print GLM.from_formula('iyield ~ C(batch) + temp', dat, family=Binomial()).fit().summary()
library(betareg)
data("GasolineYield", package = "betareg")
m = betareg(yield ~ batch + temp, data = GasolineYield)
print(summary(m))
iyield gravity pressure temp10 temp batch
0.122 50.8 8.6 190 205 1
0.223 50.8 8.6 190 275 1
0.347 50.8 8.6 190 345 1
0.457 50.8 8.6 190 407 1
0.08 40.8 3.5 210 218 2
0.131 40.8 3.5 210 273 2
0.266 40.8 3.5 210 347 2
0.074 40 6.1 217 212 3
0.182 40 6.1 217 272 3
0.304 40 6.1 217 340 3
0.069 38.4 6.1 220 235 4
0.152 38.4 6.1 220 300 4
0.26 38.4 6.1 220 365 4
0.336 38.4 6.1 220 410 4
0.144 40.3 4.8 231 307 5
0.268 40.3 4.8 231 367 5
0.349 40.3 4.8 231 395 5
0.1 32.2 5.2 236 267 6
0.248 32.2 5.2 236 360 6
0.317 32.2 5.2 236 402 6
0.028 41.3 1.8 267 235 7
0.064 41.3 1.8 267 275 7
0.161 41.3 1.8 267 358 7
0.278 41.3 1.8 267 416 7
0.05 38.1 1.2 274 285 8
0.176 38.1 1.2 274 365 8
0.321 38.1 1.2 274 444 8
0.14 32.2 2.4 284 351 9
0.232 32.2 2.4 284 424 9
0.085 31.8 0.2 316 365 10
0.147 31.8 0.2 316 379 10
0.18 31.8 0.2 316 428 10
@dburdett
Copy link

dburdett commented Apr 13, 2018

As it stands it doesn't seem to work out the box - throws up an error that suggests the hessian inversion fails for the first example:

C:\Python27\lib\site-packages\statsmodels\tools\numdiff.py:159: RuntimeWarning: invalid value encountered in double_scalars
f(*((x-ei,)+args), **kwargs))/(2 * epsilon[k])
C:\Python27\lib\site-packages\statsmodels\base\model.py:473: HessianInversionWarning: Inverting hessian failed, no bse or cov_params available
'available', HessianInversionWarning)
C:\Python27\lib\site-packages\statsmodels\base\model.py:496: ConvergenceWarning: Maximum Likelihood optimization failed to converge. Check mle_retvals
"Check mle_retvals", ConvergenceWarning)
Traceback (most recent call last):
File ".\betareg.py", line 181, in
print m.fit().summary()
File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 2095, in summary
use_t=False)
File "C:\Python27\lib\site-packages\statsmodels\iolib\summary.py", line 861, in add_table_params
use_t=use_t)
File "C:\Python27\lib\site-packages\statsmodels\iolib\summary.py", line 445, in summary_params
std_err = results.bse
File "C:\Python27\lib\site-packages\statsmodels\tools\decorators.py", line 97, in get
_cachedval = self.fget(obj)
File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 1029, in bse
return np.sqrt(np.diag(self.cov_params()))
File "C:\Python27\lib\site-packages\statsmodels\base\model.py", line 1104, in cov_params
raise ValueError('need covariance of parameters for computing '
ValueError: need covariance of parameters for computing (unnormalized) covariances

The other two examples work fine though, the foodexpenditure.csv and the methylation-test.csv inputs.

Perhaps I'm missing a step due to my unfamiliarity with Beta Regression but is there any simple way of understanding the relationship between the coefficient supplied by summary and the raw inputs? Reading around it doesn't seem there is, but given you wrote this I was hoping you might be able to shed some light.

@saccosj
Copy link

saccosj commented Jun 16, 2018

@dburnett, I can't explain the hessian error. However, since I've recently been working with beta regression, I figured yourself and others may like more info on it (Keep in mind, I am not an expert on this topic):

  1. Coefficient interpretations seems both straight forward for some, and confusing for others. Here's the best resource I have found on this.

  2. Some suggest beta problems can be simplified into binomial or logistic problems and handled from there (e.g. unpacking data from percentage values into successes and failures). I see problems with this, but others seem to not. In some situations, data may allow the distribution to be treated as normal. Given the right beta parameters, the distribution can be close to normal and/or transformed to the same. In my short experiences, beta parameters may differ within subsets of the predictor(s), making this process difficult.

  3. Beta regression cannot handle zeroes or ones in the outcome variable. Thus, any data containing zeroes for the outcome must be removed, and obviously, imputing a very small value such as 0.000001 can create major issues. If the data contains a lot of zeroes or ones, it may be considered an inflated beta distribution. Here's a link to an article neatly describing this.

  4. This code seems to handle precision models oddly on default. Different combinations of variables can be in both the logit and precision models, and will change the results. Working in r with betareg, I have derived far more reasonable estimates for known data I have. I am not 100% sure why the outcome variable is being put into the precision model, separate from the intercept, predicting itself.

Beta regression is a newer topic and most appear to try to circumvent it, so I hope this helps you or anyone else looking.

@bhishanpdl
Copy link

This is a very useful code. It behooves best if you send a pull request to statsmodels module.

@brentp
Copy link
Author

brentp commented Apr 4, 2020

there's a pull request (not sure if it's open or closed now) on the statsmodels repo where someone else did much more work on this.

@maciejkos
Copy link

there's a pull request (not sure if it's open or closed now) on the statsmodels repo where someone else did much more work on this.

@brentp - thank you for your work on this!

I searched open and closed PRs on the statsmodels repo for "beta regression" and couldn't find the PR you mentioned. I only found #2030 and #4238.

Do you remember if code in that PR was substantially more robust than yours? If yes, would you by any chance remember what it was about, so that I can try other search queries?

@hadir-mohsen
Copy link

thank you for your work.

if I had generated data follow beta distribution to crate beta regression mode how do it in R?
can I used the author estmation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment