Hi All,

Recently I have found very nice formula to calculate the Gini index particularly for a predictive model (this can be generalized easily in my opinion). The formula is really simple, it applies simple expected values. The article is in polish, but I will translate and pass the most important points.

I have not seen this formula before, so I assume this was proposed by Mariusz Gromada – blog post author.

**The formula for Gini index**

**1. The object and the class**

Let’s assume we variable *y* indicating class for the object *x,* where *y* is taking two values 0 and 1.

[; y(x)in{0,1} ;]

[; xin X ;]

*y = 1* means *x* is in class positive, *y = 0* means *x* is in class negative, *X* is object space (*X* is finite).

**2. The model estimating class probability for the object x**

Let’s assume we additionally have a model

[; p:Xto[0,1] ;]

Model p maps *X* onto continuous interval [0,1]. Interpretation of *p* is as follows

[; xin X ;]

For an object *x*, we have *p(x)* that estimates the probability of *y(x) = 1*.

Another way of thinking on *p* is to consider *X* and *y* as random variables, then

[; p(x)=p(y(x)=1|X=x)=p(y=1|x)=p(1|x) ;]

In the next part of the text I will be using the *p(1|x)* notation

[; p(1|x) ;]

[; p(0|x)=1-p(1|x) ;]

Additionally let’s define a-priori probabilities

[; N=#{xin X};]

[; pi_1=p(1)=frac{#{xin X~:~y(x) = 1}}{N} ;]

[; pi_0=p(0)=frac{#{xin X~:~y(x) = 0}}{N}=1-pi_0 ;]

**3. The model strength**

Gini index is a great measure of the model strength, as Gini index shows statistical dispersion. In case of above defined model the higher *p(1|x)* the higher share of *y(x)=1*, to lower *p(1|x)* the higher share of *y(x)=0*.

**4. Sorting the X and getting the position**

Let’s sort the *X* set by *p(1|x)* using **the descending order**

[; r:Xto{1,2,ldots,N} ;]

*r(x)* is the position of object *x* after sorting descending by *p(1|x).* Ranking normalization is provided by the function *R(x)* defined as follows.

[; R(x)=frac{r(x)}{N} ;]

[; R(x)in[0,1] ;]

*R(x)* is a kind of normalized position in a set, can be even interpreted as a random variable *R*.

**5. Final Gini for p (two formulas!)**

[; Gini(p)=frac{2times E(R|y=0)-1}{pi_1} ;]

[; Gini(p)=frac{1-2times E(R|y=1)}{pi_0} ;]

where *E(R|y=0)***is an average position of objects from class 0** and *E(R|y=1)***is an average position of object from class 1**.

Very nice and very simple to calculate!

I hope you will like it.

Please let me know what other nice formulas for Gini index you know.

Best regards

submitted by /u/leroykegan

[link] [comments]