Fitting rank order distribution with R

The long-tail distribution can be quantified in primarily three ways, see Newman ‘s paper here:

Power law distribution
Zipf distribution
Pareto distribution

Here, I talk about the Zipf distribution which assesses the relationship between rank order and Frequency (probability).

Zipf’s law now refers more generally to frequency distributions of “rank data,” in which the relative frequency of the nth-ranked item is given by the Zeta distribution. from Wikipedia

DGBD

In Gustavo Martínez-Mekler et al’s article, they proposed a discrete generalized beta distribution: $f(r) = A(N+1-r)^b/r^a$ , where r is the rank, N is maximum value, A the normalization constant and (a, b) two fitting exponents.

It’s interesting to find that Wu and Zhang (2011) adopt the DGBD distribution to quantify the online social systems. Basically, they are interested in accelerating growth in human online behaviors.

Deﬁning P as the number of active users in a day and T as the total activity generated by these users, they find a power law relationship between them $T = P^\gamma$ . Given $\gamma$ larger than 1, there exists an accelerating growth phenomena in online systems (e.g., tagging, microblogging)

They denote the activity of a user in one day with t(r), in which r is the decreasing rank of the activity among all individual activities in the day. Thus the maximum value of r, $r_{max}$ , equals population P. The DGBD model of individual activities is then

$t(r) = A(P+1−r)^b r^a (a > 0,b > 0)$

They introduced that a determines the activities of highly active users(corresponds to the exponent $\alpha$ in Zipf’s law), b determines the activities of the less active users. Using the DGBD, They can ﬁt the empirical curves with $R^2$ > 0.9.

Here I start with an empirical dataset. The Tweet of Milan city in December 2013. Basically, we know the language of each tweet. Thus we have the frequency of each language.

freq = c(1116, 2067, 137 ,  124, 643,  2042, 55  ,47186,  7504, 1488, 211,   1608,  3517 , 7  , 896  ,  378, 17 ,3098, 164977  ,  601 ,  196, 637, 149 , 44,2 ,  1801, 882   , 636,5184,  1851,  776 ,   343   , 851, 33  ,4011,   209,  715 ,  937 , 20,   6922, 2028 , 23,  3045 , 16 , 334,  31 ,  2)
lan = c("af","ar","bg","cs","da","de","el","en","es","et","fa","fi","fr","he","hr","hu","id","it","ja","ko","lt","lv","mk","ne","nl","no","pl","pt","ro","ru","sk","sl","so","sq","sv","sw","th","tl","tr","uk","und","ur","vi","zh-cn","zh-tw")

Thus, we can calculate the decreasing rank for each language.

Rank = rank(-freq, ties.method = c("first") )
data = data.frame(lan, freq, Rank)
data$Probability = data$Freq/sum(data$Freq)

We can write a simple function to fit the rank ordered data and capture the distribution.

get_dgbd = function(freq){
  Rank = rank(-freq, ties.method = c("first") )
  p = freq/sum(as.numeric(freq))
  # get the log form
  log.f = log(freq)
  log.p = log(p)
  log.rank = log(Rank)
  log.inverse.rank = log(length(Rank)+1-Rank)
  # linear regression of zifp: for probability
  cozp=coef(lm(log.p~log.rank))
  zipf.p = function(x) exp(cozp[[1]] + cozp[2]*log(x))
  # linear regression of zifp: for frequency
  cozf=coef(lm(log.f~log.rank))
  zipf.f = function(x) exp(cozf[[1]] + cozf[2]*log(x)) 
  # linear regression of dgbd: for probability
  codp=coef(lm(log.p~log.inverse.rank + log.rank))
  dgbd.p = function(x) exp(codp[[1]]+ codp[[2]]*log(length(x)+1-x) + codp[[3]]*log(x))
  # linear regression of dgbd: for frequency
  codf=coef(lm(log.f~log.inverse.rank + log.rank))
  dgbd.f = function(x) exp(codf[[1]]+ codf[[2]]*log(length(x)+1-x) + codf[[3]]*log(x))
  return(c(zipf.p, zipf.f, dgbd.p, dgbd.f))
}
zipf.p = get_dgbd(data$Freq)[[1]]
zipf.f = get_dgbd(data$Freq)[[2]]
dgbd.p = get_dgbd(data$Freq)[[3]]
dgbd.f = get_dgbd(data$Freq)[[4]]
plot(freq~Rank,log="xy", xlab = "Rank (log)", ylab = "Frequency (log)", data = data)
curve(zipf.f, col="red", add = T, n = length(data$Rank))
curve(dgbd.f, col="blue", add = T, n = length(data$Rank))

Remember to specify the length of values in ‘curve’. About its importance, check this post on stackoverflow.

Finally, we can plot it with ggplot2.

require(ggplot2)
P = ggplot(data=data, aes(x=Rank, y=Freq, label = Var1)) + geom_point() + 
  coord_trans(xtrans = "log10", ytrans = "log10")+
  stat_function(fun = dgbd.f, n = length(Rank), colour = 'red', size = 1)+
  geom_text(aes(label=Var1),hjust=1.5, vjust=0, angle = 45, size = 3)
png(file = "./language_rank_order_distribution2.png", 
    width=8, height=5, 
    units="in", res=700)
P 
dev.off()

How to construct your model?

Using the DGBD model, we can fit the frequency-rank data almost perfectly. However, there are many other other forms of alternative equations. For example, there are Zipf-Manderbrot Model and its modifications, How to guarantee that which specific model is the right one? We need to:

learn more about the underlying mechanisms.
observe the patterns of the data

In the paper titled Modeling bursts and heavy tails in human dynamics, Alexei Vázquez et al tried to propose two queuing models to explain the value of scaling parameter in the temporal patterns of human behaviors. They capture the temporal patterns with two measurements: interevent time $\tau$ and waiting times $\tau_{w}$ .

The time between two consecutive events is called the interevent time, $\tau$ ; the waiting (or response) time, $\tau_w$ , representing the amount of time a task waits on an individual’s priority list before being executed.

Assuming that the tasks are executed independently from each other at a constant rate $\lambda$ , the time can be approximated by a Poisson process:

$p(\tau) = \lambda e^{-\lambda \tau}$ .

However, we know that in many human behaviors, the Poisson process fails to capture the burst phenomenon. The temporal patterns can be generally categorized into two classes:

A. The $\alpha$ = 1 universality class: Web browsing, email, and library datasets
B. The $\alpha$ =3/2 universality class: The correspondence of Einstein, Darwin, and Freud, which is characterized by a power law decay combined with an exponential cutoff.

$p(\tau)\simeq \tau^{-3/2} e^{-\tau / \tau_{0}}$ .

in which $\tau_0 = \frac{1}{\mu (1-\sqrt{\rho})}$ , and $\rho = \lambda / \mu$ . Recall that $\lambda$ is the arrival rate of new task, and $\mu$ is the response rate , thus $\rho$ is the task/job/traffic intensity.

Subcritical regime: When $\rho$ < 1, there are fewer job, and more queuing space.
Critical regime: When $\rho$ = 1, arrival rate equals response rate
Supercritical regime: When $\rho$ > 1, there are more job, and less queuing space.

However, they also find that the interevent time distribution between two consecutive transactions made by a stock broker. The distribution follows a power law with the exponential cutoff $p(\tau)\simeq \tau^{-1.3} e^{-\tau / \tau_{0}}$ .

References

Martínez-Mekler G, Martínez RA, del Río MB, Mansilla R, Miramontes P, et al. (2009) Universality of Rank-Ordering Distributions in the Arts and Sciences. PLoS ONE 4(3): e4791. doi:10.1371/journal.pone.0004791

Wu L, Zhang J. (2011) Accelerating growth and size-dependent distribution of human online activities. PHYSICAL REVIEW E 84, 026113 (2011)

Cheng-Jun Wang / 2014-03-17
Published under (CC) BY-NC-SA in categories r tagged with power law distribution