The long-tail distribution can be quantified in primarily three ways, see Newman ‘s paper here:
- Power law distribution
- Zipf distribution
- Pareto distribution
Here, I talk about the Zipf distribution which assesses the relationship between rank order and Frequency (probability).
Zipf’s law now refers more generally to frequency distributions of “rank data,” in which the relative frequency of the nth-ranked item is given by the Zeta distribution. from Wikipedia
DGBD
In Gustavo Martínez-Mekler et al’s article, they proposed a discrete generalized beta distribution: , where r is the rank, N is maximum value, A the normalization constant and (a, b) two fitting exponents.
It’s interesting to find that Wu and Zhang (2011) adopt the DGBD distribution to quantify the online social systems. Basically, they are interested in accelerating growth in human online behaviors.
Defining P as the number of active users in a day and T as the total activity generated by these users, they find a power law relationship between them . Given larger than 1, there exists an accelerating growth phenomena in online systems (e.g., tagging, microblogging)
They denote the activity of a user in one day with t(r), in which r is the decreasing rank of the activity among all individual activities in the day. Thus the maximum value of r, , equals population P. The DGBD model of individual activities is then
They introduced that a determines the activities of highly active users(corresponds to the exponent in Zipf’s law), b determines the activities of the less active users. Using the DGBD, They can fit the empirical curves with > 0.9.
Here I start with an empirical dataset. The Tweet of Milan city in December 2013. Basically, we know the language of each tweet. Thus we have the frequency of each language.
freq = c(1116, 2067, 137 , 124, 643, 2042, 55 ,47186, 7504, 1488, 211, 1608, 3517 , 7 , 896 , 378, 17 ,3098, 164977 , 601 , 196, 637, 149 , 44,2 , 1801, 882 , 636,5184, 1851, 776 , 343 , 851, 33 ,4011, 209, 715 , 937 , 20, 6922, 2028 , 23, 3045 , 16 , 334, 31 , 2)
lan = c("af","ar","bg","cs","da","de","el","en","es","et","fa","fi","fr","he","hr","hu","id","it","ja","ko","lt","lv","mk","ne","nl","no","pl","pt","ro","ru","sk","sl","so","sq","sv","sw","th","tl","tr","uk","und","ur","vi","zh-cn","zh-tw")
Thus, we can calculate the decreasing rank for each language.
Rank = rank(-freq, ties.method = c("first") )
data = data.frame(lan, freq, Rank)
data$Probability = data$Freq/sum(data$Freq)
We can write a simple function to fit the rank ordered data and capture the distribution.
get_dgbd = function(freq){
Rank = rank(-freq, ties.method = c("first") )
p = freq/sum(as.numeric(freq))
# get the log form
log.f = log(freq)
log.p = log(p)
log.rank = log(Rank)
log.inverse.rank = log(length(Rank)+1-Rank)
# linear regression of zifp: for probability
cozp=coef(lm(log.p~log.rank))
zipf.p = function(x) exp(cozp[[1]] + cozp[2]*log(x))
# linear regression of zifp: for frequency
cozf=coef(lm(log.f~log.rank))
zipf.f = function(x) exp(cozf[[1]] + cozf[2]*log(x))
# linear regression of dgbd: for probability
codp=coef(lm(log.p~log.inverse.rank + log.rank))
dgbd.p = function(x) exp(codp[[1]]+ codp[[2]]*log(length(x)+1-x) + codp[[3]]*log(x))
# linear regression of dgbd: for frequency
codf=coef(lm(log.f~log.inverse.rank + log.rank))
dgbd.f = function(x) exp(codf[[1]]+ codf[[2]]*log(length(x)+1-x) + codf[[3]]*log(x))
return(c(zipf.p, zipf.f, dgbd.p, dgbd.f))
}
zipf.p = get_dgbd(data$Freq)[[1]]
zipf.f = get_dgbd(data$Freq)[[2]]
dgbd.p = get_dgbd(data$Freq)[[3]]
dgbd.f = get_dgbd(data$Freq)[[4]]
plot(freq~Rank,log="xy", xlab = "Rank (log)", ylab = "Frequency (log)", data = data)
curve(zipf.f, col="red", add = T, n = length(data$Rank))
curve(dgbd.f, col="blue", add = T, n = length(data$Rank))
Remember to specify the length of values in ‘curve’. About its importance, check this post on stackoverflow.
Finally, we can plot it with ggplot2.
require(ggplot2)
P = ggplot(data=data, aes(x=Rank, y=Freq, label = Var1)) + geom_point() +
coord_trans(xtrans = "log10", ytrans = "log10")+
stat_function(fun = dgbd.f, n = length(Rank), colour = 'red', size = 1)+
geom_text(aes(label=Var1),hjust=1.5, vjust=0, angle = 45, size = 3)
png(file = "./language_rank_order_distribution2.png",
width=8, height=5,
units="in", res=700)
P
dev.off()


How to construct your model?
Using the DGBD model, we can fit the frequency-rank data almost perfectly. However, there are many other other forms of alternative equations. For example, there are Zipf-Manderbrot Model and its modifications, How to guarantee that which specific model is the right one? We need to:
- learn more about the underlying mechanisms.
- observe the patterns of the data
In the paper titled Modeling bursts and heavy tails in human dynamics, Alexei Vázquez et al tried to propose two queuing models to explain the value of scaling parameter in the temporal patterns of human behaviors. They capture the temporal patterns with two measurements: interevent time and waiting times .
The time between two consecutive events is called the interevent time, ; the waiting (or response) time, , representing the amount of time a task waits on an individual’s priority list before being executed.
Assuming that the tasks are executed independently from each other at a constant rate , the time can be approximated by a Poisson process:
.
However, we know that in many human behaviors, the Poisson process fails to capture the burst phenomenon. The temporal patterns can be generally categorized into two classes:
-
A. The = 1 universality class: Web browsing, email, and library datasets
-
B. The =3/2 universality class: The correspondence of Einstein, Darwin, and Freud, which is characterized by a power law decay combined with an exponential cutoff.
.
in which , and . Recall that is the arrival rate of new task, and is the response rate , thus is the task/job/traffic intensity.
- Subcritical regime: When < 1, there are fewer job, and more queuing space.
- Critical regime: When = 1, arrival rate equals response rate
- Supercritical regime: When > 1, there are more job, and less queuing space.
However, they also find that the interevent time distribution between two consecutive transactions made by a stock broker. The distribution follows a power law with the exponential cutoff .
References
Martínez-Mekler G, Martínez RA, del Río MB, Mansilla R, Miramontes P, et al. (2009) Universality of Rank-Ordering Distributions in the Arts and Sciences. PLoS ONE 4(3): e4791. doi:10.1371/journal.pone.0004791
Wu L, Zhang J. (2011) Accelerating growth and size-dependent distribution of human online activities. PHYSICAL REVIEW E 84, 026113 (2011)