In the field of

information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...

, divergence from randomness (DFR), is a

generalization A generalization is a form of abstraction whereby common properties of specific instances are formulated as general concepts or claims. Generalizations posit the existence of a domain or set of elements, as well as one or more common characteri ...

of one of the very first models, Harter'
2-Poisson indexing-model
It is one type of

probabilistic Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...

model A model is an informative representation of an object, person, or system. The term originally denoted the plans of a building in late 16th-century English, and derived via French and Italian ultimately from Latin , . Models can be divided in ...

. It is used to test the amount of

information Information is an Abstraction, abstract concept that refers to something which has the power Communication, to inform. At the most fundamental level, it pertains to the Interpretation (philosophy), interpretation (perhaps Interpretation (log ...

carried in

document A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ', which denotes ...

s. The 2-Poisson model is based on the

hypothesis A hypothesis (: hypotheses) is a proposed explanation for a phenomenon. A scientific hypothesis must be based on observations and make a testable and reproducible prediction about reality, in a process beginning with an educated guess o ...

that the level of documents is related to a set of documents that contains words that occur in relatively greater extent than in the rest of the documents. It is not a 'model', but a framework for

weighting The process of frequency weighting involves emphasizing the contribution of particular aspects of a phenomenon (or of a set of data) over others to an outcome or result; thereby highlighting those aspects in comparison to others in the analy ...

terms using

probabilistic method In mathematics, the probabilistic method is a nonconstructive method, primarily used in combinatorics and pioneered by Paul Erdős, for proving the existence of a prescribed kind of mathematical object. It works by showing that if one randomly c ...

s, and it has a special relationship for term

based on the notion of

elite In political and sociological theory, the elite (, from , to select or to sort out) are a small group of powerful or wealthy people who hold a disproportionate amount of wealth, privilege, political power, or skill in a group. Defined by the ...

Term weights are being treated as the standard of whether a specific word is in that set or not. Term weights are computed by measuring the

divergence In vector calculus, divergence is a vector operator that operates on a vector field, producing a scalar field giving the rate that the vector field alters the volume in an infinitesimal neighborhood of each point. (In 2D this "volume" refers to ...

between a term distribution produced by a

random process In probability theory and related fields, a stochastic () or random process is a mathematical object usually defined as a family of random variables in a probability space, where the index of the family often has the interpretation of time. Stoc ...

and the actual term

distribution Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...

. Divergence from randomness models set up by instantiating the three main components of the framework: first selecting a basic randomness model, then applying the first

normalization Normalization or normalisation refers to a process that makes something more normal or regular. Science * Normalization process theory, a sociological theory of the implementation of new technologies or innovations * Normalization model, used in ...

and at last normalizing the term frequencies. The basic models are from the following tables.

Definition

The divergence from

randomness In common usage, randomness is the apparent or actual lack of definite pattern or predictability in information. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. ...

is based on this idea: "The more the divergence of the within-document term-frequency from its frequency within the collection, the more the information carried by the word t in document d. In other words, the term-weight is inversely related to the probability of term-frequency within the document d obtained by a model M of randomness."

\text(t, d) =k\text_M(t\in d, \text)

(Formula 1) # M represents the type of model of randomness which employs to calculate the probability. # d is the total number of words in the documents. # t is the number of a specific word in d. # k is defined by M. It is possible that we use different

URN An urn is a vase, often with a cover, with a typically narrowed neck above a rounded body and a footed pedestal. Describing a vessel as an "urn", as opposed to a vase or other terms, generally reflects its use rather than any particular shape ...

models to choose the appropriate model M of randomness. In Information Retrieval, there are documents instead of URNs, and terms instead of colors. There are several ways to choose M, each of these has a basic divergence from randomness model to support it.

Model

Basic Models

D Divergence approximation of the binomial P Approximation of the binomial BE Bose-Einstein distribution G Geometric approximation of the Bose-Einstein I(n) Inverse Document Frequency Model I(F) Inverse Term Frequency Model I(ne) Inverse Expected Document Frequency Model

DFR Models

BB2 Bernoulli-Einstein model with Bernoulli after-effect and normalization 2. IFB2 Inverse Term Frequency model with Bernoulli after-effect and normalization 2. In-expB2 Inverse Expected Document Frequency model with Bernoulli after-effect and normalization 2. The logarithms are base 2. This model can be used for classic ad-hoc tasks. In-expC2 Inverse Expected Document Frequency model with Bernoulli after-effect and normalization 2. The logarithms are base e. This model can be used for classic ad-hoc tasks. InL2 Inverse Document Frequency model with Laplace after-effect and normalization 2. This model can be used for tasks that require early precision. PL2 Poisson model with Laplace after-effect and normalization 2. This model can be used for tasks that require early precision ,8

First normalization

When a specific rare term cannot be found in a document, then in that document the term has approximately zero probability of being informative. On the other hand, if a rare term occurs frequently in a document, therefore it can have a very high, near 100% probability to be informative for the topic that mentioned by the document. Applying to Ponte and Croft's

language model A language model is a model of the human brain's ability to produce natural language. Language models are useful for a variety of tasks, including speech recognition, machine translation,Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013)"S ...

can also provide further data. A risk component is considered in the DFR. Logically speaking, if the term-frequency in the document is relatively high, then inversely the risk for the term of not being informative is relatively small. If Formula 1 gives a high value, then there is a minimal risk that it has the negative effect of showing small information gain. As a result, the weight of Formula one is organized 1 to only consider the portion of which is the amount of information gained with the term. The more the term occurs in the elite set, the less term-frequency is due to randomness, and thus the smaller the associated risk is. We use two models to compute the information-gain with a term within a document: the Laplace L model and the ratio of two Bernoulli's processes B.

Term frequency normalization

Before using the within-document frequency tf of a term, the document-length dl is normalized to a standard length sl. Therefore, the term-frequencies tf are recalculated with the respect to the standard document-length, that is: tf_n = tf * log(1+ sl/dl) (normalization 1) tfn represents the normalized term frequency. Another version of the normalization formula is the following: tf_n = tf * log(1 + c*(sl/dl)) (normalization 2) Normalization 2 is usually considered to be more flexible, since there is no fixed value for c. # tf is the term-frequency of the term t in the document d # dl is the document-length. # sl is the standard length.

Mathematic and statistical tools

The probability space

Sampling space V

Utility-Theoretic Indexing developed by Cooper and Maron is a theory of indexing based on utility theory. To reflect the value for documents that is expected by the users, index terms are assigned to documents. Also, Utility-Theoretic Indexing is related an "event space" in the statistical word. There are several basic spaces Ω in the Information Retrieval. A really simple basic space Ω can be the set V of terms t, which is called the vocabulary of the document collection. Due to Ω=V is the set of all mutually exclusive events, Ω can also be the certain event with probability:

P(V) = \sum_ P(t) = 1

Thus P, the probability

, assigns probabilities to all sets of terms for the vocabulary. Notice that the basic problem of Information Retrieval is to find an estimate for P(t). Estimates are computed on the basis of sampling and the experimental text collection furnishes the samples needed for the estimation. Now we run into the main concern which is how do we treat two arbitrary but

heterogeneous Homogeneity and heterogeneity are concepts relating to the uniformity of a substance, process or image. A homogeneous feature is uniform in composition or character (i.e., color, shape, size, weight, height, distribution, texture, language, i ...

pieces of texts appropriately. Paragons like a chapter in a

Science Magazine ''Science'' is the peer-reviewed academic journal of the American Association for the Advancement of Science (AAAS) and one of the world's top academic journals. It was first published in 1880, is currently circulated weekly and has a subscrib ...

and an article from a sports newspaper as the other. They can be considered as two different samples since those aiming at different population.

Sampling with a document

The relationship of the document with the experiments is made by the way in which the

sample space In probability theory, the sample space (also called sample description space, possibility space, or outcome space) of an experiment or random trial is the set of all possible outcomes or results of that experiment. A sample space is usually den ...

is chosen. In IR, term experiment, or trial, is used here with a technical meaning rather than a common sense. For example, a document could be an experiment which means the document is a sequence of outcomes t∈V, or just a sample of a population. We will talk about the event of observing a number Xt =tf of occurrences of a given word t in a sequence of experiments. In order to introduce this event space, we should introduce the product of the probability spaces associated with the experiments of the sequence. We could introduce our sample space to associate a point with possible configurations of the outcomes. The one-to-one correspondence for sample space can be defined as:

\mathop \Omega = V^

Where ld is the number of trials of the experiment or in this example, the length of a document. We can assume that each outcome may or may not depend on the outcomes of the previous experiments. If the experiments are designed so that an outcome is influencing the next outcomes, then the probability distribution on V is different at each trial. But, more commonly, in order to establish the simpler case when the probability space is invariant in IR, the term independence assumption is often made. Therefore, all possible configurations ofΩ=Vld are considered equiprobable. Considering this assumption, we can consider each document a

Bernoulli process In probability and statistics, a Bernoulli process (named after Jacob Bernoulli) is a finite or infinite sequence of binary random variables, so it is a discrete-time stochastic process that takes only two values, canonically 0 and 1. The ...

. The probability spaces of the product are invariant and the probability of a given sequence is the product of the probabilities at each trial. Consequently, if p=P(t) is the

prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

that the outcome is t and the number of experiments is ld we obtain the probability of

X_t=tf

is equal to:

P(X_t=tf, p)=\binomp^q^

Which is the sum of the probability of all possible configurations having tf outcomes out of ld. P(Xt=tf, p) is a probability distribution because

\sum_ P(X_t=tf, p)=(p+q)^{l_d} = 1

l_d

The length of document d. # tf The term frequency of t in document d. #

X_t

The number of occurrence of a specific word in one list.

Multiple samplings

Already considering the hypothesis of having a single sample, we need to consider that we have several samples, for example, a collection D of documents. The situation of having a collection of N documents is abstractly equivalent to the scheme of placing a certain number Tot of V colored types of balls in a collection of N cells. For each term t∈V a possible configuration of ball placement satisfies the equations: tf₁+...+tf_N=Ft And the condition F₁+...+F_V=Tot Where Ft is the number of balls of the same color t to be distributed in the N cells. We have thus changed the basic space. The outcome of our experiment will be the documents d in which the ball will be placed. Also, we will have a lot of possible configurations consistent with the number of colored balls. # Ft The total number of tokens of t in the collection. # Tot The total number of tokens in the collection D

Distributions

Binomial distribution In probability theory and statistics, the binomial distribution with parameters and is the discrete probability distribution of the number of successes in a sequence of statistical independence, independent experiment (probability theory) ...

Hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a Probability distribution#Discrete probability distribution, discrete probability distribution that describes the probability of k successes (random draws for which the ...

Bose–Einstein statistics In quantum statistics, Bose–Einstein statistics (B–E statistics) describes one of two possible ways in which a collection of non-interacting identical particles may occupy a set of available discrete energy states at thermodynamic equilibri ...

Fat-tailed distribution A fat-tailed distribution is a probability distribution that exhibits a large skewness or kurtosis, relative to that of either a normal distribution or an exponential distribution. In common usage, the terms fat-tailed and Heavy-tailed distribut ...

Conclusion

The divergence from Randomness Model is based on the Bernoulli model and its limiting forms, the hypergeometric distribution,

Bose–Einstein statistics In quantum statistics, Bose–Einstein statistics (B–E statistics) describes one of two possible ways in which a collection of non-interacting identical particles may occupy a set of available discrete energy states at thermodynamic equilibri ...

and its limiting forms, the compound of the binomial distribution with the beta distribution, and the fat-tailed distribution. Divergence from randomness model shows a unifying framework that has the potential constructing a lot of different effective models of IR.

Applications

Applications and characteristics

# The Divergence from randomness model can be applied in automatic indexing in I nformation Retrieval. These can be explained as the dissertation elite, the notion of an informative content of a term within a document. # The effectiveness of the models based on divergence from randomness is very high in comparison with both BM25 and language model. For short queries, the performance of the models of divergence from randomness is definitely better than the BM25 Model, which since 1994 has been used as a standard baseline for the comparison of the models. # The Divergence from randomness model can show the best performance with only a few documents comparing to other

query expansion Query expansion (QE) is the process of reformulating a given query to improve retrieval performance in information retrieval operations, particularly in the context of query understanding. In the context of search engines, query expansion involves ...

skills. # The framework of Divergence from randomness model is very general and flexible. With the query expansion provided for each component, we can apply different technologies in order to get the best performance.

Proximity

Proximity can be handled within divergence from randomness to consider the number of occurrences of a pair of query terms within a window of pre-defined size. To specify, the DFR Dependence Score Modifier DSM implements both the pBiL and pBiL2 models, which calculate the randomness divided by the document's length, rather than the statistics of the pair in the corpus the pair in the corpus.

Examples of divergence from randomness

Let t be a term and c be a collection. Let the term occur in tfc=nL(t,c)=200 locations, and in df(t,c)=nL(t,c)=100 documents. The expected average term frequency is avgtf(t,c)=200/100=2; this is the average over the documents in which the term occurs. Let N.D(c)=1000 be the total amounts of documents. The term's occurrence is 10% in the documents: P.D(t, c)=100/1000. The expected average term frequency is 200/1000=1/5, and this is the average over all documents. The term frequency is shown as Kt =0,...,6. Graph for Example 1

The following table show the column nD is the number of Documents that contains kt occurrence of t, shown as nD(t,c,kt). Another column nL is the number of Locations at which the term occurs follows by this equation: nL=kt*nD. The columns to the right show the observed and Poisson probabilities. P obs,elite(Kt) is the observed probability over all documents. P Poisson, all, lambda(Kt) is the Poisson probability, where lambda(t,c)=nL(t,c)/N D(c)=0.20 is the Poisson parameter. The table illustrates how the observed probability is different from the Poisson probability. P Poisson(1) is greater than P obs(1), whereas for kt>1.the observed probabilities are greater than the Poisson probabilities. There is more mass in the tail of the observed distribution than the

Poisson distribution In probability theory and statistics, the Poisson distribution () is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time if these events occur with a known const ...

assumes. Moreover, the columns to the right illustrate the usage of the elite documents instead of all documents. Here, the single event probability is based on the locations of elite documents only.

Further interest of examples

#
Adjusting document length
#
Applying DFR in content-only XML Documents
#

References

External links

Glasgow IR group Wiki DFR page
Ranking functions Information retrieval techniques Probabilistic models