HOME

TheInfoList



OR:

The bag-of-words (BoW) model is a model of text which uses an unordered collection (a "
bag A bag, also known regionally as a sack, is a common tool in the form of a floppy container, typically made of cloth, leather, bamboo, paper, or plastic. The use of bags predates recorded history, with the earliest bags being lengths of animal s ...
") of words. It is used in
natural language processing Natural language processing (NLP) is a subfield of computer science and especially artificial intelligence. It is primarily concerned with providing computers with the ability to process data encoded in natural language and is thus closely related ...
and
information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
(IR). It disregards
word order In linguistics, word order (also known as linear order) is the order of the syntactic constituents of a language. Word order typology studies it from a cross-linguistic perspective, and examines how languages employ different orders. Correlatio ...
(and thus most of syntax or grammar) but captures multiplicity. The bag-of-words model is commonly used in methods of
document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...
where, for example, the (frequency of) occurrence of each word is used as a
feature Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...
for training a classifier. It has also been used for computer vision. An early reference to "bag of words" in a linguistic context can be found in
Zellig Harris Zellig Sabbettai Harris (; October 23, 1909 – May 22, 1992) was an influential American linguist, mathematical syntactician, and methodologist of science. Originally a Semiticist, he is best known for his work in structural linguistics and di ...
's 1954 article on ''Distributional Structure''.


Definition

The following models a text document using bag-of-words. Here are two simple text documents: (1) John likes to watch movies. Mary likes movies too. (2) Mary also likes to watch football games. Based on these two text documents, a list is constructed as follows for each document: "John","likes","to","watch","movies","Mary","likes","movies","too" "Mary","also","likes","to","watch","football","games" Representing each bag-of-words as a JSON object, and attributing to the respective
JavaScript JavaScript (), often abbreviated as JS, is a programming language and core technology of the World Wide Web, alongside HTML and CSS. Ninety-nine percent of websites use JavaScript on the client side for webpage behavior. Web browsers have ...
variable: BoW1 = ; BoW2 = ; Each key is the word, and each value is the number of occurrences of that word in the given text document. The order of elements is free, so, for example is also equivalent to ''BoW1''. It is also what we expect from a strict ''JSON object'' representation. Note: if another document is like a union of these two, (3) John likes to watch movies. Mary likes movies too. Mary also likes to watch football games. its JavaScript representation will be: BoW3 = ; So, as we see in the bag algebra, the "union" of two documents in the bags-of-words representation is, formally, the
disjoint union In mathematics, the disjoint union (or discriminated union) A \sqcup B of the sets and is the set formed from the elements of and labelled (indexed) with the name of the set from which they come. So, an element belonging to both and appe ...
, summing the multiplicities of each element. :


Word order

The BoW representation of a text removes all word ordering. For example, the BoW representation of " man bites dog" and "dog bites man" are the same, so any algorithm that operates with a BoW representation of text must treat them in the same way. Despite this lack of syntax or grammar, BoW representation is fast and may be sufficient for simple tasks that do not require word order. For instance, for
document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...
, if the words "stocks" "trade" "investors" appears multiple times, then the text is likely a financial report, even though it would be insufficient to distinguish between
Yesterday, investors were rallying, but today, they are retreating.
and
Yesterday, investors were retreating, but today, they are rallying.
and so the BoW representation would be insufficient to determine the detailed meaning of the document.


Implementations

Implementations of the bag-of-words model might involve using frequencies of words in a document to represent its contents. The frequencies can be "normalized" by the inverse of document frequency, or
tf–idf In information retrieval, tf–idf (term frequency–inverse document frequency, TF*IDF, TFIDF, TF–IDF, or Tf–idf) is a measure of importance of a word to a document in a collection or Text corpus, corpus, adjusted for the fact that some words ...
. Additionally, for the specific purpose of classification, supervised alternatives have been developed to account for the class label of a document. Lastly, binary (presence/absence or 1/0) weighting is used in place of frequencies for some problems (e.g., this option is implemented in the
WEKA The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. Some authorities consider it as the only extant member of the genus '' Gallirallus''. ...
machine learning software system).


Python implementation

# Make sure to install the necessary packages first # pip install --upgrade pip # pip install tensorflow from tensorflow import keras from typing import List from keras.preprocessing.text import Tokenizer sentence = John likes to watch movies. Mary likes movies too." def print_bow(sentence: List tr -> None: tokenizer = Tokenizer() tokenizer.fit_on_texts(sentence) sequences = tokenizer.texts_to_sequences(sentence) word_index = tokenizer.word_index bow = for key in word_index: bow ey= sequences count(word_index ey print(f"Bag of word sentence 1:\n") print(f"We found unique tokens.") print_bow(sentence)


Hashing trick

A common alternative to using dictionaries is the hashing trick, where words are mapped directly to indices with a hashing function. Thus, no memory is required to store a dictionary. Hash collisions are typically dealt via freed-up memory to increase the number of hash buckets. In practice, hashing simplifies the implementation of bag-of-words models and improves scalability.


See also

*
Additive smoothing In statistics, additive smoothing, also called Laplace smoothing or Lidstone smoothing, is a technique used to smooth count data, eliminating issues caused by certain values having 0 occurrences. Given a set of observation counts \mathbf = \lang ...
*
Feature extraction Feature may refer to: Computing * Feature recognition, could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (machine learning), in statistics: individual measurable properties of the phenome ...
*
Machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
* MinHash *
Vector space model Vector space model or term vector model is an algebraic model for representing text documents (or more generally, items) as vector space, vectors such that the distance between vectors represents the relevance between the documents. It is used in i ...
* w-shingling


Notes


References

*McTear, Michael (et al) (2016). ''The Conversational Interface''. Springer International Publishing. {{DEFAULTSORT:Bag-of-words model Natural language processing Machine learning Articles with example Python (programming language) code