Statistically improbable phrase
   HOME

TheInfoList



OR:

A statistically improbable phrase (SIP) is a phrase or set of words that occurs more frequently in a document (or collection of documents) than in some larger corpus. Amazon.com uses this concept in determining keywords for a given book or chapter, since keywords of a book or chapter are likely to appear disproportionately within that section.
Christian Rudder Christian Rudder (born September 1, 1975) is an American entrepreneur, writer, and musician. Education Rudder graduated from Little Rock Central High School in 1993. He attended Harvard University, graduating with a degree in mathematics in 1 ...
has also used this concept with data from online dating profiles and
Twitter Twitter is an online social media and social networking service owned and operated by American company Twitter, Inc., on which users post and interact with 280-character-long messages known as "tweets". Registered users can post, like, and ...
posts to determine the phrases most characteristic of a given race or gender in his book ''
Dataclysm ''Dataclysm: love, sex, race, and identity'' is a book by OkCupid founder Christian Rudder that discusses how the vast trove of aggregated online data about individuals helps explain everything from political beliefs to speech patterns. Much of t ...
''. SIPs with a linguistic density of two or three words, adjective, adjective, noun or adverb, adverb, verb, will signal the author's attitude, premise or conclusions to the reader or express an important idea. Another use of SIPs is as a detection tool for plagiarism. (Almost) unique combinations of words can be searched for online, and if they have appeared in a published text, the search will identify where. This method only checks those texts that have been published and that have been digitized online. garden style, praising irregularity in design. For example, a submission by, say, a student that contained the phrase "garden style, praising irregularity in design", might be searched for using Google.com and will yield the original Wikipedia article about Sir William Temple, English political figure and essayist.


Example

In a document about computers, the most common word is likely to be the word "the", but since "the" is the most commonly used word in the English language, it is probable that any given document will have the word "the" used very frequently. However, a phrase like "explicit Boolean algorithm" might occur in the document at a much higher rate than its average rate in the English language. Hence, it is a phrase unlikely to occur in any given document, but ''did'' occur in the document given. "Explicit Boolean algorithm" would be a statistically improbable phrase. Statistically improbable phrases of Darwin's ''
On the Origin of Species ''On the Origin of Species'' (or, more completely, ''On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for Life''),The book's full original title was ''On the Origin of Species by Me ...
'' could be: ''temperate productions, genera descended, transitional gradations, unknown progenitor, fossiliferous formations, our domestic breeds, modified offspring, doubtful forms, closely allied forms, profitable variations, enormously remote, transitional grades, very distinct species'' and ''mongrel offspring''.Sociologically Improbable Phrases
Crooked Timber April 2005


See also

*
Collocation In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, a collocation is a type of compositional phraseme, meaning that it can be understood from the words ...
– Any series of words that co-occur more often than would be expected by chance *
Googlewhack A Googlewhack is a contest to find a Google Search query that returns a single result. A Googlewhack must consist of two words found in a dictionary and is only considered legitimate if both of the search terms appear in the result. The term googl ...
– A pair of words occurring on a single webpage, as indexed by Google * tf-idf – A statistic used in information retrieval and text mining


References

{{comp-ling-stub Amazon (company) Bookselling Information retrieval systems Computational linguistics