A scraper site is a
website
A website (also written as a web site) is any web page whose content is identified by a common domain name and is published on at least one web server. Websites are typically dedicated to a particular topic or purpose, such as news, educatio ...
that copies content from other websites using
web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for data extraction, extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. W ...
. The content is then mirrored with the goal of creating revenue, usually through advertising and sometimes by selling user data.
Scraper sites come in various forms: Some provide little if any material or information and are intended to obtain user information such as e-mail addresses to be targeted for spam e-mail. Price aggregation and shopping sites access multiple listings of a product and allow a user to rapidly compare the prices.
Examples of scraper websites
Search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
s such as
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
could be considered a type of scraper site. Search engines gather content from other websites, save it in their own databases, index it and present the scraped content to the search engines' own users. The majority of content scraped by search engines is copyrighted.
The scraping technique has been used on various dating websites as well. These sites often combine their scraping activities with
facial recognition.
Scraping is also used on general
image analysis
Image analysis or imagery analysis is the extraction of meaningful information from images; mainly from digital images by means of digital image processing techniques. Image analysis tasks can be as simple as reading barcode, bar coded tags or a ...
(recognition) websites, as well as websites specifically made to identify images of crops with pests and diseases.
Made for advertising
Some scraper sites are created to make money by using advertising programs. In such case, they are called ''Made for
AdSense
Google AdSense is a program run by Google through which website publishers in the Google Network of content sites serve text, images, video, or interactive media advertisements that are targeted to the site content and audience. These adver ...
'' sites or MFA. This derogatory term refers to websites that have no redeeming value except to lure visitors to the website for the sole purpose of clicking on advertisements.
''Made for AdSense'' sites are considered
search engine spam that dilute the search results with less-than-satisfactory search results. The scraped content is redundant compared to content shown by the search engine under normal circumstances, had no MFA website been found in the listings.
Some scraper sites link to other sites in order to improve their
search engine ranking through a
private blog network. Prior to Google's update to its search algorithm known as
Panda
The giant panda (''Ailuropoda melanoleuca''), also known as the panda bear or simply panda, is a bear species endemic to China. It is characterised by its white coat with black patches around the eyes, ears, legs and shoulders. Its body is ...
, a type of scraper site known as an
auto blog was quite common among black-hat marketers who used a method known as
spamdexing
Spamdexing (also known as search engine spam, search engine poisoning, black-hat search engine optimization, search spam or web spam) is the deliberate manipulation of search engine indexes. It involves a number of methods, such as link building ...
.
Legality
Scraper sites may violate
copyright law
A copyright is a type of intellectual property that gives its owner the exclusive legal right to copy, distribute, adapt, display, and perform a creative work, usually for a limited time. The creative work may be in a literary, artistic, e ...
. Even taking content from an
open content
Free content, libre content, libre information, or free information is any kind of creative work, such as a work of art, a book, a software, software program, or any other creative Media (communication), content for which there are very minimal ...
site can be a
copyright violation
Copyright infringement (at times referred to as piracy) is the use of works protected by copyright without permission for a usage where such permission is required, thereby infringing certain exclusive rights granted to the copyright holder, ...
, if done in a way which does not respect the license. For instance, the
GNU Free Documentation License
The GNU Free Documentation License (GNU FDL or GFDL) is a copyleft license for free documentation, designed by the Free Software Foundation (FSF) for the GNU Project. It is similar to the GNU General Public License, giving readers the rights ...
(GFDL) and
Creative Commons
Creative Commons (CC) is an American non-profit organization and international network devoted to educational access and expanding the range of creative works available for others to build upon legally and to share. The organization has release ...
ShareAlike (CC-BY-SA) licenses used on Wikipedia
[
] require that a republisher of Wikipedia inform its readers of the conditions on these licenses, and give credit to the original author.
Techniques
Depending upon the objective of a scraper, the methods in which websites are targeted differ. For example, sites with large amounts of content such as airlines, consumer electronics, department stores, etc. might be routinely targeted by their competition just to stay abreast of pricing information.
Another type of scraper will pull snippets and text from websites that rank high for keywords they have targeted. This way they hope to rank highly in the
search engine results page
A search engine results page (SERP) is a webpage that is displayed by a search engine in response to a query by a user. The main component of a SERP is the listing of results that are returned by the search engine in response to a Keyword (Inter ...
s (SERPs), piggybacking on the original page's
page rank.
RSS
RSS ( RDF Site Summary or Really Simple Syndication) is a web feed that allows users and applications to access updates to websites in a standardized, computer-readable format. Subscribing to RSS feeds can allow a user to keep track of many ...
feeds are vulnerable to scrapers.
Other scraper sites consist of advertisements and paragraphs of words randomly selected from a dictionary. Often a visitor will click on a
pay-per-click
Pay-per-click (PPC) is an internet advertising model used to drive traffic to websites, in which an advertiser pays a publisher (typically a search engine, website owner, or a network of websites) when the ad is clicked. This differs from more t ...
advertisement on such site because it is the only comprehensible text on the page. Operators of these scraper sites gain financially from these clicks. Advertising networks claim to be constantly working to remove these sites from their programs, although these networks benefit directly from the clicks generated at this kind of site. From the advertisers' point of view, the networks don't seem to be making enough effort to stop this problem.
Scrapers tend to be associated with
link farm
On the World Wide Web, a link farm is any group of websites that all hyperlink to other sites in the group for the purpose of increasing SEO rankings. In graph theoretic terms, a link farm is a clique. Although some link farms can be creat ...
s and are sometimes perceived as the same thing, when multiple scrapers link to the same target site. A frequent target victim site might be accused of link-farm participation, due to the artificial pattern of incoming links to a victim website, linked from multiple scraper sites.
Domain hijacking
Some programmers who create scraper sites may purchase a recently expired
domain name
In the Internet, a domain name is a string that identifies a realm of administrative autonomy, authority, or control. Domain names are often used to identify services provided through the Internet, such as websites, email services, and more. ...
to reuse its SEO power in Google. Whole businesses focus on understanding all expired domains and utilising them for their historical ranking ability exist. Doing so will allow SEOs to utilize the already-established
backlink
From the point of view of a given web resource (referent), a backlink is a regular hyperlink on another web resource (the referrer) that points to the referent. A ''web resource'' may be (for example) a website, web page, or web directory.
A ba ...
s to the domain name. Some spammers may try to match the topic of the expired site or copy the existing content from the
Internet Archive
The Internet Archive is an American 501(c)(3) organization, non-profit organization founded in 1996 by Brewster Kahle that runs a digital library website, archive.org. It provides free access to collections of digitized media including web ...
to maintain the authenticity of the site so that the backlinks don't drop. For example, an expired website about a photographer may be re-registered to create a site about photography tips or use the domain name in their
private blog network to power their own photography site.
Services at some expired domain name registration agents provide both the facility to find these expired domains and to gather the HTML that the domain name used to have on its web site.
See also
*
Scraping
*
Contact scraping
In online advertising, contact scraping is the practice of obtaining access to a customer's e-mail account in order to retrieve contact information that is then used for marketing purposes.
''The New York Times'' refers to the practices of Tagge ...
*
Domain parking
Domain parking is the registration of an Internet domain name without that domain being associated with any services such as e-mail or a website. This may have been done with a view to reserving the domain name for future development, and to pr ...
*
Web scraping
Web scraping, web harvesting, or web data extraction is data scraping used for data extraction, extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. W ...
*
Blog scraping
Web scraping, web harvesting, or web data extraction is data scraping used for extracting data from websites. Web scraping software may directly access the World Wide Web using the Hypertext Transfer Protocol or a web browser. While web scrapin ...
*
Multi-protocol messengers: can connect to several networks, yet require to have an account on all of these, so don't violate any terms of the networks
*
Content farm
A content farm or content mill is an organization focused on generating a large amount of web content, often specifically designed to satisfy algorithms for maximal retrieval by search engines, a practice known as search engine optimization (SE ...
*
Search engine optimization
Search engine optimization (SEO) is the process of improving the quality and quantity of Web traffic, website traffic to a website or a web page from web search engine, search engines. SEO targets unpaid search traffic (usually referred to as ...
(SEO)
References
{{DEFAULTSORT:Scraper Site
Web scraping