HOME

TheInfoList



OR:

Googlebot is the
web crawler Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
software used by
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
that collects documents from the
web Web most often refers to: * Spider web, a silken structure created by the animal * World Wide Web or the Web, an Internet-based hypertext system Web, WEB, or the Web may also refer to: Computing * WEB, a literate programming system created by ...
to build a searchable index for the
Google Search Google Search (also known simply as Google or Google.com) is a search engine operated by Google. It allows users to search for information on the World Wide Web, Web by entering keywords or phrases. Google Search uses algorithms to analyze an ...
engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler (to simulate desktop users) and a mobile crawler (to simulate a mobile user).


Behavior

A website will probably be crawled by both Googlebot Desktop and Googlebot Mobile. However starting from September 2020, all sites were switched to mobile-first indexing, meaning Google is crawling the web using a smartphone Googlebot. The subtype of Googlebot can be identified by looking at the user agent string in the request. However, both crawler types obey the same product token (useent token) in robots.txt, and so a developer cannot selectively target either Googlebot mobile or Googlebot desktop using robots.txt. Google provides various methods that enable website owners to manage the content displayed in Google's search results. If a webmaster chooses to restrict the information on their site available to a Googlebot, or another
spider Spiders (order (biology), order Araneae) are air-breathing arthropods that have eight limbs, chelicerae with fangs generally able to inject venom, and spinnerets that extrude spider silk, silk. They are the largest order of arachnids and ran ...
, they can do so with the appropriate directives in a robots.txt file, or by adding the meta tag to the web page. Googlebot requests to
Web server A web server is computer software and underlying Computer hardware, hardware that accepts requests via Hypertext Transfer Protocol, HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, co ...
s are identifiable by a user-agent string containing "Googlebot" and a host address containing "googlebot.com". Currently, Googlebot follows HREF links and SRC links. There is increasing evidence Googlebot can execute JavaScript and parse content generated by
Ajax Ajax may refer to: Greek mythology and tragedy * Ajax the Great, a Greek mythological hero, son of King Telamon and Periboea * Ajax the Lesser, a Greek mythological hero, son of Oileus, the king of Locris * Ajax (play), ''Ajax'' (play), by the an ...
calls as well. There are many theories regarding how advanced Googlebot's ability is to process JavaScript, with opinions ranging from minimal ability derived from custom interpreters. Currently, Googlebot uses a web rendering service (WRS) that is based on the Chromium rendering engine (version 74 as on 7 May 2019). Googlebot discovers pages by harvesting every link on every page that it can find. Unless prohibited by a
nofollow nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engi ...
-tag, it then follows these links to other web pages. New web pages must be linked to from other known pages on the web in order to be crawled and indexed, or manually submitted by the webmaster. A problem that webmasters with low-bandwidth
Web hosting A web hosting service is a type of Internet hosting service that hosts websites for clients, i.e. it offers the facilities required for them to create and maintain a site and makes it accessible on the World Wide Web. Companies providing web ho ...
plans have often noted with the Googlebot is that it takes up an enormous amount of bandwidth. This can cause websites to exceed their bandwidth limit and be taken down temporarily. This is especially troublesome for
mirror A mirror, also known as a looking glass, is an object that Reflection (physics), reflects an image. Light that bounces off a mirror forms an image of whatever is in front of it, which is then focused through the lens of the eye or a camera ...
sites which host many
gigabyte The gigabyte () is a multiple of the unit byte for digital information. The SI prefix, prefix ''giga-, giga'' means 109 in the International System of Units (SI). Therefore, one gigabyte is one billion bytes. The unit symbol for the gigabyte i ...
s of data. Google provides " Search Console" that allow website owners to throttle the crawl rate. How often Googlebot will crawl a site depends on the crawl budget. Crawl budget is an estimation of how typically a website is updated. Technically, Googlebot's development team (Crawling and Indexing team) uses several defined terms internally to take over what "crawl budget" stands for. Since May 2019, Googlebot uses the latest
Chromium Chromium is a chemical element; it has Symbol (chemistry), symbol Cr and atomic number 24. It is the first element in Group 6 element, group 6. It is a steely-grey, Luster (mineralogy), lustrous, hard, and brittle transition metal. Chromium ...
rendering engine, which supports ECMAScript 6 features. This will make the bot a bit more "evergreen" and ensure that it is not relying on an outdated rendering engine compared to browser capabilities.


Mediabot

''Mediabot'' is the
web crawler Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
that
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
uses for analyzing the content so
Google AdSense Google AdSense is a program run by Google through which website publishers in the Google Display Network, Google Network of content sites serve text, images, video, or interactive media advertisements that are targeted advertising, targeted t ...
can serve contextually relevant advertising to a web page. Mediabot identifies itself with the
user agent On the Web, a user agent is a software agent responsible for retrieving and facilitating end-user interaction with Web content. This includes all web browsers, such as Google Chrome and Safari A safari (; originally ) is an overland jour ...
string "Mediapartners-Google/2.1". Unlike other crawlers, Mediabot does not follow links to discover new crawlable URLs, instead only visiting URLs that have included the AdSense code. Where that content resides behind a login, the crawler can be given a log in so that it is able to crawl protected content.


Inspection Tool Crawlers

''InspectionTool'' is the crawler used by Search testing tools such as the Rich Result Test and URL inspection in
Google Search Console Google Search Console (formerly Google Webmaster Tools) is a web service by Google which allows webmasters to check indexing status, search queries, crawling errors and search engine optimization, optimize visibility of their websites. Until 20 ...
. Apart from the user agent and user agent token, it mimics Googlebot. A guide to the crawlers was independently published. It details four (4) distinctive crawler agents based on Web server directory index data - one (1) non-chrome and three (3) chrome crawlers.


References


External links


Google's official Googlebot FAQ
{{Web crawlers Google software Web crawlers Internet bots Google Search