Googlebot is the
web crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spi ...
software used by
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
that collects documents from the
web
Web most often refers to:
* Spider web, a silken structure created by the animal
* World Wide Web or the Web, an Internet-based hypertext system
Web, WEB, or the Web may also refer to:
Computing
* WEB, a literate programming system created b ...
to build a searchable index for the
Google Search
Google Search (also known simply as Google) is a search engine provided by Google. Handling more than 3.5 billion searches per day, it has a 92% share of the global search engine market. It is also the most-visited website in the world.
The ...
engine. This name is actually used to refer to two different types of web crawlers: a desktop crawler (to simulate desktop users) and a mobile crawler (to simulate a mobile user).
Behavior
A website will probably be crawled by both Googlebot Desktop and Googlebot Mobile. However Google announced that, starting from September 2020, all sites were switched to mobile-first indexing, meaning Google is crawling the web using a smartphone Googlebot. The subtype of Googlebot can be identified by looking at the user agent string in the request. However, both crawler types obey the same product token (useent token) in robots.txt, and so a developer cannot selectively target either Googlebot mobile or Googlebot desktop using robots.txt.
If a
webmaster
A webmaster is a person responsible for maintaining one or more websites. The title may refer to web architects, web developers, site authors, website administrators, website owners, website coordinators, or website publishers.
The duties of ...
chooses to restrict the information on their site available to a Googlebot, or another
spider
Spiders (order Araneae) are air-breathing arthropods that have eight legs, chelicerae with fangs generally able to inject venom, and spinnerets that extrude silk. They are the largest order of arachnids and rank seventh in total species d ...
, they can do so with the appropriate directives in a
robots.txt file,
or by adding the
meta tag
to the web page. Googlebot requests to
Web server
A web server is computer software and underlying hardware that accepts requests via HTTP (the network protocol created to distribute web content) or its secure variant HTTPS. A user agent, commonly a web browser or web crawler, initi ...
s are identifiable by a
user-agent string containing "Googlebot" and a host address containing "googlebot.com".
Currently, Googlebot follows
HREF
An HTML element is a type of HTML (HyperText Markup Language) document component, one of several types of HTML nodes (there are also text nodes, comment nodes and others). The first used version of HTML was written by Tim Berners-Lee in 1993 ...
links and SRC links.
There is increasing evidence Googlebot can execute JavaScript and parse content generated by
Ajax
Ajax may refer to:
Greek mythology and tragedy
* Ajax the Great, a Greek mythological hero, son of King Telamon and Periboea
* Ajax the Lesser, a Greek mythological hero, son of Oileus, the king of Locris
* ''Ajax'' (play), by the ancient Gree ...
calls as well. There are many theories regarding how advanced Googlebot's ability is to process JavaScript, with opinions ranging from minimal ability derived from custom interpreters. Currently, Googlebot uses a web rendering service (WRS) that is based on the Chromium rendering engine (version 74 as on 7 May 2019).
Googlebot discovers pages by harvesting every link on every page that it can find. Unless prohibited by a
nofollow
nofollow is a setting on a web page hyperlink that directs search engines not to use the link for page ranking calculations. It is specified in the page as a type of link relation; that is: <a rel="nofollow" ...>. Because search engine ...
-tag, it then follows these links to other web pages. New web pages must be linked to from other known pages on the web in order to be crawled and indexed, or manually submitted by the webmaster.
A problem that webmasters with low-bandwidth
Web hosting
A web hosting service is a type of Internet hosting service that hosts websites for clients, i.e. it offers the facilities required for them to create and maintain a site and makes it accessible on the World Wide Web. Companies providing we ...
plans have often noted with the Googlebot is that it takes up an enormous amount of bandwidth. This can cause websites to exceed their bandwidth limit and be taken down temporarily. This is especially troublesome for
mirror
A mirror or looking glass is an object that Reflection (physics), reflects an image. Light that bounces off a mirror will show an image of whatever is in front of it, when focused through the lens of the eye or a camera. Mirrors reverse the ...
sites which host many
gigabyte
The gigabyte () is a multiple of the unit byte for digital information. The prefix '' giga'' means 109 in the International System of Units (SI). Therefore, one gigabyte is one billion bytes. The unit symbol for the gigabyte is GB.
This defini ...
s of data. Google provides "
Search Console" that allow website owners to throttle the crawl rate.
How often Googlebot will crawl a site depends on the crawl budget. Crawl budget is an estimation of how typically a website is updated. Technically, Googlebot's development team (Crawling and Indexing team) uses several defined terms internally to take over what "crawl budget" stands for. Since May 2019, Googlebot uses the latest
Chromium
Chromium is a chemical element with the symbol Cr and atomic number 24. It is the first element in group 6. It is a steely-grey, lustrous, hard, and brittle transition metal.
Chromium metal is valued for its high corrosion resistance and h ...
rendering engine, which supports
ECMAScript 6 features. This will make the bot a bit more "evergreen" and ensure that it is not relying on an outdated rendering engine compared to browser capabilities.
Mediabot
''Mediabot'' is the
web crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spi ...
that
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
uses for analyzing the content so
Google AdSense
Google AdSense is a program run by Google through which website publishers in the Google Network of content sites serve text, images, video, or interactive media advertisements that are targeted to the site content and audience. These advert ...
can serve
contextually relevant advertising to a web page. Mediabot identifies itself with the
user agent
In computing, a user agent is any software, acting on behalf of a user, which "retrieves, renders and facilitates end-user interaction with Web content". A user agent is therefore a special kind of software agent.
Some prominent examples of u ...
string "Mediapartners-Google/2.1".
Unlike other crawlers, Mediabot does not follow links to discover new crawlable URLs, instead only visiting URLs that have included the AdSense code. Where that content resides behind a login, the crawler can be given a log in so that it is able to crawl protected content.
References
External links
Google's official Googlebot FAQ
{{Web crawlers
Google software
Web crawlers
Internet bots
Google Search