Introduction

In order to implement SEO strategies and tactics properly, one must first understand the indexing and ranking life cycle of web pages. This article describes how it happens under the hood at Google, using publicly available information. There little to no information is available for other search engines.

Website Crawling

Seo Summaries Google crawling and indexing
Before a page can be indexed, it must first be discovered. This is the job of Google web crawlers ( also called Google bots or Google spiders). They rely on a database of links which is tied to a scheduler. This database is fed, amongst other, with links found in sitemap.xml files submitted in Google Search Console or found on websites.

Web crawlers pick links from the database and crawl corresponding pages. They extract all links from these pages, except the NOFOLLOW ones, and add them back to the database. The crawling information is updated in the scheduler. Web pages can be crawled (or revisited) in matters of minutes or several months apart.

SEO Mistakes & Best Practices

- Crawling rates are set per server, if many sites are hosted on the same server, then it depends on the content value and its change rate for each site.
- Crawling rates are adapting to servers' capacities and response time.
- There is an upper limit to crawling rates.
- In critical situations, you can prevent Google from crawling your website by blocking access to content in your robots.txt.
- The revisit-after META tag is not used to define crawling rates.
- Setting a crawling rate means: don't crawl more than this rate!
- More crawling does not mean content is more relevant content or will rank more, it happens when it changes more often.
- Extra crawling can happen when Google has extra capacity.
- The more Google trusts a website, the more time it will spend crawling it.
- Google checks for modifications in sitemap.xml to identify new pages to crawl.
- If a website's robots.txt file cannot be read, crawling will be suspended.
- If a website goes down, crawling stops temporarily until it is live again.
- A lot of changes on a website can trigger a temporary crawling rate boost.
- Wildcard subdomains are harder to crawl, they should be avoided.
- Setting canonical links on duplicate content can reduce the amount of crawling.
- If not abused, submitting a new sitemap.xml with recent last modification dates accelerates the recrawling of corresponding URLS.
- If you systematically set the server's current date as the page's last change date, it will be ignored.
- Public pages in social websites (Facebook, Google+...) are crawled.
- Home pages are crawled a bit more frequently than other pages.
- Feeds are treated like sitemaps, but they are crawled more often.
- Use Pubhubsubhub to optimise feed crawling rates and bandwidth usage.
- Images are crawled at a lower rate than text content.
- Not all URLs are scheduled for recrawling.
- Google crawls at most 10 MB per page.