What’s the secret to Google loving your website and indexing it? The secret is to steer crawlers to the most important and useful pages of your site and ignore those pages with little value to searchers. You may think that you want Google or Bing to index every URL of your website but you could be preventing it from crawling your most important pages by using up your ‘crawl budget’ on poor or irrelevant pages.
When Google’s crawlbot (or Bing’s for that matter) comes to your website, it has a finite amount of resources to spend on accessing your website’s pages; after all, Google has millions of other websites to crawl that day too.
Your crawl budget can be determined by the value or quality of your website already, including but not limited to the quality of backlinks to your website. I won’t speculate on more factors here, but there are some good research pieces on the web where people have tried to identify these factors.
By controlling where Googlebot is allowed to crawl within your website, you are increasing the likelihood that important and valuable pages will be crawled every time Google visits your website.
Examples of these could be your product or service pages, blog post pages or even your contact details page. All of these are pages you want to have ranked highly in the search results so users can find this information quicker.
There are a number of different ways in which you can help Googlebot access your website. The more of the following you can adjust or implement, the more control you should have over Googlebot or Bingbot.
The first thing to look at is setting disallow rules in your robots file for all pages, folders or files types on your site that do not need to be crawled. Upon visiting a site, the first place a crawler will look at is your robots.txt file (provided it is always located at https://www.mydomain.com/robots.txt). This will help indicate to various crawlers which parts of your website it should not attempt to crawl. You can set rules depending on what crawler bot you want to control.
You can learn all about robots.txt and common issues in this Koozai blog post from Irish Wonder. Always test your rules in Google Search Console’s robots.txt tester tool before you set live any changes as some rules could block your whole website or pages you didn’t want blocked.
To help prevent certain pages from being indexed, it is also recommended that you add the NoIndex tag to the header code of those pages. Once added to a page, you should test these tags by doing a ‘Fetch as Google’ request on the URLs in Google Search Console.
If your site is powered by a CMS or an e-commerce system, you’ll need to be careful with dynamically generated URLs causing duplicate pages. Googlebot can easily get caught up and waste time crawling these URLs. The URL parameters section in Google Search Console can help you tell which of the dynamic URLs are found by Google and set a preference over those it can ignore.
Be aware that this is a powerful tool and you must use it with caution as it could prevent crawling of important parts of your website.
Although Google won’t take your XML sitemap as a rule of which pages to crawl, it takes it as a hint – so make sure it’s up to date to help reinforce which pages of your site it should be indexing.
Remove any old pages from your site and add any new pages.
Googlebot will follow links it finds in your webpage content so make sure you aren’t going to waste its time by letting it crawl links to missing pages. Use a crawling tool such as Screaming Frog’s SEO Spider tool to find these broken internal links and fix them at the source.
Googlebot will need to load each of your pages when it visits them so by reducing the load time of each you can allow it to crawl and index more pages within the same overall time. There are a number of free tools available to help you analyse and improve site speed.
A good site structure is an underrated method of helping Googlebot crawl your website a lot easier. Clearly categorising page content and not hiding pages away too deep in your site structure increases the likelihood they’ll be found by the crawler.
If you’ve managed to implement some or all of the above recommendations and tested them using the tools mentioned, you should begin to see some changes in crawl stats shown within Google Search Console.
Here we’re looking for the number of pages crawled to be similar, or just over, the number of actual pages on your site in the first blue graph. The reduction in kilobytes downloaded (in red) should mimic the reduction in pages crawled if you previously had lots of pages being crawled.
Below is an example of a site with a significant number of URL parameter issues in which Googlebot crawled up to 12,000 URLs when in fact there were just a few hundred actual pages of the site. Through the application of URL parameter rules and the other factors mentioned above, the number of pages crawled became much more consistent and realistic.
If Google is crawling your useful pages each time, the rankings of your pages will be more likely to change frequently, and most likely for the better. Fresh content will get indexed and ranked a lot quicker and time won’t be wasted from your ‘crawl budget’.