Get to Know SEO: How does Google find a website - and how can it go wrong?

Get to Know SEO: How does Google find a website - and how can it go wrong?

?? Hey, you! Welcome to Get to Know SEO - the weekly newsletter for small marketing teams who want to embrace the power of search to get more from their online presence.

This week’s topic: How does Google find a website - and how can it go wrong?

You’ve just spent ages perfecting your shiny new web page. Now what? If you’re looking for SEO success, then you need the page to be appearing in search results to make the effort worthwhile.

To get a page from unknown to appearing in the search results, there’s a process it has to go through. Once the page appears in results, you can focus on optimising it for higher rankings – but step 1 is to make sure it gets there!

In this week’s blog, we’re going to take a look at the key steps of this process and how you can fix some common problems which prevent a page from appearing in search.

What’s the process of appearing in Google search results?

There are three key stages which will get your page starting to appear in search results for relevant keywords. These are:

Discovery & Crawling

Before Google can decide which results to show your page for, it needs to know that it exists! That’s the ‘discovery’ section. Once Google knows about a URL, the page may be crawled (although it’s never guaranteed that a page will be crawled or indexed – more on that later.) Crawling is when a piece of software – known as a crawler – visits the page to see what content is on there. The crawlers browse the web by following links, which can lead to the discovery of new URLs.

Indexing

At the indexing stage, Google’s systems render the web page on a recent version of Chrome (like a browser would do), and take note of lots of information about the page, including:

  • what the title tag says
  • what language it’s written in
  • the alt text of images
  • what images and media are on the page
  • how usable the page is for visitors
  • how recently updated the website was
  • which keywords appearing in the content

The page is then listed under the index for every word which appears in the content. Google uses this massive index when deciding which pages to display in search results.

During the indexing process, Google also groups together pages with the same or very similar content and decides which version will be the primary page shown to users. An example of this is when using filters on search pages creates different URLs with the same results, depending on which order they’re applied – e.g.

www.website.com/dresses.asp?brand=yumi&color=teal

instead of

www.website.com/dresses.asp?color=teal&brand=yumi

– both URLs would show the same content. This example shouldn’t cause a problem with ranking, but it can affect site performance as Googlebot will be crawling more pages than necessary.

When duplicate content happens across different domains it can cause more of an issue with ranking. This often happens in e-commerce when the same product descriptions are used when a product is stocked by multiple distributors. Google doesn’t want to show 10 pages with the same content in the search results, so unless you’ve worked to add extra value to this content, you might be competing with bigger sites which have more domain authority – and you’ll struggle to rank.

If someone has scraped your content and put it on their site, again Google can usually use signals to work out which site was the original creator, so this shouldn’t negatively affect your rankings. If this starts to cause issues, you can file a DMCA request with Google to try and get the pages removed from the index.

Serving results

When someone enters a search query, Google responds by presenting the pages it thinks are most relevant. The user’s search intent is deciphered by the algorithms, which also take into account hundreds of factors, including the user’s location, which device they’re using and the language the search was made in.

How can I help Google discover a web page?

In many cases, Google finds new URLs on its own via links within the website or from another site. But there’s no guarantee that the page the link comes from is getting crawled very frequently, meaning it could take some time for your URL to be discovered.

The best way to manage URL discovery is by creating a sitemap. This is a file created for search engines which has a link to every page of your website. In many cases this sitemap will update automatically when you add new pages, but it’s always worth double-checking this is the case with your CMS (the platform your site is built on, e.g. WordPress or Shopify).

You can submit your sitemap URL directly to Google via Google Search Console, and from there Google will automatically be told about new URLs when your sitemap updates. Creating and submitting a sitemap is an essential step for new website owners, as it allows you to start the process of getting your pages indexed straight away, rather than waiting for Google to eventually come across a link to your site.

Websites also have a file called robots.txt, which tells crawlers such as Googlebot which pages it can crawl on your site. For larger sites in particular, this file helps you manage the crawlers so that they don’t overload your websites with requests. It’s also best practice to reference the location of your sitemap in robots.txt, which helps crawlers find it easily.

If you’ve already submitted a sitemap to Google Search Console, but you’ve just published a new page and want to reduce the time it takes Google to find it, you can use the ‘inspect’ feature of Google Search Console to request indexing of URLs on an individual basis. You should also look for opportunities to link from established pages of your website to a new page as this helps crawlers to discover the new URL.

How do I know whether my pages have been indexed?

If you want a quick indication of whether any of your site has been indexed, go to Google and search for site: https://yourwebsite.com and see what results come up. If there’s nothing showing, the site hasn’t yet been indexed and you’ll have to do some digging to work out why (take a look at the next steps!).

If you want to get an overview of how your site as a whole has been indexed, then use the page indexing report on Google Search Console. It’s located on the left side under ‘indexing’ then ‘pages’. This report shows you how many pages are either indexed or not, with a graph showing how those numbers have changed in the last 3 months. There are different views available – you can take a look at all of the URLs that Google knows about, or select a sitemap to see more specific data.

If you want to check if an individual page has been indexed, ‘inspect’ the URL by using the inspect search bar at the top of the page. This will give you the latest data on a specific URL, including whether the page is currently indexed, or if not, the reason why.

What can go wrong in the process, and how can I fix it?

On the page indexing report, you’ve probably noticed a section titled ‘Why pages aren’t indexed’. Under this, you’ll find different reasons why pages on your site haven’t been indexed, and if you click through to these individual reports you’ll find a list of the URLs affected (or a selection of affected pages if the list is over 1000 pages).

Here are some of the common issues and some steps you can take to fix them:

Not found (404)

A 404 error means that Google wasn’t able to find the URL, usually because it’s been deleted. Googlebot may have either revisited a page it knew existed previously, or followed a link and found the page is no longer there. If you’re deleting a page, the best thing to do is put a redirect in place to send a visitor (and search engine) to the most relevant alternative page. Depending how your website’s set up, you can either do this on your CMS system (e.g. Wix, Squarespace) or through your .htaccess file. If it’s the latter and you haven’t edited a .htaccess file before, speak to an SEO or web developer first, as errors in the code could break your website.

Soft 404

Similar to a 404 message, a soft 404 means that Google saw a user-friendly ‘not found’ message on the page, but the page didn’t properly return a 404 HTTP response code to the search engine. If the page should exist (and you can see it when you visit the page), but has missing or very little content, improving the content could fix the error. A soft 404 can also be caused by an issue with the code which is preventing Google from seeing the page correctly, in which case you’ll likely need the support of your web developer to try and fully diagnose the cause.

Page with redirect

This report shows pages which now redirect to another URL. You can click on an individual URL to check that a page is redirecting to the right place.

After you’ve put a redirect in place, Google often shows the old URL under this report for a while. As long as the redirects are working and pointing to where you want them to, this isn’t a cause for concern.

Redirect error

Unlike ‘page with redirect’, a redirect error is something that needs to be fixed. It can be easy to make errors with redirects, and over time redirects often build into redirect chains, which can cause problems.

Googlebot will only follow so many redirects in a chain, so if you’ve ended up with a long redirect chain, the original URL will eventually end up on this report. Errors within the chain, such as misspelling a URL, can cause the chain to break – as can URLs which become too long.

Redirect loops happen when pages redirect to each other, creating an endless circle. This means that there’s never a final page for the browser to land on, so the URL can’t be shown and the page hits an error.

Redirect errors cause issues for users as well as search engines. Long chains mean that a user has a longer wait for a page to be shown, while redirect loops will never show a page and display an error to users instead.

You can review redirects using tools such as Screaming Frog or httpstatus.

Crawled – currently not indexed or Discovered – currently not indexed

Indexing is never guaranteed. These two reports show that a page is eligible for search results, but hasn’t currently been indexed. There are a number of possible reasons why Google has decided the page isn’t a good fit for search results, and it can take some digging if you have important pages included in this report.

Some of the reasons URLs are appearing in this report include duplicate content issues or poor quality content. There could also be URLs in this list which weren’t intended to be indexed, or wouldn’t have a great user experience if they were, such as URLs for RSS feeds, paginated pages or URLs or created from on-site search results.

If you’re seeing these reasons for a page not being indexed, check the lists for any important pages of your site which you need to be indexed. Start by looking at the content and the intended search intent, and see if improving the content helps to get the page indexed. Remember that content should be original, helpful and detailed enough to fully answer a search query.

It’s also worth inspecting the URL on Google Search Console and using the ‘test live page’ option to view a screenshot of how Google is rendering the page – this will help to confirm whether the content is displaying correctly to Google, or whether there’s an error in the coding preventing this from happening.

URL marked ‘noindex’

This means that a tag has been put on the page which asks Google (and other search engines) not to index a page. Unsurprisingly, this means the page is then eventually included in this report!

If the page should be indexed, you’ll need to remove the tag and resubmit the URL to Google (either through the Inspect tool or by the ‘validate fix’ button on the report page for a bulk of URLs).

A key time when this can go wrong is when a site is launched or updated from a development site, for example during a web migration or major update. Part of the process when this happens should be to run your own crawl and check that noindex tags aren’t pushed across to the live site. A tool like Screaming Frog can help to do this in bulk. This means that the problem can be picked up quickly and remedied before the pages start dropping out of the index.

URL blocked by robots.txt

Sometimes people try to keep a page out of search results by blocking it on the robots.txt file. This doesn’t work if Google knows (or finds out) that the page exists via a link from elsewhere, and can guess from that link what the page might contain. If this happens, it can include a version of the page in the search results, often without displaying a description or media.

If there’s a page in this report that shouldn’t be blocked, you’ll need to review your robots.txt file to allow GoogleBot to crawl it properly. If you want to prevent a page from being in the index, you can do this by using a noindex tag on the page, or by password protecting the page. To urgently remove a page from Google’s index, go on Google Search Console and click ‘removals’ under the ‘indexing’ header. You can then request a temporary removal to quickly get a page out the index. This removal lasts for 6 months, giving you plenty of time to implement a noindex tag.

Server error (5xx)

This issue means that when GoogleBot tried to access the URL, it wasn’t able to – possibly because the site was too busy or the request timed out. This isn’t always a long term issue and you may struggle to replicate the same problem again when you find the report.

To make sure there aren’t wider issues around your server, use the crawl stats report. This is located on Google Search Console under settings > crawling > crawl stats. This report will provide information such as the average response time and how many crawl requests have been made in the last 3 months. Under the ‘hosts’ heading, you’ll see which versions of your domain have been crawled in the last 3 months and whether there have been any issues with the hosting.

If there are regular issues, you’ll need to look into whether your web hosting is suitable for your needs. You might need to upgrade to a better package, or consider moving web hosts if you’re not getting a reliable and fast service. Uptimerobot is a quick tool to check if issues with the requests timing out could be affecting users too. It monitors your site and sends you an alert if it goes down.

What does success look like?

100% of pages indexed might seem like a good number to aim for, but it’s not necessary (or even realistic!) for most websites to get every single page indexed. While I’ve seen this number happen before for new and small sites or sitemaps, over time there are often URLs which won’t be indexed for one or more of the reasons given above. Less than 100% indexation is completely normal and of itself isn’t something that needs to be fixed.

Getting the important pages indexed is the aim, so you should be keeping an eye on the reports for any key pages where this isn’t happening. By default, Google Search Console will send you an email alert when it finds any new issues, which can help you keep track of this.

Success for most sites means that the page indexation graph will show an increasing number of ‘green’ indexed pages over time, as new content is added to the website. However, as with everything in SEO, your actual measure of success depends what your goals are for your website and business.

Once a page has been indexed, you’ll be able to start collecting data on how well it ranks for various keywords, the number of times it’s been seen and how much traffic is coming through from the search results. This data will then go on to inform your next steps to improve your ranking and get more traffic and customers to your website.

Looking for some help to get more traffic from search results? Get in touch via ?? [email protected] or ?? 0115 990 2779 to have a chat about how we can work together.

Want more tips like this? Sign up to the Get to Know SEO newsletter for SEO tips, updates and ideas every Friday.


Content originally published on HeyYou! Digital

要查看或添加评论,请登录

社区洞察

其他会员也浏览了