We check a number of different data points (currently 14 distinct points) to try and work out any possible issues that might affect the ability for a URL to be indexed by the search engines. Here are some of them, with what they mean and how they can affect the URLs they appear on.
The robots.txt file is often the first port of call for most web crawlers, and gives them information on what parts of a website the site owner gives permission for them to access. Most legitimate web crawlers and robots will abide by these rules, but more nefarious 'bots' will just flat out ignore them. Search engine spiders tend to follow them to rule, and so we need to make sure that the URLs you want to get indexed are not blocked in this robots.txt file.
The robots meta tag is a html element that gives specific guidelines to web crawlers about what they are allowed to do with the content on the page. Directives such as 'noindex' will tell spiders that the page should NOT be indexed. So if you want a page to appear in Google's search results you ideally do not want the 'noindex' attribute anywhere in that page's html code.
Every web request and response contains HTTP headers at the start of the transferred content. These headers aren't typically shown by a web browser, but contain technical information about what content type is being given, when it was requested, what encoding it uses, and so on. One of these possible headers is 'X-Robots-Tag'. This can be used in the place of a robots meta tag to tell spiders if they can index the page in question or not. As it is 'hidden' away from most users, it can be hard for the average person to spot.
The 'canonical' link element features in html and is meant to help prevent issues with duplicate content. It features the URL that search engines should use as the 'original' document. If a page has a canonical URL that differs from the actual URL it is occupying, search engines can get confused as to what to display in it's results, and will most of the time display the URL that is specified as canonical.
These codes are issued by a server in the response to a client request, and gives information on how successful the request was. Ideally this status code should be 200, which means that it has been received successfully and accepted - basically, it has served you the page you were requesting with no issues. Other status codes may mean that the page you have requested is elsewhere (a redirection), the page does not exist (a 404 error), or the request has failed somewhere along the line (a 500 error).
For years search engines have told us that we should be using SSL on all our websites, even touting a ranking boost to get us on board. That boost never really came, but using https is pretty much the standard these days. However, numerous times a site's webmaster will let a SSL certificate lapse, which means broken https connectivity. Googlebot will see this as a security issue, and is likely to not want to feature such URLs until the problem is resolved.
With the internet becoming increasingly bloated with billions upon billions of pages out there, search engines are becoming more and more selective about what URLs they will bother to index. So, given the choice between a forum profile with no unique relevant information on it except for a link to another site, and a blog page with an article containing at least 200 words of unique text, which do you think Google will choose to include? Give the spiders something meaty to index, and they'll most likely index it. Give them a bare profile, and they're likely to discard it as useless and irrelevant.
These points are some of the major issues that can affect the indexing of a URL in the search engines, but there are a number of other subtleties that can have some sway as well. We do our best to ensure that our tool covers every eventuality, and checks for every possible issue that can arise. If you need more features or think you've found an error, please don't hesitate to contact us.