Detecting search engine bots

Detecting search engine bots

Intro

Search engine bots are also known as crawlers, spiders, robots, etc. These are software designed to run on a large number of computers with the express purpose of crawling all of the pages on the Internet. They will read almost everything on the pages being crawled and also dive into any links found on said pages.

All of the data gathered will be processed and the important parts will be indexed so that search engines like Google are able to return search results for any queries related to those pages. Without search engine bots, there would be no results when you perform a search on Google.

Why do I need to detect search engine bots?

If you own or operate a website, your site will sooner or later be visited by these crawlers. By the IP address of these visitors, you can tell if they are human or bots. If you have a lot of pages or contents, you might be inclined to monetize your visitors by only showing them parts of a page unless they sign up or pay. This is known as a paywall and is a common tactic used by large media companies to get you to sign up as a subscriber.

Before you can monetize though, you will need to generate traffic to your pages. That’s where the search engines come in. If your website only shows parts of a page to the search engines, your pages may not appear in search results due to the incomplete indexing of your page contents. If you’re able to detect that the visitor to your page is actually a search engine spider, you can opt to have your page display fully so that every bit of your page is properly indexed.

How do I detect search engine crawlers?

Fortunately, IP2Location makes it very easy to find out if an IP is a search engine bot. The usage type field in IP2Location databases includes a value called “SES” which stands for Search Engine Spider. Whenever an IP has this “SES” value in the usage type field, that’s a bot.

Web developers have 2 options if they wish to include this detection capability into their website codes. The first option which is maintenance-free is to call the IP2Location Web Service and pass in the IP address to check. The result from the web service can then be checked for the usage type.

The alternative is if you have your own database server like SQL Server or MySQL running. You can subscribe to the IP2Location DB24 database. The database comes in the form of a CSV file which you will need to import into your database server. Then your webpage codes will need to perform an SQL query to determine the usage type. This option requires a monthly updating of the data.

What if I just want to whitelist or blacklist the search engine bots?

Fear not, IP2Location has a quick and easy tool to generate the list of IPs to either blacklist or whitelist in your firewall. The Firewall List by Search Engine is a simple page where you select the search engine and then choose the output format to match the firewall or program you want.

Was this article helpful?

Related Articles