
Intro #
Search engine bots, also known as crawlers, spiders, or web robots, are automated software programs used by search engines to discover and analyze content across the Internet. These bots continuously visit websites, scan web pages, and follow links from one page to another to collect information about available content.
During the crawling process, the bots examine various elements on a page, including text, images, metadata, keywords, headings, and links. The collected information is then processed and organized into a massive search index. When users perform a search on search engines such as Google, the search engine retrieves relevant results from this index instead of scanning the Internet in real time.
Without search engine bots and indexing systems, search engines would not be able to discover websites or provide accurate search results to users.
Why do I need to detect search engine bots? #
If you own or operate a website, your site will sooner or later be visited by these crawlers. By the IP address of these visitors, you can tell if they are human or bots. If you have a lot of pages or contents, you might be inclined to monetize your visitors by only showing them parts of a page unless they sign up or pay. This is known as a paywall and is a common tactic used by large media companies to get you to sign up as a subscriber.
Before you can monetize though, you will need to generate traffic to your pages. That’s where the search engines come in. If your website only shows parts of a page to the search engines, your pages may not appear in search results due to the incomplete indexing of your page contents. If you’re able to detect that the visitor to your page is actually a search engine spider, you can opt to have your page display fully so that every bit of your page is properly indexed.
How do I detect search engine crawlers? #
Fortunately, IP2Location makes it very easy to find out if an IP is a search engine bot. The usage type field in IP2Location databases includes a value called “SES” which stands for Search Engine Spider. Whenever an IP has this “SES” value in the usage type field, that’s a bot.
Web developers have 2 options if they wish to include this detection capability into their website codes. The first option which is maintenance-free is to call the IP2Location Web Service and pass in the IP address to check. The result from the web service can then be checked for the usage type.
The alternative is if you have your own database server like SQL Server or MySQL running. You can subscribe to the IP2Location DB24 database. The database comes in the form of a CSV file which you will need to import into your database server. Then your webpage codes will need to perform an SQL query to determine the usage type. This option requires a monthly updating of the data.
What if I just want to whitelist or blacklist the search engine bots? #
Fear not, IP2Location has a quick and easy tool to generate the list of IPs to either blacklist or whitelist in your firewall. The Firewall List by Search Engine is a simple page where you select the search engine and then choose the output format to match the firewall or program you want.
