Web crawler

The gradually optimized Web crawler technologies bring website operators more and more profits, but Web security issues still come along. The exact reason will be explained in the following text. To begin with, we would like to make a brief introduction to the term. A briefing to Web crawler Being universally applied in Internet area, Web crawler, also Web spider or Web robot is a kind of program or script which is able to automatically fetch information on the Internet. By using it, search engine is able to access, organize and administrate information such as files, images, audios and videos presented on the website. Then, it will supply the fetched information for users' query. Web Security Issues Brought by Web Crawler Since the main information fetch mode of certain program is to access high-value information, web bandwidth consumption and processing workload of Web server will increase correspondingly during certain course. Moreover, some webmasters of small-sized websites find obvious increase in network traffic while the crawler is fetching website information. Malicious users may use certain flaw to implement DoS attacks. What's worse, the sensitive information fetched by crawler is most likely to bring webmaster unexpected losses. Solutions to Eliminating Threats Taking the possible menaces brought by Web crawler into consideration, many website managers may want to restrict access to information. As a matter of fact, it is advisable to treat certain procedure or descript according to actual demands. For websites where confidential and sensitive data are stored, website managers can strictly restrict access. And the following are detailed solutions to hardening Web security. 1. Set robots.txt file To set robots.txt file is the simplest way to block Web crawler. Robots.txt is the first file to be checked by search engine, and it tells crawler what server files will be checked. For instance, "Disallow: /" means that all paths are not allowed to be checked. It is a pity that not all crawlers will conform to the regulation. Thus, to set robots.txt file is far from being enough to impede crawlers. 2. Set User Agent identification and restrictions To restrict crawlers which do not abide by robots.txt regulations, website managers should firstly identify and classify network traffic brought by crawlers and network traffic brought by common users. For general crawlers, the User Agent field in HTTP queries can identify the using operating system, CPU, browser version, browser render engine and browser language. However, the User Agent field in browser is different from that of Web crawlers. That is why website managers can filter out unneeded crawlers by setting User Agent field. 3. Identify and restrict crawlers according to behavioral traits To deal with Web crawlers which disguise as browser in User Agent, website managers can identify them on the basis of behavioral traits. Crawlers regularly visit websites, while true users casually.