spider frequent access to the web site, that love website, often to see the content of the website is updated, the website is a good thing. If the ten and a half months to come again, the website content may be hard to update, also increase the number of the chain guide, visit the website spider. The length of stay can reflect the degree of web spider love, one thing to note is that, if the residence time is long, but grab low there is a problem, may be on the spider web site content to crawl web content or difficult, low quality of lead. These three indicators should be taken together, will get more valuable information.
that so much information, we need to focus on the one point
What is the
statistics spider on the web page, by observing a period of time you will find some spiders crawl page often, analyze why spiders love these pages, these pages compared with other pages of what is different, there is no other pages where you can learn. In addition, can also through the analysis of the web pages, to understand some of the site, such as repeated page issue and URL standardization etc..
statistics spider on the web site directory, the level of website directory is to grab some normal, need to focus on the promotion of the directory is crawling, if not grab you need to adjust the site within the chain or chain increases, improve the column weight, guide the spider crawling. In addition, the spider may grab some meaningless directory, such as we do not want the search engine to understand the information, then you can make these directory shield.
4, the number of visits, the residence time, the spider crawling
2, which capture the directory
two, HTTP status code
Web log, defined "love Shanghai encyclopedia Web log records web server receives the processing request and the runtime errors and other original information to · log at the end of the document, exactly speaking, should be the server log." Site information through this definition can understand the web log records to the full, including visitor information (such as IP, visitors what browser, operating system, access time, etc.) on the website of the spider grab (grab what directory, have what spider and so on), running error information (mainly look at some of the HTTP status code).
mainly focus on the 404, 500, 302 and the like. 404 needless to say, the best regular.
? , a spider crawl
look at the mainstream spider have access to the site, if not, may be the site blocking, can check the website of the robots file.
3, grab which pages
1, what are the spider to visit