Escaped Fragment URI Analysis
So due to a little bit of miscommunication between developers, and some code being released late to production we regrettably had three diverging URLs in our content body, canonical url's and sitemap. For a period of two weeks the spiders crawled -- and we had the opportunity to go back and see what happened.
Sample size: 240914 urls crawled over 14 days 2013/03/30 - 2013/04/12 -- here are the respective dates and # of URL's crawled:
2013/03/30 - 16,375
2013/03/31 - 18,452
2013/04/01 - 17,865
2013/04/02 - 19,431
2013/04/03 - 15,443
2013/04/04 - 4,832
2013/04/05 - 21,053
2013/04/06 - 24,316
2013/04/07 - 27,477
2013/04/08 - 23,862
2013/04/09 - 22,050
2013/04/10 - 15,488
2013/04/11 - 6,396
2013/04/12 - 7,874
We will break down the # of URL's spidered, and also discuss the encoding of the _escaped_fragment_ which outlined by Google - but also appears to have been adopted by Bing, Yandex, Yahoo, Facebook and others.
Canonical URL Results (82%)
Example: http://www.domain.com/something#!v=1
URLS: 5858
Yandex (199.21.99.97) & Facebook (69.171.229.116) will fetch:
http://www.domain.com/something?_escaped_fragment_=v%3D1
URLS: 192774
Google & Bing does not escape #v=1 so they will request:
http://www.domain.com/something?_escaped_fragment_=v=1
(this is because they --incorrectly-- assume we will escape the canonical URL in the URI)
Sitemap URL (0.53%)
Example: http://www.domain.com/something#!sitemap
Requested URL: http://www.domain.com/something?_escaped_fragment_=sitemap
1284 links crawled
NOTE: we found this exceptionally low, the sitemap had thousands of files and clearly is effectively not used (when compared to content & body) - of all URL's crawled only only <1% were retrieved. Google (and other search engines) clearly prefer organic content.
Content URL: (17%)
Example: http://www.domain.com/something#!pagetype?key1=value1&key2=value2
Requested URL: http://www.domain.com/something?_escaped_fragment_=pagetype?key1=value1&key2=value2
40998 total links crawled
40913 by Google
85 from everybody Else
In Canonical and Content URL's Google encodes *SOME* special characters before requesting the _escaped_fragment_ -- but it did not work how we expected. Specifically GoogleBot leaves characters such as ., ? and = alone - and encodes other characters including ampersand (seriously wtf!)
http://www.domain.com/something#!pagetype?key1=value 1&key2=value-2
would be requested as:
http://www.domain.com/something?_escaped_fragment_=pagetype?key1=value%20D1%26key2=value%2D2
Also of interest was the fact that GoogleBot's IP address was _clearly_ and undeniable accessing API functions which return JSON. (more on this later).
No comments:
Post a Comment