March, 2010 – Graham Miln

Property hunting in Lyon, Web::Scraper, and tbody tags

During the last few weekends, I have needed to brush up on my web site parsing skills. The tools available have moved on nicely since my last dip into this topic.

I am currently keeping an eye on properties in Lyon, France. The process has been tedious and called out for some automation. Megan and I plan to return to France in the future and this little project should ease the burden of finding an apartment or house.

This morning I discovered the perl module Web::Scraper. It is a port of a Ruby based tool called scrAPI. The approach taken avoids regular expression matching and opts for XPath and DOM tree selector matching; both more resilient methods of addressing specific sections of a web page.

I found one stumbling block that took a while to overcome. After a little trial and error, I discovered the FireFox browser returned misleading XPaths for objects embedded in tables.

The XPaths provided by FireBug and XPather, included browser-inserted tbody tags. These tags did not appear in my source web pages. Thus the browser’s XPath did not match the structure used by Web::Scraper, and caused Web::Scraper to miss the desired content.

The solution was easy; strip out the tbody tags and Web::Scraper returns to working as advertised.

With this problem overcome, the project is already looking helpful.