Perl, XML, Television and something about encodings
I’ve been busy hacking away on a script which would read the XMLTV format and pour it into a mysql database. It’s fun but nothing is quite as easy as it should be. I’ve rediscovered it once again. I had never before tried to parse XML data from Perl, but a quick google visit suggested it should be a breeze as everyone else seems to do it. I haven’t done my homework and really figured out what the various differences between the many methods you can parse XML is – so I basically found a few examples and picked the one the example used (XML::Parser) that.
It was only the first step of the way though. XML::Parser has several “styles” in which you can work with it – tree, subs, objects and stream – picking the right one is a pain when you (still) don’t known much about XML.
I’ve figured out how one of the style sort-of worked and got a parser going, which could retrieve the data I needed from the xml-file and stuff it into the database.
The original xml-file was in ISO-8859-1 format (aka Latin1) and you should think it would be not trouble getting the extracted data into the mysql. Well, it was. Expat – the XML parser toolkit upon which XML::Parser is built seems to return everything UTF-8 encoded – which in a common browser made our three Danish letters look quite strange.
I really didn’t need this challenge, but after hammering away for awhile it seems my newly built function (which basically reuse some CPAN modules), can convert the strings back from UTF to plain ol’ ISO-8859-1.
So now, my tv-parser is flying. In celebration, I’ve even started hacking on the interface which use the data from the mysql to make a far nicer tvguide than any other (Danish) site. The “multi-channel today” view is going to look something like Porjes design.