Reviews - What do customers think about Data Munging with Perl?
No-nonsense resource for meat and potatoes Perl scripting Jul 21, 2007
The quintessential Perl activity is data processing, particularly in a Unix environment, where output is piped into a script from some other program, transformed, and spat out again. Many people's first encounter with Perl will probably be in this task. David Cross's book shows how to do this with the minimum of fuss and the maximum of flexibility. It's not a Perl tutorial however, so you will need some basic knowledge of Perl, having read The Llama is enough. There is an appendix of 'essential Perl' to refresh your memory if you're a bit rusty.
The book begins by revising some of those basic Perl practices that come in handy for scripting, e.g. command line options, regular expressions and sorting. The second part of the book deals with parsing fairly simple data: traditional fixed-width record data (e.g. the column-based stuff that you often find as the output of old Fortran and C programs), unstructured data (e.g. doing word counts on text files), and formats such as CSV, PNG and MP3. This is the strongest section of the book, and contains lots of useful hands-on information.
The third part of the book deals with more modern forms of data files, in the shape of XML. Parsing HTML also gets a chapter to itself, after the author usefully demonstrates the limitations of any simple solution (e.g. using regexes), which provides pretty strong evidence in favour of the standard 'don't try it yourself, use a CPAN module' argument. The XML chapter itself covers the XML::Parser module in reasonable detail. However, there are now many more XML parsers in Perl out there, and XML::Parser is probably no longer the best solution (Grant McClean's Perl XML FAQ on the net has a good overview of the options). Excluding the seemingly obligatory 'here's a bunch of books and websites to learn more' chapter, the last proper chapter is on parsing, and the Rec::Descent module, and it's a very good gentle introduction.
If you're not working in a command line environment, there's not a whole lot here you're going to need. Equally, if you've been doing this sort of thing for a while, there's not much here that will be new to you, not all the subjects are explored in any great depth. And some of it (particularly the XML chapter) is a bit outdated and superficial, so I would knock off a star from my rating if you're more interested in the XML/HTML chapters.
But for the simpler tasks, e.g. parsing column based data, this is recommended. You're shown all the handy tricks you need such as piping, taking input from standard in as well as files, slurping paragraphs etc. My 4-star rating applies if this sounds like what you need: it's a clear, short and to-the-point book, which is definitely taking with you on your first journey into data munging.
I wish I had purchased this book years ago Jan 2, 2007
As a DBA, I bought this book to enhance my data manipulation skills with Perl but I found so much more in this compact book. David Cross provides many excellent code examples and explanations for common, non-database data manipulation tasks. For example: working on delimited and fixed-width text files and managing complex data structures in perl with array and hash refs. David has excellent communications skills as his examples and explanations taught me much about Perl that I did not previously understand completely. I also found the Chapter 4 on regular expressions to be one of the best and most concise. The only downside of this book is that I wish it had more pages to read! Regardless, it's a must-have perl book.
Belongs on every sysadmin's desk Jul 2, 2002
This book isn't about arcane corners of Perl theory. It's about how to write Perl programs that perform the "simple" task of converting data from one format to another.
Need to get every headline from an RSS feed? Or report the three users with the most processes running, as listed by `ps`? Or extract the first paragraph from each of a thousand HTML files? Or make a .tsv file based on all the "From:" and "Subject:" lines in your mailbox file? If those sorts of tasks sound familiar to you, then this is the book you've been looking for. It has working code for doing these sorts of things, involving lots of different common kinds of formats.
By tech book standards, this book is short (300 pages), but it's clear and direct and to the point -- no bloat here. Every page tells you something you need to know, with useful examples for every idea that it explains.
Valuable for its _clarity_ Jul 25, 2001
After reading this book I rewrote a pretty massive postscript pasrsing and munging system that I was having a lot of trouble with and felt like I did it the _right_ way. If you follow the author through his examples and actually read the book (which I was able to read almost straight through) I think that you will find yourself with a more long-view approach. And I think that makes this book valuable. And admit it, every time you read throgh a regex chapter you get a little more in the old noggin...
Good for data-processing *beginners* Jul 6, 2001
It's a guide. David takes you through the different "data munging" tasks ( record oriented data ? binary data ? fixed-width data ? XML ? ) and shows you his proper ways of dealing with them ( or, at least, thinking about them ). It's not an encyclopedia of "data munging", the book is 300 pages and many of them ( too many, may be ) are detailed descriptions of useful CPAN modules ( which I wasn't reading as careful as the rest of the book, since POD was always enough ), so it covers only a usual data processing tasks letting you to go deeper by yourself for more advanced topics. After you'll finish it much less "data sources" will scare you - the solutions and references are inside.
As I said, it may be good for data-processing beginners, but Perl experts will hardly find lot's of new information in it.
P.S. I trust him and therefore follow his advices in every script I start to think of ( especially the one about "UNIX filter model" ).