My first-ever blog post documents my experiences so far with Scala pickling.
Since I’ve started in the middle, a quick precap.
I’ve been learning Scala for three or four years in my spare time and have a little project built with Scala, ScalaFX and a bit of Akka. It’s a map viewer that started off as a viewer for my implementation of the Revised R*-Tree. The original implementation was in Java, but I rewrote the R*-Tree in Scala and ported the rest of the Java code to Scala. I have a test data set based on data published by Minnesota Department of Transportation which amounts to about half a million features – mostly county road segments. The data was imported from Shape files (an ESRI format) and is stored in a PostGIS database. The PostGIS database lives on my home server, so for development purposes I serialize the objects to a local file and (for the most-part) work from the local file instead.
At the moment the code uses Java serialization via Scala. It’s OK, but it’s a little fragile, not easily human readable and a bit slow to work with. This is where Scala Pickling comes in. My plan was to use Scala pickling to serialize the same objects to JSON. JSON is about as human-readable as my kind of structured data is likely to get and more resilient and more human fixable than binary data too. If the hype is to be believed, Scala pickling should also be more efficient. It sounded better all-round so I thought I’d give it a try.
So, this morning I fired up my IDE (IntelliJ IDEA) and started to add pickling. I naively thought that it might be included in Scala 2.11.5, but it turns out that it’s a short download away via SBT. I do this in a slightly clunky way, because my IntelliJ project pre-dates a solid SBT integration. In theory I use my new SBT-based project to do the download, and then copy the files over to the lib directory in the old-style project. Easy … except … I work on a MacBook Pro running OSX and getting Finder to see files in a directory starting with a ‘.' involves twelve flavours of witchcraft. Thirteen on a bad day. After a 90 minute digression in which I discovered that:
- a half-decent file manager for OSX costs at least $18;
- if it’s any good it definitely isn’t in the App store;
- since everybody and his aunt needs a better file manager than Finder there are thousands of unofficial file managers to choose from
- at $18+ a pop I definitely need the once and future king of file managers
… I really couldn’t decide which was the undisputed Joffrey the first of his name and used cp on the command line instead. 93 minutes to copy three files. At this rate … but I digress.
A few seconds later, Scala pickling was set up as a library ready to be imported and used. The import is a little confusing since it used to be “import scala.pickling._” but at some point this changed to “import scala.pickling.Defaults._”, so about half the examples on the web don’t work straight away. The next step is to choose the pickling format, which in my case was achieved with “import scala.pickling.json._”, exactly as advertised.
In theory, the next and final step should be to pickle the target object, something like this: “val pickledObject = targetObject.pickle().value” which in my case ran, but turned my container of a Java ArrayList holding half a million objects into a JSON file of 105 bytes. I don’t know why, perhaps its the Java ArrayList causes a Scala-Java-Scala effect, as the fix was to use a Scala Array instead.
Java serialization produces a 290MB file and runs happily in the 4GB maximum allocated to the JVM. Unfortunately Scala pickling broke with an OutOfMemoryError after whirring away for a couple of minutes. At first I thought this might be down to cycles in my object graph, but a swift Google revealed that this is supported and works. After experimenting with subsets of the original array I found that increasing the maximum heap size to 8GB gave Scala enough head-room to pickle the whole array.
A working solution and, as usual, the hardest parts of the process were:
- getting started,
- losing so much time to glitches.
The end result is memory-hungry and quite slow: ~80s to pickle the objects to a JSON string and ~15s to write the string to a file. The file is about 840MB but it does compress to a very tiny 3.4MB using “tar -cvzf <output file> <input file>”. For comparison with Java serialization, the uncompressed file is 290MB which compresses to 66MB.
In conclusion, this was an interesting experiment but Scala pickling to JSON is probably the wrong tool for this job. If I’d set out to write an export to JSON feature I’d have been delighted with this result, since it took so few lines of code to implement the whole feature. I’ll probably stick with Java serialization for now, but if I was to dig deeper – by chunking my data, by using a different format or by exploring the customization options – I think there are ways forward to a much more promising result.
Also, by writing this post, I realised that the proportion of the effort that went into the real work of implementing this feature was a shockingly small part of the total. The rest was about getting past hurdles that shouldn’t have been there in the first place. In retrospect I’m surprised at how much each one slowed me down.