Even More Pickling

Puzzled at my lack of success with Scala pickling I filed an issue on GitHub [1]https://github.com/scala/pickling/issues/326. I will be the first to admit that as bug reports go, “I’ve got 20,000 lines of code and it doesn’t work with your library”, is pretty much a lost cause. I did hope to get one or two tips on how I could help myself, or possibly flamed for asking a stupid question, but I didn’t expect to be completely ignored for two weeks. I did eventually receive a reply, which involved quite a lot of work to follow up.

The reply suggested that I build my code in Java 7 outside of the IntelliJ environment. Doing this (via a shell script) identified a few small issues [2]I discovered a couple of Java files still lurking in the project but increased my confidence that it was pickling that was broken, not my environment. This led me to read through all of the open issues and #296 [3]https://github.com/scala/pickling/issues/296 turned out to be a real gem.

It turns out that pickling identifies any Scala class with a getter or setter method as a Java Bean. This leads it to do strange things and in particular it will generate a pickler at compile time for traits/base classes without full knowledge of the runtime type. This leads to a lot of lost or empty objects in the output. This is more or less a disaster for my scenario, with a lot of code converted from Java.

Fast forward another few weeks and I’ve converted most of the getters and setters to something closer to idiomatic Scala. A single feature definition produces about 12K of JSON output, which is probably not complete, but getting there.

Pickling still isn’t usable because it falls over with a strange error:

Caused by: java.lang.RuntimeException: error: cannot find class or module with type name 'scala.collection.immutable.ListMap.Node'
full type string: 'scala.collection.immutable.ListMap.Node'

I can get past this for a single instance by capturing it as a val and declaring the concrete static type. There is no ListMap in the output. If it is declared as a trait then it fails. This looks like another case of selecting the wrong static pickler at compile time, when deferring to a runtime pickler would be the safer choice.

It seems that the development team have the features that they need in working condition. Perhaps it is good enough to write papers about, good enough to solve some big data problems, good enough to use if you know what works and what doesn’t. I don’t think the current version is a general purpose competitor to Java Serialization. It isn’t robust enough and the documentation doesn’t point out the areas in which valid Scala code will cause it to break. If there was a bit more activity on the GitHub project I’d have waited for the library to mature, but it looks like the current version is here to stay for a while.

What puzzles me most of all is my grim determination to force it to work. To start with it looked like a lost cause. Many hours of spare-time programming later it looks like I might eventually trick it into working … this one line of code:

val pickled = featureDefinition.pickle.value

It seems I still have some good-will left, despite the obvious frustrations of spending so long getting nowhere.

References

References
1 https://github.com/scala/pickling/issues/326
2 I discovered a couple of Java files still lurking in the project
3 https://github.com/scala/pickling/issues/296

More Pickling

Following on from my experiments with Scala Pickling, I thought I would try unpickle() on the JSON file. This is where I discovered that things hadn’t gone well at all. First of all, I had to increase the maximum heap – 8GB was not enough. I used 14GB, but perhaps 9 would have been enough. It’s still a lot. At this point I started to get exceptions from the unpickle() call. It turned out that a lot of my data structure had turned into references like this:

{
 "$type": "java.lang.Object"
 }

Unsurprisingly Scala can’t unpickle this into anything useful [1]I suppose it could create the Java object instance, but it wouldn’t be usable anywhere with a type declaration.

Now I am going through my code converting the Java collections to Scala collections – mostly mutable.Map and mutable.ArrayBuffer to start with. I’ve spent quite a few hours on this now. The code is no doubt better for it, but pickling is still losing most of my object graph. One class retains one String field but loses another. This is puzzling to say the least. If this was some random library from the internet I would give up and move on, but this is the right way to do Scala serialization from EPFL. Although it’s fairly new, I’m pretty sure it should work a lot better than this.

References

References
1 I suppose it could create the Java object instance, but it wouldn’t be usable anywhere with a type declaration.

Scala Pickling

My first-ever blog post documents my experiences so far with Scala pickling.

Since I’ve started in the middle, a quick precap[1]A “precap” is like a recap, but informs the reader about details that might be written in a future post. Although it sounds like it should be short for “precapitulation” that would imply being resigned to defeat and surrender, so it’s a complete word in its own right. No defeatism here … at least, not yet!.

I’ve been learning Scala for three or four years in my spare time and have a little project built with Scala, ScalaFX and a bit of Akka. It’s a map viewer that started off as a viewer for my implementation of the Revised R*-Tree[2]Beckmann and Seeger, ACM 978-1-60558-551-2/09/06. The original implementation was in Java, but I rewrote the R*-Tree in Scala and ported the rest of the Java code to Scala. I have a test data set  based on data published by Minnesota Department of Transportation which amounts to about half a million features – mostly county road segments. The data was imported from Shape files (an ESRI format) and is stored in a PostGIS database[3]I also have a copy stored in Neo4J, but that’s another story. The PostGIS database lives on my home server, so for development purposes I serialize the objects to a local file and (for the most-part) work from the local file instead.

At the moment the code uses Java serialization via Scala. It’s OK, but it’s a little fragile, not easily human readable and a bit slow to work with. This is where Scala Pickling comes in. My plan was to use Scala pickling to serialize the same objects to JSON. JSON is about as human-readable as my kind of structured data is likely to get and more resilient and more human fixable than binary data too. If the hype is to be believed, Scala pickling should also be more efficient. It sounded better all-round so I thought I’d give it a try.

So, this morning I fired up my IDE (IntelliJ IDEA) and started to add pickling. I naively thought that it might be included in Scala 2.11.5, but it turns out that it’s a short download away via SBT. I do this in a slightly clunky way, because my IntelliJ project pre-dates a solid SBT integration. In theory I use my new SBT-based project to do the download, and then copy the files over to the lib directory in the old-style project. Easy … except … I work on a MacBook Pro running OSX and getting Finder to see files in a directory starting with a ‘.'[4]the character known as “period” if you’re in the US, “dot” for the rest of us involves twelve flavours of witchcraft. Thirteen on a bad day. After a 90 minute digression in which I discovered that:

  • a half-decent file manager for OSX costs at least $18;
  • if it’s any good it definitely isn’t in the App store;
  • since everybody and his aunt needs a better file manager than Finder there are thousands of unofficial file managers to choose from
  • at $18+ a pop I definitely need the once and future king of file managers

… I really couldn’t decide which was the undisputed Joffrey the first of his name and used cp on the command line instead. 93 minutes to copy three files. At this rate … but I digress.

A few seconds later, Scala pickling was set up as a library ready to be imported and used. The import is a little confusing since it used to be “import scala.pickling._” but at some point this changed to “import scala.pickling.Defaults._”, so about half the examples on the web don’t work straight away. The next step is to choose the pickling format, which in my case was achieved with “import scala.pickling.json._”, exactly as advertised.

In theory, the next and final step should be to pickle the target object, something like this: “val pickledObject = targetObject.pickle().value” which in my case ran, but turned my container of a Java ArrayList holding half a million objects into a JSON file of 105 bytes[5]via a standard FileWriter.. I don’t know why, perhaps its the Java ArrayList causes a Scala-Java-Scala effect, as the fix[6]when I say “fix” I obviously mean “caused it to break in a new and even more fascinating way”. was to use a Scala Array instead.

Java serialization produces a 290MB file and runs happily in the 4GB maximum allocated to the JVM. Unfortunately Scala pickling broke with an OutOfMemoryError after whirring away for a couple of minutes. At first I thought this might be down to cycles in my object graph, but a swift[7]when I say “swift” I’m glossing over dealing with the malware on one of the dodgy answer aggregation sites that scrapes answers from StackOverflow and adds … a whole world of pain. After killing Safari, carefully reopening the tabs and killing the dodgy one before it started playing its evil tricks, I was back in control of my browser Google revealed that this is supported and works. After experimenting with subsets of the original array[8]courtesy of take() I found that increasing the maximum heap size to 8GB gave Scala enough head-room to pickle the whole array.

A working solution and, as usual, the hardest parts of the process were:

  1. getting started,
  2. losing so much time to glitches.

The end result is memory-hungry and quite slow: ~80s to pickle the objects to a JSON string and ~15s to write the string to a file. The file is about 840MB but it does compress to a very tiny 3.4MB using “tar -cvzf <output file> <input file>”. For comparison with Java serialization, the uncompressed file is 290MB which compresses to 66MB.

In conclusion, this was an interesting experiment but Scala pickling to JSON is probably the wrong tool for this job. If I’d set out to write an export to JSON feature I’d have been delighted with this result, since it took so few lines of code[9]about 5 plus about 10 more to integrate it into the rest of the program to implement the whole feature. I’ll probably stick with Java serialization for now, but if I was to dig deeper – by chunking my data, by using a different format or by exploring the customization options – I think there are ways forward to a much more promising result.

Also, by writing this post, I realised that the proportion of the effort that went into the real work of implementing this feature was a shockingly small part of the total. The rest was about getting past hurdles that shouldn’t have been there in the first place. In retrospect I’m surprised at how much each one slowed me down.

 

References

References
1 A “precap” is like a recap, but informs the reader about details that might be written in a future post. Although it sounds like it should be short for “precapitulation” that would imply being resigned to defeat and surrender, so it’s a complete word in its own right. No defeatism here … at least, not yet!
2 Beckmann and Seeger, ACM 978-1-60558-551-2/09/06
3 I also have a copy stored in Neo4J, but that’s another story
4 the character known as “period” if you’re in the US, “dot” for the rest of us
5 via a standard FileWriter.
6 when I say “fix” I obviously mean “caused it to break in a new and even more fascinating way”.
7 when I say “swift” I’m glossing over dealing with the malware on one of the dodgy answer aggregation sites that scrapes answers from StackOverflow and adds … a whole world of pain. After killing Safari, carefully reopening the tabs and killing the dodgy one before it started playing its evil tricks, I was back in control of my browser
8 courtesy of take()
9 about 5 plus about 10 more to integrate it into the rest of the program