Announcing sv 1.0Posted on July 20, 2018
sv, a CSV library for Haskell by the Queensland Functional Programming Lab, was released over four months ago. Since then, we’ve had feedback on what potential users want from the library, and we have altered the design to the point where we are now releasing a new version:
The Story Thus Far
sv, QFPL wanted a library that could parse a CSV file and build a data structure which preserved all syntactic information about the input document. This allowed users to manipulate the document using lenses and output a new file that was minimally different from the input. With this, a user could write custom sanitisation tools and linters for their data.
We wrote such a data structure, and gave it a parser with the parsers package.
parsers abstracts over the common parsing primitives and combinators, allowing someone to instantiate a
parsers parser to other parsing libraries like
sv exposed its parser so that a user could use it as a component of some larger parser. Since it used
parsers, our parser was likely to work with whatever parser library the user is already using.
sv was not only for writing custom linters and sanitisation tools. It was also for writing decoders to extract values from CSV documents into Haskell programs. Importantly,
sv did not use type classes for this, but rather an Applicative combinator-based approach. Users could pass around and manipulate decoders as values, and create multiple decoders of each type.
All was not well. The performance of
sv’s parser was very slow, mainly due to extravagant memory usage. There was also keen interest in streaming support, meaning to parse and decode a file without keeping the whole thing in memory at once. This appeared very difficult to integrate into
sv’s design without significant changes to its syntax tree.
The Times They Are a-Changin’
At YOW! Lambda Jam 2018 in Sydney, I (George Wilson) gave a talk about the design of
sv. This prompted many useful conversations about
sv throughout and following the rest of the conference. The key insight was this:
sv’s two broad goals were at odds with each other. From a decoding library, users demand speed and streaming support. By giving
sv a syntax-preserving representation and making its parser as generic as possible, it can also serve as a toolkit for building custom linting and sanitisation tools, but it is much harder to provide decoding users the speed they crave. I have decided that a sensible way forward is to split up
sv into a handful of libraries. The decoding will be paired with a parser of much higher performance that does not keep all syntax information. Syntax-preserving manipulation will be available separately and for now will retain its performance problems.
Edward Kmett suggested I speak to John Ky about John’s high performance CSV parser, which has now been released to hackage as
hw-dsv. This library uses rank-select data structures to index into CSV documents and offers both a strict and a lazy (streaming) interface. After Edward explained succinct data structures to me, and after further conversations with John, I was very keen to play around with this library as a new parser for
sv. And that is what I have done.
sv’s decoding layer now sits atop
The release of
sv 1.0 will be structured as follows:
sv-core: The Decoding/Encoding of
sv, agnostic of any parsing
sv-coreatop its new default parser (
cassava’s parser instead
svfactor: The syntax-preserving parsing/printing/manipulation of
sv 0.1, packaged as its own library, with no dependency on any other
sv-coreatop svfactor, for those wanting the behaviour of the earlier
Most users interested in decoding and encoding should use
sv, being careful to set the right cabal flags as explained in the README and cabal package description in order to get the best performance. Those running GHC versions below 8.4 will get the better performance by using
sv-cassava instead. Those interested not in decoding nor encoding, but interested in a syntax-preserving CSV datastructure can use
svfactor. Finally, anyone interested in writing decoders which depend on structural information of the CSV, as was the case in the first release of
sv, can use
Alternatively, you can use
sv’s decoders atop
cassava’s parser by using sv-cassava instead of sv.
sv is still missing key features, such as column-name-based decoding. Although the parser is streaming,
sv itself still is not. Furthermore, encoding data as CSV is likely to still be slow. I intend to work on these features and performance concerns.
svfactor’s parser still has performance problems, so it should be altered or rewritten. When I get around to this, I intend to blog about it. I also intend to blog about benchmarking
Links and Further Reading
George Wilson is an enthusiastic programmer at the Queensland Functional Programming Lab. George enjoys writing code, and speaking and writing about Haskell and functional programming.