Streaming parsing of JSON in Haskell with Pipes.Aeson - json

The Pipes.Aeson library exposes the following function:
decode :: (Monad m, ToJSON a) => Parser ByteString m (Either DecodingError a)
If I use evalStateT with this parser and a file handle as an argument, a single JSON object is read from the file and parsed.
The problem is that the file contains several objects (all of the same type) and I'd like to fold or reduce them as they are read.
Pipes.Parse provides:
foldAll :: Monad m => (x -> a -> x) -> x -> (x -> b) -> Parser a m b
but as you can see this returns a new parser - I can't think of a way of supplying the first parser as an argument.
It looks like a Parser is actually a Producer in a StateT monad transformer. I wondered whether there's a way of extracting the Producer from the StateT so that evalStateT can be applied to the foldAll Parser, and the Producer from the decode Parser.
This is probably completely the wrong approach though.
My question, in short:
When parsing a file using Pipes.Aeson, what's the best way to fold all the objects in the file?

Instead of using decode, you can use the decoded parsing lens from Pipes.Aeson.Unchecked. It turns a producer of ByteString into a producer of parsed JSON values.
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Pipes
import qualified Pipes.Prelude as P
import qualified Pipes.Aeson as A
import qualified Pipes.Aeson.Unchecked as AU
import qualified Data.ByteString as B
import Control.Lens (view)
byteProducer :: Monad m => Producer B.ByteString m ()
byteProducer = yield "1 2 3 4"
intProducer :: Monad m => Producer Int m (Either (A.DecodingError, Producer B.ByteString m ()) ())
intProducer = view AU.decoded byteProducer
The return value of intProducer is a bit scary, but it only means that intProducer finishes either with a parsing error and the unparsed bytes after the error, or with the return value of the original producer (which is () in our case).
We can ignore the return value:
intProducer' :: Monad m => Producer Int m ()
intProducer' = intProducer >> return ()
And plug the producer into a fold from Pipes.Prelude, like sum:
main :: IO ()
main = do
total <- P.sum intProducer'
putStrLn $ show total
In ghci:
λ :main
10
Note also that the functions purely and impurely let you apply to producers folds defined in the foldl package.

Related

Reading nested JSON data encoded as a nested string with Aeson

I have this weird JSON to parse containing nested JSON ... a string. So instead of
{\"title\": \"Lord of the rings\", \"author\": {\"666\": \"Tolkien\"}\"}"
I have
{\"title\": \"Lord of the rings\", \"author\": \"{\\\"666\\\": \\\"Tolkien\\\"}\"}"
Here's my (failed) attempt to parse the nested string using decode, inside an instance of FromJSON :
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Maybe
import GHC.Generics
import Data.Aeson
import qualified Data.Map as M
type Authors = M.Map Int String
data Book = Book
{
title :: String,
author :: Authors
}
deriving (Show, Generic)
decodeAuthors x = fromJust (decode x :: Maybe Authors)
instance FromJSON Book where
parseJSON = withObject "Book" $ \v -> do
t <- v .: "title"
a <- decodeAuthors <?> v .: "author"
return $ Book t a
jsonTest = "{\"title\": \"Lord of the rings\", \"author\": \"{\\\"666\\\": \\\"Tolkien\\\"}\"}"
test = decode jsonTest :: Maybe Book
Is there a way to decode the whole JSON in a single pass ? Thanks !
A couple problems here.
First, your use of <?> is nonsensical. I'm going to assume it's a typo, and what you actually meant was <$>.
Second, the type of decodeAuthors is ByteString -> Authors, which means its parameter is of type ByteString, which means that the expression v .: "author" must be of type Parser ByteString, which means that there must be an instance FromJSON ByteString, but such instance doesn't exists (for reasons that escape me at the moment).
What you actually want is for v .: "author" to return a Parser String (or perhaps Parser Text), and then have decodeAuthors accept a String and convert it to ByteString (using pack) before passing to decode:
import Data.ByteString.Lazy.Char8 (pack)
decodeAuthors :: String -> Authors
decodeAuthors x = fromJust (decode (pack x) :: Maybe Authors)
(also note: it's a good idea to give you declarations type signatures that you think they should have. This lets the compiler point out errors earlier)
Edit:
As #DanielWagner correctly points out, pack may garble Unicode text. If you want to handle it correctly, use Data.ByteString.Lazy.UTF8.fromString from utf8-string to do the conversion:
import Data.ByteString.Lazy.UTF8 (fromString)
decodeAuthors :: String -> Authors
decodeAuthors x = fromJust (decode (fromString x) :: Maybe Authors)
But in that case you should also be careful about the type of jsonTest: the way your code is written, its type would be ByteString, but any non-ASCII characters that may be inside would be cut off because of the way IsString works. To preserve them, you need to use the same fromString on it:
jsonTest = fromString "{\"title\": \"Lord of the rings\", \"author\": \"{\\\"666\\\": \\\"Tolkien\\\"}\"}"

How to handle variability of JSON objects in Haskell?

Some REST service has variable returning JSONs, for example some fields can appear or disappear depending on the parameters of the request, the structure itself may change, nesting, etc.
So, this leads to avalanche-type growth in the number of types (along with FromJSON instances). Options are to:
try to make a lot of fields under Maybe (but this does not help very much with the variability in structure)
to introduce a lot of types
to create different phantom types (actually no big difference with prev.)
The 1. has drawback that if your call with some fixed parameters always returns good knows fields, you have to handle Nothing cases too, code becomes more complex. The 2. and 3. is tiring.
What is the most simple/convenient way to handle such variability in Haskell (if you use Aeson, sure, another option is to avoid Aeson usage)?
A possible solution to the existing/non-existing fields problem using type-level computation.
Some required extensions and imports:
{-# LANGUAGE DeriveGeneric, ScopedTypeVariables, DataKinds, KindSignatures,
TypeApplications, TypeFamilies, TypeOperators, FlexibleContexts #-}
import Data.Aeson
import Data.Proxy
import GHC.Generics
import GHC.TypeLits
Here's a data type (to be used promoted) that indicates if some field is absent or present. Also a type family that maps absent types to ():
data Presence = Present
| Absent
type family Encode p v :: * where
Encode Present v = v
Encode Absent v = ()
Now we can define a parameterized record containing all possible fields, like this:
data Foo (a :: Presence)
(b :: Presence)
(c :: Presence) = Foo {
field1 :: Encode a Int,
field2 :: Encode b Bool,
field3 :: Encode c Char
} deriving Generic
instance (FromJSON (Encode a Int),
FromJSON (Encode b Bool),
FromJSON (Encode c Char)) => FromJSON (Foo a b c)
One problem: writing the full type for each combination of occurrences/absences would be tedious, especially if only a few fields are present each time. But perhaps we could define an auxiliary type synonym FooWith that let us mention only those fields that are present:
type family Mentioned (ns :: [Symbol]) (n :: Symbol) :: Presence where
Mentioned '[] _ = Absent
Mentioned (n ': _) n = Present
Mentioned (_ ': ns) n = Mentioned ns n
-- the field names are repeated as symbols, how to avoid this?
type FooWith (ns :: [Symbol]) = Foo (Mentioned ns "field1")
(Mentioned ns "field2")
(Mentioned ns "field3")
Example of use:
ghci> :kind! FooWith '["field2","field3"]
FooWith '["field2","field3"] :: * = Foo 'Absent 'Present 'Present
Another problem: for each request, we must repeat the list of required fields two times: one in the URL ("fields=a,b,c...") and another in the expected type. It would be better to have a single source of truth.
We can deduce the term-level list of fields to be added to the URL from the type-level list of fields, by using an auxiliary type class Demote:
class Demote (ns :: [Symbol]) where
demote :: Proxy ns -> [String]
instance Demote '[] where
demote _ = []
instance (KnownSymbol n, Demote ns) => Demote (n ': ns) where
demote _ = symbolVal (Proxy #n) : demote (Proxy #ns)
For example:
ghci> demote (Proxy #["field2","field3"])
["field2","field3"]

Trouble with JSON (Data.Aeson)

I'm new to Haskell and in order to learn the language I am working on a project that involves dealing with JSON. I am currently getting the feeling Haskell is the wrong language for the job, but that isn't the point here.
I've been struggling to understand how this works for a few days. I have searched and everything I have found does not seem to work. Here's the issue:
I have some JSON in the following format:
>>>less "path/to/json"
{
"stringA1_stringA2": {"stringA1":floatA1,
"stringA2":foatA2},
"stringB1_stringB2": {"stringB1":floatB1,
"stringB2":floatB2}
...
}
Here floatX1 and floatX2 are actually strings of the form "0.535613567", "1.221362183" etc. What I want to do is parse this into the following data
data Mydat = Mydat { name :: String, num :: Float} deriving (Show)
where name would correspond to "stringX1_stringX2" and num to floatX1 for X = A,B,...
So far I have reached a 'solution' which feels fairly hackish and convoluted and doesn't work properly.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
import Data.Functor
import Data.Monoid
import Data.Aeson
import Data.List
import Data.Text
import Data.Map (Map)
import qualified Data.HashMap.Strict as DHM
--import qualified Data.HashMap as DHM
import qualified Data.ByteString.Lazy as LBS
import System.Environment
import GHC.Generics
import Text.Read
data Mydat = Mydat {name :: String, num :: Float} deriving (Show)
test s = do
d <- LBS.readFile s
let v = decode d :: Maybe (DHM.HashMap String Object)
case v of
-- Just v -> print v
Just v -> return $ Prelude.map dataFromList $ DHM.toList $ DHM.map (DHM.lookup "StringA1") v
good = ['1','2','3','4','5','6','7','8','9','0','.']
f x = elem x good
dataFromList :: (String, Maybe Value) -> Mydat
dataFromList (a,b) = Mydat a (read (Prelude.filter f (show b)) :: Float)
Now I can compile this and run
test "path/to/json"
in ghci and it prints a list of Mydat's in the case where "stringX1"="stringA1" for all X. In reality there are two values for "stringX1" so aside from the hackyness this is not satisfactory. There must be a better way to do this. I get that I need to write my own parser probably but I am confused about how this works so any suggestions would be great. Thanks in advance.
The structure of your JSON is pretty nasty, but here's a basic working solution:
#!/usr/bin/env stack
-- stack --resolver lts-11.5 script --package containers --package aeson
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as Map
import qualified Data.Aeson as Aeson
data Mydat = Mydat { name :: String
, num :: Float
} deriving (Show)
instance Eq Mydat where
(Mydat _ x1) == (Mydat _ x2) = x1 == x2
instance Ord Mydat where
(Mydat _ x1) `compare` (Mydat _ x2) = x1 `compare` x2
type MydatRaw = Map.Map String (Map.Map String String)
processRaw :: MydatRaw -> [Mydat]
processRaw = Map.foldrWithKey go []
where go key value accum =
accum ++ (Mydat key . read <$> Map.elems value)
main :: IO ()
main =
do let json = "{\"stringA1_stringA2\":{\"stringA1\":\"0.1\",\"stringA2\":\"0.2\"}}"
print $ fmap processRaw (Aeson.eitherDecode json)
Note that read is partial and generally not a good idea. But I'll leave it to you to flesh out a safer version :)
As I commented, the best thing would probably be to make your JSON file well-formed in the sense that the float fields should really be floats, not strings.
If that's not an option, I would recommend you phrase out the type that the JSON file seems to represent as simple as possible (but without dynamic Objects), and then convert that to the type you actually want.
import Data.Map (Map)
import qualified Data.Map as Map
type GarbledJSON = Map String (Map String String)
-- ^ you could also stick with hash maps for everything, but
-- usually `Map` is actually more sensible in Haskell.
data MyDat = MyDat {name :: String, num :: Float} deriving (Show)
test :: FilePath -> IO [MyDat]
test s = do
d <- LBS.readFile s
case decode d :: Maybe GarbledJSON of
Just v -> return [ MyDat iName ( read . filter (`elem`good)
$ iVals Map.! valKey )
| (iName, iVals) <- Map.toList v
, let valKey = takeWhile (/='_') iName ]
Note that this will crash completely if any of the items don't contain the first part of the name as a string of float format, and likely give bogus items when you filter out characters that aren't good. If you just want to ignore any malformed items (which is also not a very clean approach...), you can do it this way:
test :: FilePath -> IO [MyDat]
test s = do
d <- LBS.readFile s
return $ case decode d :: Maybe GarbledJSON of
Just v -> [ MyDat iName iVal
| (iName, iVals) <- Map.toList v
, let valKey = takeWhile (/='_') iName
, Just iValStr <- [iVals Map.!? valKey]
, [(iVal,"")] <- [reads iValStr] ]
Nothing -> []

How to use Data.Text.Lazy.IO to parse JSON files with Aeson

I want to parse all json files in a given directory into a data type Result.
So i have a decode function
decodeResult :: Data.ByteString.Lazy.ByteString -> Maybe Result
I began with Data.Text.Lazy.IO to load file into Lazy ByteString,
import qualified Data.Text.Lazy.IO as T
import qualified Data.Text.Lazy.Encoding as T
getFileContent :: FilePath -> IO B.ByteString
getFileContent path = T.encodeUtf8 `fmap` T.readFile path
It compiled, but I ran into Too many files opened problem, so I thought maybe I should use withFile.
import System.IO
import qualified Data.ByteString.Lazy as B
import qualified Data.Text.Lazy.IO as T
import qualified Data.Text.Lazy.Encoding as T
getFileContent :: FilePath -> IO (Maybe Result)
getFileContent path = withFile path ReadMode $ \hnd -> do
content <- T.hGetContents hnd
return $ (decodeAnalytic . T.encodeUtf8) content
loadAllResults :: FilePath -> IO [Result]
loadAllResults path = do
paths <- listDirectory path
results <- sequence $ fmap getFileContent (fmap (path ++ ) $ filter (endswith ".json") paths)
return $ catMaybes results
In this version, the lazy io seems never got evaluated, it always return empty list. But If i print content inside getFileContent function, then everything seems work correctly.
getFileContent :: FilePath -> IO (Maybe Result)
getFileContent path = withFile path ReadMode $ \hnd -> do
content <- T.hGetContents hnd
print content
return $ (decodeAnalytic . T.encodeUtf8) content
So I am not sure what am I missing, should I use conduit for this type of things?
Generally speaking I would recommend using a streaming library for parsing arbitrarily sized data like a JSON file. However, in the specific case of parsing JSON with aeson, the concerns of overrunning memory are not as significant IMO, since the aeson library itself will ultimately represent the entire file in memory as a Value type. So given that, you may choose to simply use strict bytestring I/O. I've given an example of using both conduit and strict I/O for parsing a JSON value. (I think the conduit version exists in some libraries already, I'm not sure.)
#!/usr/bin/env stack
{- stack --resolver lts-7.14 --install-ghc runghc
--package aeson --package conduit-extra
-}
import Control.Monad.Catch (MonadThrow, throwM)
import Control.Monad.IO.Class (MonadIO, liftIO)
import Data.Aeson (FromJSON, Result (..), eitherDecodeStrict',
fromJSON, json, Value)
import Data.ByteString (ByteString)
import qualified Data.ByteString as B
import Data.Conduit (ConduitM, runConduitRes, (.|))
import Data.Conduit.Attoparsec (sinkParser)
import Data.Conduit.Binary (sourceFile)
sinkFromJSON :: (MonadThrow m, FromJSON a) => ConduitM ByteString o m a
sinkFromJSON = do
value <- sinkParser json
case fromJSON value of
Error e -> throwM $ userError e
Success x -> return x
readJSONFile :: (MonadIO m, FromJSON a) => FilePath -> m a
readJSONFile fp = liftIO $ runConduitRes $ sourceFile fp .| sinkFromJSON
-- Or using strict I/O
readJSONFileStrict :: (MonadIO m, FromJSON a) => FilePath -> m a
readJSONFileStrict fp = liftIO $ do
bs <- B.readFile fp
case eitherDecodeStrict' bs of
Left e -> throwM $ userError e
Right x -> return x
main :: IO ()
main = do
x <- readJSONFile "test.json"
y <- readJSONFileStrict "test.json"
print (x :: Value)
print (y :: Value)
EDIT Forgot to mention: I'd strongly recommend against using textual I/O for reading your JSON files. JSON files should be encoded with UTF-8, while the textual I/O functions will use whatever your system settings specify for character encoding. Relying on Data.ByteString.readFile and similar is more reliable. I went into more detail in a recent blog post.

Haskell aeson package basic usage

Working my way through Haskell and I'm trying to learn how to serialized to/from JSON.
I'm using aeson-0.8.0.2 & I'm stuck at basic decoding. Here's what I have:
file playground/aeson.hs:
{-# LANGUAGE OverloadedStrings #-}
import Data.Text
import Data.Aeson
data Person = Person
{ name :: Text
, age :: Int
} deriving Show
instance FromJSON Person where
parseJSON (Object v) = Person <$>
v .: "name" <*>
v .: "age"
parseJSON _ = mzero
main = do
let a = decode "{\"name\":\"Joe\",\"age\":12}" :: Maybe Person
print "aa"
ghc --make playground/aeson.hs yields:
[1 of 1] Compiling Main ( playground/aeson.hs,
playground/aeson.o )
playground/aeson.hs:13:35: Not in scope: `'
playground/aeson.hs:14:40: Not in scope: `<*>'
playground/aeson.hs:17:28: Not in scope: `mzero'
Any idea what I'm doing wrong?
Why is OverloadedString needed here?
Also, I have no idea what <$>, <*>, or mzero are supposed to mean; I'd appreciate tips on where I can read about any of these.
You need to import Control.Applicative and Control.Monad to get <$>, <*> and mzero. <$> just an infix operator for fmap, and <*> is the Applicative operator, you can think of it as a more generalized form of fmap for now. mzero is defined for the MonadPlus class, which is a class representing that a Monad has the operation
mplus :: m a -> m a -> m a
And a "monadic zero" element called mzero. The simplest example is for lists:
> mzero :: [Int]
[]
> [1, 2, 3] `mplus` [4, 5, 6]
[1, 2, 3, 4, 5, 6]
Here mzero is being used to represent a failure to parse. For looking up symbols in the future, I recommend using hoogle or FP Complete's version. Once you find the symbol, read the documentation, the source, and look around for examples of its use on the internet. You'll learn a lot by looking for it yourself, although it'll take you a little while to get used to this kind of research.
The OverloadedStrings extension is needed here because the Aeson library works with the Text type from Data.Text instead of the built-in String type. This extension lets you use string literals as Text instead of String, just as the numeric literal 0 can be an Int, Integer, Float, Double, Complex and other types. OverloadedStrings makes string literals have type Text.String.IsString s => s instead of just String, so it makes it easy to use alternate string-y types.
For <$> you need to import Control.Applicative and for mzero you need to import Control.Monad.
You can determine this by using the web-based version of hoogle (http://www.haskell.org/hoogle) or the command line version:
$ hoogle '<$>'
$ hoogle mzero