Cassava parsing error in haskell - csv

Im trying to convert a csv into a vector using cassava. The csv Im trying to convert is the fischer iris data set, used for machine learning. It consists of four doubles and one string.
My code is the following:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Csv
import qualified Data.ByteString.Lazy as BS
import qualified Data.Vector as V
data Iris = Iris
{ sepal_length :: !Double
, sepal_width :: !Double
, petal_length :: !Double
, petal_width :: !Double
, iris_type :: !String
} deriving (Show, Eq, Read)
instance FromNamedRecord Iris where
parseNamedRecord r =
Iris
<$> r .: "sepal_length"
<*> r .: "sepal_width"
<*> r .: "petal_length"
<*> r .: "petal_width"
<*> r .: "iris_type"
printIris :: Iris -> IO ()
printIris r = putStrLn $ show (sepal_length r) ++ show (sepal_width r)
++ show(petal_length r) ++ show(petal_length r) ++ "hola"
main :: IO ()
main = do
csvData <- BS.readFile "./iris/test-iris"
print csvData
case decodeByName csvData of
Left err -> putStrLn err
-- forM : O(n) Apply the monadic action to all elements of the vector,
-- yielding a vector of results.
Right (h, v) -> V.forM_ v $ printIris
When I run this, it seems as if the csvData is correctly formatted, the first lines from the print csvData return the following:
"5.1,3.5,1.4,0.2,Iris-setosa\n4.9,3.0,1.4,0.2,Iris- setosa\n4.7,3.2,1.3,0.2,Iris-setosa\n4.6,3.1,1.5,0.2,Iris-setosa\n5.0,3.6,1.4,0.2,Iris-setosa\n5.4,3.9,1.7,0.4,Iris-setosa\n4.6,3.4,1.4,0.3,Iris-setosa\n5.0,3.4,1.5,0.2,Iris-setosa\n4.4,2.9,1.4,0.2,Iris-setosa\n4.9,3.1,1.5,0.1,Iris-setosa\n5.4,3.7,1.5,0.2,Iris-setosa\n4.8,3.4,1.6,0.2,Iris-setosa\n4.8,3.0,1.4,0.1,Iris-setosa\n4.3,3.0,1.1,0.1,Iris-setosa\n5.8,4.0,1.2,0.2,Iris-setosa\n5.7,4.4,1.5,0.4,Iris-set
But I get the following error:
parse error (Failed reading: conversion error: no field named "sepal_length") at
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4 (truncated)
Does anybody have any idea as to why I can be getting this error? The csv has no missing values, and if I replace the line which produces the error for another row I get the same error.

It appears your data does not have a header, which is assumed by decodeByName
The data is assumed to be preceeded by a header.
Add a header, or use decode NoHeader and the FromRecord type class.

Related

Haskell cassava (Data.Csv): Ignore missing columns/fields

How can I set up cassava to ignore missing columns/fields and fill the respective data type with a default value? Consider this example:
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Lazy.Char8
import Data.Csv
import Data.Vector
import GHC.Generics
data Foo = Foo {
a :: String
, b :: Int
} deriving (Eq, Show, Generic)
instance FromNamedRecord Foo
decodeAndPrint :: ByteString -> IO ()
decodeAndPrint csv = do
print $ (decodeByName csv :: Either String (Header, Vector Foo))
main :: IO ()
main = do
decodeAndPrint "a,b,ignore\nhu,1,pu" -- [1]
decodeAndPrint "ignore,b,a\npu,1,hu" -- [2]
decodeAndPrint "ignore,b\npu,1" -- [3]
[1] and [2] work perfectly fine, but [3] fails with
Left "parse error (Failed reading: conversion error: no field named \"a\") at \"\""
How could I make decodeAndPrint capable of handling this incomplete input?
I could of course manipulate the input bytestring, but maybe there is a more elegant solution.
A Solution thanks to the input of Daniel Wagner below:
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative
import Data.ByteString.Lazy.Char8
import Data.Csv
import Data.Vector
import GHC.Generics
data Foo = Foo {
a :: Maybe String
, b :: Maybe Int
} deriving (Eq, Show, Generic)
instance FromNamedRecord Foo where
parseNamedRecord rec = pure Foo
<*> ((Just <$> Data.Csv.lookup rec "a") <|> pure Nothing)
<*> ((Just <$> Data.Csv.lookup rec "b") <|> pure Nothing)
decodeAndPrint :: ByteString -> IO ()
decodeAndPrint csv = do
print $ (decodeByName csv :: Either String (Header, Vector Foo))
main :: IO ()
main = do
decodeAndPrint "a,b,ignore\nhu,1,pu" -- [1]
decodeAndPrint "ignore,b,a\npu,1,hu" -- [2]
decodeAndPrint "ignore,b\npu,1" -- [3]
(Warning: completely untested! Code is for idea transmission only, not suitable for any use, etc. etc.)
The Parser type demanded by FromNamedRecord is an Alternative, so just toss a default on with (<|>).
instance FromNamedRecord Foo where
parseNamedRecord rec = pure Foo
<*> (lookup rec "a" <|> pure "missing")
<*> (lookup rec "b" <|> pure 0)
If you want to know later whether the field was there or not, make your fields rich enough to record that:
data RichFoo = RichFoo
{ a :: Maybe String
, b :: Maybe Int
}
instance FromNamedRecord Foo where
parseNamedRecord rec = pure RichFoo
<*> ((Just <$> lookup rec "a") <|> pure Nothing)
<*> ((Just <$> lookup rec "b") <|> pure Nothing)

Cassava Skip or Ignore Columns

So I have a decent grasp on Cassava for simple use cases but I have this one .csv that has 25 columns, and I only care about the 2nd and 5th columns. Is there any way to do a partial parse of each line or do I need to make 20 _ :: Text parameter declarations in the labmda declaration like below?
Right v -> V.forM_ v $ \ (_ :: Text, thingA :: Text, _ :: Text, _ :: Text, _ :: Text, thingB :: Text,, _ :: Text, _ :: Text, _ :: Text ..... etc
Edit: Also just discovered there's no instance for a 25 column CSV anyhow so even my ridiculous 336 character signature can't work.
Edit': Seems that one solution might be named records (suggested here as a fix for dealing with super wide documents] )
You can write your own FromRecord instance for this. You just need to write a parseRecord method that takes a Record (which is type Vector Field), extracts the desired columns at indexes 1 and 4, and loads them into your data type.
Something like the following will work:
{-# LANGUAGE OverloadedStrings #-}
import Data.Csv
import qualified Data.Vector as V
import Data.Text (Text)
data SomeFields = SomeFields Text Text deriving (Show)
instance FromRecord SomeFields where
parseRecord r = SomeFields <$> parseField (r V.! 1) <*> parseField (r V.! 4)
main = do
print $ (decode NoHeader "1,2,3,4,5,6,7\na,b,c,d,e,f,g\n"
:: Either String (V.Vector SomeFields))
Ended up solving it this way.
data ThingAtoThinBMapping = ThingAtoThinBMapping {
thingA :: Text
, thingB :: Text
} deriving (Eq, Show, Read)
instance FromNamedRecord ThingAtoThinBMapping where
parseNamedRecord r = ThingAtoThinBMapping
<$> r .: "thing_a"
<*> r .: "thing_b"
printABMap = do
csvData <- BL.readFile "reallywide.csv"
case decodeByNameWith decodeOpts csvData of
Left err -> putStrLn err
Right (h, v) -> do
putStrLn $ show h
V.forM_ v $ \ m ->
putStrLn $ show (thingA m, thingB m)

Haskell type mismatch with csv parsing

I'm trying to parse a csv file where I want to ignore the first line and the last line, as in:
Someheader
foo, 1000,
bah, 2000,
somefooter
I wrote some Haskell using the cassava library:
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative
import qualified Data.ByteString.Lazy as BL
import Data.Csv
import qualified Data.Vector as V
import Control.Monad (mzero)
data Demand = Demand
{ name :: !String
, amount :: !Int
} deriving Show
instance FromRecord Demand where
parseRecord r
| length == 2 = Demand <$> r .! 0
<*> r .! 1
| otherwise = mzero
main :: IO ()
main = do
csvData <- BL.readFile "demand.csv"
case decode HasHeader csvData of
Left err -> putStrLn err
Right (_, v) -> V.forM_ v $ \ p ->
putStrLn $ (name p) ++ " amount " ++ show (amount p)
When I run this get a type mismatch, that I can't figure out:
parser.hs:34:15: error:
• Couldn't match expected type ‘V.Vector a2’
with actual type ‘(a1, V.Vector Demand)’
• In the pattern: (_, v)
In the pattern: Right (_, v)
My guess is that I haven't unpacked the Vector in the record correctly? Any help, gratefully received.
decode has the type FromRecord a => HasHeader -> ByteString-> Either String (Vector a) based on the documentation for cassava.
So the correct pattern would be Right v instead of Right (_, v).
Another problem in the code, is that length is a function, and you didn't apply it to anything, in the guard | length == 2 = .... I believe the correct code should instead be | length r == 2 = ...
Here's the complete code after those changes:
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative
import qualified Data.ByteString.Lazy as BL
import Data.Csv
import qualified Data.Vector as V
import Control.Monad (mzero)
data Demand = Demand
{ name :: !String
, amount :: !Int
} deriving Show
instance FromRecord Demand where
parseRecord r
| length r == 2 = Demand <$> r .! 0
<*> r .! 1
| otherwise = mzero
main :: IO ()
main = do
csvData <- BL.readFile "demand.csv"
case decode HasHeader csvData of
Left err -> putStrLn err
Right v -> V.forM_ v $ \ p ->
putStrLn $ (name p) ++ " amount " ++ show (amount p)

In Haskell, how do I decode a JSON value that could be of two different types?

I'm trying to parse some bibliographic data, more specifically, pull out the 'subject' field for each item. The data is json and looks something like this:
{"rows": [
{"doc":{"sourceResource": {"subject": ["fiction", "horror"]}}},
{"doc":{"sourceResource": {"subject": "fantasy"}}}
]}
I can pull out 'subject' if every entry is either Text or [Text], but I'm stumped as to how to accommodate both. Here is my program in its current state:
{-# LANGUAGE OverloadedStrings#-}
import Debug.Trace
import Data.Typeable
import Data.Aeson
import Data.Text
import Control.Applicative
import Control.Monad
import qualified Data.ByteString.Lazy as B
import Network.HTTP.Conduit (simpleHttp)
import qualified Data.HashMap.Strict as HM
import qualified Data.Map as Map
jsonFile :: FilePath
jsonFile = "bib.json"
getJSON :: IO B.ByteString
getJSON = B.readFile jsonFile
data Document = Document { rows :: [Row]}
deriving (Eq, Show)
data Row = SubjectList [Text]
| SubjectText Text
deriving (Eq, Show)
instance FromJSON Document where
parseJSON (Object o) = do
rows <- parseJSON =<< (o .: "rows")
return $ Document rows
parseJSON _ = mzero
instance FromJSON Row where
parseJSON (Object o) = do
item <- parseJSON =<< ((o .: "doc") >>=
(.: "sourceResource") >>=
(.: "subject"))
-- return $ SubjectText item
return $ SubjectList item
parseJSON _ = mzero
main :: IO ()
main = do
d <- (decode <$> getJSON) :: IO (Maybe Document)
print d
Any help would be appreciated.
Edit:
the working FromJSON Row instance:
instance FromJSON Row where
parseJSON (Object o) =
(SubjectList <$> (parseJSON =<< val)) <|>
(SubjectText <$> (parseJSON =<< val))
where
val = ((o .: "doc") >>=
(.: "sourceResource") >>=
(.: "subject"))
parseJSON _ = mzero
First, look at the type of
((o .: "doc") >>=
(.: "sourceResource") >>=
(.: "subject")) :: FromJSON b => Parser b
We can get out of it anything that's an instance of FromJSON. Now, clearly, this can work for Text or [Text] individually, but your problem is that you want to get either Text or [Text]. Fortunately, it should be fairly easy to deal with this. Rather than letting it decode it for you further, just get a Value out of it. Once you've got a Value, you could decode it as a Text and put it in a SubjectText:
SubjectText <$> parseJSON val :: Parser Row
Or as a [Text] and put it in a SubjectList:
SubjectList <$> parseJSON val :: Parser Row
But wait, either one of these will do, and they have the same output type. Notice that Parser is an instance of Alternative, which lets us say exactly that (“either one will do”). Thus,
(SubjectList <$> parseJSON val) <|> (SubjectText <$> parseJSON val) :: Parser Row
Ta-da! (Actually, it wasn't necessary to pull it out as a Value; we could have instead embedded that long ((o .: "doc") >>= (.: "sourceResource") >>= (.: "subject")) chain into each subexpression. But that's ugly.)

Conduit with aeson / attoparsec, how to exit cleanly without exception once source has no more data

I'm using aeson / attoparsec and conduit / conduit-http connected by conduit-attoparsec to parse JSON data from a file / webserver. My problem is that my pipeline always throws this exception...
ParseError {errorContexts = ["demandInput"], errorMessage = "not enough bytes", errorPosition = 1:1}
...once the socket closes or we hit EOF. Parsing and passing on the resulting data structures through the pipeline etc. works just fine, but it always ends with the sinkParser throwing this exception. I invoke it like this...
j <- CA.sinkParser json
...inside of my conduit that parses ByteStrings into my message structures.
How can I have it just exit the pipeline cleanly once there is no more data (no more top-level expressions)? Is there any decent way to detect / distinguish this exception without having to look at error strings?
Thanks!
EDIT: Example:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Control.Applicative
import qualified Data.ByteString as B
import qualified Data.ByteString.Char8 as B8
import qualified Data.Conduit.Attoparsec as CA
import Data.Aeson
import Data.Conduit
import Data.Conduit.Binary
import Control.Monad.IO.Class
data MyMessage = MyMessage String deriving (Show)
parseMessage :: (MonadIO m, MonadResource m) => Conduit B.ByteString m B.ByteString
parseMessage = do
j <- CA.sinkParser json
let msg = fromJSON j :: Result MyMessage
yield $ case msg of
Success r -> B8.pack $ show r
Error s -> error s
parseMessage
main :: IO ()
main =
runResourceT $ do
sourceFile "./input.json" $$ parseMessage =$ sinkFile "./out.txt"
instance FromJSON MyMessage where
parseJSON j =
case j of
(Object o) -> MyMessage <$> o .: "text"
_ -> fail $ "Expected Object - " ++ show j
Sample input (input.json):
{"text":"abc"}
{"text":"123"}
Outputs:
out: ParseError {errorContexts = ["demandInput"], errorMessage = "not enough bytes", errorPosition = 3:1}
and out.txt:
MyMessage "abc"MyMessage "123"
This is a perfect use case for conduitParserEither:
parseMessage :: (MonadIO m, MonadResource m) => Conduit B.ByteString m B.ByteString
parseMessage =
CA.conduitParserEither json =$= awaitForever go
where
go (Left s) = error $ show s
go (Right (_, msg)) = yield $ B8.pack $ show msg ++ "\n"
If you're on FP Haskell Center, you can clone my solution into the IDE.