Cassava Skip or Ignore Columns - csv

So I have a decent grasp on Cassava for simple use cases but I have this one .csv that has 25 columns, and I only care about the 2nd and 5th columns. Is there any way to do a partial parse of each line or do I need to make 20 _ :: Text parameter declarations in the labmda declaration like below?
Right v -> V.forM_ v $ \ (_ :: Text, thingA :: Text, _ :: Text, _ :: Text, _ :: Text, thingB :: Text,, _ :: Text, _ :: Text, _ :: Text ..... etc
Edit: Also just discovered there's no instance for a 25 column CSV anyhow so even my ridiculous 336 character signature can't work.
Edit': Seems that one solution might be named records (suggested here as a fix for dealing with super wide documents] )

You can write your own FromRecord instance for this. You just need to write a parseRecord method that takes a Record (which is type Vector Field), extracts the desired columns at indexes 1 and 4, and loads them into your data type.
Something like the following will work:
{-# LANGUAGE OverloadedStrings #-}
import Data.Csv
import qualified Data.Vector as V
import Data.Text (Text)
data SomeFields = SomeFields Text Text deriving (Show)
instance FromRecord SomeFields where
parseRecord r = SomeFields <$> parseField (r V.! 1) <*> parseField (r V.! 4)
main = do
print $ (decode NoHeader "1,2,3,4,5,6,7\na,b,c,d,e,f,g\n"
:: Either String (V.Vector SomeFields))

Ended up solving it this way.
data ThingAtoThinBMapping = ThingAtoThinBMapping {
thingA :: Text
, thingB :: Text
} deriving (Eq, Show, Read)
instance FromNamedRecord ThingAtoThinBMapping where
parseNamedRecord r = ThingAtoThinBMapping
<$> r .: "thing_a"
<*> r .: "thing_b"
printABMap = do
csvData <- BL.readFile "reallywide.csv"
case decodeByNameWith decodeOpts csvData of
Left err -> putStrLn err
Right (h, v) -> do
putStrLn $ show h
V.forM_ v $ \ m ->
putStrLn $ show (thingA m, thingB m)

Related

Haskell cassava (Data.Csv): Ignore missing columns/fields

How can I set up cassava to ignore missing columns/fields and fill the respective data type with a default value? Consider this example:
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.ByteString.Lazy.Char8
import Data.Csv
import Data.Vector
import GHC.Generics
data Foo = Foo {
a :: String
, b :: Int
} deriving (Eq, Show, Generic)
instance FromNamedRecord Foo
decodeAndPrint :: ByteString -> IO ()
decodeAndPrint csv = do
print $ (decodeByName csv :: Either String (Header, Vector Foo))
main :: IO ()
main = do
decodeAndPrint "a,b,ignore\nhu,1,pu" -- [1]
decodeAndPrint "ignore,b,a\npu,1,hu" -- [2]
decodeAndPrint "ignore,b\npu,1" -- [3]
[1] and [2] work perfectly fine, but [3] fails with
Left "parse error (Failed reading: conversion error: no field named \"a\") at \"\""
How could I make decodeAndPrint capable of handling this incomplete input?
I could of course manipulate the input bytestring, but maybe there is a more elegant solution.
A Solution thanks to the input of Daniel Wagner below:
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative
import Data.ByteString.Lazy.Char8
import Data.Csv
import Data.Vector
import GHC.Generics
data Foo = Foo {
a :: Maybe String
, b :: Maybe Int
} deriving (Eq, Show, Generic)
instance FromNamedRecord Foo where
parseNamedRecord rec = pure Foo
<*> ((Just <$> Data.Csv.lookup rec "a") <|> pure Nothing)
<*> ((Just <$> Data.Csv.lookup rec "b") <|> pure Nothing)
decodeAndPrint :: ByteString -> IO ()
decodeAndPrint csv = do
print $ (decodeByName csv :: Either String (Header, Vector Foo))
main :: IO ()
main = do
decodeAndPrint "a,b,ignore\nhu,1,pu" -- [1]
decodeAndPrint "ignore,b,a\npu,1,hu" -- [2]
decodeAndPrint "ignore,b\npu,1" -- [3]
(Warning: completely untested! Code is for idea transmission only, not suitable for any use, etc. etc.)
The Parser type demanded by FromNamedRecord is an Alternative, so just toss a default on with (<|>).
instance FromNamedRecord Foo where
parseNamedRecord rec = pure Foo
<*> (lookup rec "a" <|> pure "missing")
<*> (lookup rec "b" <|> pure 0)
If you want to know later whether the field was there or not, make your fields rich enough to record that:
data RichFoo = RichFoo
{ a :: Maybe String
, b :: Maybe Int
}
instance FromNamedRecord Foo where
parseNamedRecord rec = pure RichFoo
<*> ((Just <$> lookup rec "a") <|> pure Nothing)
<*> ((Just <$> lookup rec "b") <|> pure Nothing)

Cassava parsing error in haskell

Im trying to convert a csv into a vector using cassava. The csv Im trying to convert is the fischer iris data set, used for machine learning. It consists of four doubles and one string.
My code is the following:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Csv
import qualified Data.ByteString.Lazy as BS
import qualified Data.Vector as V
data Iris = Iris
{ sepal_length :: !Double
, sepal_width :: !Double
, petal_length :: !Double
, petal_width :: !Double
, iris_type :: !String
} deriving (Show, Eq, Read)
instance FromNamedRecord Iris where
parseNamedRecord r =
Iris
<$> r .: "sepal_length"
<*> r .: "sepal_width"
<*> r .: "petal_length"
<*> r .: "petal_width"
<*> r .: "iris_type"
printIris :: Iris -> IO ()
printIris r = putStrLn $ show (sepal_length r) ++ show (sepal_width r)
++ show(petal_length r) ++ show(petal_length r) ++ "hola"
main :: IO ()
main = do
csvData <- BS.readFile "./iris/test-iris"
print csvData
case decodeByName csvData of
Left err -> putStrLn err
-- forM : O(n) Apply the monadic action to all elements of the vector,
-- yielding a vector of results.
Right (h, v) -> V.forM_ v $ printIris
When I run this, it seems as if the csvData is correctly formatted, the first lines from the print csvData return the following:
"5.1,3.5,1.4,0.2,Iris-setosa\n4.9,3.0,1.4,0.2,Iris- setosa\n4.7,3.2,1.3,0.2,Iris-setosa\n4.6,3.1,1.5,0.2,Iris-setosa\n5.0,3.6,1.4,0.2,Iris-setosa\n5.4,3.9,1.7,0.4,Iris-setosa\n4.6,3.4,1.4,0.3,Iris-setosa\n5.0,3.4,1.5,0.2,Iris-setosa\n4.4,2.9,1.4,0.2,Iris-setosa\n4.9,3.1,1.5,0.1,Iris-setosa\n5.4,3.7,1.5,0.2,Iris-setosa\n4.8,3.4,1.6,0.2,Iris-setosa\n4.8,3.0,1.4,0.1,Iris-setosa\n4.3,3.0,1.1,0.1,Iris-setosa\n5.8,4.0,1.2,0.2,Iris-setosa\n5.7,4.4,1.5,0.4,Iris-set
But I get the following error:
parse error (Failed reading: conversion error: no field named "sepal_length") at
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4 (truncated)
Does anybody have any idea as to why I can be getting this error? The csv has no missing values, and if I replace the line which produces the error for another row I get the same error.
It appears your data does not have a header, which is assumed by decodeByName
The data is assumed to be preceeded by a header.
Add a header, or use decode NoHeader and the FromRecord type class.

Trouble with JSON (Data.Aeson)

I'm new to Haskell and in order to learn the language I am working on a project that involves dealing with JSON. I am currently getting the feeling Haskell is the wrong language for the job, but that isn't the point here.
I've been struggling to understand how this works for a few days. I have searched and everything I have found does not seem to work. Here's the issue:
I have some JSON in the following format:
>>>less "path/to/json"
{
"stringA1_stringA2": {"stringA1":floatA1,
"stringA2":foatA2},
"stringB1_stringB2": {"stringB1":floatB1,
"stringB2":floatB2}
...
}
Here floatX1 and floatX2 are actually strings of the form "0.535613567", "1.221362183" etc. What I want to do is parse this into the following data
data Mydat = Mydat { name :: String, num :: Float} deriving (Show)
where name would correspond to "stringX1_stringX2" and num to floatX1 for X = A,B,...
So far I have reached a 'solution' which feels fairly hackish and convoluted and doesn't work properly.
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE DeriveGeneric #-}
import Data.Functor
import Data.Monoid
import Data.Aeson
import Data.List
import Data.Text
import Data.Map (Map)
import qualified Data.HashMap.Strict as DHM
--import qualified Data.HashMap as DHM
import qualified Data.ByteString.Lazy as LBS
import System.Environment
import GHC.Generics
import Text.Read
data Mydat = Mydat {name :: String, num :: Float} deriving (Show)
test s = do
d <- LBS.readFile s
let v = decode d :: Maybe (DHM.HashMap String Object)
case v of
-- Just v -> print v
Just v -> return $ Prelude.map dataFromList $ DHM.toList $ DHM.map (DHM.lookup "StringA1") v
good = ['1','2','3','4','5','6','7','8','9','0','.']
f x = elem x good
dataFromList :: (String, Maybe Value) -> Mydat
dataFromList (a,b) = Mydat a (read (Prelude.filter f (show b)) :: Float)
Now I can compile this and run
test "path/to/json"
in ghci and it prints a list of Mydat's in the case where "stringX1"="stringA1" for all X. In reality there are two values for "stringX1" so aside from the hackyness this is not satisfactory. There must be a better way to do this. I get that I need to write my own parser probably but I am confused about how this works so any suggestions would be great. Thanks in advance.
The structure of your JSON is pretty nasty, but here's a basic working solution:
#!/usr/bin/env stack
-- stack --resolver lts-11.5 script --package containers --package aeson
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.Map as Map
import qualified Data.Aeson as Aeson
data Mydat = Mydat { name :: String
, num :: Float
} deriving (Show)
instance Eq Mydat where
(Mydat _ x1) == (Mydat _ x2) = x1 == x2
instance Ord Mydat where
(Mydat _ x1) `compare` (Mydat _ x2) = x1 `compare` x2
type MydatRaw = Map.Map String (Map.Map String String)
processRaw :: MydatRaw -> [Mydat]
processRaw = Map.foldrWithKey go []
where go key value accum =
accum ++ (Mydat key . read <$> Map.elems value)
main :: IO ()
main =
do let json = "{\"stringA1_stringA2\":{\"stringA1\":\"0.1\",\"stringA2\":\"0.2\"}}"
print $ fmap processRaw (Aeson.eitherDecode json)
Note that read is partial and generally not a good idea. But I'll leave it to you to flesh out a safer version :)
As I commented, the best thing would probably be to make your JSON file well-formed in the sense that the float fields should really be floats, not strings.
If that's not an option, I would recommend you phrase out the type that the JSON file seems to represent as simple as possible (but without dynamic Objects), and then convert that to the type you actually want.
import Data.Map (Map)
import qualified Data.Map as Map
type GarbledJSON = Map String (Map String String)
-- ^ you could also stick with hash maps for everything, but
-- usually `Map` is actually more sensible in Haskell.
data MyDat = MyDat {name :: String, num :: Float} deriving (Show)
test :: FilePath -> IO [MyDat]
test s = do
d <- LBS.readFile s
case decode d :: Maybe GarbledJSON of
Just v -> return [ MyDat iName ( read . filter (`elem`good)
$ iVals Map.! valKey )
| (iName, iVals) <- Map.toList v
, let valKey = takeWhile (/='_') iName ]
Note that this will crash completely if any of the items don't contain the first part of the name as a string of float format, and likely give bogus items when you filter out characters that aren't good. If you just want to ignore any malformed items (which is also not a very clean approach...), you can do it this way:
test :: FilePath -> IO [MyDat]
test s = do
d <- LBS.readFile s
return $ case decode d :: Maybe GarbledJSON of
Just v -> [ MyDat iName iVal
| (iName, iVals) <- Map.toList v
, let valKey = takeWhile (/='_') iName
, Just iValStr <- [iVals Map.!? valKey]
, [(iVal,"")] <- [reads iValStr] ]
Nothing -> []

Haskell type mismatch with csv parsing

I'm trying to parse a csv file where I want to ignore the first line and the last line, as in:
Someheader
foo, 1000,
bah, 2000,
somefooter
I wrote some Haskell using the cassava library:
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative
import qualified Data.ByteString.Lazy as BL
import Data.Csv
import qualified Data.Vector as V
import Control.Monad (mzero)
data Demand = Demand
{ name :: !String
, amount :: !Int
} deriving Show
instance FromRecord Demand where
parseRecord r
| length == 2 = Demand <$> r .! 0
<*> r .! 1
| otherwise = mzero
main :: IO ()
main = do
csvData <- BL.readFile "demand.csv"
case decode HasHeader csvData of
Left err -> putStrLn err
Right (_, v) -> V.forM_ v $ \ p ->
putStrLn $ (name p) ++ " amount " ++ show (amount p)
When I run this get a type mismatch, that I can't figure out:
parser.hs:34:15: error:
• Couldn't match expected type ‘V.Vector a2’
with actual type ‘(a1, V.Vector Demand)’
• In the pattern: (_, v)
In the pattern: Right (_, v)
My guess is that I haven't unpacked the Vector in the record correctly? Any help, gratefully received.
decode has the type FromRecord a => HasHeader -> ByteString-> Either String (Vector a) based on the documentation for cassava.
So the correct pattern would be Right v instead of Right (_, v).
Another problem in the code, is that length is a function, and you didn't apply it to anything, in the guard | length == 2 = .... I believe the correct code should instead be | length r == 2 = ...
Here's the complete code after those changes:
{-# LANGUAGE OverloadedStrings #-}
import Control.Applicative
import qualified Data.ByteString.Lazy as BL
import Data.Csv
import qualified Data.Vector as V
import Control.Monad (mzero)
data Demand = Demand
{ name :: !String
, amount :: !Int
} deriving Show
instance FromRecord Demand where
parseRecord r
| length r == 2 = Demand <$> r .! 0
<*> r .! 1
| otherwise = mzero
main :: IO ()
main = do
csvData <- BL.readFile "demand.csv"
case decode HasHeader csvData of
Left err -> putStrLn err
Right v -> V.forM_ v $ \ p ->
putStrLn $ (name p) ++ " amount " ++ show (amount p)

Writing custom instances for JSON date data in Aeson

I have JSON date data in the following form:
{"date": "2015-04-12"}
and a corresponding haskell type:
data Date = Date {
year :: Int
, month :: Int
, day :: Int
}
How can I write the custom FromJSON and ToJSON functions for the
Aeson library?
Deriving the instances does not work because of the formatting.
Why reinvent the wheel? There is a semi-standard representation for what you call Date in the time package - it is called Day. It gets better: not only does that same package even give you the utilities for parsing Day from the format you have, those utilities are even exported to aeson. Yep, there are already ToJSON and FromJSON instances in aeson for Day:
ghci> :set -XOverloadedStrings
ghci> import Data.Time.Calendar
ghci> import Data.Aeson
ghci> fromJSON "2015-04-12" :: Result Day
Success 2015-04-12
ghci> toJSON (fromGregorian 2015 4 12)
String "2015-04-12"
If you really want to extract the days, months, and years, you can always use toGregorian :: Day -> (Integer, Int, Int). Sticking to the standard abstraction is probably a good long-term choice though.
You have convert y/m/d to/from string
{-# LANGUAGE OverloadedStrings #-}
{-# OPTIONS_GHC -fno-warn-tabs #-}
import Control.Monad
import Data.Aeson
import qualified Data.Text as T
import Text.Read (readMaybe)
-- import qualified Data.Attoparsec.Text as A
data Date = Date Int Int Int deriving (Read, Show)
instance ToJSON Date where
toJSON (Date y m d) = toJSON $ object [
"date" .= T.pack (str 4 y ++ "-" ++ str 2 m ++ "-" ++ str 2 d)]
where
str n = pad . show where
pad s = replicate (n - length s) '0' ++ s
instance FromJSON Date where
parseJSON = withObject "date" $ \v -> do
str <- v .: "date"
let
ps#(~[y, m, d]) = T.split (== '-') str
guard (length ps == 3)
Date <$> readNum y <*> readNum m <*> readNum d
where
readNum = maybe (fail "not num") return . readMaybe . T.unpack
-- -- or with attoparsec
-- parseJSON = withObject "date" $ \v -> do
-- str <- v .: "date"
-- [y, m, d] <- either fail return $
-- A.parseOnly (A.decimal `A.sepBy` A.char '-') str
-- return $ Date y m d