Parse CSV/TSV file in Haskell - Unicode Characters - csv

I'm trying to parse a tab-delimited file using cassava/Data.Csv in Haskell. However, I get problems if there are "strange" (Unicode) characters in my CSV file. I'll get a parse error (endOfInput) then.
According to the command-line tool "file", my file has a "UTF-8 Unicode text" decoding. My Haskell code looks like this:
{-# LANGUAGE ScopedTypeVariables #-}
{-# LANGUAGE OverloadedStrings #-}
import qualified Data.ByteString as C
import qualified System.IO.UTF8 as U
import qualified Data.ByteString.UTF8 as UB
import qualified Data.ByteString.Lazy.Char8 as DL
import qualified Codec.Binary.UTF8.String as US
import qualified Data.Text.Lazy.Encoding as EL
import qualified Data.ByteString.Lazy as L
import Data.Text.Encoding as E
-- Handle CSV / TSV files with ...
import Data.Csv
import qualified Data.Vector as V
import Data.Char -- ord
csvFile :: FilePath
csvFile = "myFile.txt"
-- Set delimiter to \t (tabulator)
myOptions = defaultDecodeOptions {
decDelimiter = fromIntegral (ord '\t')
}
main :: IO ()
main = do
csvData <- L.readFile csvFile
case EL.decodeUtf8' csvData of
Left err -> print err
Right dat ->
case decodeWith myOptions NoHeader $ EL.encodeUtf8 dat of
Left err -> putStrLn err
Right v -> V.forM_ v $ \ (category :: String ,
user :: String ,
date :: String,
time :: String,
message :: String) -> do
print message
I tried using decodingUtf8', preprocessing (filtering) the input with predicates from Data.Char, and much more. However the endOfFile error persists.
My CSV-file looks like this:
a - - - RT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a - - - Uhm .. wat dan ook ????!!!! 👋
Or more literally:
a\t-\t-\t-\tRT USE " Kenny" • Hahahahahahahahaha. #Emmen #Brandstapel
a\t-\t-\t-\tUhm .. wat dan ook ????!!!! 👋
The problem chars are the 👋 and • (and in my complete file, there are many more of similar characters). What can I do, so that cassava / Data.Csv can read my file properly?
EDIT:
I've created the following preprocessor for escaping my Text before decoding it with cassava (see tibbe's answer). There's probably a better possibility, but so far, that works fine!
import qualified Data.Text as T
preprocess :: T.Text -> T.Text
preprocess txt = cons '\"' $ T.snoc escaped '\"'
where escaped = T.concatMap escaper txt
escaper :: Char -> T.Text
escaper c
| c == '\t' = "\"\t\""
| c == '\n' = "\"\n\""
| c == '\"' = "\"\""
| otherwise = T.singleton c

Per the cassava documentation:
Non-escaped fields may contain any characters except double-quotes, commas, carriage returns, and newlines.
Escaped fields may contain any characters (but double-quotes need to be escaped).
Since the last field in your first record contains double quotes the field needs to be escaped with double quotes and any double quotes need to be escaped, like so:
a - - - "RT USE "" Kenny"" • Hahahahahahahahaha. #Emmen #Brandstapel"
This code works for me:
import Data.ByteString.Lazy
import Data.Char
import Data.Csv
import Data.Text.Encoding
import Data.Vector
test :: Either String (Vector (String, String, String, String, String))
test = decodeWith
defaultDecodeOptions {decDelimiter = fromIntegral $ ord '\t' }
NoHeader
(fromStrict $ encodeUtf8 "a\t-\t-\t-\t\"RT USE \"\" Kenny\"\" • Hahahahahahahahaha. #Emmen #Brandstapel\"")
(Note that I had to make sure to use encodeUtf8 on a literal of type Text rather than just using a ByteString literal directly. The IsString instance for ByteStrings, which is what's used to convert the literal to a ByteString, truncates each Unicode code point.)

Related

Haskell - how do I convert piped-in JSON-based to a data record?

I'm building a reinforcement learning library where I'd like to pass certain instance information into the executables via a piped JSON.
Using aeson's Simplest.hs, I'm able to get the following basic example working as intended. Note that the parameters are sitting in Main.hs as a String params as a placeholder.
I tried to modify Main.hs so I would pipe the Nim game parameters in from a JSON file via getContents, but am running into the expected [Char] vs. IO String issue. I've tried to read up as much as possible about IO, but can't figure out how to lift my JSON parsing method to deal with IO.
How would I modify the below so that I can work with piped-in JSON?
Main.hs
module Main where
import qualified System.Random as Random
import qualified Data.ByteString.Lazy.Char8 as BL
import qualified Games.Engine as Engine
import qualified Games.IO.Nim as NimIO
import qualified Games.Rules.Nim as Nim
import qualified Games.Learn.ValueIteration as VI
main :: IO ()
main = do
let params = "{\"players\":[\"Bob\", \"Alice\", \"Charlie\"], \"initialPiles\": [3, 4, 5], \"isMisere\": false}"
let result = NimIO.decode $ BL.pack params :: Maybe NimIO.NimGame
case result of
Nothing -> putStrLn "Parameter errors."
Just game -> do
putStrLn "Let's play some Nim! Remainder of code omitted"
Games.IO.Nim.hs
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE RecordWildCards #-}
module Games.IO.Nim
( decode
, NimGame
, players
, initialPiles
, isMisere
) where
import Control.Applicative (empty)
import qualified Data.ByteString.Lazy.Char8 as BL
import Data.Aeson
( pairs,
(.:),
object,
FromJSON(parseJSON),
Value(Object),
KeyValue((.=)),
ToJSON(toJSON, toEncoding),
decode)
data NimGame = NimGame
{ players :: [String]
, initialPiles :: [Int]
, isMisere :: Bool
} deriving (Show)
instance ToJSON NimGame where
toJSON (NimGame playersV initialPilesV isMisereV) = object [ "players" .= playersV,
"initialPiles" .= initialPilesV,
"isMisere" .= isMisereV]
toEncoding NimGame{..} = pairs $
"players" .= players <>
"initialPiles" .= initialPiles <>
"isMisere" .= isMisere
instance FromJSON NimGame where
parseJSON (Object v) = NimGame <$>
v .: "players" <*>
v .: "initialPiles" <*>
v .: "isMisere"
parseJSON _ = empty
Alternative Main.hs that generates compile error
module Main where
import qualified System.Random as Random
import qualified Data.ByteString.Lazy.Char8 as BL
import qualified Games.Engine as Engine
import qualified Games.IO.Nim as NimIO
import qualified Games.Rules.Nim as Nim
import qualified Games.Learn.ValueIteration as VI
main :: IO ()
main = do
--let params = "{\"players\":[\"Bob\", \"Alice\", \"Charlie\"], \"initialPiles\": [3, 4, 5], \"isMisere\": false}"
let params = getContents
let result = NimIO.decode $ BL.pack params :: Maybe NimIO.NimGame
case result of
Nothing -> putStrLn "Parameter errors."
Just game -> do
putStrLn "Let's play some Nim!"
Compile Error
(base) randm#pearljam ~/Projects/gameshs $ stack build
gameshs-0.1.0.0: unregistering (local file changes: app/Nim.hs)
gameshs> configure (lib + exe)
Configuring gameshs-0.1.0.0...
gameshs> build (lib + exe)
Preprocessing library for gameshs-0.1.0.0..
Building library for gameshs-0.1.0.0..
Preprocessing executable 'nim-exe' for gameshs-0.1.0.0..
Building executable 'nim-exe' for gameshs-0.1.0.0..
[2 of 2] Compiling Main
/home/randm/Projects/gameshs/app/Nim.hs:17:41: error:
• Couldn't match expected type ‘[Char]’
with actual type ‘IO String’
• In the first argument of ‘BL.pack’, namely ‘params’
In the second argument of ‘($)’, namely ‘BL.pack params’
In the expression:
NimIO.decode $ BL.pack params :: Maybe NimIO.NimGame
|
17 | let result = NimIO.decode $ BL.pack params :: Maybe NimIO.NimGame
| ^^^^^^
-- While building package gameshs-0.1.0.0 (scroll up to its section to see the error) using:
/home/randm/.stack/setup-exe-cache/x86_64-linux-tinfo6/Cabal-simple_mPHDZzAJ_3.2.1.0_ghc-8.10.4 --builddir=.stack-work/dist/x86_64-linux-tinfo6/Cabal-3.2.1.0 build lib:gameshs exe:nim-exe --ghc-options " -fdiagnostics-color=always"
Process exited with code: ExitFailure 1
getContents returns not a String as you apparently expect, but IO String, which is a "program", which, when executed, will produce a String. So when you're trying to parse this program with decode, of course that doesn't work: decode parses a String, it cannot parse a program.
So how do you execute this program to obtain the String? There are two ways: either you make it part of another program or you call it main and it becomes your entry point.
In your case, the sensible thing to do would be to make getContent part of your main program. To do that, use the left arrow <-, like this:
main = do
params <- getContents
let result = NimIO.decode $ BL.pack params :: Maybe NimIO.NimGame
...

Mustache not rendering JSON value as JSON encoded string

The full example is here:
{-# LANGUAGE OverloadedStrings #-}
module Test2 where
import Data.Aeson
import Text.Mustache
main :: IO ()
main = do
let example = Data.Aeson.object [ "key" .= (5 :: Integer), "somethingElse" .= (2 :: Integer) ]
print . encode $ example
print ("Start" :: String)
case compileTemplate "" "{{{jsonData}}}" of
Right x -> do
print $ substituteValue x (Text.Mustache.object ["jsonData" ~= example])
Left e -> error . show $ e
The above produces the following output:
"{\"somethingElse\":2,\"key\":5}"
"Start"
"fromList [(\"somethingElse\",2.0),(\"key\",5.0)]"
My expectation is it would produce:
"{\"somethingElse\":2,\"key\":5}"
"Start"
"{\"somethingElse\":2,\"key\":5}"
Mustache doesn't seem to support straight substitution of JSON objects. Setting up a similar example here, I receive
[object Object]
as output. Not identical to yours, but it indicates that the issue is not necessarily with the Haskell implementation.
In other words, I believe the issue is with your template {{{jsonData}}}.
If you change it to {{{jsonData.somethingElse}}}, it works fine (I know this is not what you want).
Alternatively, encode the JSON data as text prior to passing it to the substitution function as suggested here. Basically something like this:
substituteValue x (Text.Mustache.object ["jsonData" ~= (encodeToLazyText jsonData)])
This results in your desired output. encodeToLazyText is found in Data.Aeson.Text.
Working code:
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Aeson ((.=))
import qualified Data.Aeson as A
import Data.Aeson.Text (encodeToLazyText)
import Text.Mustache ((~=))
import qualified Text.Mustache as M
import qualified Text.Mustache.Types as M
main :: IO ()
main = do
print . A.encode $ jsonData
putStrLn "Start"
case M.compileTemplate "" "in mustache: {{{jsonData}}}" of
Right template ->
print (M.substituteValue template mustacheVals)
Left e ->
error . show $ e
jsonData :: A.Value
jsonData =
A.object
[ "key" .= (5 :: Integer)
, "somethingElse" .= (2 :: Integer)
]
mustacheVals :: M.Value
mustacheVals =
M.object
[ "jsonData" ~= encodeToLazyText jsonData
]

Creating a csv file in haskell

I suspect this a bit of a noobie question...
So I have a list myList that I would like to write to a csv file, using the cassava library:
import qualified Data.ByteString.Lazy as BL
import qualified Data.Vector as V
import Data.Csv
main = BL.writeFile "myFileLocation" (encode $ V.fromList myList)
as far as I can establish, encode has type ToRecord a => Vector a -> ByteString yet I'm getting the following error:
Couldn't match expected type `[a0]'
with actual type `V.Vector (a,b,c,d)'
In the return type of a call of `V.fromList'
In the second argument of `($)', namely `V.fromList myList'
In the second argument of `BL.writeFile', namely
`(encode $ V.fromList myList)'
I'm confused!

Using blaze-html, how do I create leading non-breaking spaces in html

Using the package blaze-html, I want to create some html that looks like this
<p> Some indented text
I can't figure out how to create the non-breaking spaces. What's the best way to do that?
One way of course is to give a string containing a Haskell-encoded version of that character to toHtml. Another way is to use preEscapedToMarkup:
preEscapedToMarkup " "
Following the advice of ertes, I tried "to give a string containing a Haskell-encoded version of the character". This code is verbose so I could better understand how the types fit together.
{-# LANGUAGE OverloadedStrings #-}
-- * base
import Data.Monoid ((<>))
import Data.Char (chr)
-- * text
import qualified Data.Text.Lazy as L (Text)
import Data.Text (Text, singleton)
import Data.Text.Lazy.Read (hexadecimal)
-- * blaze-html
import Text.Blaze.Html5
import qualified Text.Blaze.Html5 as H
import Text.Blaze.Renderer.String
nbspHex = "00A0" :: L.Text -- Unicode codepoint for nbsp
Right (nbspInt, _) = hexadecimal nbspHex
nbspChar = chr nbspInt :: Char
nbspTxt = singleton nbspChar :: Text
nbspHtml = toHtml nbspTxt :: Html
someHtml :: Html
someHtml = docTypeHtml $
body $ do
p "some text"
p (nbspHtml <> nbspHtml <> "some indented text")
main :: IO ()
main = do
let s = renderHtml $ someHtml
putStrLn s
This code displayed the " some indented text" in the browser as I had hoped, but looking at the html, I didn't see any "#nbsp;", as I had expected. So this isn't really the answer I was looking for.
Thanks to Jukka for pointing out a better solution.

Fault tolerant JSON parsing

I'm using Data.Aeson to parse some JSON into a Record type. From time to time data is added to the JSON and this breaks my code as Aeson complains something to the effect of:
expected Object with 21
name/value pairs but got 23 name/value
I'd really prefer to parse the JSON in a fault tolerant way -- I don't care if more fields are added to the JSON at a later date, just parse whatever you can! Is there a way to achieve this fault tolerance? Here's my code:
myRecordFromJSONString :: BS.ByteString -> Maybe MyRecord
myRecordFromJSONString s = case Data.Attoparsec.parse json s of
Done _rest res -> Data.Aeson.Types.parseMaybe parseJSON res
_ -> Nothing
I should add that I'm using deriveJSON from Data.Aeson.TH to generate the parsing code. If I write the FromJSON code manually it's fault tolerant but I'd like to not have to do that...
If you are using GHC 7.2 or 7.4, the new generics support in aeson doesn't check for extra fields. I'm not sure if this is by design or not but we use it for the same reason.
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
import Data.Aeson
import qualified Data.Aeson.Types
import Data.Attoparsec
import qualified Data.ByteString as BS
import Data.ByteString.Char8 ()
import GHC.Generics
data MyRecord = MyRecord
{ field1 :: Int
} deriving (Generic, Show)
instance FromJSON MyRecord
myRecordFromJSONString :: BS.ByteString -> Maybe MyRecord
myRecordFromJSONString s = case Data.Attoparsec.parse json s of
Done _rest res -> Data.Aeson.Types.parseMaybe parseJSON res
_ -> Nothing
main :: IO ()
main = do
let parsed = myRecordFromJSONString "{ \"field1\": 1, \"field2\": 2 }"
print parsed
Running this would fail with the TH derived instance due to 'field2' not existing in the record. The Generic instance returns the desired result:
Just (MyRecord {field1 = 1})