Records from <tr>s in an Html table using Arrows and HXT in Haskell - html

Looking to extract records from a table in a very well formed HTMl table using HXT. I've reviewed a couple of examples on SO and the HXT documentation, such as:
Extracting Values from a Subtree
http://adit.io/posts/2012-04-14-working_with_HTML_in_haskell.html
https://www.schoolofhaskell.com/school/advanced-haskell/xml-parsing-with-validation
Running Haskell HXT outside of IO?
extract multiples html tables with hxt
Parsing html in haskell
http://neilbartlett.name/blog/2007/08/01/haskell-explaining-arrows-through-xml-transformationa/
https://wiki.haskell.org/HXT/Practical/Simple2
https://wiki.haskell.org/HXT/Practical/Simple1
Group html table rows with HXT in Haskell
Parsing multiple child nodes in Haskell with HXT
My problem is:
I want to identify a table uniquely by a known id, and then for each
tr within that table, create a record object and return this as a list
of records.
Here's my HTML
<!DOCTYPE html>
<head>
<title>FakeHTML</title>
</head>
<body>
<table id="fakeout-dont-get-me">
<thead><tr><td>Null</td></tr></thead>
<tbody><tr><td>Junk!</td></tr></tbody>
</table>
<table id="Greatest-Table">
<thead>
<tr><td>Name</td><td>Favorite Rock</td></tr>
</thead>
<tbody>
<tr id="rock1">
<td>Fred</td>
<td>Igneous</td>
</tr>
<tr id="rock2">
<td>Bill</td>
<td>Sedimentary</td>
</tr>
</tbody>
</table>
</body>
</html>
Here's the code I'm trying, along with 2 different approaches to parsing this. First, imports ...
{-# LANGUAGE Arrows, OverloadedStrings, DeriveDataTypeable, FlexibleContexts #-}
import Text.XML.HXT.Core
import Text.HandsomeSoup
import Text.XML.HXT.XPath.XPathEval
import Data.Tree.NTree.TypeDefs
import Text.XML.HXT.XPath.Arrows
What I want is a list of Rockrecs, eg from...
recs = [("rock1", "Name", "Fred", "Favorite Rock", "Igneous"),
("rock2", "Name", "Bill", "Favorite Rock", "Sedimentary")]
data Rockrec = Rockrec { rockID:: String,
rockName :: String,
rockFav :: String} deriving Show
rocks = [(\(a,_,b,_,c) -> Rockrec a b c ) r | r <- recs]
-- [Rockrec {rockID = "rock1", rockName = "Fred", rockFav = "Igneous"},
-- Rockrec {rockID = "rock2", rockName = "Bill", rockFav = "Sedimentary"}]
Here's my first way, which uses a bind on runLA after I return a bunch of [XMLTree]. That is, I do a first parse just to get the right table, then I process the tree rows after that first grab.
Attempt 1
getTab = do
dt <- Prelude.readFile "fake.html"
let html = parseHtml dt
tab <- runX $ html //> hasAttrValue "id" (== "Greatest-Table")
return tab
-- hmm, now this gets tricky...
-- table <- getTab
node tag = multi (hasName tag)
-- a la https://stackoverflow.com/questions/3901492/running-haskell-hxt-outside-of-io?rq=1
getIt :: ArrowXml cat => cat (Data.Tree.NTree.TypeDefs.NTree XNode) (String, String)
getIt = (node "tr" >>>
(getAttrValue "id" &&& (node "td" //> getText)))
This kinda works. I need to massage a bit, but can get it to run...
-- table >>= runLA getIt
-- [("","Name"),("","Favorite Rock"),("rock1","Fred"),("rock1","Igneous"),("rock2","Bill"),("rock2","Sedimentary")]
This is a second approach, inspired by https://wiki.haskell.org/HXT/Practical/Simple1. Here, I think I'm relying on something in {-# LANGUAGE Arrows -} (which coincidentally breaks my list comprehension for rec above), to use the proc function to do this in a more readable do block. That said, I can't even get a minimal version of this to compile:
Attempt 2
getR :: ArrowXml cat => cat XmlTree Rockrec
getR = (hasAttrValue "id" (== "Greatest-Table")) >>>
proc x -> do
rockId <- getText -< x
rockName <- getText -< x
rockFav <- getText -< x
returnA -< Rockrec rockId rockName rockFav
EDIT
Trouble with the types, in response to the comment below from Alec
λ> getR [table]
<interactive>:56:1-12: error:
• Couldn't match type ‘NTree XNode’ with ‘[[XmlTree]]’
Expected type: [[XmlTree]] -> Rockrec
Actual type: XmlTree -> Rockrec
• The function ‘getR’ is applied to one argument,
its type is ‘cat0 XmlTree Rockrec’,
it is specialized to ‘XmlTree -> Rockrec’
In the expression: getR [table]
In an equation for ‘it’: it = getR [table]
λ> getR table
<interactive>:57:1-10: error:
• Couldn't match type ‘NTree XNode’ with ‘[XmlTree]’
Expected type: [XmlTree] -> Rockrec
Actual type: XmlTree -> Rockrec
• The function ‘getR’ is applied to one argument,
its type is ‘cat0 XmlTree Rockrec’,
it is specialized to ‘XmlTree -> Rockrec’
In the expression: getR table
In an equation for ‘it’: it = getR table
END EDIT
Even if I'm not selecting elements, I can't get the above to run. I'm also a little puzzled at how I should do something like put the first td in rockName and the second td in rockFav, how to include an iterator on these (supposing I have a lot of td fields, instead of just 2.)
Any further general tips on how to do this more painlessly appreciated.

From HXT/Practical/Google1 I think I am able to piece together a solution.
{-# LANGUAGE Arrows #-}
{-# LANGUAGE ScopedTypeVariables #-}
module Hanzo where
import Text.HandsomeSoup
import Text.XML.HXT.Cor
atTag tag =
deep (isElem >>> hasName tag)
text =
deep isText >>> getText
data Rock = Rock String String String deriving Show
rocks =
atTag "tbody" //> atTag "tr"
>>> proc x -> do
rowID <- x >- getAttrValue "id"
name <- x >- atTag "td" >. (!! 0) >>> text
kind <- x >- atTag "td" >. (!! 1) >>> text
returnA -< Rock rowID name kind
main = do
dt <- readFile "html.html"
result <- runX $ parseHtml dt
//> hasAttrValue "id" (== "Greatest-Table")
>>> rocks
print result
The key takeways are these:
Your arrows work on streams of elements, but not individual elements. This is the ArrowList constraint. Thus, calling getText three times will produce surprising behavior because getText represents all the different possible text values you could get in the course of streaming <table> elements through your proc x -> do {...}.
What we can do instead is focus on the stream we want: a stream of <tr>s inside the <tbody>. For each table row, we grab the ID attribute value and the text of the first two <td>s.
This does not seem the most elegant solution, but one way we can index into a stream is to filter it down with the (>.) :: ArrowList cat => cat a b -> ([b] -> c) -> cat a c combinator.
One last trick, one that I noticed in the practical wiki examples: we can use deep and isElem/isText to focus on just the nodes we want. XML trees are noisy!

Related

How to extract the name of a Party?

In a DAML contract, how do I extract the name of a party from a Party field?
Currently, toText p gives me Party(Alice). I'd like to only keep the name of the party.
That you care about the precise formatting of the resulting string suggests that you are implementing a codec in DAML. As a general principle DAML excels as a modelling/contract language, but consequently has limited features to support the sort of IO-oriented work this question implies. You are generally better off returning DAML values, and implementing codecs in Java/Scala/C#/Haskell/etc interfacing with the DAML via the Ledger API.
Still, once you have a Text value you also have access to the standard List manipulation functions via unpack, so converting "Party(Alice)" to "Alice" is not too difficult:
daml 1.0 module PartyExtract where
import Base.List
def pack (cs: List Char) : Text =
foldl (fun (acc: Text) (c: Char) -> acc <> singleton c) "" cs;
def partyToText (p: Party): Text =
pack $ reverse $ drop 2 $ reverse $ drop 7 $ unpack $ toText p
test foo : Scenario {} = scenario
let p = 'Alice'
assert $ "Alice" == partyToText p
In DAML 1.2 the standard library has been expanded, so the code above can be simplified:
daml 1.2
module PartyExtract2
where
import DA.Text
traceDebug : (Show a, Show b) => b -> a -> a
traceDebug b a = trace (show b <> show a) $ a
partyToText : Party -> Text
partyToText p = dropPrefix "'" $ dropSuffix "'" $ traceDebug "show party: " $ show p
foo : Scenario ()
foo = do
p <- getParty "Alice"
assert $ "Alice" == (traceDebug "partyToText party: " $ partyToText p)
NOTE: I have left the definition and calls to traceDebug so you can see the exact strings being generated in the scenario trace output.

How to stop Haskell Parsec parser at EOF

So, I'm writing a small parser that will extract all <td> tag content with specific class, like this one <td class="liste">some content</td> --> Right "some content"
I will be parsing large html file but I don't really care about all the noise, so idea was to consume all characters until I reach <td class="liste"> than I'd consume all characters (content) until </td> and return content string.
This works fine if last element in a file is my td.liste tag, but if I have some text after it or eof than my parser consumes it and throws unexpected end of input if you execute parseMyTest test3.
-- EDIT
See end of test3 to understand what is the edge case.
Here is my code so far :
import Text.Parsec
import Text.Parsec.String
import Data.ByteString.Lazy (ByteString)
import Data.ByteString.Char8 (pack)
colOP :: Parser String
colOP = string "<td class=\"liste\">"
colCL :: Parser String
colCL = string "</td>"
col :: Parser String
col = do
manyTill anyChar (try colOP)
content <- manyTill anyChar $ try colCL
return content
cols :: Parser [String]
cols = many col
test1 :: String
test1 = "<td class=\"liste\">Hello world!</td>"
test2 :: String
test2 = read $ show $ pack test1
test3 :: String
test3 = "\n\r<html>asdfasd\n\r<td class=\"liste\">Hello world 1!</td>\n<td class=\"liste\">Hello world 2!</td>\n\rasldjfasldjf<td class=\"liste\">Hello world 3!</td><td class=\"liste\">Hello world 4!</td>adsafasd"
parseMyTest :: String -> Either ParseError [String]
parseMyTest test = parse cols "test" test
btos :: ByteString -> String
btos = read . show
I created a combinator skipTill p end which applies p until end matches and then returns what end returns.
By contrast, manyTill p end applies p until end matches and then
returns what the p parsers matched.
import Text.Parsec
import Text.Parsec.String
skipTill :: (Stream s m t) => ParsecT s u m a -> ParsecT s u m end -> ParsecT s u m end
skipTill p end = scan
where
scan = end <|> do { p; scan }
td :: Parser String
td = do
string "("
manyTill anyChar (try (string ")"))
tds = do r <- many (try (skipTill anyChar (try td)))
many anyChar -- discard stuff at end
return r
test1 = parse tds "" "111(abc)222(def)333" -- Right ["abc", "def"]
test2 = parse tds "" "111" -- Right []
test3 = parse tds "" "111(abc" -- Right []
test4 = parse tds "" "111(abc)222(de" -- Right ["abc"]
Update
This also appears to work:
tds' = scan
where scan = (eof >> return [])
<|> do { r <- try td; rs <- scan; return (r:rs) }
<|> do { anyChar; scan }

Convert bitstring to tuple

I'm trying to find out how to convert an Erlang bitstring to a tuple, but so far without any luck.
What I want is to get from for example <<"{1,2}">> the tuple {1,2}.
You can use the modules erl_scan and erl_parse, as in this answer. Since erl_scan:string requires a string, not a binary, you have to convert the value with binary_to_list first:
> {ok, Scanned, _} = erl_scan:string(binary_to_list(<<"{1,2}">>)).
{ok,[{'{',1},{integer,1,1},{',',1},{integer,1,2},{'}',1}],1}
Then, you'd use erl_parse:parse_term to get the actual term. However, this function expects the term to end with a dot, so we have to add it explicitly:
> {ok, Parsed} = erl_parse:parse_term(Scanned ++ [{dot,0}]).
{ok,{1,2}}
Now the variable Parsed contains the result:
> Parsed.
{1,2}
You can use binary functions and erlang:list_to_tuple/1
1> B = <<"{1,2}">>.
<<"{1,2}">>
2> list_to_tuple([list_to_integer(binary_to_list(X)) || X <- binary:split(binary:part(B, 1, byte_size(B)-2), <<",">>, [global])]).
{1,2}

Read An Input.md file and output a .html file Haskell

I had a question concerning some basic transformations in Haskell.
Basically, I have a written Input file, named Input.md. This contains some markdown text that is read in my project file, and I want to write a few functions to do transformations on the text. After completing these functions under a function called convertToHTML, I have output the file as an .html file in the correct format.
module Main
(
convertToHTML,
main
) where
import System.Environment (getArgs)
import System.IO
import Data.Char (toLower, toUpper)
process :: String -> String
process s = head $ lines s
convertToHTML :: String -> String
convertToHTML str = do
x <- str
if (x == '#')
then "<h1>"
else return x
--convertToHTML x = map toUpper x
main = do
args <- getArgs -- command line args
let (infile,outfile) = (\(x:y:ys)->(x,y)) args
putStrLn $ "Input file: " ++ infile
putStrLn $ "Output file: " ++ outfile
contents <- readFile infile
writeFile outfile $ convertToHTML contents
So,
How would I read through my input file, and transform any line that starts with a # to an html tag
How would I read through my input file once more and transform any WORD that is surrounded by _word_ (1 underscore) to another html tag
Replace any Character with an html string.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text. Please if anybody has any suggestions. I've been working on this for 2 days straight and have a bunch of failed code to show for a couple of weeks and have a bunch of failed code to show it.
I tried using such functions such as Map, Filter, ZipWith, but could not figure out how to iterate through the text and transform each text.
Because they work on appropriate element collection. And they don't really "iterate"; you simply have to feed the appropriate data. Let's tackle the # problem as an example.
Our file is one giant String, and what we'd like is to have it nicely split in lines, so [String]. What could do it for us? I have no idea, so let's just search Hoogle for String -> [String].
Ah, there we go, lines function! Its counterpart, unlines, is also going to be useful. Now we can write our line wrapper:
convertHeader :: String -> String
convertHeader [] = [] -- that prevents us from calling head on an empty line
convertHeader x = if head x == '#' then "<h1>" ++ x ++ "</h1>"
else x
and so:
convertHeaders :: String -> String
convertHeaders = unlines . map convertHeader . lines
-- ^String ^[String] ^[String] ^String
As you can see the function first converts the file to lines, maps convertHeader on each line, and the puts the file back together.
See it live on Ideone
Try now doing the same with words to replace your formatting patterns. As a bonus exercise, change convertHeader to count the number of # in front of the line and output <h1>, <h2>, <h3> and so on accordingly.

How do exceptions work in Haskell (part two)?

I have the following code:
{-# LANGUAGE DeriveDataTypeable #-}
import Prelude hiding (catch)
import Control.Exception (throwIO, Exception)
import Control.Monad (when)
import Data.Maybe
import Data.Word (Word16)
import Data.Typeable (Typeable)
import System.Environment (getArgs)
data ArgumentParserException = WrongArgumentCount | InvalidPortNumber
deriving (Show, Typeable)
instance Exception ArgumentParserException
data Arguments = Arguments Word16 FilePath String
main = do
args <- return []
when (length args /= 3) (throwIO WrongArgumentCount)
let [portStr, cert, pw] = args
let portInt = readMaybe portStr :: Maybe Integer
when (portInt == Nothing) (throwIO InvalidPortNumber)
let portNum = fromJust portInt
when (portNum < 0 || portNum > 65535) (throwIO InvalidPortNumber)
return $ Arguments (fromInteger portNum) cert pw
-- Newer 'base' has Text.Read.readMaybe but alas, that doesn't come with
-- the latest Haskell platform, so let's not rely on it
readMaybe :: Read a => String -> Maybe a
readMaybe s = case reads s of
[(x, "")] -> Just x
_ -> Nothing
Its behavior differs when compiled with optimizations on and off:
crabgrass:~/tmp/signserv/src% ghc -fforce-recomp Main.hs && ./Main
Main: WrongArgumentCount
crabgrass:~/tmp/signserv/src% ghc -O -fforce-recomp Main.hs && ./Main
Main: Main.hs:20:9-34: Irrefutable pattern failed for pattern [portStr, cert, pw]
Why is this? I am aware that imprecise exceptions can be chosen from arbitrarily; but here we are choosing from one precise and one imprecise exception, so that caveat should not apply.
I would agree with hammar, this looks like a bug. And it seems fixed in HEAD since some time. With an older ghc-7.7.20130312 as well as with today's HEAD ghc-7.7.20130521, the WrongArgumentCount exception is raised and all the other code of main is removed (bully for the optimiser). Still broken in 7.6.3, however.
The behaviour changed with the 7.2 series, I get the expected WrongArgumentCount from 7.0.4, and the (optimised) core makes that clear:
Main.main1 =
\ (s_a11H :: GHC.Prim.State# GHC.Prim.RealWorld) ->
case GHC.List.$wlen
# GHC.Base.String (GHC.Types.[] # GHC.Base.String) 0
of _ {
__DEFAULT ->
case GHC.Prim.raiseIO#
# GHC.Exception.SomeException # () Main.main7 s_a11H
of _ { (# new_s_a11K, _ #) ->
Main.main2 new_s_a11K
};
3 -> Main.main2 s_a11H
}
when the length of the empty list is different from 3, raise WrongArgumentCount, otherwise try to do the rest.
With 7.2 and later, the evaluation of the length is moved behind the parsing of portStr:
Main.main1 =
\ (eta_Xw :: GHC.Prim.State# GHC.Prim.RealWorld) ->
case Main.main7 of _ {
[] -> case Data.Maybe.fromJust1 of wild1_00 { };
: ds_dTy ds1_dTz ->
case ds_dTy of _ { (x_aOz, ds2_dTA) ->
case ds2_dTA of _ {
[] ->
case ds1_dTz of _ {
[] ->
case GHC.List.$wlen
# [GHC.Types.Char] (GHC.Types.[] # [GHC.Types.Char]) 0
of _ {
__DEFAULT ->
case GHC.Prim.raiseIO#
# GHC.Exception.SomeException # () Main.main6 eta_Xw
of wild4_00 {
};
3 ->
where
Main.main7 =
Text.ParserCombinators.ReadP.run
# GHC.Integer.Type.Integer Main.main8 Main.main3
Main.main8 =
GHC.Read.$fReadInteger5
GHC.Read.$fReadInteger_$sconvertInt
Text.ParserCombinators.ReadPrec.minPrec
# GHC.Integer.Type.Integer
(Text.ParserCombinators.ReadP.$fMonadP_$creturn
# GHC.Integer.Type.Integer)
Main.main3 = case lvl_r1YS of wild_00 { }
lvl_r1YS =
Control.Exception.Base.irrefutPatError
# ([GHC.Types.Char], [GHC.Types.Char], [GHC.Types.Char])
"Except.hs:21:9-34|[portStr, cert, pw]"
Since throwIO is supposed to respect ordering of IO actions,
The throwIO variant should be used in preference to throw to raise an exception within the IO monad because it guarantees ordering with respect to other IO operations, whereas throw does not.
that should not happen.
You can force the correct ordering by using a NOINLINE variant of when, or by performing an effectful IO action before throwing, so it seems that when the inliner sees that the when does nothing except possibly throwing, it decides that order doesn't matter.
(Sorry, not a real answer, but try to fit that in a comment ;)