Parse HTML in Scala

Parse HTML in Scala - html

Task: HTML - Parser in Scala. Im pretty new to scala.
So far: I have written a little Parser in Scala to parse a random html document.
import scala.xml.Elem
import scala.xml.Node
import scala.collection.mutable.Queue
import scala.xml.Text
import scala.xml.PrettyPrinter
object Reader {
def loadXML = {
val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.randomurl.com")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
val feed = adapter.loadXML(source, parser)
feed
}
def proc(node: Node): String =
node match {
case <body>{ txt }</body> => "Partial content: " + txt
case _ => "grmpf"
}
def main(args: Array[String]): Unit = {
val content = Reader.loadXML
Console.println(content)
Console.println(proc(content))
}
}
The problem is that the "proc" does not work. Basically, I would like to get exactly the content of one node. Or is there another way to achieve that without matching?
Does the "feed" in the loadxml-function give me back the right format for parsing or is there a better way to achieve that? Feed gives me back the root node, right?
Thanks in advance

You're right: adapter.loadXML(source, parser) gives you the root node. The problem is that that root node probably isn't going to match the body case in in your proc method. Even if the root node were body, it still wouldn't match unless the element contained nothing but text.
You probably want something more like this:
def proc(node: Node): String = (node \\ "body").text
Where \\ is a selector method that's roughly equivalent to XPath's //—i.e., it returns all the descendants of node named body. If you know that body is a child (as opposed to a deeper descendant) of the root node, which is probably the case for HTML, you can use \ instead of \\.

Related

akka-stream - How to treat the last element of a stream differently in a Flow/Graph

I'm trying to implement an Akka Streams Flow that will convert a stream of JSON objects to a stream of a single array of JSON objects. I can use Concat to add an "[" before and "]" after, as well as Zip to insert commas in between elements, but I can't figure out how to not insert the final comma.
The code I have so far is:
trait JsonStreamSupport {
protected def toJsonArrayString[T : Writes] = Flow[T].map(Json.toJson(_)).map(_.toString()).via(jsonArrayWrapper)
private[this] val jsonArrayWrapper: Flow[String, String, NotUsed] = Flow.fromGraph(GraphDSL.create() { implicit b =>
import GraphDSL.Implicits._
val start = Source.single("[")
val comma = Source.repeat(",")
val end = Source.single("]")
val concat = b.add(Concat[String](3))
val zip = b.add(Zip[String,String])
comma ~> zip.in1
start ~> concat.in(0)
zip.out.map({case (msg,delim) => msg + delim}) ~> concat.in(1)
end ~> concat.in(2)
FlowShape(zip.in0, concat.out)
})
}
Currently the output is:
[{"key":"value},{"key","value"},]
but I need it to be
[{"key":"value},{"key","value"}] (without final comma), where each element of the array is still a distinct element of the stream so can be, for example, sent over chunked HTTP separately.

just found out about intersperse which is exactly what you need, and much simpler than what I suggested in the first place:
http://doc.akka.io/api/akka/2.4.4/index.html#akka.stream.scaladsl.Flow#intersperse[T%3E:Out]%28start:T,inject:T,end:T%29:FlowOps.this.Repr[T]

Play Framework 2.2.1 / How should controller handle a queryString in JSON format

My client side executes a server call encompassing data (queryString) in a JSON object like this:
?q={"title":"Hello"} //non encoded for the sample but using JSON.stringify actually
What is an efficient way to retrieve the title and Hello String?
I tried this:
val params = request.queryString.map {case(k,v) => k->v.headOption}
that returns the Tuple: (q,Some({"title":"hello"}))
I could further extract to retrieve the values (although I would need to manually map the JSON object to a Scala object), but I wonder whether there is an easier and shorter way.
Any idea?

First, if you intend to pluck only the q parameter from a request and don't intend to do so via a route you could simply grab it directly:
val q: Option[String] = request.getQueryString("q")
Next you'd have to parse it as a JSON Object:
import play.api.libs.json._
val jsonObject: Option[JsValue] = q.map(raw: String => json.parse(raw))
with that you should be able to check for the components the jsonObject contains:
val title: Option[String] = jsonObject.flatMap { json: JsValue =>
(json \ "title").asOpt[String]
}
In short, omitting the types you could use a for comprehension for the whole thing like so:
val title = for {
q <- request.getQueryString("q")
json <- Try(Json.parse(q)).toOption
titleValue <- (json \ "title").asOpt[String]
} yield titleValue
Try is defined in scala.util and basically catches Exceptions and wraps it in a processable form.
I must admit that the last version simply ignores Exceptions during the parsing of the raw JSON String and treats them equally to "no title query has been set".
That's not the best way to know what actually went wrong.
In our productive code we're using implicit shortcuts that wraps a None and JsError as a Failure:
val title: Try[String] = for {
q <- request.getQueryString("q") orFailWithMessage "Query not defined"
json <- Try(Json.parse(q))
titleValue <- (json \ "title").validate[String].asTry
} yield titleValue
Staying in the Try monad we gain information about where it went wrong and can provide that to the User.
orFailWithMessage is basically an implicit wrapper for an Option that will transform it into Succcess or Failure with the specified message.
JsResult.asTry is also simply a pimped JsResult that will be Success or Failure as well.

Optimal way to read out JSON from MongoDB into a Scalatra API

I have a pre-formatted JSON blob stored as a string in MongoDB as a field in one of collections. Currently in my Scalatra based API, I have a before filter that renders all of my responses with a JSON content type. An example of how I return the content looks like the following:
get ("/boxscore", operation(getBoxscore)) {
val game_id:Int = params.getOrElse("game_id", "3145").toInt
val mongoColl = mongoDb.apply("boxscores")
val q: DBObject = MongoDBObject("game_id" -> game_id)
val res = mongoColl.findOne(q)
res match {
case Some(j) => JSON.parseFull(j("json_body").toString)
case None => NotFound("Requested document could not be found.")
}
}
Now this certainly does work. It doesn't seem the "Scala" way of doing things and I feel like this can be optimized. The worrisome part to me is when I add a caching layer and a cache does not hit that I am spending additional CPU time on re-parsing a String I already formatted as JSON in MongoDB:
JSON.parseFull(j("json_body").toString)
I have to take the result from findOne(), run .toString on it, then re-parse it into JSON afterwards. Is there a more optimal route? Since the JSON is already stored as a String in MongoDB, I'm guessing a serializer / case class isn't the right solution here. Of course I can just leave what's here - but I'd like to learn if there's a way that would be more Scala-like and CPU friendly going forward.

There is the option to extend Scalatra's render pipeline with handling for MongoDB classes. The following two routes act as an example. They return a MongoCursor and a DBObject as result. We are going to convert those to a string.
get("/") {
mongoColl.find
}
get("/:key/:value") {
val q = MongoDBObject(params("key") -> params("value"))
mongoColl.findOne(q) match {
case Some(x) => x
case None => halt(404)
}
}
In order to handle the types we need to define a partial function which takes care of the conversion and sets the appropriate content type.
There are two cases, the first one handles a DBObject. The content type is set to "application/json" and the object is converted to a string by calling the toString method. The second case handles a MongoCursor. Since it implements TraversableOnce the map function can be used.
def renderMongo = {
case dbo: DBObject =>
contentType = "application/json"
dbo.toString
case xs: TraversableOnce[_] => // handles a MongoCursor, be aware of type erasure here
contentType = "application/json"
val ls = xs map (x => x.toString) mkString(",")
"[" + ls + "]"
}: RenderPipeline
(Note the following type definition: type RenderPipeline = PartialFunction[Any, Any])
Now the method needs to get hooked in. After a HTTP call has been handled the result is forwarded to the render pipeline for further conversion. Custom handling can be added by overriding the renderPipeline method from ScalatraBase. With the following definition the renderMongo function is called first:
override protected def renderPipeline = renderMongo orElse super.renderPipeline
This is a basic approach to handle MongoDB types. There are other options as well, for example by making use of json4s-mongo.
Here is the previous code in a working sample project.

Hocon: Read an array of objects from a configuration file

I have created an Play application (2.1) which uses the configuration in conf/application.conf in the Hocon format.
I want to add an array of projects in the configuration. The file conf/application.conf looks like this:
...
projects = [
{name: "SO", url: "http://stackoverflow.com/"},
{name: "google", url: "http://google.com"}
]
I try to read this configuration in my Scala project:
import scala.collection.JavaConversions._
case class Project(name: String, url: String)
val projectList: List[Project] =
Play.maybeApplication.map{x =>
val simpleConfig = x.configration.getObjectList("projects").map{y =>
y.toList.map{z =>
Project(z.get("name").toString, z.get("url").toString) // ?!? doesn't work
...
}}}}}}}} // *arg*
This approach seems to be very complicated, I am lost in a lot of Options, and my Eclipse IDE cannot give me any hints about the classes.
Has anybody an example how you can read an array of objects from a Hocon configuration file?
Or should I use for this a JSON-file with an JSON-parser instead of Hocon?

The following works for me in Play 2.1.2 (I don't have a .maybeApplication on my play.Play object though, and I'm not sure why you do):
import play.Play
import scala.collection.JavaConversions._
case class Project(name: String, url: String)
val projectList: List[Project] = {
val projs = Play.application.configuration.getConfigList("projects") map { p =>
Project(p.getString("name"), p.getString("url")) }
projs.toList
}
println(projectList)
Giving output:
List(Project(SO,http://stackoverflow.com/), Project(google,http://google.com))
There's not a whole lot different, although I don't get lost in a whole lot of Option instances either (again, different from the API you seem to have).
More importantly, getConfigList seems to be a closer match for what you want to do, since it returns List[play.Configuration], which enables you to specify types on retrieval instead of resorting to casts or .toString() calls.

What are you trying to accomplish with this part y.toList.map{z =>? If you want a collection of Project as the result, why not just do:
val simpleConfig = x.configration.getObjectList("projects").map{y =>
Project(y.get("name").toString, y.get("url").toString)
}
In this case, the map operation should be taking instances of ConfigObject which is what y is. That seems to be all you need to get your Project instances, so I'm not sure why you are toListing that ConfigObject (which is a Map) into a List of Tuple2 and then further mapping that again.

If a normal HOCON configuration then similar to strangefeatures answer this will work
import javax.inject._
import play.api.Configuration
trait Barfoo {
def configuration: Configuration
def projects = for {
projectsFound <- configuration.getConfigList("projects").toList
projectConfig <- projectsFound
name <- projectConfig.getString("name").toList
url <- projectConfig.getString("url").toList
} yield Project(name,url)
}
class Foobar #Inject() (val configuration: Configuration) extends Barfoo
(Using Play 2.4+ Injection)

Given that the contents of the array are Json and you have a case class, you could try to use the Json Play API and work with the objects in that way. The Inception part should make it trivial.

Groovy - NekoHTML Sax parser

I am having hard time with my NekoHTML parser.
It is working fine on URL's but when I want to test in on a simple XML test, it does not read it properly.
Here is how I declare it:
def createAndSetParser() {
SAXParser parser = new SAXParser() //Default Sax NekoHTML parser
def charset = "Windows-1252" // The encoding of the page
def tagFormat = "upper" // Ensures all the tags and consistently written, by putting all of them in upper-case. We can choose "lower", "upper" of "match"
def attrFormat = "lower" // Same thing for attributes. We can choose "upper", "lower" or "match"
Purifier purifier = new Purifier() //Creating a purifier, in order to clean the incoming HTML
XMLDocumentFilter[] filter = [purifier] //Creating a filter, and adding the purifier to this filter. (NekoHTML feature)
parser.setProperty("http://cyberneko.org/html/properties/filters", filter)
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset)
parser.setProperty("http://cyberneko.org/html/properties/names/elems", tagFormat)
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", attrFormat)
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true) // Forces the parser to use the charset we provided to him.
parser.setFeature("http://cyberneko.org/html/features/override-doctype", false) // To let the Doctype as it is.
parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false) // To make sure no namespace is added or overridden.
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true)
return new XmlSlurper(parser) // A groovy parser that does not download the all tree structure, but rather supply only the information it is asked for.
}
Again it is working very fine when I use it on websites.
Any guess why I cannot do so on simple XML text samples ??
Any help greatly apreciated :)

I made your script executable on the Groovy Console to try it out easily using Grape to fetch the required NekoHTML library from the Maven Central Repository.
#Grapes(
#Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15')
)
import groovy.xml.StreamingMarkupBuilder
import org.apache.xerces.xni.parser.XMLDocumentFilter
import org.cyberneko.html.parsers.SAXParser
import org.cyberneko.html.filters.Purifier
def createAndSetParser() {
SAXParser parser = new SAXParser()
parser.setProperty("http://cyberneko.org/html/properties/filters", [new Purifier()] as XMLDocumentFilter[])
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", "Windows-1252")
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "upper")
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", "lower")
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true)
parser.setFeature("http://cyberneko.org/html/features/override-doctype", false)
parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false)
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true)
return new XmlSlurper(parser)
}
def printResult(def gPathResult) {
println new StreamingMarkupBuilder().bind { out << gPathResult }
}
def parser = createAndSetParser()
printResult parser.parseText('<html><body>Hello World</body></html>')
printResult parser.parseText('<house><room>bedroom</room><room>kitchen</room></house>')
When being executed this way the result of the two printResult-statements looks like shown below and can explain your issues parsing the XML string because it is wrapped into <html><body>...</body></html> tags and looses the root tag called <house/>:
<HTML><tag0:HEAD xmlns:tag0='http://www.w3.org/1999/xhtml'></tag0:HEAD><BODY>Hello World</BODY></HTML>
<HTML><BODY><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></BODY></HTML>
All this is caused by the http://cyberneko.org/html/features/balance-tags feature which you enabled in your script. If I disable this feature (it must be explicitly set to false because it defaults to true) the results looks like this:
<HTML><BODY>Hello World</BODY></HTML>
<HOUSE><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></HOUSE>

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Parse HTML in Scala - html

Related

akka-stream - How to treat the last element of a stream differently in a Flow/Graph

Play Framework 2.2.1 / How should controller handle a queryString in JSON format

Optimal way to read out JSON from MongoDB into a Scalatra API

Hocon: Read an array of objects from a configuration file

Groovy - NekoHTML Sax parser

Categories

Resources