Groovy - NekoHTML Sax parser - html

I am having hard time with my NekoHTML parser.
It is working fine on URL's but when I want to test in on a simple XML test, it does not read it properly.
Here is how I declare it:
def createAndSetParser() {
SAXParser parser = new SAXParser() //Default Sax NekoHTML parser
def charset = "Windows-1252" // The encoding of the page
def tagFormat = "upper" // Ensures all the tags and consistently written, by putting all of them in upper-case. We can choose "lower", "upper" of "match"
def attrFormat = "lower" // Same thing for attributes. We can choose "upper", "lower" or "match"
Purifier purifier = new Purifier() //Creating a purifier, in order to clean the incoming HTML
XMLDocumentFilter[] filter = [purifier] //Creating a filter, and adding the purifier to this filter. (NekoHTML feature)
parser.setProperty("http://cyberneko.org/html/properties/filters", filter)
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset)
parser.setProperty("http://cyberneko.org/html/properties/names/elems", tagFormat)
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", attrFormat)
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true) // Forces the parser to use the charset we provided to him.
parser.setFeature("http://cyberneko.org/html/features/override-doctype", false) // To let the Doctype as it is.
parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false) // To make sure no namespace is added or overridden.
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true)
return new XmlSlurper(parser) // A groovy parser that does not download the all tree structure, but rather supply only the information it is asked for.
}
Again it is working very fine when I use it on websites.
Any guess why I cannot do so on simple XML text samples ??
Any help greatly apreciated :)

I made your script executable on the Groovy Console to try it out easily using Grape to fetch the required NekoHTML library from the Maven Central Repository.
#Grapes(
#Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15')
)
import groovy.xml.StreamingMarkupBuilder
import org.apache.xerces.xni.parser.XMLDocumentFilter
import org.cyberneko.html.parsers.SAXParser
import org.cyberneko.html.filters.Purifier
def createAndSetParser() {
SAXParser parser = new SAXParser()
parser.setProperty("http://cyberneko.org/html/properties/filters", [new Purifier()] as XMLDocumentFilter[])
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", "Windows-1252")
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "upper")
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", "lower")
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true)
parser.setFeature("http://cyberneko.org/html/features/override-doctype", false)
parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false)
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true)
return new XmlSlurper(parser)
}
def printResult(def gPathResult) {
println new StreamingMarkupBuilder().bind { out << gPathResult }
}
def parser = createAndSetParser()
printResult parser.parseText('<html><body>Hello World</body></html>')
printResult parser.parseText('<house><room>bedroom</room><room>kitchen</room></house>')
When being executed this way the result of the two printResult-statements looks like shown below and can explain your issues parsing the XML string because it is wrapped into <html><body>...</body></html> tags and looses the root tag called <house/>:
<HTML><tag0:HEAD xmlns:tag0='http://www.w3.org/1999/xhtml'></tag0:HEAD><BODY>Hello World</BODY></HTML>
<HTML><BODY><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></BODY></HTML>
All this is caused by the http://cyberneko.org/html/features/balance-tags feature which you enabled in your script. If I disable this feature (it must be explicitly set to false because it defaults to true) the results looks like this:
<HTML><BODY>Hello World</BODY></HTML>
<HOUSE><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></HOUSE>

Related

How to properly merge multiple FlowFile's?

I use MergeContent 1.3.0 in order to merge FlowFiles from 2 sources: 1) from ListenHTTP and 2) from QueryElasticsearchHTTP.
The problem is that the merging result is a list of JSON strings. How can I convert them into a single JSON string?
{"event-date":"2017-08-08T00:00:00"}{"event-date":"2017-02-23T00:00:00"}{"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
I would to get this result:
{"event-date":["2017-08-08T00:00:00","2017-02-23T00:00:00"],"eid":1,"zid":1,"latitude":38.3,"longitude":2.4}
Is it possible?
UPDATE:
After changing data structure in Elastic, I was able to come up with the following output result of MergeContent. Now I have a common field eid in both JSON strings. I would like to merge these strings by eid in order to get a single JSON file. Which operator should I use?
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
I need to get the following output:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4,"dates":{"event-date":["2017-08-08","2017-02-23"]}}
It was suggested to use ExecuteScript to merge files. However I cannot figure out how to do this. This is what I tried:
import json
import java.io
from org.apache.commons.io import IOUtils
from java.nio.charset import StandardCharsets
from org.apache.nifi.processor.io import StreamCallback
class ModJSON(StreamCallback):
def __init__(self):
pass
def process(self, inputStream, outputStream):
text = IOUtils.toString(inputStream, StandardCharsets.UTF_8)
obj = json.loads(text)
newObj = {
"eid": obj['eid'],
"zid": obj['zid'],
...
}
outputStream.write(bytearray(json.dumps(newObj, indent=4).encode('utf-8')))
flowFile1 = session.get()
flowFile2 = session.get()
if (flowFile1 != None && flowFile2 != None):
# WHAT SHOULD I PUT HERE??
flowFile = session.write(flowFile, ModJSON())
flowFile = session.putAttribute(flowFile, "filename", flowFile.getAttribute('filename').split('.')[0]+'_translated.json')
session.transfer(flowFile, REL_SUCCESS)
session.commit()
The example how to read multiple files from incoming queue using filtering
Assume you have multiple pairs of flow files with following content:
{"eid":"1","zid":1,"latitude":38.3,"longitude":2.4}
and
{"eid":"1","dates":{"event-date":["2017-08-08","2017-02-23"]}}
The same value of eid field provides a link between pairs.
Before merging we have to extract the value of eid field and put it into na attribute of the flow file for fast filtering.
Use the EvaluateJsonPath processor with properties:
Destination : flowfile-attribute
eid : $.eid
After this you'll have new eid attribute of the flow file.
Then use ExecuteScript processor with groovy language and with following code:
import org.apache.nifi.processor.FlowFileFilter;
import groovy.json.JsonSlurper
import groovy.json.JsonBuilder
//get first flow file
def ff0 = session.get()
if(!ff0)return
def eid = ff0.getAttribute('eid')
//try to find files with same attribute in the incoming queue
def ffList = session.get(new FlowFileFilter(){
public FlowFileFilterResult filter(FlowFile ff) {
if( eid == ff.getAttribute('eid') )return FlowFileFilterResult.ACCEPT_AND_CONTINUE
return FlowFileFilterResult.REJECT_AND_CONTINUE
}
})
//let's assume you require two additional files in queue with the same attribute
if( !ffList || ffList.size()<1 ){
//if less than required
//rollback current session with penalize retrieved files so they will go to the end of the incoming queue
//with pre-configured penalty delay (default 30sec)
session.rollback(true)
return
}
//let's put all in one list to simplify later iterations
ffList.add(ff0)
if( ffList.size()>2 ){
//for example unexpected situation. you have more files then expected
//redirect all of them to failure
session.transfer(ffList, REL_FAILURE)
return
}
//create empty map (aka json object)
def json = [:]
//iterate through files parse and merge attributes
ffList.each{ff->
session.read(ff).withStream{rawIn->
def fjson = new JsonSlurper().parse(rawIn)
json.putAll(fjson)
}
}
//create new flow file and write merged json as a content
def ffOut = session.create()
ffOut = session.write(ffOut,{rawOut->
rawOut.withWriter("UTF-8"){writer->
new JsonBuilder(json).writeTo(writer)
}
} as OutputStreamCallback )
//set mime-type
ffOut = session.putAttribute(ffOut, "mime.type", "application/json")
session.remove(ffList)
session.transfer(ffOut, REL_SUCCESS)
Joining together two different types of data is not really what MergeContent was made to do.
You would need to write a custom processor, or custom script, that understood your incoming data formats and created the new output.
If you have ListenHttp connected to QueryElasticSearchHttp, meaning that you are triggering the query based on the flow file coming out of ListenHttp, then you may want to make a custom version of QueryElasticSearchHttp that takes the content of the incoming flow file and joins it together with any of the outgoing results.
Here is where the query result is currently written to a flow file:
https://github.com/apache/nifi/blob/master/nifi-nar-bundles/nifi-elasticsearch-bundle/nifi-elasticsearch-processors/src/main/java/org/apache/nifi/processors/elasticsearch/QueryElasticsearchHttp.java#L360
Another option is to use ExecuteScript and write a script that could take multiple flow files and merge them together in the way you described.

YAML or JSON library that supports inheritance

We are building a service. It has to read config from a file. We are currently using YAML and Jackson for deserializing the YAML. We have a situation where our YAML file needs to inherit/extend another YAML file(s). E.g., something like:
extends: base.yaml
appName: my-awesome-app
...
thus part of the config is stored in base.yaml. Is there any library that has support for this? Bonus points if it allows to inherit from more than one file. We could change to using JSON instead of YAML.
Neither JSON nor YAML have the ability to include files. Whatever you do will be a pre-processing step where you will be putting the base.yaml and your actual file together.
A crude way of doing this would be:
#include base.yaml
appName: my-awesome-app
Let this be your file. Upon loading, you first read the first line, and if it starts with #include, you replace it with the content of the included file. You need to do this recursively. This is basically what the C preprocessor does with C files and includes.
Drawbacks are:
even if both files are valid YAML, the result may not.
if either files includes a directive end or document end marker (--- or ...), you will end up with two separate documents in one file.
you cannot replace any values from base.yaml inside your file.
So an alternative would be to actually operate on the YAML structure. For this, you need the API of the YAML parser (SnakeYAML in your case) and parse your file with that. You should use the compose API:
private Node preprocess(final Reader myInput) {
final Yaml yaml = new Yaml();
final Node node = yaml.compose(myInput);
processIncludes(node);
return node;
}
private void processIncludes(final Node node) {
if (node instanceof MappingNode) {
final List<NodeTuple> values = ((MappingNode) node).getValue();
for (final NodeTuple tuple: values) {
if ("!include".equals(tuple.getKeyNode().getTag().getValue())) {
final String includedFilePath =
((ScalarNode) tuple.getValueNode()).getValue();
final Node content = preprocess(new FileReader(includedFilePath));
// now merge the content in your preferred way into the values list.
// that will change the content of the node.
}
}
}
}
public String executePreprocessor(final Reader source) {
final Node node = preprocess(source);
final StringWriter writer = new StringWriter();
final DumperOptions dOptions = new DumperOptions()
Serializer ser = new Serializer(new Emitter(writer, dOptions),
new Resolver(), dOptions, null);
ser.open();
ser.serialize(node);
ser.close();
return writer.toString();
}
This code would parse includes like this:
!include : base.yaml
appName: my-awesome-app
I used the private tag !include so that there will not be name clashes with any normal mapping key. Mind the space behind !include. I didn't give code to merge the included file because I did not know how you want to handle duplicate mapping keys. It should not be hard to implement though. Be aware of bugs, I have not tested this code.
The resulting String can be the input to Jackson.
Probably for the same desire, I have created this tool: jq-front.
You can do it by following syntax and combinating with yq command.
extends: [ base.yaml ]
appName: my-awesome-app
...
$ yq -j . your.yaml | jq-front | yq -y .
Note that you need to place file names to be extended in an array since the tool supports multiple inheritance.
Points potentially you don't like are
It's quite a bit slow. (But for configuration information, it might be ok since you can convert it to an expanded file once and you will never not the original one after that for your system)
Objects inside an array cannot behave as expected since the tool relies on * operator of jq.

Parse HTML in Scala

Task: HTML - Parser in Scala. Im pretty new to scala.
So far: I have written a little Parser in Scala to parse a random html document.
import scala.xml.Elem
import scala.xml.Node
import scala.collection.mutable.Queue
import scala.xml.Text
import scala.xml.PrettyPrinter
object Reader {
def loadXML = {
val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.randomurl.com")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
val feed = adapter.loadXML(source, parser)
feed
}
def proc(node: Node): String =
node match {
case <body>{ txt }</body> => "Partial content: " + txt
case _ => "grmpf"
}
def main(args: Array[String]): Unit = {
val content = Reader.loadXML
Console.println(content)
Console.println(proc(content))
}
}
The problem is that the "proc" does not work. Basically, I would like to get exactly the content of one node. Or is there another way to achieve that without matching?
Does the "feed" in the loadxml-function give me back the right format for parsing or is there a better way to achieve that? Feed gives me back the root node, right?
Thanks in advance
You're right: adapter.loadXML(source, parser) gives you the root node. The problem is that that root node probably isn't going to match the body case in in your proc method. Even if the root node were body, it still wouldn't match unless the element contained nothing but text.
You probably want something more like this:
def proc(node: Node): String = (node \\ "body").text
Where \\ is a selector method that's roughly equivalent to XPath's //—i.e., it returns all the descendants of node named body. If you know that body is a child (as opposed to a deeper descendant) of the root node, which is probably the case for HTML, you can use \ instead of \\.

Possible to pretty print JSON in Grails 1.3.7?

The JSON in question is being read in from a RESTful service, and I would like to print it out (to console, although in .gsp would be fine also) for debugging purposes. Groovy 1.3.7 (current as of August 2011) uses Groovy 1.7.8 (which does not have the JsonOutput introduced in 1.8)
Note I am currently reading it in like this, which I am not convinced is the 'grooviest or grail-est' way to do it - perhaps I could take advantage of the converters and pretty printing if done differently? Code sample would be appreciated.
def serviceURL = new URL(theURL)
def json = new JSONObject(serviceURL.text)
println json
You can pretty print JSON with the toString(int indentFactor) method. Example:
def json = new JSONObject()
json.put('foo', 'bar')
json.put('blat', 'greep')
println json
===>{"foo":"bar","blat","greep"}
println json.toString(4)
===>{
"foo": "bar",
"blat": "greep"
}
You can use grails.converters.JSON (which is the most commonly used library for JSON):
In your config.groovy file, add the line to set prettyPrint to true:
grails.converters.default.pretty.print=true
Then, in your controller:
import grails.converters.*
def serviceURL = new URL(theURL)
def json = JSON.parse(serviceURL.text)
println "JSON RESPONSE: ${json.toString()"
If you're in a Grails controller and plan to render the json, then you use something like this (using Grails 2.3.5):
public prettyJson() {
JSON json = ['status': 'OK'] as JSON
json.prettyPrint = true
json.render response
}
I found that solution here: http://www.intelligrape.com/blog/2012/07/16/rendering-json-with-formatting/
Apart from set default pretty print in Config.groovy, JSON's toString() method accepts one boolean parameter. It controls whether pretty print the result or not.
import grails.converters.*
import my.data.*
def accountJson = Account.get(1001) as JSON
println(accountJson.toString(true))
println(accountJson.toString(false))
Tested in Grails 1.3.9.

extracting parts of HTML with groovy

I need to extract a part of HTML from a given HTML page. So far, I use the XmlSlurper with tagsoup to parse the HTML page and then try to get the needed part by using the StreamingMarkupBuilder:
import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def dom = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)
However, the result I get is
<html:body xmlns:html='http://www.w3.org/1999/xhtml'>a <html:b>test</html:b></html:body>
which looks great, but I would like to get it without the html-namespace.
How do I avoid the namespace?
Turn off the namespace feature on the TagSoup parser. Example:
import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature, false)
def dom = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)