I need to extract a part of HTML from a given HTML page. So far, I use the XmlSlurper with tagsoup to parse the HTML page and then try to get the needed part by using the StreamingMarkupBuilder:
import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def dom = new XmlSlurper(new org.ccil.cowan.tagsoup.Parser()).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)
However, the result I get is
<html:body xmlns:html='http://www.w3.org/1999/xhtml'>a <html:b>test</html:b></html:body>
which looks great, but I would like to get it without the html-namespace.
How do I avoid the namespace?
Turn off the namespace feature on the TagSoup parser. Example:
import groovy.xml.StreamingMarkupBuilder
def html = "<html><body>a <b>test</b></body></html>"
def parser = new org.ccil.cowan.tagsoup.Parser()
parser.setFeature(parser.namespacesFeature, false)
def dom = new XmlSlurper(parser).parseText(html)
println new StreamingMarkupBuilder().bindNode(dom.body)
Related
I am using Python3 and Scrapy.
I have a simple spider (shown bellow) for which I want to save as items the response.url and the response.text. I would like to save the response.text to Notepad++ as a JSON. Is there any way it can be saved with a nested structure? Such as the one that appears in the native HTML og the page?
class Spider1(scrapy.Spider):
name = "Spider1"
allowed_domains = []
start_urls = ['http://www.uam.es/']
def parse(self, response):
items = Spider1Item()
items['url'] = response.url
items['body'] = response.text
yield items
pass
EDIT:
Here is a snippet of the target structure I would like when I export to a JSON.
HTML snippet of target structure
I have a webpage were I take the RSS links from. The links are XML and I would like to use the XMLFeedSpider functionality to simplify the parsing.
Is that possible?
This would be the flow:
GET example.com/rss (return HTML)
Parse html and get RSS links
foreach link parse XML
I found a simply way based on the existing example in the documentation and looking at the source code. Here is my solution:
from scrapy.spiders import XMLFeedSpider
from myproject.items import TestItem
class MySpider(XMLFeedSpider):
name = 'example.com'
allowed_domains = ['example.com']
start_urls = ['http://www.example.com/feed.xml']
iterator = 'iternodes' # This is actually unnecessary, since it's the default value
itertag = 'item'
def start_request(self):
urls = ['http://www.example.com/get-feed-links']
for url in urls:
yield scrapy.Request(url=url, callback=self.parse_main)
def parse_main(self, response):
for el in response.css("li.feed-links"):
yield scrapy.Request(el.css("a::attr(href)").extract_first(),
callback=self.parse)
def parse_node(self, response, node):
self.logger.info('Hi, this is a <%s> node!: %s', self.itertag, ''.join(node.extract()))
item = TestItem()
item['id'] = node.xpath('#id').extract()
item['name'] = node.xpath('name').extract()
item['description'] = node.xpath('description').extract()
return item
Task: HTML - Parser in Scala. Im pretty new to scala.
So far: I have written a little Parser in Scala to parse a random html document.
import scala.xml.Elem
import scala.xml.Node
import scala.collection.mutable.Queue
import scala.xml.Text
import scala.xml.PrettyPrinter
object Reader {
def loadXML = {
val parserFactory = new org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
val parser = parserFactory.newSAXParser()
val source = new org.xml.sax.InputSource("http://www.randomurl.com")
val adapter = new scala.xml.parsing.NoBindingFactoryAdapter
val feed = adapter.loadXML(source, parser)
feed
}
def proc(node: Node): String =
node match {
case <body>{ txt }</body> => "Partial content: " + txt
case _ => "grmpf"
}
def main(args: Array[String]): Unit = {
val content = Reader.loadXML
Console.println(content)
Console.println(proc(content))
}
}
The problem is that the "proc" does not work. Basically, I would like to get exactly the content of one node. Or is there another way to achieve that without matching?
Does the "feed" in the loadxml-function give me back the right format for parsing or is there a better way to achieve that? Feed gives me back the root node, right?
Thanks in advance
You're right: adapter.loadXML(source, parser) gives you the root node. The problem is that that root node probably isn't going to match the body case in in your proc method. Even if the root node were body, it still wouldn't match unless the element contained nothing but text.
You probably want something more like this:
def proc(node: Node): String = (node \\ "body").text
Where \\ is a selector method that's roughly equivalent to XPath's //—i.e., it returns all the descendants of node named body. If you know that body is a child (as opposed to a deeper descendant) of the root node, which is probably the case for HTML, you can use \ instead of \\.
I am having hard time with my NekoHTML parser.
It is working fine on URL's but when I want to test in on a simple XML test, it does not read it properly.
Here is how I declare it:
def createAndSetParser() {
SAXParser parser = new SAXParser() //Default Sax NekoHTML parser
def charset = "Windows-1252" // The encoding of the page
def tagFormat = "upper" // Ensures all the tags and consistently written, by putting all of them in upper-case. We can choose "lower", "upper" of "match"
def attrFormat = "lower" // Same thing for attributes. We can choose "upper", "lower" or "match"
Purifier purifier = new Purifier() //Creating a purifier, in order to clean the incoming HTML
XMLDocumentFilter[] filter = [purifier] //Creating a filter, and adding the purifier to this filter. (NekoHTML feature)
parser.setProperty("http://cyberneko.org/html/properties/filters", filter)
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", charset)
parser.setProperty("http://cyberneko.org/html/properties/names/elems", tagFormat)
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", attrFormat)
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true) // Forces the parser to use the charset we provided to him.
parser.setFeature("http://cyberneko.org/html/features/override-doctype", false) // To let the Doctype as it is.
parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false) // To make sure no namespace is added or overridden.
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true)
return new XmlSlurper(parser) // A groovy parser that does not download the all tree structure, but rather supply only the information it is asked for.
}
Again it is working very fine when I use it on websites.
Any guess why I cannot do so on simple XML text samples ??
Any help greatly apreciated :)
I made your script executable on the Groovy Console to try it out easily using Grape to fetch the required NekoHTML library from the Maven Central Repository.
#Grapes(
#Grab(group='net.sourceforge.nekohtml', module='nekohtml', version='1.9.15')
)
import groovy.xml.StreamingMarkupBuilder
import org.apache.xerces.xni.parser.XMLDocumentFilter
import org.cyberneko.html.parsers.SAXParser
import org.cyberneko.html.filters.Purifier
def createAndSetParser() {
SAXParser parser = new SAXParser()
parser.setProperty("http://cyberneko.org/html/properties/filters", [new Purifier()] as XMLDocumentFilter[])
parser.setProperty("http://cyberneko.org/html/properties/default-encoding", "Windows-1252")
parser.setProperty("http://cyberneko.org/html/properties/names/elems", "upper")
parser.setProperty("http://cyberneko.org/html/properties/names/attrs", "lower")
parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset", true)
parser.setFeature("http://cyberneko.org/html/features/override-doctype", false)
parser.setFeature("http://cyberneko.org/html/features/override-namespaces", false)
parser.setFeature("http://cyberneko.org/html/features/balance-tags", true)
return new XmlSlurper(parser)
}
def printResult(def gPathResult) {
println new StreamingMarkupBuilder().bind { out << gPathResult }
}
def parser = createAndSetParser()
printResult parser.parseText('<html><body>Hello World</body></html>')
printResult parser.parseText('<house><room>bedroom</room><room>kitchen</room></house>')
When being executed this way the result of the two printResult-statements looks like shown below and can explain your issues parsing the XML string because it is wrapped into <html><body>...</body></html> tags and looses the root tag called <house/>:
<HTML><tag0:HEAD xmlns:tag0='http://www.w3.org/1999/xhtml'></tag0:HEAD><BODY>Hello World</BODY></HTML>
<HTML><BODY><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></BODY></HTML>
All this is caused by the http://cyberneko.org/html/features/balance-tags feature which you enabled in your script. If I disable this feature (it must be explicitly set to false because it defaults to true) the results looks like this:
<HTML><BODY>Hello World</BODY></HTML>
<HOUSE><ROOM>bedroom</ROOM><ROOM>kitchen</ROOM></HOUSE>
The JSON in question is being read in from a RESTful service, and I would like to print it out (to console, although in .gsp would be fine also) for debugging purposes. Groovy 1.3.7 (current as of August 2011) uses Groovy 1.7.8 (which does not have the JsonOutput introduced in 1.8)
Note I am currently reading it in like this, which I am not convinced is the 'grooviest or grail-est' way to do it - perhaps I could take advantage of the converters and pretty printing if done differently? Code sample would be appreciated.
def serviceURL = new URL(theURL)
def json = new JSONObject(serviceURL.text)
println json
You can pretty print JSON with the toString(int indentFactor) method. Example:
def json = new JSONObject()
json.put('foo', 'bar')
json.put('blat', 'greep')
println json
===>{"foo":"bar","blat","greep"}
println json.toString(4)
===>{
"foo": "bar",
"blat": "greep"
}
You can use grails.converters.JSON (which is the most commonly used library for JSON):
In your config.groovy file, add the line to set prettyPrint to true:
grails.converters.default.pretty.print=true
Then, in your controller:
import grails.converters.*
def serviceURL = new URL(theURL)
def json = JSON.parse(serviceURL.text)
println "JSON RESPONSE: ${json.toString()"
If you're in a Grails controller and plan to render the json, then you use something like this (using Grails 2.3.5):
public prettyJson() {
JSON json = ['status': 'OK'] as JSON
json.prettyPrint = true
json.render response
}
I found that solution here: http://www.intelligrape.com/blog/2012/07/16/rendering-json-with-formatting/
Apart from set default pretty print in Config.groovy, JSON's toString() method accepts one boolean parameter. It controls whether pretty print the result or not.
import grails.converters.*
import my.data.*
def accountJson = Account.get(1001) as JSON
println(accountJson.toString(true))
println(accountJson.toString(false))
Tested in Grails 1.3.9.