GoLang - XmlPath Selectors with HTML - html

I am looking at the documented example here, but it is iterating purely over an XML tree, and not HTML. Therefore, I am still partly confused.
For example, if I wanted to find a specific meta tag within the head tag by name, it seems I cannot? Instead, I need to find it by the order it is in the head tag. In this case, I want the 8th meta tag, which I assume is:
headTag, err := getByID(xmlroot, "/head/meta[8]/")
But of course, this is using a getByID function for a tag name - which I don't believe will work. What is the full list of "getBy..." commands?
Then, the problem is, how do I access the meta tag's contents? The documentation only provides examples for the inner tag node content. However, will this example work?:
resp.Query = extractValue(headTag, #content)
The # selector confuses me, is this appropriate for this case?
In other words:
Is there a proper HTML example available?
Is there a list of correct selectors for IDs, Tags, etc?
Can Tags be found by name, and content extracted from its inner content tag?
Thank you very much!

I know this answer is late, but I still want to recommend an htmlquery package that is simple and powerful, based on XPath expressions*.
The below code based on #Time-Cooper example.
package main
import (
"fmt"
"github.com/antchfx/htmlquery"
)
func main() {
doc, err := htmlquery.LoadURL("https://example.com")
if err != nil {
panic(err)
}
s := htmlquery.Find(doc, "//meta[#name='viewport']")
if len(s) == 0 {
fmt.Println("could not find viewpoint")
return
}
fmt.Println(htmlquery.SelectAttr(s[0], "content"))
// alternative method,but simple more.
s2 := htmlquery.FindOne(doc, "//meta[#name='viewport']/#content")
fmt.Println(htmlquery.InnerText(s2))
}

XPath does not seem suitable here; you should be using goquery, which is designed for HTML.
Here is an example:
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
)
func main() {
doc, err := goquery.NewDocument("https://example.com")
if err != nil {
panic(err)
}
s := doc.Find(`html > head > meta[name="viewport"]`)
if s.Length() == 0 {
fmt.Println("could not find viewpoint")
return
}
fmt.Println(s.Eq(0).AttrOr("content", ""))
}

Related

Wrapping json member fields to object

My objective is to add fields to json on user request.
Everything is great, but when displaying the fields with
fmt.Printf("%s: %s\n", content.Date, content.Description)
an error occurs:
invalid character '{' after top-level value
And that is because after adding new fields the file looks like this:
{"Date":"2017-03-20 10:46:48","Description":"new"}
{"Date":"2017-03-20 10:46:51","Description":"new .go"}
The biggest problem is with the writting to file
reminder := &Name{dateString[:19], text} //text - input string
newReminder, _ := json.Marshal(&reminder)
I dont really know how to do this properly
My question is how should I wrap all member fields into one object?
And what is the best way to iterate through member fields?
The code is available here: https://play.golang.org/p/NunV_B6sud
You should store the reminders into an array inside the json file, as mentioned by #Gerben Jacobs, and then, every time you want to add a new reminder to the array you need to read the full contents of rem.json, append the new reminder in Go, truncate the file, and write the new slice into the file. Here's a quick implentation https://play.golang.org/p/UKR91maQF2.
If you have lots of reminders and the process of reading, decoding, encoding, and writing the whole content becomes a pain you could open the file, implement a way to truncate only the last ] from the file contents, and then write only , + new reminder + ].
So after some research, people in the go-nuts group helped me and suggested me to use a streaming json parser that parses items individually.
So I needed to change my reminder listing function:
func listReminders() error {
f, err := os.Open("rem.json")
if err != nil {
return err
}
dec := json.NewDecoder(f)
for {
var content Name
switch dec.Decode(&content) {
case nil:
fmt.Printf("%#v\n", content)
case io.EOF:
return nil
default:
return err
}
}
}
Now everything works the way I wanted.

HTML Validation with Golang

Within my API I have a POST end point. One of the expected parameters being posted to that end point is a block of (loosely) valid HTML.
The POST will be in the format of JSON.
Within golang how can I ensure that the HTML which is posted is valid? I have been looking for something for a few days now and still haven't managed to find anything?
The term "valid" is kind of loose. I trying to ensure that tags are opened and closed, speech marks are in the right places etc.
A bit late to the game, but here are a couple of parsers in Go that will work if you just want to validate the structure of the HTML (eg. you don't care if a div is inside a span, which is not allowed but is a schema level problem):
x/net/html
The golang.org/x/net/html package contains a very loose parser. Almost anything will result in valid HTML, similar to what a lot of web browsers try to do (eg. it will ignore problems with unescaped values in many cases).
For example, something like <span>></span> will likely validate (I didn't check this particular one, I just made it up) as a span with the '>' character in it.
It can be used something like this:
r := strings.NewReader(`<span>></span>`)
z := html.NewTokenizer(r)
for {
tt := z.Next()
if tt == html.ErrorToken {
err := z.Err()
if err == io.EOF {
// Not an error, we're done and it's valid!
return nil
}
return err
}
}
encoding/xml
If you need something a tiny bit more strict, but which is still okay for HTML you can configure an xml.Decoder to work with HTML (this is what I do, it lets me be a bit more flexible about how strict I want to be in any given situation):
r := strings.NewReader(`<html></html>`)
d := xml.NewDecoder(r)
// Configure the decoder for HTML; leave off strict and autoclose for XHTML
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity
for {
tt, err := d.Token()
switch err {
case io.EOF:
return nil // We're done, it's valid!
case nil:
default:
return err // Oops, something wasn't right
}
}
You check that the HTML blob provided parses correctly using html.Parse from this package. For validation only, all you have to do is check for errors.

In Go templates, I can get Parse to work but cannot get ParseFiles to work in like manner. Why?

I have the following code:
t, err := template.New("template").Funcs(funcMap).Parse("Howdy {{ myfunc . }}")
In this form everything works fine. But if I do exactly the same thing with ParseFiles, placing the text above in template.html it's a no go:
t, err := template.New("template").Funcs(funcMap).ParseFiles("template.html")
I was able to get ParseFiles to work in the following form, but cannot get Funcs to take effect:
t, err := template.ParseFiles("template.html")
t.Funcs(funcMap)
Of course, this last form is a direct call to a function instead of a call to the receiver method, so not the same thing.
Any one have any ideas what's going on here? Difficult to find a lot of detail on templates out in the either.
Did some digging and found this comment for template.ParseFiles in the source:
First template becomes return value if not already defined, and we use that one for subsequent New calls to associate all the templates together. Also, if this file has the same name as t, this file becomes the contents of t, so t, err := New(name).Funcs(xxx).ParseFiles(name) works. Otherwise we create a new template associated with t.
So the format should be as follows, given my example above:
t, err := template.New("template.html").Funcs(funcMap).ParseFiles("path/template.html")
.New("template.html") creates an empty template with the given name, .Funcs(funcMap) associates any custom functions we want to apply to our templates, and then .ParseFiles("path/template.html") parses one or more templates with an awareness of those functions and associates the contents with a template of that name.
Note that the base name of the first file MUST be the same as the name used in New. Any content parsed will be associated with either an empty preexisting template having the same base name of the first file in the series or with a new template having that base name.
So in my example above, one empty template named "template" was created and had a function map associated with it. Then a new template named "template.html" was created. THESE ARE NOT THE SAME! And since, ParseFiles was called last, t ends up being the "template.html" template, without any functions attached.
What about the last example? Why didn't this work? template.ParseFiles calls the receiver method Parse, which in turn applies any previously registered functions:
trees, err := parse.Parse(t.name, text, t.leftDelim, t.rightDelim, t.parseFuncs, builtins)
This means that custom functions have to be registered prior to parsing. Adding the functions after parsing the templates doesn't have any affect, leading to nil pointer errors at runtime when attempting to call a custom function.
So I think that covers my original question. The design here seems a little clunky. Doesn't make sense that I can chain on ParseFiles to one template and end up returning a different template from the one I am chaining from if they don't happen to be named the same. That is counterintuitive and probably ought to be addressed in future releases to avoid confusion.
ParseFiles should have some names of templates which is basename of filename. But you call template.New, it's create new one of template named error. So you should select one of templates.
foo.go
package main
import (
"text/template"
"log"
"os"
"strings"
)
func main() {
tmpl, err := template.New("error").Funcs(template.FuncMap{
"trim": strings.TrimSpace,
}).ParseFiles("foo.tmpl")
if err != nil {
log.Fatal(err)
}
tmpl = tmpl.Lookup("foo.tmpl")
err = tmpl.Execute(os.Stdout, " string contains spaces both ")
if err != nil {
log.Fatal(err)
}
}
foo.tmpl
{{. | trim}}
try this:
var templates = template.Must(template.New("").Funcs(fmap).ParseFiles("1.tmpl, "2.tmpl"))

Parse broken HTML with golang

I need to find elements in an HTML string. Unfortunately the HTML is pretty much broken (e.g. closing tags without an opening pair).
I tried to use XPath with launchpad.net/xmlpath but it can't parse an HTML file so damn buggy.
How can I find elements in a broken HTML with golang? I would prefer using XPath, but I am open for other solutions too if I can use it to look for tags with a specific id or class.
It seems net/html does the job.
So that's what I am doing now:
package main
import (
"strings"
"golang.org/x/net/html"
"log"
"bytes"
"gopkg.in/xmlpath.v2"
)
func main() {
brokenHtml := `<!DOCTYPE html><html><body><h1 id="someid">My First Heading</h1><p>paragraph</body></html>`
reader := strings.NewReader(brokenHtml)
root, err := html.Parse(reader)
if err != nil {
log.Fatal(err)
}
var b bytes.Buffer
html.Render(&b, root)
fixedHtml := b.String()
reader = strings.NewReader(fixedHtml)
xmlroot, xmlerr := xmlpath.ParseHTML(reader)
if xmlerr != nil {
log.Fatal(xmlerr)
}
var xpath string
xpath = `//h1[#id='someid']`
path := xmlpath.MustCompile(xpath)
if value, ok := path.String(xmlroot); ok {
log.Println("Found:", value)
}
}

Equivalent to Python's HTML parsing function/module in Go?

I'm now learning Go myself and am stuck in getting and parsing HTML/XML. In Python, I usually write the following code when I do web scraping:
from urllib.request import urlopen, Request
url = "http://stackoverflow.com/"
req = Request(url)
html = urlopen(req).read()
, then I can get raw HTML/XML in a form of either string or bytes and proceed to work with it. In Go, how can I cope with it? What I hope to get is raw HTML data which is stored either in string or []byte (though it can be easily converted, that I don't mind which to get at all). I consider using gokogiri package to do web scraping in Go (not sure I'll indeed end up with using it!), but it looks like it requires raw HTML text before doing any work with it...
So how can I acquire such object?
Or is there any better way to do web scraping work in Go?
Thanks.
From the Go http.Get Example:
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
)
func main() {
res, err := http.Get("http://www.google.com/robots.txt")
if err != nil {
log.Fatal(err)
}
robots, err := ioutil.ReadAll(res.Body)
res.Body.Close()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s", robots)
}
Will return the contents of http://www.google.com/robots.txt into the string variable robots.
For XML parsing look into the Go encoding/xml package.