Parse broken HTML with golang

Parse broken HTML with golang - html

I need to find elements in an HTML string. Unfortunately the HTML is pretty much broken (e.g. closing tags without an opening pair).
I tried to use XPath with launchpad.net/xmlpath but it can't parse an HTML file so damn buggy.
How can I find elements in a broken HTML with golang? I would prefer using XPath, but I am open for other solutions too if I can use it to look for tags with a specific id or class.

It seems net/html does the job.
So that's what I am doing now:
package main
import (
"strings"
"golang.org/x/net/html"
"log"
"bytes"
"gopkg.in/xmlpath.v2"
)
func main() {
brokenHtml := `<!DOCTYPE html><html><body><h1 id="someid">My First Heading</h1><p>paragraph</body></html>`
reader := strings.NewReader(brokenHtml)
root, err := html.Parse(reader)
if err != nil {
log.Fatal(err)
}
var b bytes.Buffer
html.Render(&b, root)
fixedHtml := b.String()
reader = strings.NewReader(fixedHtml)
xmlroot, xmlerr := xmlpath.ParseHTML(reader)
if xmlerr != nil {
log.Fatal(xmlerr)
}
var xpath string
xpath = `//h1[#id='someid']`
path := xmlpath.MustCompile(xpath)
if value, ok := path.String(xmlroot); ok {
log.Println("Found:", value)
}
}

Related

Handling malformed HTML with Go's net/html tokenizer?

I’ve found that the html.NewTokenizer() doesn’t auto-fix some things. So it’s possible that you can end up with a stray closing tag (html.EndTagToken). So <div></p></div> would be html.StartTagToken, html.EndTagToken, html.EndTagToken.
Is there a recommended solution for handling ignoring/removing/fixing these tags?
My first guess would be manually keeping a []atom.Atom slice and push/pop to the list as you start/end each tag (after comparing the tag to make sure you don’t get an unexpected end tag).
Here is some code to demonstrate the problem:
var err error
htm := `<div><div><p></p></p></div>`
tokenizer := html.NewTokenizer(strings.NewReader(htm))
for {
if tokenizer.Next() == html.ErrorToken {
err = tokenizer.Err()
if err == io.EOF {
err = nil
}
return
}
token := tokenizer.Token()
switch token.Type {
case html.DoctypeToken:
continue
case html.CommentToken:
continue
case html.SelfClosingTagToken:
fmt.Println(token.Data)
continue
case html.StartTagToken:
fmt.Printf("<%s>\n", token.Data)
case html.EndTagToken:
fmt.Printf("</%s>\n", token.Data)
case html.TextToken:
continue
default:
continue
}
}
Output:
<div>
<div>
<p>
</p>
</p>
</div>

FWIW, it seems that net/html can fix such issues when you use its Parse method. Here's an example adapted from another SO answer, using your malformed HTML snippet:
package main
import (
"bytes"
"fmt"
"log"
"strings"
"golang.org/x/net/html"
)
func main() {
brokenHtml := `<div><div><p></p></p></div>`
reader := strings.NewReader(brokenHtml)
root, err := html.Parse(reader)
if err != nil {
log.Fatal(err)
}
var b bytes.Buffer
html.Render(&b, root)
fixedHtml := b.String()
fmt.Println(fixedHtml)
}

GoLang - XmlPath Selectors with HTML

I am looking at the documented example here, but it is iterating purely over an XML tree, and not HTML. Therefore, I am still partly confused.
For example, if I wanted to find a specific meta tag within the head tag by name, it seems I cannot? Instead, I need to find it by the order it is in the head tag. In this case, I want the 8th meta tag, which I assume is:
headTag, err := getByID(xmlroot, "/head/meta[8]/")
But of course, this is using a getByID function for a tag name - which I don't believe will work. What is the full list of "getBy..." commands?
Then, the problem is, how do I access the meta tag's contents? The documentation only provides examples for the inner tag node content. However, will this example work?:
resp.Query = extractValue(headTag, #content)
The # selector confuses me, is this appropriate for this case?
In other words:
Is there a proper HTML example available?
Is there a list of correct selectors for IDs, Tags, etc?
Can Tags be found by name, and content extracted from its inner content tag?
Thank you very much!

I know this answer is late, but I still want to recommend an htmlquery package that is simple and powerful, based on XPath expressions*.
The below code based on #Time-Cooper example.
package main
import (
"fmt"
"github.com/antchfx/htmlquery"
)
func main() {
doc, err := htmlquery.LoadURL("https://example.com")
if err != nil {
panic(err)
}
s := htmlquery.Find(doc, "//meta[#name='viewport']")
if len(s) == 0 {
fmt.Println("could not find viewpoint")
return
}
fmt.Println(htmlquery.SelectAttr(s[0], "content"))
// alternative method,but simple more.
s2 := htmlquery.FindOne(doc, "//meta[#name='viewport']/#content")
fmt.Println(htmlquery.InnerText(s2))
}

XPath does not seem suitable here; you should be using goquery, which is designed for HTML.
Here is an example:
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
)
func main() {
doc, err := goquery.NewDocument("https://example.com")
if err != nil {
panic(err)
}
s := doc.Find(`html > head > meta[name="viewport"]`)
if s.Length() == 0 {
fmt.Println("could not find viewpoint")
return
}
fmt.Println(s.Eq(0).AttrOr("content", ""))
}

Golang: Type [type] is not an expression; json config parsing

I'm trying to work out a bit of code to pull in config from a JSON file.
When I attempt to build, I get this error
type ConfigVars is not an expression
Below is the config and program code I'm trying to work with. Every example I've found so far is similar to the below code. Any suggestion of what I'm doing incorrectly?
-- Config File
{"beaconUrl":"http://test.com/?id=1"}
-- Program Code
package main
import (
"encoding/json"
"fmt"
"os"
)
type ConfigVars struct {
BeaconUrl string
}
func main() {
configFile, err := os.Open("config.json")
defer configFile.Close()
if err != nil {
fmt.Println("Opening config file", err.Error())
}
jsonParser := json.NewDecoder(configFile)
if err = jsonParser.Decode(&ConfigVars); err != nil {
fmt.Println("Parsing config file", err.Error())
}
}

What you're doing there is trying to pass a pointer to the ConfigVars type (which obviously doesn't really mean anything). What you want to do is make a variable whose type is ConfigVars and pass a pointer to that instead:
var cfg ConfigVars
err = jsonParser.Decode(&cfg)
...

For others who come onto this problem, you may find that you've forgotten to initialize the variable during assignment using the := operator, as described in Point 3 at the end of this GoLang tutorial.
var cfg ConfigVars
err := jsonParser.Decode(&cfg)

Differences in parsing json with a custom unmarshaller between golang versions

I am trying to port some code written against go1.3 to current versions and ran into a case where the json parsing behavior is different between versions. We are using a custom unmarshaller for parsing some specific date format. It looks like recent versions pass in the string with additional quotes which 1.3 did not.
Is this a bug or an intentional change? And whats the best way of writing code which is compatible with different versions in this situation. Just go looking for all places where a custom unmarshaller is in use always strip out extra quotes if any? It would be a pity to have to do that - so I am hoping there is a better way.
package main
import "encoding/json"
import "fmt"
import "time"
type Timestamp1 time.Time
func (t *Timestamp1) UnmarshalJSON(b []byte) (err error) {
fmt.Println("String to parse as timestamp:", string(b))
parsedTime, err := time.Parse("2006-01-02T15:04:05", string(b))
if err == nil {
*t = Timestamp1(parsedTime)
return nil
} else {
return err
}
}
type S struct {
LastUpdatedDate Timestamp1 `json:"last_updated_date,string"`
}
func main() {
s := `{"last_updated_date" : "2015-11-03T10:00:00"}`
var s1 S
err := json.Unmarshal([]byte(s), &s1)
fmt.Println(err)
fmt.Println(s1)
}

There was a bug concerning json:",string" tag that was fixed in 1.5. If there isn't a particular reason you need it, you can remove it and simply adjust your format:
// N.B. time is in quotes.
parsedTime, err := time.Parse(`"2006-01-02T15:04:05"`, string(b))
Playground: http://play.golang.org/p/LgWuKcPEuI.
This should work in 1.3 as well as 1.5.

Equivalent to Python's HTML parsing function/module in Go?

I'm now learning Go myself and am stuck in getting and parsing HTML/XML. In Python, I usually write the following code when I do web scraping:
from urllib.request import urlopen, Request
url = "http://stackoverflow.com/"
req = Request(url)
html = urlopen(req).read()
, then I can get raw HTML/XML in a form of either string or bytes and proceed to work with it. In Go, how can I cope with it? What I hope to get is raw HTML data which is stored either in string or []byte (though it can be easily converted, that I don't mind which to get at all). I consider using gokogiri package to do web scraping in Go (not sure I'll indeed end up with using it!), but it looks like it requires raw HTML text before doing any work with it...
So how can I acquire such object?
Or is there any better way to do web scraping work in Go?
Thanks.

From the Go http.Get Example:
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
)
func main() {
res, err := http.Get("http://www.google.com/robots.txt")
if err != nil {
log.Fatal(err)
}
robots, err := ioutil.ReadAll(res.Body)
res.Body.Close()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s", robots)
}
Will return the contents of http://www.google.com/robots.txt into the string variable robots.
For XML parsing look into the Go encoding/xml package.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Parse broken HTML with golang - html

Related

Handling malformed HTML with Go's net/html tokenizer?

GoLang - XmlPath Selectors with HTML

Golang: Type [type] is not an expression; json config parsing

Differences in parsing json with a custom unmarshaller between golang versions

Equivalent to Python's HTML parsing function/module in Go?

Categories

Resources