Handling malformed HTML with Go's net/html tokenizer? - html

I’ve found that the html.NewTokenizer() doesn’t auto-fix some things. So it’s possible that you can end up with a stray closing tag (html.EndTagToken). So <div></p></div> would be html.StartTagToken, html.EndTagToken, html.EndTagToken.
Is there a recommended solution for handling ignoring/removing/fixing these tags?
My first guess would be manually keeping a []atom.Atom slice and push/pop to the list as you start/end each tag (after comparing the tag to make sure you don’t get an unexpected end tag).
Here is some code to demonstrate the problem:
var err error
htm := `<div><div><p></p></p></div>`
tokenizer := html.NewTokenizer(strings.NewReader(htm))
for {
if tokenizer.Next() == html.ErrorToken {
err = tokenizer.Err()
if err == io.EOF {
err = nil
}
return
}
token := tokenizer.Token()
switch token.Type {
case html.DoctypeToken:
continue
case html.CommentToken:
continue
case html.SelfClosingTagToken:
fmt.Println(token.Data)
continue
case html.StartTagToken:
fmt.Printf("<%s>\n", token.Data)
case html.EndTagToken:
fmt.Printf("</%s>\n", token.Data)
case html.TextToken:
continue
default:
continue
}
}
Output:
<div>
<div>
<p>
</p>
</p>
</div>

FWIW, it seems that net/html can fix such issues when you use its Parse method. Here's an example adapted from another SO answer, using your malformed HTML snippet:
package main
import (
"bytes"
"fmt"
"log"
"strings"
"golang.org/x/net/html"
)
func main() {
brokenHtml := `<div><div><p></p></p></div>`
reader := strings.NewReader(brokenHtml)
root, err := html.Parse(reader)
if err != nil {
log.Fatal(err)
}
var b bytes.Buffer
html.Render(&b, root)
fixedHtml := b.String()
fmt.Println(fixedHtml)
}

Related

Golang: Type [type] is not an expression; json config parsing

I'm trying to work out a bit of code to pull in config from a JSON file.
When I attempt to build, I get this error
type ConfigVars is not an expression
Below is the config and program code I'm trying to work with. Every example I've found so far is similar to the below code. Any suggestion of what I'm doing incorrectly?
-- Config File
{"beaconUrl":"http://test.com/?id=1"}
-- Program Code
package main
import (
"encoding/json"
"fmt"
"os"
)
type ConfigVars struct {
BeaconUrl string
}
func main() {
configFile, err := os.Open("config.json")
defer configFile.Close()
if err != nil {
fmt.Println("Opening config file", err.Error())
}
jsonParser := json.NewDecoder(configFile)
if err = jsonParser.Decode(&ConfigVars); err != nil {
fmt.Println("Parsing config file", err.Error())
}
}
What you're doing there is trying to pass a pointer to the ConfigVars type (which obviously doesn't really mean anything). What you want to do is make a variable whose type is ConfigVars and pass a pointer to that instead:
var cfg ConfigVars
err = jsonParser.Decode(&cfg)
...
For others who come onto this problem, you may find that you've forgotten to initialize the variable during assignment using the := operator, as described in Point 3 at the end of this GoLang tutorial.
var cfg ConfigVars
err := jsonParser.Decode(&cfg)

Differences in parsing json with a custom unmarshaller between golang versions

I am trying to port some code written against go1.3 to current versions and ran into a case where the json parsing behavior is different between versions. We are using a custom unmarshaller for parsing some specific date format. It looks like recent versions pass in the string with additional quotes which 1.3 did not.
Is this a bug or an intentional change? And whats the best way of writing code which is compatible with different versions in this situation. Just go looking for all places where a custom unmarshaller is in use always strip out extra quotes if any? It would be a pity to have to do that - so I am hoping there is a better way.
package main
import "encoding/json"
import "fmt"
import "time"
type Timestamp1 time.Time
func (t *Timestamp1) UnmarshalJSON(b []byte) (err error) {
fmt.Println("String to parse as timestamp:", string(b))
parsedTime, err := time.Parse("2006-01-02T15:04:05", string(b))
if err == nil {
*t = Timestamp1(parsedTime)
return nil
} else {
return err
}
}
type S struct {
LastUpdatedDate Timestamp1 `json:"last_updated_date,string"`
}
func main() {
s := `{"last_updated_date" : "2015-11-03T10:00:00"}`
var s1 S
err := json.Unmarshal([]byte(s), &s1)
fmt.Println(err)
fmt.Println(s1)
}
There was a bug concerning json:",string" tag that was fixed in 1.5. If there isn't a particular reason you need it, you can remove it and simply adjust your format:
// N.B. time is in quotes.
parsedTime, err := time.Parse(`"2006-01-02T15:04:05"`, string(b))
Playground: http://play.golang.org/p/LgWuKcPEuI.
This should work in 1.3 as well as 1.5.

Golang - Parsing JSON string arrays from Twitch TV RESTful service

I've been working on parsing a JSON object that I retrieve through an HTTP GET request using Go's built in HTTP library. I initially tried using the default JSON library in Go in order to do this, but I was having a difficult time (I am a novice in Go still). I eventually resorted to using a different library and had little trouble after that, as shown below:
package main
import (
"github.com/antonholmquist/jason"
"fmt"
"net/http"
)
func main() {
resp, err := http.Get("http://tmi.twitch.tv/group/user/deernadia/chatters")
if nil != err {
panic(err)
}
defer resp.Body.Close()
body, err := jason.NewObjectFromReader(resp.Body)
chatters, err := body.GetObject("chatters")
if nil != err {
panic(err)
}
moderators, err := chatters.GetStringArray("moderators")
if nil != err {
panic(err)
}
for _, moderator := range moderators {
fmt.Println(moderator)
}
}
Where github.com/antonholmquist/jason corresponds to the custom JSON library I used.
This code produces something similar to the following output when run in a Linux shell (the RESTful service will update about every 30 seconds or so, which means the values in the JSON object will potentially change):
antwan250
bbrock89
boxception22
cmnights
deernadia
fartfalcon
fijibot
foggythought
fulc_
h_ov
iceydefeat
kingbobtheking
lospollogne
nightbot
nosleeptv
octaviuskhan
pateyy
phosphyg
poisonyvie
shevek18
trox94
trox_bot
uggasmesh
urbanelf
walmartslayer
wift3
And the raw JSON looks similar to this (with some of the users removed for brevity):
{
"_links": {},
"chatter_count": 469,
"chatters": {
"moderators": [
"antwan250",
"bbrock89",
"boxception22",
"cmnights",
"deernadia",
"fartfalcon",
"fijibot",
"foggythought",
"fulc_",
"h_ov",
"iceydefeat",
"kingbobtheking",
"lospollogne",
"nightbot",
"nosleeptv",
"octaviuskhan",
"pateyy",
"phosphyg",
"poisonyvie",
"shevek18",
"trox94",
"trox_bot",
"uggasmesh",
"urbanelf",
"walmartslayer",
"wift3"
],
"staff": [
"tnose"
],
"admins": [],
"global_mods": [],
"viewers": [
"03xuxu30",
"0dominic0",
"3389942",
"812mfk",
"910dan",
"aaradabooti",
"admiralackbar99",
"adrian97lol",
"aequitaso_o",
"aethiris",
"afropigeon",
"ahhhmong",
"aizaix",
"aka_magosh",
"akitoalexander",
"alex5761",
"allenhei",
"allou_fun_park",
"amilton_tkm",
"... more users that I removed...",
"zachn17",
"zero_x1",
"zigslip",
"ziirbryad",
"zonato83",
"zorr03body",
"zourtv"
]
}
}
As I said before, I'm using a custom library hosted on Github in order to accomplish what I needed, but for the sake of learning, I'm curious... how would I accomplish this same thing using Go's built in JSON library?
To be clear, what I'd like to do is be able to harvest the users from each JSON array embedded within the JSON object returned from the HTTP GET request. I'd also like to be able to get the list of viewers, admins, global moderators, etc., in the same way, but I figured that if I can see the moderator example using the default Go library, then reproducing that code for the other user types will be trivial.
Thank you in advance!
If you want to unmarshal moderators only, use the following:
var v struct {
Chatters struct {
Moderators []string
}
}
if err := json.Unmarshal(data, &v); err != nil {
// handle error
}
for _, mod := range v2.Chatters.Moderators {
fmt.Println(mod)
}
If you want to get all types of chatters, use the following:
var v struct {
Chatters map[string][]string
}
if err := json.Unmarshal(data, &v); err != nil {
handle error
}
for kind, users := range v1.Chatters {
for _, user := range users {
fmt.Println(kind, user)
}
}
run the code on the playground

goroutine channels over a for loop

My main function reads json from a file, unmarshals it into a struct, converts it into another struct type and spits out formatted JSON through stdout.
I'm trying to implement goroutines and channels to add concurrency to my for loop.
func main() {
muvMap := map[string]string{"male": "M", "female": "F"}
fileA, err := os.Open("serviceAfileultimate.json")
if err != nil {
panic(err)
}
defer fileA.Close()
data := make([]byte, 10000)
count, err := fileA.Read(data)
if err != nil {
panic(err)
}
dataBytes := data[:count]
var servicesA ServiceA
json.Unmarshal(dataBytes, &servicesA)
var servicesB = make([]ServiceB, servicesA.Count)
goChannels := make(chan ServiceB, servicesA.Count)
for i := 0; i < servicesA.Count; i++ {
go func() {
reflect.ValueOf(&servicesB[i]).Elem().FieldByName("Address").SetString(Merge(&servicesA.Users[i].Location))
reflect.ValueOf(&servicesB[i]).Elem().FieldByName("Date_Of_Birth").SetString(dateCopyTransform(servicesA.Users[i].Dob))
reflect.ValueOf(&servicesB[i]).Elem().FieldByName("Email").SetString(servicesA.Users[i].Email)
reflect.ValueOf(&servicesB[i]).Elem().FieldByName("Fullname").SetString(Merge(&servicesA.Users[i].Name))
reflect.ValueOf(&servicesB[i]).Elem().FieldByName("Gender").SetString(muvMap[servicesA.Users[i].Gender])
reflect.ValueOf(&servicesB[i]).Elem().FieldByName("Phone").SetString(servicesA.Users[i].Cell)
reflect.ValueOf(&servicesB[i]).Elem().FieldByName("Username").SetString(servicesA.Users[i].Username)
goChannels <- servicesB[i]
}()
}
for index := range goChannels {
json.NewEncoder(os.Stdout).Encode(index)
}
}
It compiles but is returning messages like:
goroutine 1 [chan receive]: main.main() C://.....go.94 +0x55b.
You're printing the channels info, not the data it contains. You don't want a loop, you just want to receive then print.
json := <-index
json.NewEncoder(os.Stdout).Encode(json)
Now I do I need to point out, that code is not going to block. If you want to keep reading until all work is done you need some kind of locking/coordination mechanism.
You'll often see things like
for {
select {
case json := <-jsonChannel:
// do stuff
case <-abort:
// get out of here
}
}
To deal with that. Also, just fyi you're initializing your channel with a default capacity (meaning it's a buffered channel) which is pretty odd. I'd recommend reviewing some tutorials on the topic cause overall your design needs some work actually be an improvement of non-concurrent implementations. Lastly you can find libraries to abstract some of this work for you and most people would probably recommend you do. Here's an example; https://github.com/lytics/squaredance

Parse broken HTML with golang

I need to find elements in an HTML string. Unfortunately the HTML is pretty much broken (e.g. closing tags without an opening pair).
I tried to use XPath with launchpad.net/xmlpath but it can't parse an HTML file so damn buggy.
How can I find elements in a broken HTML with golang? I would prefer using XPath, but I am open for other solutions too if I can use it to look for tags with a specific id or class.
It seems net/html does the job.
So that's what I am doing now:
package main
import (
"strings"
"golang.org/x/net/html"
"log"
"bytes"
"gopkg.in/xmlpath.v2"
)
func main() {
brokenHtml := `<!DOCTYPE html><html><body><h1 id="someid">My First Heading</h1><p>paragraph</body></html>`
reader := strings.NewReader(brokenHtml)
root, err := html.Parse(reader)
if err != nil {
log.Fatal(err)
}
var b bytes.Buffer
html.Render(&b, root)
fixedHtml := b.String()
reader = strings.NewReader(fixedHtml)
xmlroot, xmlerr := xmlpath.ParseHTML(reader)
if xmlerr != nil {
log.Fatal(xmlerr)
}
var xpath string
xpath = `//h1[#id='someid']`
path := xmlpath.MustCompile(xpath)
if value, ok := path.String(xmlroot); ok {
log.Println("Found:", value)
}
}