HTML Validation with Golang - html

Within my API I have a POST end point. One of the expected parameters being posted to that end point is a block of (loosely) valid HTML.
The POST will be in the format of JSON.
Within golang how can I ensure that the HTML which is posted is valid? I have been looking for something for a few days now and still haven't managed to find anything?
The term "valid" is kind of loose. I trying to ensure that tags are opened and closed, speech marks are in the right places etc.

A bit late to the game, but here are a couple of parsers in Go that will work if you just want to validate the structure of the HTML (eg. you don't care if a div is inside a span, which is not allowed but is a schema level problem):
x/net/html
The golang.org/x/net/html package contains a very loose parser. Almost anything will result in valid HTML, similar to what a lot of web browsers try to do (eg. it will ignore problems with unescaped values in many cases).
For example, something like <span>></span> will likely validate (I didn't check this particular one, I just made it up) as a span with the '>' character in it.
It can be used something like this:
r := strings.NewReader(`<span>></span>`)
z := html.NewTokenizer(r)
for {
tt := z.Next()
if tt == html.ErrorToken {
err := z.Err()
if err == io.EOF {
// Not an error, we're done and it's valid!
return nil
}
return err
}
}
encoding/xml
If you need something a tiny bit more strict, but which is still okay for HTML you can configure an xml.Decoder to work with HTML (this is what I do, it lets me be a bit more flexible about how strict I want to be in any given situation):
r := strings.NewReader(`<html></html>`)
d := xml.NewDecoder(r)
// Configure the decoder for HTML; leave off strict and autoclose for XHTML
d.Strict = false
d.AutoClose = xml.HTMLAutoClose
d.Entity = xml.HTMLEntity
for {
tt, err := d.Token()
switch err {
case io.EOF:
return nil // We're done, it's valid!
case nil:
default:
return err // Oops, something wasn't right
}
}

You check that the HTML blob provided parses correctly using html.Parse from this package. For validation only, all you have to do is check for errors.

Related

Golang struct unmarshal xss

I have a struct which has XSS injected in it. In order to remove it, I json.Marshal it, then run json.HTMLEscape. Then I json.Unmarshal it into a new struct.
The problem is the new struct has XSS injected still.
I simply can't figure how to remove the XSS from the struct. I can write a function to do it on the field but considering there is json.HTMLEscape and we can Unmarshal it back it should work fine, but its not.
type Person struct {
Name string `json:"name"`
}
func main() {
var p, p2 Person
// p.Name has XSS
p.Name = "<script>alert(1)</script>"
var tBytes bytes.Buffer
// I marshal it so I can use json.HTMLEscape
marshalledJson, _ := json.Marshal(p)
json.HTMLEscape(&tBytes, marshalledJson)
// here I insert it into a new struct, sadly the p2 struct has the XSS still
err := json.Unmarshal(tBytes.Bytes(), &p2)
if err != nil {
fmt.Printf(err.Error())
}
fmt.Print(p2)
}
expected outcome is p2.Name to be sanitized like <script>alert(1)</script>
First, json.HTMLEscape doesn't do what you want:
HTMLEscape appends to dst the JSON-encoded src with <, >, &, U+2028 and U+2029 characters inside string literals changed to \u003c, \u003e, \u0026, \u2028, \u2029 so that the JSON will be safe to embed inside HTML <script> tags.
but what you want is:
p2.Name to be sanitized like <script>alert(1)</script>
which you can get by calling html.EscapeString, but not by any of the json encoder routines.1
Second, if you inspect the result of json.Marshal you'll see that it has already replaced < with \u003c and so forth—it's already done the json.HTMLEscape, so json.HTMLEscape does not have any characters to replace! See https://play.golang.org/p/Zergs3bwElY for an example.
As Ahmed Hashem noted, if you really want to do this sort of thing, you can use reflection to find string fields (as in Implement XSS protection in Golang)—but in general it's probably wiser to do this at the point of input. Note that the answer there does not recurse into inner objects that might contain strings.
1JSON is not HTML, nor XML, etc. Keep them separate in your head, and in your code.
See also https://medium.com/#oazzat19/what-is-the-difference-between-html-vs-xml-vs-json-254864972bbb, a short summary of how we got here that as far as I can tell has no errors, which is pretty good for a random web article. :-) When using JSON, we get very simple typed data: objects, strings, numbers, lists/arrays, boolean, and null; see https://www.w3schools.com/js/js_json_syntax.asp, https://www.w3schools.com/js/js_json_objects.asp, and https://cswr.github.io/JsonSchema/spec/basic_types/ for instance.

Wrapping json member fields to object

My objective is to add fields to json on user request.
Everything is great, but when displaying the fields with
fmt.Printf("%s: %s\n", content.Date, content.Description)
an error occurs:
invalid character '{' after top-level value
And that is because after adding new fields the file looks like this:
{"Date":"2017-03-20 10:46:48","Description":"new"}
{"Date":"2017-03-20 10:46:51","Description":"new .go"}
The biggest problem is with the writting to file
reminder := &Name{dateString[:19], text} //text - input string
newReminder, _ := json.Marshal(&reminder)
I dont really know how to do this properly
My question is how should I wrap all member fields into one object?
And what is the best way to iterate through member fields?
The code is available here: https://play.golang.org/p/NunV_B6sud
You should store the reminders into an array inside the json file, as mentioned by #Gerben Jacobs, and then, every time you want to add a new reminder to the array you need to read the full contents of rem.json, append the new reminder in Go, truncate the file, and write the new slice into the file. Here's a quick implentation https://play.golang.org/p/UKR91maQF2.
If you have lots of reminders and the process of reading, decoding, encoding, and writing the whole content becomes a pain you could open the file, implement a way to truncate only the last ] from the file contents, and then write only , + new reminder + ].
So after some research, people in the go-nuts group helped me and suggested me to use a streaming json parser that parses items individually.
So I needed to change my reminder listing function:
func listReminders() error {
f, err := os.Open("rem.json")
if err != nil {
return err
}
dec := json.NewDecoder(f)
for {
var content Name
switch dec.Decode(&content) {
case nil:
fmt.Printf("%#v\n", content)
case io.EOF:
return nil
default:
return err
}
}
}
Now everything works the way I wanted.

GoLang - XmlPath Selectors with HTML

I am looking at the documented example here, but it is iterating purely over an XML tree, and not HTML. Therefore, I am still partly confused.
For example, if I wanted to find a specific meta tag within the head tag by name, it seems I cannot? Instead, I need to find it by the order it is in the head tag. In this case, I want the 8th meta tag, which I assume is:
headTag, err := getByID(xmlroot, "/head/meta[8]/")
But of course, this is using a getByID function for a tag name - which I don't believe will work. What is the full list of "getBy..." commands?
Then, the problem is, how do I access the meta tag's contents? The documentation only provides examples for the inner tag node content. However, will this example work?:
resp.Query = extractValue(headTag, #content)
The # selector confuses me, is this appropriate for this case?
In other words:
Is there a proper HTML example available?
Is there a list of correct selectors for IDs, Tags, etc?
Can Tags be found by name, and content extracted from its inner content tag?
Thank you very much!
I know this answer is late, but I still want to recommend an htmlquery package that is simple and powerful, based on XPath expressions*.
The below code based on #Time-Cooper example.
package main
import (
"fmt"
"github.com/antchfx/htmlquery"
)
func main() {
doc, err := htmlquery.LoadURL("https://example.com")
if err != nil {
panic(err)
}
s := htmlquery.Find(doc, "//meta[#name='viewport']")
if len(s) == 0 {
fmt.Println("could not find viewpoint")
return
}
fmt.Println(htmlquery.SelectAttr(s[0], "content"))
// alternative method,but simple more.
s2 := htmlquery.FindOne(doc, "//meta[#name='viewport']/#content")
fmt.Println(htmlquery.InnerText(s2))
}
XPath does not seem suitable here; you should be using goquery, which is designed for HTML.
Here is an example:
package main
import (
"fmt"
"github.com/PuerkitoBio/goquery"
)
func main() {
doc, err := goquery.NewDocument("https://example.com")
if err != nil {
panic(err)
}
s := doc.Find(`html > head > meta[name="viewport"]`)
if s.Length() == 0 {
fmt.Println("could not find viewpoint")
return
}
fmt.Println(s.Eq(0).AttrOr("content", ""))
}

Why does json.Encoder add an extra line?

json.Encoder seems to behave slightly different than json.Marshal. Specifically it adds a new line at the end of the encoded value. Any idea why is that? It looks like a bug to me.
package main
import "fmt"
import "encoding/json"
import "bytes"
func main() {
var v string
v = "hello"
buf := bytes.NewBuffer(nil)
json.NewEncoder(buf).Encode(v)
b, _ := json.Marshal(&v)
fmt.Printf("%q, %q", buf.Bytes(), b)
}
This outputs
"\"hello\"\n", "\"hello\""
Try it in the Playground
Because they explicitly added a new line character when using Encoder.Encode. Here's the source code to that func, and it actually states it adds a newline character in the documentation (see comment, which is the documentation):
https://golang.org/src/encoding/json/stream.go?s=4272:4319
// Encode writes the JSON encoding of v to the stream,
// followed by a newline character.
//
// See the documentation for Marshal for details about the
// conversion of Go values to JSON.
func (enc *Encoder) Encode(v interface{}) error {
if enc.err != nil {
return enc.err
}
e := newEncodeState()
err := e.marshal(v)
if err != nil {
return err
}
// Terminate each value with a newline.
// This makes the output look a little nicer
// when debugging, and some kind of space
// is required if the encoded value was a number,
// so that the reader knows there aren't more
// digits coming.
e.WriteByte('\n')
if _, err = enc.w.Write(e.Bytes()); err != nil {
enc.err = err
}
encodeStatePool.Put(e)
return err
}
Now, why did the Go developers do it other than "makes the output look a little nice"? One answer:
Streaming
The go json Encoder is optimized for streaming (e.g. MB/GB/PB of json data). It is typical that when streaming you need a way to deliminate when your stream has completed. In the case of Encoder.Encode(), that is a \n newline character. Sure, you can certainly write to a buffer. But you can also write to an io.Writer which would stream the block of v.
This is opposed to the use of json.Marshal which is generally discouraged if your input is from an untrusted (and unknown limited) source (e.g. an ajax POST method to your web service - what if someone posts a 100MB json file?). And, json.Marshal would be a final complete set of json - e.g. you wouldn't expect to concatenate a few 100 Marshal entries together. You'd use Encoder.Encode() for that to build a large set and write to the buffer, stream, file, io.Writer, etc.
Whenever in doubt if it's a bug, I always lookup the source - that's one of the advantages to Go, it's source and compiler is just pure Go. Within [n]vim I use \gb to open the source definition in a browser with my .vimrc settings.
You can erease the newline by backward stream:
f, _ := os.OpenFile(fname, ...)
encoder := json.NewEncoder(f)
encoder.Encode(v)
f.Seek(-1, 1)
f.WriteString("other data ...")
They should let user control this strange behavior:
a build option to disable it
Encoder.SetEOF(eof string)
Encoder.SetIndent(prefix, indent, eof string)
The Encoder writes a stream of documents. The extra whitespace terminates a JSON document in the stream.
A terminator is required for stream readers. Consider a stream containing these JSON documents: 1, 2, 3. Without the extra whitespace, the data on the wire is the sequence of bytes 123. This is a single JSON document with the number 123, not three documents.

In Go templates, I can get Parse to work but cannot get ParseFiles to work in like manner. Why?

I have the following code:
t, err := template.New("template").Funcs(funcMap).Parse("Howdy {{ myfunc . }}")
In this form everything works fine. But if I do exactly the same thing with ParseFiles, placing the text above in template.html it's a no go:
t, err := template.New("template").Funcs(funcMap).ParseFiles("template.html")
I was able to get ParseFiles to work in the following form, but cannot get Funcs to take effect:
t, err := template.ParseFiles("template.html")
t.Funcs(funcMap)
Of course, this last form is a direct call to a function instead of a call to the receiver method, so not the same thing.
Any one have any ideas what's going on here? Difficult to find a lot of detail on templates out in the either.
Did some digging and found this comment for template.ParseFiles in the source:
First template becomes return value if not already defined, and we use that one for subsequent New calls to associate all the templates together. Also, if this file has the same name as t, this file becomes the contents of t, so t, err := New(name).Funcs(xxx).ParseFiles(name) works. Otherwise we create a new template associated with t.
So the format should be as follows, given my example above:
t, err := template.New("template.html").Funcs(funcMap).ParseFiles("path/template.html")
.New("template.html") creates an empty template with the given name, .Funcs(funcMap) associates any custom functions we want to apply to our templates, and then .ParseFiles("path/template.html") parses one or more templates with an awareness of those functions and associates the contents with a template of that name.
Note that the base name of the first file MUST be the same as the name used in New. Any content parsed will be associated with either an empty preexisting template having the same base name of the first file in the series or with a new template having that base name.
So in my example above, one empty template named "template" was created and had a function map associated with it. Then a new template named "template.html" was created. THESE ARE NOT THE SAME! And since, ParseFiles was called last, t ends up being the "template.html" template, without any functions attached.
What about the last example? Why didn't this work? template.ParseFiles calls the receiver method Parse, which in turn applies any previously registered functions:
trees, err := parse.Parse(t.name, text, t.leftDelim, t.rightDelim, t.parseFuncs, builtins)
This means that custom functions have to be registered prior to parsing. Adding the functions after parsing the templates doesn't have any affect, leading to nil pointer errors at runtime when attempting to call a custom function.
So I think that covers my original question. The design here seems a little clunky. Doesn't make sense that I can chain on ParseFiles to one template and end up returning a different template from the one I am chaining from if they don't happen to be named the same. That is counterintuitive and probably ought to be addressed in future releases to avoid confusion.
ParseFiles should have some names of templates which is basename of filename. But you call template.New, it's create new one of template named error. So you should select one of templates.
foo.go
package main
import (
"text/template"
"log"
"os"
"strings"
)
func main() {
tmpl, err := template.New("error").Funcs(template.FuncMap{
"trim": strings.TrimSpace,
}).ParseFiles("foo.tmpl")
if err != nil {
log.Fatal(err)
}
tmpl = tmpl.Lookup("foo.tmpl")
err = tmpl.Execute(os.Stdout, " string contains spaces both ")
if err != nil {
log.Fatal(err)
}
}
foo.tmpl
{{. | trim}}
try this:
var templates = template.Must(template.New("").Funcs(fmap).ParseFiles("1.tmpl, "2.tmpl"))