Extracting positional offset of *html.Node in Golang - html

How do I can extract positional offset for specific node of already parsed HTML document? For example, for document <div>Hello, <b>World!</b></div> I want to be able to know that offset of World! is 15:21. Document may be changed while parsing.
I have a solution to render whole document with special marks, but it's really bad for performance. Any ideas?
package main
import (
"bytes"
"golang.org/x/net/html"
"golang.org/x/net/html/atom"
"log"
"strings"
)
func nodeIndexOffset(context *html.Node, node *html.Node) (int, int) {
if node.Type != html.TextNode {
node = node.FirstChild
}
originalData := node.Data
var buf bytes.Buffer
node.Data = "|start|" + originalData
_ = html.Render(&buf, context.FirstChild)
start := strings.Index(buf.String(), "|start|")
buf = bytes.Buffer{}
node.Data = originalData + "|end|"
_ = html.Render(&buf, context.FirstChild)
end := strings.Index(buf.String(), "|end|")
node.Data = originalData
return start, end
}
func main() {
s := "<div>Hello, <b>World!</b></div>"
var context html.Node
context = html.Node{
Type: html.ElementNode,
Data: "body",
DataAtom: atom.Body,
}
nodes, err := html.ParseFragment(strings.NewReader(s), &context)
if err != nil {
log.Fatal(err)
}
for _, node := range nodes {
context.AppendChild(node)
}
world := nodes[0].FirstChild.NextSibling.FirstChild
log.Println("target", world)
log.Println(nodeIndexOffset(&context, world))
}

Not an answer, but too long for a comment. The following could work to some extent:
Use a Tokenizer and step through each element one by one.
Wrap your input into a custom reader which records lines and
column offsets as the Tokenizer reads from it.
Query your custom reader for the position before and after calling Next()
to record the approximate position information you need.
This is a bit painful and not too accurate but probably the best you could do.

I come up with solution where we extend (please fix me if there's another way to do it) original HTML package with additional custom.go file with new exported function. This function is able to access unexported data property of Tokenizer, which holds exactly start and end position of current Node. We have to adjust positions after each buffer read. See globalBufDif.
I don't really like that I have to fork the package only to access couple of properties, but seems like this is a Go way.
func parseWithIndexes(p *parser) (map[*Node][2]int, error) {
// Iterate until EOF. Any other error will cause an early return.
var err error
var globalBufDif int
var prevEndBuf int
var tokenIndex [2]int
tokenMap := make(map[*Node][2]int)
for err != io.EOF {
// CDATA sections are allowed only in foreign content.
n := p.oe.top()
p.tokenizer.AllowCDATA(n != nil && n.Namespace != "")
t := p.top().FirstChild
for {
if t != nil && t.NextSibling != nil {
t = t.NextSibling
} else {
break
}
}
tokenMap[t] = tokenIndex
if prevEndBuf > p.tokenizer.data.end {
globalBufDif += prevEndBuf
}
prevEndBuf = p.tokenizer.data.end
// Read and parse the next token.
p.tokenizer.Next()
tokenIndex = [2]int{p.tokenizer.data.start + globalBufDif, p.tokenizer.data.end + globalBufDif}
p.tok = p.tokenizer.Token()
if p.tok.Type == ErrorToken {
err = p.tokenizer.Err()
if err != nil && err != io.EOF {
return tokenMap, err
}
}
p.parseCurrentToken()
}
return tokenMap, nil
}
// ParseFragmentWithIndexes parses a fragment of HTML and returns the nodes
// that were found. If the fragment is the InnerHTML for an existing element,
// pass that element in context.
func ParseFragmentWithIndexes(r io.Reader, context *Node) ([]*Node, map[*Node][2]int, error) {
contextTag := ""
if context != nil {
if context.Type != ElementNode {
return nil, nil, errors.New("html: ParseFragment of non-element Node")
}
// The next check isn't just context.DataAtom.String() == context.Data because
// it is valid to pass an element whose tag isn't a known atom. For example,
// DataAtom == 0 and Data = "tagfromthefuture" is perfectly consistent.
if context.DataAtom != a.Lookup([]byte(context.Data)) {
return nil, nil, fmt.Errorf("html: inconsistent Node: DataAtom=%q, Data=%q", context.DataAtom, context.Data)
}
contextTag = context.DataAtom.String()
}
p := &parser{
tokenizer: NewTokenizerFragment(r, contextTag),
doc: &Node{
Type: DocumentNode,
},
scripting: true,
fragment: true,
context: context,
}
root := &Node{
Type: ElementNode,
DataAtom: a.Html,
Data: a.Html.String(),
}
p.doc.AppendChild(root)
p.oe = nodeStack{root}
p.resetInsertionMode()
for n := context; n != nil; n = n.Parent {
if n.Type == ElementNode && n.DataAtom == a.Form {
p.form = n
break
}
}
tokenMap, err := parseWithIndexes(p)
if err != nil {
return nil, nil, err
}
parent := p.doc
if context != nil {
parent = root
}
var result []*Node
for c := parent.FirstChild; c != nil; {
next := c.NextSibling
parent.RemoveChild(c)
result = append(result, c)
c = next
}
return result, tokenMap, nil
}

Related

How to extract the text of a custom html tag with goquery?

I am trying to extract the text a custom html tag (<prelogin-cookie>):
someHtml := `<html><body>Login Successful!</body><!-- <saml-auth-status>1</saml-auth-status><prelogin-cookie>4242424242424242</prelogin-cookie><saml-username>my-username</saml-username><saml-slo>no</saml-slo> --></html>`
query, _ := goquery.NewDocumentFromReader(strings.NewReader(someHtml))
sel:= query.Find("prelogin-cookie")
println(sel.Text())
But it does not return anything, just an empty string, how can I get the actual text of that html tag, aka 4242424242424242?
<prelogin-cookie> is not found because it's inside an HTML comment.
Your comment is actually a series of XML or HTML tags, it may be processed as HTML if you use that as the input document.
Warning. Only the first solution below handles "all" HTML documents properly. The other solutions are simpler and will also handle your case just fine, but they might not handle some edge cases. Decide if they worth using for you.
1. By searching the HTML node tree
One way to find and extract the comment would be to traverse the HTML node tree and look for a node with type html.CommentNode.
For this, we'll use a recursive helper function to traverse a node tree:
func findComment(n *html.Node) *html.Node {
if n == nil {
return nil
}
if n.Type == html.CommentNode {
return n
}
if res := findComment(n.FirstChild); res != nil {
return res
}
if res := findComment(n.NextSibling); res != nil {
return res
}
return nil
}
And using it:
doc, err := goquery.NewDocumentFromReader(strings.NewReader(someHtml))
if err != nil {
panic(err)
}
var comment *html.Node
for _, node := range doc.Nodes {
if comment = findComment(node); comment != nil {
break
}
}
if comment == nil {
fmt.Println("no comment")
return
}
doc, err = goquery.NewDocumentFromReader(strings.NewReader(comment.Data))
if err != nil {
panic(err)
}
sel := doc.Find("prelogin-cookie")
fmt.Println(sel.Text())
This will print (try it on the Go Playground):
4242424242424242
2. With strings
If you just have to handle the "document at hand", a simpler solution may be to just use strings package to find the start and end indices of the comment:
start := strings.Index(someHtml, "<!--")
if start < 0 {
panic("no comment")
}
end := strings.Index(someHtml[start:], "-->")
if end < 0 {
panic("no comment")
}
And using this as the input:
doc, err := goquery.NewDocumentFromReader(strings.NewReader(someHtml[start+4 : end]))
if err != nil {
panic(err)
}
sel := doc.Find("prelogin-cookie")
fmt.Println(sel.Text())
This will output the same. Try it on the Go Playground).
3. Using regexp
A simpler (but less efficient) alternative of the previous solution is to use regexp to get the comment out of the original document:
comments := regexp.MustCompile(`<!--(.*?)-->`).FindAllString(someHtml, -1)
if len(comments) == 0 {
fmt.Println("no comment")
return
}
doc, err := goquery.NewDocumentFromReader(strings.NewReader(
comments[0][4 : len(comments[0])-3]))
Try this one on the Go Playground.

How to replace/update a key value inside a json array in golang?

I have a json array where it contains some flags as key and I have set the default values for those keys as false. this is my json array.
var flags = map[string]bool{
"terminationFlag": false,
"transferFlag": false,
"jrCancelledFlag": false,
"jrFilledFlag": false,
}
On performing an operation in a for loop, i have to update 1 field in the above json array as true. During the next iteration, it has to update the 2nd field in the json array as true. After all the fields in the json array is set to true, I have to return the json array.
the code i tried:
Keystrings := []string{"terminationReport - 2019-1","transferReport - 2019-1","jrCancelledReport - 2019-1","jrFilledReport - 2019-1"}
fmt.Println("Keystrings ", Keystrings)
for i,value := range Keystrings {
bytesread, err = stub.GetState(value)
var result []string
_ = json.Unmarshal(bytesread, &result)
fmt.Println("result ", result)
if result[0] == "yes"{
fmt.Println("result in if ", result)
flags[i] = true
}
}
Since it's very hard to understand from the question what is being asked, here's a simple attempt at working with similar data as the question, in the hope that you can take the right parts from this sample and adapt them to your issue. Follow the comments in the code to understand what's going on.
package main
import (
"encoding/json"
"fmt"
"log"
)
var jsonBlob = []byte(`["jrCancelledFlag", "yes"]`)
var flags = map[string]bool{
"terminationFlag": false,
"transferFlag": false,
"jrCancelledFlag": false,
"jrFilledFlag": false,
}
func main() {
// Parse jsonBlob into a slice of strings
var parsed []string
if err := json.Unmarshal(jsonBlob, &parsed); err != nil {
log.Fatalf("JSON unmarshal: %s", err)
}
// Expect the slice to be of length 2, first item flag name, second item
// yes/no.
if len(parsed) != 2 {
log.Fatalf("parsed len %d, expected 2", len(parsed))
}
// Assume parsed[0] actually appears in flags... otherwise more error checking
// is needed.
if parsed[1] == "yes" {
flags[parsed[0]] = true
}
// Emit updated flags as json
json, err := json.Marshal(flags)
if err != nil {
log.Fatalf("JSON marshal: %s", err)
}
fmt.Println(string(json))
}
This can be achieved cleaning by using the JSON interface to define your own unmarshaller
https://medium.com/#nate510/dynamic-json-umarshalling-in-go-88095561d6a0
package main
import (
"encoding/json"
"fmt"
"log"
)
var jsonBlob = []byte(`["jrCancelledFlag", "yes"]`)
// Flags ...
type Flags struct {
TerminationFlag bool `json:"terminationFlag,omitempty"`
TransferFlag bool `json:"transferFlag,omitempty"`
JRCancelledFlag bool `json:"jrCancelledFlag,omitempty"`
JRFilledFlag bool `json:"jrFilledFlag,omitempty"`
}
// UnmarshalJSON satisfies the JSON unmarshaller interface
func (f *Flags) UnmarshalJSON(data []byte) error {
var parsed []string
if err := json.Unmarshal(jsonBlob, &parsed); err != nil {
return err
}
if len(parsed)%2 != 0 {
return fmt.Errorf("expected string to be evenly paired")
}
for i := 0; i < len(parsed); i++ {
j := i + 1
if j < len(parsed) {
switch parsed[i] {
case "terminationFlag":
f.TerminationFlag = toBool(parsed[j])
case "transferFlag":
f.TransferFlag = toBool(parsed[j])
case "jrCancelledFlag":
f.JRCancelledFlag = toBool(parsed[j])
case "jrFilledFlag":
f.JRFilledFlag = toBool(parsed[j])
}
}
}
return nil
}
func toBool(s string) bool {
if s == "yes" {
return true
}
return false
}
func main() {
var flags Flags
err := json.Unmarshal(jsonBlob, &flags)
if err != nil {
log.Fatal(err)
}
b, _ := json.Marshal(flags)
fmt.Println(string(b))
}

Check if strings are JSON format

How to check if a given string is in form of multiple json string separated by spaces/newline?
For example,
given: "test" 123 {"Name": "mike"} (3 json concatenated with space)
return: true, since each of item ("test" 123 and {"Name": "mike"}) is a valid json.
In Go, I can write a O(N^2) function like:
// check given string is json or multiple json concatenated with space/newline
func validateJSON(str string) error {
// only one json string
if isJSON(str) {
return nil
}
// multiple json string concatenate with spaces
str = strings.TrimSpace(str)
arr := []rune(str)
start := 0
end := 0
for start < len(str) {
for end < len(str) && !unicode.IsSpace(arr[end]) {
end++
}
substr := str[start:end]
if isJSON(substr) {
for end < len(str) && unicode.IsSpace(arr[end]) {
end++
}
start = end
} else {
if end == len(str) {
return errors.New("error when parsing input: " + substr)
}
for end < len(str) && unicode.IsSpace(arr[end]) {
end++
}
}
}
return nil
}
func isJSON(str string) bool {
var js json.RawMessage
return json.Unmarshal([]byte(str), &js) == nil
}
But this won't work for large input.
There are two options. The simplest, from a coding standpoint, is going to be just to decode the JSON string normally. You can make this most efficient by decoding to an empty struct:
package main
import "encoding/json"
func main() {
input := []byte(`{"a":"b", "c": 123}`)
var x struct{}
if err := json.Unmarshal(input, &x); err != nil {
panic(err)
}
input = []byte(`{"a":"b", "c": 123}xxx`) // This one fails
if err := json.Unmarshal(input, &x); err != nil {
panic(err)
}
}
(playground link)
This method has a few potential drawbacks:
It only works with a single JSON object. That is, a list of objects (as requested in the question) will fail, without additional logic.
As pointed out by #icza in comments, it only works with JSON objects, so bare arrays, numbers, or strings will fail. To accomodate these types, interface{} must be used, which introduces the potential for some serious performance penalties.
The throw-away x value must still be allocated, and at least one reflection call is likely under the sheets, which may introduce a noticeable performance penalty for some workloads.
Given these limitations, my recommendation is to use the second option: loop through the entire JSON input, ignoring the actual contents. This is made simple with the standard library json.Decoder:
package main
import (
"bytes"
"encoding/json"
"io"
)
func main() {
input := []byte(`{"a":"b", "c": 123}`)
dec := json.NewDecoder(bytes.NewReader(input))
for {
_, err := dec.Token()
if err == io.EOF {
break // End of input, valid JSON
}
if err != nil {
panic(err) // Invalid input
}
}
input = []byte(`{"a":"b", "c": 123}xxx`) // This input fails
dec = json.NewDecoder(bytes.NewReader(input))
for {
_, err := dec.Token()
if err == io.EOF {
break // End of input, valid JSON
}
if err != nil {
panic(err) // Invalid input
}
}
}
(playground link)
As Volker mentioned in the comments, use a *json.Decoder to decode all json documents in your input successively:
package main
import (
"encoding/json"
"io"
"log"
"strings"
)
func main() {
input := `"test" 123 {"Name": "mike"}`
dec := json.NewDecoder(strings.NewReader(input))
for {
var x json.RawMessage
switch err := dec.Decode(&x); err {
case nil:
// not done yet
case io.EOF:
return // success
default:
log.Fatal(err)
}
}
}
Try it on the playground: https://play.golang.org/p/1OKOii9mRHn
Try fastjson.Scanner:
s := `"test" 123 {"Name": "mike"}`
var sc fastjson.Scanner
sc.Init(s)
// Iterate over a stream of json objects
for sc.Next() {}
if sc.Error() != nil {
fmt.Println("ok")
} else {
fmt.Println("false")
}

Appending to json file without writing entire file

I have a json which contains one its attributes value as an array and I need to keep appending values to the array and write to a file. Is there a way I could avoid rewrite of the existing data and only append the new values?
----- Moving next question on different thread ---------------
what is recommended way for writing big data sets onto the file incremental file write or file dump at the end process?
A general solution makes the most sense if the existing JSON is actually an array, or if it's an object that has an array as the last or only pair, as in your case. Otherwise, you're inserting instead of appending. You probably don't want to read the entire file either.
One approach is not much different than what you were thinking, but handles several details
Read the end of the file to verify that it "ends with an array"
Retain that part
Position the file at that ending array bracket
Take the output from a standard encoder for an array of new data, dropping its opening bracket, and inserting a comma if necessary
The end of the the new output replaces the original ending array bracket
Tack the rest of the tail back on
import (
"bytes"
"errors"
"io"
"io/ioutil"
"os"
"regexp"
"unicode"
)
const (
tailCheckLen = 16
)
var (
arrayEndsObject = regexp.MustCompile("(\\[\\s*)?](\\s*}\\s*)$")
justArray = regexp.MustCompile("(\\[\\s*)?](\\s*)$")
)
type jsonAppender struct {
f *os.File
strippedBracket bool
needsComma bool
tail []byte
}
func (a jsonAppender) Write(b []byte) (int, error) {
trimmed := 0
if !a.strippedBracket {
t := bytes.TrimLeftFunc(b, unicode.IsSpace)
if len(t) == 0 {
return len(b), nil
}
if t[0] != '[' {
return 0, errors.New("not appending array: " + string(t))
}
trimmed = len(b) - len(t) + 1
b = t[1:]
a.strippedBracket = true
}
if a.needsComma {
a.needsComma = false
n, err := a.f.Write([]byte(", "))
if err != nil {
return n, err
}
}
n, err := a.f.Write(b)
return trimmed + n, err
}
func (a jsonAppender) Close() error {
if _, err := a.f.Write(a.tail); err != nil {
defer a.f.Close()
return err
}
return a.f.Close()
}
func JSONArrayAppender(file string) (io.WriteCloser, error) {
f, err := os.OpenFile(file, os.O_RDWR, 0664)
if err != nil {
return nil, err
}
pos, err := f.Seek(0, io.SeekEnd)
if err != nil {
return nil, err
}
if pos < tailCheckLen {
pos = 0
} else {
pos -= tailCheckLen
}
_, err = f.Seek(pos, io.SeekStart)
if err != nil {
return nil, err
}
tail, err := ioutil.ReadAll(f)
if err != nil {
return nil, err
}
hasElements := false
if len(tail) == 0 {
_, err = f.Write([]byte("["))
if err != nil {
return nil, err
}
} else {
var g [][]byte
if g = arrayEndsObject.FindSubmatch(tail); g != nil {
} else if g = justArray.FindSubmatch(tail); g != nil {
} else {
return nil, errors.New("does not end with array")
}
hasElements = len(g[1]) == 0
_, err = f.Seek(-int64(len(g[2])+1), io.SeekEnd) // 1 for ]
if err != nil {
return nil, err
}
tail = g[2]
}
return jsonAppender{f: f, needsComma: hasElements, tail: tail}, nil
}
Usage is then like in this test fragment
a, err := JSONArrayAppender(f)
if err != nil {
t.Fatal(err)
}
added := []struct {
Name string `json:"name"`
}{
{"Wonder Woman"},
}
if err = json.NewEncoder(a).Encode(added); err != nil {
t.Fatal(err)
}
if err = a.Close(); err != nil {
t.Fatal(err)
}
You can use whatever settings on the Encoder you want. The only hard-coded part is handling needsComma, but you can add an argument for that.
If your JSON array is simple you can use something like the following code. In this code, I create JSON array manually.
type item struct {
Name string
}
func main() {
fd, err := os.Create("hello.json")
if err != nil {
log.Fatal(err)
}
fd.Write([]byte{'['})
for i := 0; i < 10; i++ {
b, err := json.Marshal(item{
"parham",
})
if err != nil {
log.Fatal(err)
}
if i != 0 {
fd.Write([]byte{','})
}
fd.Write(b)
}
fd.Write([]byte{']'})
}
If you want to have a valid array in each step you can write ']' at the end of each iteration and then seek back on the start of the next iteration.

Don't read unneeded JSON key-values into memory

I have a JSON file with a single field that takes a huge amount of space when loaded into memory. The other fields are reasonable, but I'm trying to take care not to load that particular field unless I absolutely have to.
{
"Field1": "value1",
"Field2": "value2",
"Field3": "a very very long string that potentially takes a few GB of memory"
}
When reading that file into memory, I'd want to ignore Field3 (because loading it could crash my app). Here's some code that I would assume does that because it uses io streams rather than passing a []byte type to the Unmarshal command.
package main
import (
"encoding/json"
"os"
)
func main() {
type MyStruct struct {
Field1 string
Field2 string
}
fi, err := os.Open("myJSONFile.json")
if err != nil {
os.Exit(2)
}
// create an instance and populate
var mystruct MyStruct
err = json.NewDecoder(fi).Decode(&mystruct)
if err != nil {
os.Exit(2)
}
// do some other stuff
}
The issue is that the built-in json.Decoder type reads the entire file into memory on Decode before throwing away key-values that don't match the struct's fields (as has been pointed out on StackOverflow before: link).
Are there any ways of decoding JSON in Go without keeping the entire JSON object in memory?
You could write a custom io.Reader that you feed to json.Decoder and that will pre-read your json file and skip that specific field.
The other option is to write your own decoder, more complicated and messy.
//edit it seemed like a fun exercise, so here goes:
type IgnoreField struct {
io.Reader
Field string
buf bytes.Buffer
}
func NewIgnoreField(r io.Reader, field string) *IgnoreField {
return &IgnoreField{
Reader: r,
Field: field,
}
}
func (iF *IgnoreField) Read(p []byte) (n int, err error) {
if n, err = iF.Reader.Read(p); err != nil {
return
}
s := string(p)
fl := `"` + iF.Field + `"`
if i := strings.Index(s, fl); i != -1 {
l := strings.LastIndex(s[0:i], ",")
if l == -1 {
l = i
}
iF.buf.WriteString(s[0:l])
s = s[i+1+len(fl):]
i = strings.Index(s, `"`)
if i != -1 {
s = s[i+1:]
}
for {
i = strings.Index(s, `"`) //end quote
if i != -1 {
s = s[i+1:]
fmt.Println("Skipped")
break
} else {
if n, err = iF.Reader.Read(p); err != nil {
return
}
s = string(p)
}
}
iF.buf.WriteString(s)
}
ln := iF.buf.Len()
if ln >= len(p) {
tmp := iF.buf.Bytes()
iF.buf.Reset()
copy(p, tmp[0:len(p)])
iF.buf.Write(p[len(p):])
ln = len(p)
} else {
copy(p, iF.buf.Bytes())
iF.buf.Reset()
}
return ln, nil
}
func main() {
type MyStruct struct {
Field1 string
Field2 string
}
fi, err := os.Open("myJSONFile.json")
if err != nil {
os.Exit(2)
}
// create an instance and populate
var mystruct MyStruct
err := json.NewDecoder(NewIgnoreField(fi, "Field3")).Decode(&mystruct)
if err != nil {
fmt.Println(err)
}
fmt.Println(mystruct)
}
playground