Equivalent to Python's HTML parsing function/module in Go? - html

I'm now learning Go myself and am stuck in getting and parsing HTML/XML. In Python, I usually write the following code when I do web scraping:
from urllib.request import urlopen, Request
url = "http://stackoverflow.com/"
req = Request(url)
html = urlopen(req).read()
, then I can get raw HTML/XML in a form of either string or bytes and proceed to work with it. In Go, how can I cope with it? What I hope to get is raw HTML data which is stored either in string or []byte (though it can be easily converted, that I don't mind which to get at all). I consider using gokogiri package to do web scraping in Go (not sure I'll indeed end up with using it!), but it looks like it requires raw HTML text before doing any work with it...
So how can I acquire such object?
Or is there any better way to do web scraping work in Go?
Thanks.

From the Go http.Get Example:
package main
import (
"fmt"
"io/ioutil"
"log"
"net/http"
)
func main() {
res, err := http.Get("http://www.google.com/robots.txt")
if err != nil {
log.Fatal(err)
}
robots, err := ioutil.ReadAll(res.Body)
res.Body.Close()
if err != nil {
log.Fatal(err)
}
fmt.Printf("%s", robots)
}
Will return the contents of http://www.google.com/robots.txt into the string variable robots.
For XML parsing look into the Go encoding/xml package.

Related

How to read "interfaces" map of json without defining structure in Golang?

Following this tutorial I'm trying to read a json file in Golang. It says there are two ways of doing that:
unmarshal the JSON using a set of predefined structs
or unmarshal the JSON using a map[string]interface{}
Since I'll probably have a lot of different json formats I prefer to interpret it on the fly. So I now have the following code:
package main
import (
"fmt"
"os"
"io/ioutil"
"encoding/json"
)
func main() {
// Open our jsonFile
jsonFile, err := os.Open("users.json")
// if we os.Open returns an error then handle it
if err != nil {
fmt.Println(err)
}
fmt.Println("Successfully Opened users.json")
// defer the closing of our jsonFile so that we can parse it later on
defer jsonFile.Close()
byteValue, _ := ioutil.ReadAll(jsonFile)
var result map[string]interface{}
json.Unmarshal([]byte(byteValue), &result)
fmt.Println(result["users"])
fmt.Printf("%T\n", result["users"])
}
This prints out:
Successfully Opened users.json
[map[type:Reader age:23 social:map[facebook:https://facebook.com twitter:https://twitter.com] name:Elliot] map[name:Fraser type:Author age:17 social:map[facebook:https://facebook.com twitter:https://twitter.com]]]
[]interface {}
At this point I don't understand how I can read the age of the first user (23). I tried some variations:
fmt.Println(result["users"][0])
fmt.Println(result["users"][0].age)
But apparently, type interface {} does not support indexing.
Is there a way that I can access the items in the json without defining the structure?
Probably you want
fmt.Println(result["users"].(map[string]interface{})["age"])
or
fmt.Println(result[0].(map[string]interface{})["age"])
As the JSON is a map of maps the type of the leaf nodes is interface{} and so has to be converted to map[string]interface{} in order to lookup a key
Defining a struct is much easier. My top tip for doing this is to use a website that converts JSON to a Go struct definition, like Json-To-Go

ASCII to Json in go

im kinda new to programming but i found that python didnt have the speed i needed so i switched to go, im building a scraper and i need to convert a what looks like to be a ASCII formated string to json but i cant find any good documentation on how to do this in go.
the string i need converted looks something like this: debug%22%3Afalse%2C%22pageOpts%22%3A%7B%22noBidIfUnsold%22%3Atrue%2C%22keywords%22%3A%7B%22no-sno-finn-object_type%22%3A%22private%22%2C%22no-sno-finn-car_make%22%3A%22796%22%2C%22aa-sch-publisher%22%3A%22finn%22%2C%22aa-sch-inventory_type%22%3A%22classified%22%2C%22aa-sch-country_code%22%3A%22no%22%2C%22no-sno-finn-section%22%3A%22car%22%2C%22no-sno-finn-ad_owner%22%3A%22false%22%2C%22no-sno-publishergroup%22%3A%22schibsted%22%2C%22aa-sch-supply_type%22%3A%22web_desktop%22%2C%22no-sno-finn-subsection%22%3A%22car_used%22%2C%22aa-sch-page_type%22%3A%22object%22%7D
Thanks in advance!
As mentioned by a commenter, your string is URL encoded and can be decoded using url.QueryUnescape(...):
package main
import (
"fmt"
"net/url"
)
func main() {
querystr := "debug%22%3Afalse%2C%22pageOpts%22%3A%7B%22noBidIfUnsold%22%3Atrue%2C%22keywords%22%3A%7B%22no-sno-finn-object_type%22%3A%22private%22%2C%22no-sno-finn-car_make%22%3A%22796%22%2C%22aa-sch-publisher%22%3A%22finn%22%2C%22aa-sch-inventory_type%22%3A%22classified%22%2C%22aa-sch-country_code%22%3A%22no%22%2C%22no-sno-finn-section%22%3A%22car%22%2C%22no-sno-finn-ad_owner%22%3A%22false%22%2C%22no-sno-publishergroup%22%3A%22schibsted%22%2C%22aa-sch-supply_type%22%3A%22web_desktop%22%2C%22no-sno-finn-subsection%22%3A%22car_used%22%2C%22aa-sch-page_type%22%3A%22object%22%7D"
// Parse the URL encoded string.
plainstr, err := url.QueryUnescape(querystr)
if err != nil {
panic(err)
}
fmt.Println(plainstr)
// debug":false,"pageOpts":{"noBidIfUnsold":true,"keywords":{"no-sno-finn-object_type":"private","no-sno-finn-car_make":"796","aa-sch-publisher":"finn","aa-sch-inventory_type":"classified","aa-sch-country_code":"no","no-sno-finn-section":"car","no-sno-finn-ad_owner":"false","no-sno-publishergroup":"schibsted","aa-sch-supply_type":"web_desktop","no-sno-finn-subsection":"car_used","aa-sch-page_type":"object"}
}
Your example string appears to be incomplete but eventually it can be decoded into a struct or map using json.Unmarshal(...).

Output a simple json file to a rest api with golang

This is my first go project. All I want to do is read a file.json on my server, then make it available to others via a REST API. But I'm getting errors. Here's what I have so far.
main.go
package main
import (
"encoding/json"
"github.com/gorilla/mux"
"log"
"net/http"
"io/ioutil"
"fmt"
)
func GetDetail(w http.ResponseWriter, r *http.Request) {
b,_ := ioutil.ReadFile("file.json");
rawIn := json.RawMessage(string(b))
var objmap map[string]*json.RawMessage
err := json.Unmarshal(rawIn, &objmap)
if err != nil {
fmt.Println(err)
}
fmt.Println(objmap)
json.NewEncoder(w).Encode(objmap)
}
func main() {
router := mux.NewRouter()
router.HandleFunc("/detail", GetDetail).Methods("GET")
log.Fatal(http.ListenAndServe(":8000", router))
}
file.json
{
favourite_color:"blue",
attribute:{density:23,allergy:"peanuts",locations:["USA","Canada","Jamaica"]},
manufacture_year:1998
}
When I run go build; ./sampleproject, then go to my web browser at http://localhost:8000/detail, I get the error message:
invalid character 'f' looking for beginning of object key string
map[]
I've tried a few marshal methods, but they all give me different errors. I just need a working example to study from to better understand how all this works.
I should also mention that file.json does not have a fixed schema. It can change drastically at any minute to have a random set of data.
How do I get around the error invalid character f message and get my file.json to render at http://localhost:8000/detail?
As cerise mentioned, it's just formatting error in the JSON file. Must quote all properties.
Then the rest of the code works

Golang: Type [type] is not an expression; json config parsing

I'm trying to work out a bit of code to pull in config from a JSON file.
When I attempt to build, I get this error
type ConfigVars is not an expression
Below is the config and program code I'm trying to work with. Every example I've found so far is similar to the below code. Any suggestion of what I'm doing incorrectly?
-- Config File
{"beaconUrl":"http://test.com/?id=1"}
-- Program Code
package main
import (
"encoding/json"
"fmt"
"os"
)
type ConfigVars struct {
BeaconUrl string
}
func main() {
configFile, err := os.Open("config.json")
defer configFile.Close()
if err != nil {
fmt.Println("Opening config file", err.Error())
}
jsonParser := json.NewDecoder(configFile)
if err = jsonParser.Decode(&ConfigVars); err != nil {
fmt.Println("Parsing config file", err.Error())
}
}
What you're doing there is trying to pass a pointer to the ConfigVars type (which obviously doesn't really mean anything). What you want to do is make a variable whose type is ConfigVars and pass a pointer to that instead:
var cfg ConfigVars
err = jsonParser.Decode(&cfg)
...
For others who come onto this problem, you may find that you've forgotten to initialize the variable during assignment using the := operator, as described in Point 3 at the end of this GoLang tutorial.
var cfg ConfigVars
err := jsonParser.Decode(&cfg)

Parse broken HTML with golang

I need to find elements in an HTML string. Unfortunately the HTML is pretty much broken (e.g. closing tags without an opening pair).
I tried to use XPath with launchpad.net/xmlpath but it can't parse an HTML file so damn buggy.
How can I find elements in a broken HTML with golang? I would prefer using XPath, but I am open for other solutions too if I can use it to look for tags with a specific id or class.
It seems net/html does the job.
So that's what I am doing now:
package main
import (
"strings"
"golang.org/x/net/html"
"log"
"bytes"
"gopkg.in/xmlpath.v2"
)
func main() {
brokenHtml := `<!DOCTYPE html><html><body><h1 id="someid">My First Heading</h1><p>paragraph</body></html>`
reader := strings.NewReader(brokenHtml)
root, err := html.Parse(reader)
if err != nil {
log.Fatal(err)
}
var b bytes.Buffer
html.Render(&b, root)
fixedHtml := b.String()
reader = strings.NewReader(fixedHtml)
xmlroot, xmlerr := xmlpath.ParseHTML(reader)
if xmlerr != nil {
log.Fatal(xmlerr)
}
var xpath string
xpath = `//h1[#id='someid']`
path := xmlpath.MustCompile(xpath)
if value, ok := path.String(xmlroot); ok {
log.Println("Found:", value)
}
}