Use regex on htmlParseTree in R - html

I have an HTML internal doc that I want to strip character vectors from. Specifically, I am trying to parse Google results.
##create search query
vcSearchInput <- "Alberta+Alabama+USA+latitude+longitude"
##scrape and parse google results to XML
vcSearchOutput <- getURL(paste0("http://www.google.com/search?q=",vcSearchInput))
from here, I can that exactly what I want comes after:
<a href="http://maps.google.com/maps?um=1&ie=UTF-8&fb=1&gl=us&sa=X&ll=
I have figured out converting to character:
vaSearchOutput <- paste(capture.output(vaSearchOutput,file="test.txt"),collapse="")
But, of course, my search string has TONS of special characters that require escaping.
I tried:
gregexpr("http\\:\\/\\/maps\\.google\\.com\\/maps\\?um\\=1\\&amp\\;ie\\=UTF\\-8\\&amp\\;fb\\=1\\&amp\\;gl\\=us\\&amp\\;sa\\=X\\&amp\\;ll\\=",vaSearchOutput,ignore.case=T)
I tried:
regmatches(regexpr("maps\\.google\\.com.*",vaSearchOutput,ignore.case=T),vaSearchOutput)
and received:
Error in so + attr(m, "match.length")[ind] :
non-numeric argument to binary operator
So how can I work with these kinds of variable types to find regular expressions?

Related

Double quote handling when exporting JSON field with BigQuery

I am making use of the JSON datatype in BigQuery and I have a table that looks like this:
myStringField | myJSONField
----------------|----------------------------------
someStringValue | {"key1":"value1", "key1":"value2"}
In SQL, everything works fine. But, when it comes to exporting data, it gets messy. For instance, if I click the "Save results" button and if I choose the "CSV (local file)" option, I obtain the following content in my CSV:
myStringField,myJSONField
someStringValue,"{""key1"":""value1"", ""key1"":""value2""}"
As you can see, I get "double double quotes" inside my JSON and it makes things complicated to parse for the downstream system that receives the file.
I tried to fix it by using different combinations of JSON functions such as PARSE_JSON(), TO_JSON_STRING(), STRING() but nothing worked and, in some cases, it even made things worse ("triple double quotes").
Ideally, the expected output of my CSV should resemble this:
myStringField,myJSONField
someStringValue,{"key1":"value1", "key1":"value2"}
Any workaround?
According to the docs, exporting JSON fields to a CSV format has some limitations:
When you export data in JSON format, INT64 (integer) data types are encoded as JSON strings to preserve 64-bit precision when the data is read by other systems.
When you export a table in JSON format, the symbols <, >, and & are converted by using the unicode notation \uNNNN, where N is a hexadecimal digit. For example, profit&loss becomes profit\u0026loss. This unicode conversion is done to avoid security vulnerabilities.
Check out the export limitations here: https://cloud.google.com/bigquery/docs/exporting-data#export_limitations
Regarding the export format you mentioned, that is the expected way to escape the double quote characters in CSV. So this is the expected output.
First quotes are there because of the CSV encode mechanism for strings and every other double quote inside that string will be escaped with another double quote.
"{""key1"":""value1""}"
If you are parsing this csv with any parser out there, this format should be supported with the right setup.

Spark reading multiple files : double quotes replaced by %22

I have requirements to read random json files in different folders where data has changed. So I can't apply regex to read pattern . I know which are those files and I could list them .But when I form string with all the file path and try reading json in spark. The double quotes are replaced by %22 and reading files via spark fails. Could any one please help ?
val FilePath = "\"/path/2019/02/01/*\"" + ","+ "\"path/2019/02/05/*\"" + "\"path/2019/02/24/*\""
FilePath:String = "path/2019/02/20/*","path/2019/02/05/*","path/2019/02/24/*"
Now when I use this variable to read josn files, it fails with error and quotes are replaced by %22.
spark.read.json(FilePath)
java.lang.IllegalArgumentException: java.net.URISyntaxException: Illegal character in scheme name at index 0: "/path/2019/02/01/*%22,%22/path/2019/02/05/*%22,%22/path/2019/02/24/*%22
I've just tried this with an older version of Spark (1.6.0) and it works fine if you supply separate paths or wildcard patterns as varargs to the json method, i.e.:
sqlContext.read.json("foo/*", "bar/*")
When you pass multiple patterns in a single string, Spark is trying to construct a single URI from them, which is incorrect, and it will try to URL-encode the quotes characters as %22.
As an aside, trying to create a URI is failing because your string starts with a double-quote, which is an illegal character in that position (RFC 3986):
Scheme names consist of a sequence of characters beginning with a
letter and followed by any combination of letters, digits, plus
("+"), period ("."), or hyphen ("-").
Add the paths in a list(ex: pathList) and use as below
spark.read.option("basePath", basePath).json(pathList: _*)

Change single backslash in R character string to valid JSON string

I have a string in R which escapes quotation marks:
my_text = {\"stim\":[\"platery\",\"denial\",\"generic\"]}
When using cat() I get:
{"stim":["platery","denial","generic"]}
Now my whole string is a JSON string that needs to be parsed and is evaluated invalid by JSONLint. If I copy&paste the cat() version, this is valid a JSON, so I think I just miss some pre-processing here.
I saw this SO post here, and this one, and this really good one, so I tried to replace the single quotation marks with double quotation marks for the JSON parser:
gsub("\\\\", "\\\\\\\\", my_text, fixed=TRUE)
but it did't change my string as I wanted. How can I change the string to become valid JSON?
As Wiktor said your gsub didn't work because you are attempting to replace backslashes, but there aren't any backslashes in your string. R is just using the backslashes as a way to store the double quotes. The third SO post you link does a good job explaining R's string literals which addresses this. A backslash in R is stored as a double backslash.
My first piece of advice would be to use the R package jsonlite to construct your JSON from an R object as opposed to a string if possible (heres the vignette).
Example:
myJSON <- jsonlite::toJSON(list(stim=c("platery","denial","generic")))
# {"stim":["platery","denial","generic"]}
Second, (as the third SO post again does a good job of explaining) copying/pasting the print method of the string may not be the best way to validate the JSON. I'm not sure of the use case, but R storing the double quotes with escape characters is probably not a bad thing.
If you want to get a prettier print method you can use numerous tricks in R (noquote(), cat(), print(quote=F)) but this won't change that R stores the double quotes with backslashes:
Additionally, in some cases constructing the JSON isn't necessary. I have an API built using the plumbr package that returns a list as JSON without any modifications (recJSON <- list(message=messages,recommendations=list(name=names, link=URLs)))

Reading unescaped backslashes in JSON into R

I'm trying to read some data from the Facebook Graph API Explorer into R to do some text analysis. However, it looks like there are unescaped backslashes in the JSON feed, which is causing rjson to barf. The following is a minimal example of the kind of input that's causing problems.
library(rjson)
txt <- '{"data":[{"id":2, "value":"I want to \\"post\\" a picture\\video"}]}'
fromJSON(txt)
(Note that the double backslashes at \\" and \\video will convert to single backslashes after parsing, which is what's in my actual data.)
I also tried the RJSONIO package which also gave errors, and even crashed R at times.
Has anyone come across this problem before? Is there a way to fix this short of manually hunting down every error that crops up? There's potentially megabytes of JSON being parsed, and the error messages aren't very informative about where exactly the problematic input is.
Just replace backslashes that aren't escaping double quotes, tabs or newlines with double backslashes.
In the regular expression, '\\\\' is converted to one backslash (two levels of escaping are needed, one for R, one for the regular expression engine). We need the perl regex engine in order to use lookahead.
library(stringr)
txt2 <- str_replace_all(txt, perl('\\\\(?![tn"])'), '\\\\\\\\')
fromJSON(txt2)
The problem is that you are trying to parse invalid JSON:
library(jsonlite)
txt <- '{"data":[{"id":2, "value":"I want to \\"post\\" a picture\\video"}]}'
validate(txt)
The problem is the picture\\video part because \v is not a valid JSON escape sequence, even though it is a valid escape sequence in R and some other languages. Perhaps you mean:
library(jsonlite)
txt <- '{"data":[{"id":2, "value":"I want to \\"post\\" a picture\\/video"}]}'
validate(txt)
fromJSON(txt)
Either way to problem is at the JSON data source that is generating invalid JSON. If this data really comes form Facebook, you found a bug in their API. But more likely you are not retrieving it correctly.

Unescaping Characters in a JSON response string

I made a JSON request that gives me a string that uses Unicode character codes that looks like:
s = "\u003Cp\u003E"
And I want to convert it to:
s = "<p>"
What's the best way to do this in Python?
Note, this is the same question as this one, only in Python except Ruby. I am also using the Posterous API.
>>> "\\u003Cp\\u003E".decode('unicode-escape')
u'<p>'
If the data came from JSON, the json module should already have decoded these escapes for you:
>>> import json
>>> json.loads('"\u003Cp\u003E"')
u'<p>'
EDIT: The original question "Unescaping Characters in a String with Python" did not clarify if the string was to be written or to be read (later on, the "JSON response" words were added, to clarify the intention was to read).
So I answered the opposite question: how to write JSON serialized data dumping them to a unescaped string (rather than loading data from the string).
My use case was producing a JSON file from my own data dictionary, but the file contained scaped non-ASCII characters. So I did it like this:
with open(filename,'w') as jsonfile:
jsonstr = json.dumps(myDictionary, ensure_ascii=False)
print(jsonstr) # to screen
jsonfile.write(jsonstr) # to file
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Taken from here: https://docs.python.org/3/library/json.html