Unexpected behavior of gsub in R

Unexpected behavior of gsub in R - html

Excuse me for not being more specific in the title, but I don't know how to explain this without an example.
I have a .html file that looks like this:
<TR><TD>log p-value:</TD><TD>-2.797e+02</TD></TR>
<TR><TD>Information Content per bp:</TD><TD>1.736</TD></TR>
<TR><TD>Number of Target Sequences with motif</TD><TD>894.0</TD></TR>
<TR><TD>Percentage of Target Sequences with motif</TD><TD>47.58%</TD></TR>
<TR><TD>Number of Background Sequences with motif</TD><TD>10864.6</TD></TR>
<TR><TD>Percentage of Background Sequences with motif</TD><TD>22.81%</TD></TR>
<TR><TD>Average Position of motif in Targets</TD><TD>402.4 +/- 261.2bp</TD></TR>
<TR><TD>Average Position of motif in Background</TD><TD>400.6 +/- 246.8bp</TD></TR>
<TR><TD>Strand Bias (log2 ratio + to - strand density)</TD><TD>-0.0</TD></TR>
<TR><TD>Multiplicity (# of sites on avg that occur together)</TD><TD>1.48</TD></TR>
I read it in:
html = readLines("file.html")
I am interested in whatever is between </TD><TD> and </TD></TR>. When I run the following, I get the result I want:
mypattern = '<TR><TD>log p-value:</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
[1] "-2.797e+02"
It works well for almost all lines I want to match, but when I do the same thing for the last two lines, it does not extract anything.
mypattern = '<TR><TD>Strand Bias (log2 ratio + to - strand density)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
character(0)
mypattern = '<TR><TD>Multiplicity (# of sites on avg that occur together)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
character(0)
Why is this happening?
Thank you for your help.

If your data structure is really like this. You have a xml file with keys and values so I assume it is easier to utilize this!
library(xml2)
xd <- read_xml("file.html", as_html = TRUE)
key_values <- xml_text(xml_find_all(xd, "//td"))
is_key <- as.logical(seq_along(key_values) %% 2)
setNames(key_values[!is_key], key_values[is_key])

First, I'll say that I would actually solve this problem like this:
gsub(".+>([^<]+)</TD></TR>", "\\1", html)
#> [1] "-2.797e+02" "1.736" "894.0"
#> [4] "47.58%" "10864.6" "22.81%"
#> [7] "402.4 +/- 261.2bp" "400.6 +/- 246.8bp" "-0.0"
#> [10] "1.48"
But, to answer the question of why your way didn't work, we need to checkout the help file for R regular expressions (help("regex")):
Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ? ...
The patterns that you had trouble with included parentheses, which you needed to escape (note the double backslash, since backslashes themselves need to be escaped):
mypattern = '<TR><TD>Multiplicity \\(# of sites on avg that occur together\\)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
# [1] "1.48"

Related

How can I set the numbering of the x-axis of an Octave plot to engineering notation?

I made a very simple Octave script
a = [10e6, 11e6, 12e6];
b = [10, 11, 12];
plot(a, b, 'rd-')
which outputs the following graph.
Graph
Is it possible to set the numbering on the x-axis to engineering notation, rather than scientific, and have it display "10.5e+6, 11e+6, 11.5e+6" instead of "1.05e+7, 1.1e+7, 1.15+e7"?

While octave provides a 'short eng' formatting option, which does what you're asking for in terms of printing to the terminal, it does not appear to provide this functionality in plots or when formatting strings via sprintf.
Therefore you'll have to find a way to do this by yourself, with some creative string processing of the initial xticks, and substituting the plot's ticklabels accordingly. Thankfully it's not that hard :)
Using your example:
a = [10e6, 11e6, 12e6];
b = [10, 11, 12];
plot(a, b, 'rd-')
format short eng % display stdout in engineering format
TickLabels = disp( xticks ) % collect string as it would be displayed on the stdout
TickLabels = strsplit( TickLabels ) % tokenize at spaces
TickLabels = TickLabels( 2 : end - 1 ) % discard start and end empty tokens
TickLabels = regexprep( TickLabels, '\.0+e', 'e' ) % remove purely zero decimals using a regular expression
TickLabels = regexprep( TickLabels, '(\.[1-9]*)0+e', '$1e' ) % remove non-significant zeros in non-zero decimals using a regular expression
xticklabels( TickLabels ) % set the new ticklabels to the plot
format % reset short eng format back to default, if necessary

R convert json to list to data.table

I have a data.table where one of the columns contains JSON. I am trying to extract the content so that each variable is a column.
library(jsonlite)
library(data.table)
df<-data.table(a=c('{"tag_id":"34","response_id":2}',
'{"tag_id":"4","response_id":1,"other":4}',
'{"tag_id":"34"}'),stringsAsFactors=F)
The desired result, that does not refer to the "other" variable:
tag_id response_id
1 "34" 2
2 "4" 1
3 "34" NA
I have tried several versions of:
parseLog <- function(x){
if (is.na(x))
e=c(tag_id=NA,response_id=NA)
else{
j=fromJSON(x)
e=c(tag_id=as.integer(j$tag_id),response_id=j$response_id)
}
e
}
that seems to work well to retrieve a list of vectors (or lists if c is replaced by list) but when I try to convert the list to data.table something doesn´t work as expected.
parsed<-lapply(df$a,parseLog)
rparsed<-do.call(rbind.data.frame,parsed)
colnames(rparsed)<-c("tag_id","response_id")
Because of the missing value in the third row. How can I solve it in a R-ish clean way? How can I make that my parse method returns an NA for the missing value. Alternative, Is there a parameter "fill" like there is for rbind that can be used in rbind.data.frame or analogous method?
The dataset I am using has 11M rows so performance is important.
Additionally, there is an equivalent method to rbind.data.frame to obtain a data.table. How would that be used? When I check the documentation it refers me to rbindlist but it complains the parameter is not used and if call directly(without do.call it complains about the type of parsed):
rparsed<-do.call(rbindlist,fill=T,parsed)
EDIT: The case I need to cover is more general, in a set of 11M records all the possible circumstances happen:
df<-data.table(a=c('{"tag_id":"34","response_id":2}',
'{"trash":"34","useless":2}',
'{"tag_id":"4","response_id":1,"other":4}',
NA,
'{"response_id":"34"}',
'{"tag_id":"34"}'),stringsAsFactors=F)
and the output should only contain tag_id and response_id columns.

There might be a simpler way but this seems to be working:
library(data.table)
library(jsonlite)
df[, json := sapply(a, fromJSON)][, rbindlist(lapply(json, data.frame), fill=TRUE)]
#or if you need all the columns :
#df[, json := sapply(a, fromJSON)][,
# c('tag_id', 'response_id') := rbindlist(lapply(json, data.frame), fill=TRUE)]
Output:
> df[, json := sapply(a, fromJSON)][, rbindlist(lapply(json, data.frame), fill=TRUE)]
tag_id response_id
1: 34 2
2: 4 1
3: 34 NA
EDIT:
This solution comes after the edit of the question with additional requests.
There are lots of ways to do this but I find the simplest one is at the creation of the data.frame like this:
df[, json := sapply(a, fromJSON)][,
rbindlist(lapply(json, function(x) data.frame(x)[-3]), fill=TRUE)]
# tag_id response_id
#1: 34 2
#2: 4 1
#3: 34 NA

Json Files parsing

So I am trying to open some json files to look for a publication year and sort them accordingly. But before doing this, I decided to experiment on a single file. I am having trouble though, because although I can get the files and the strings, when I try to print one word, it starts printinf the characters.
For example:
print data2[1] #prints
THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine. #results
but now
print data2[1][0] #should print THE
T #prints T
This is my code right now:
json_data =open(path)
data = json.load(json_data)
i=0
data2 = []
for x in range(0,len(data)):
data2.append(data[x]['section'])
if len(data[x]['content']) > 0:
for i in range(0,len(data[x]['content'])):
data2.append(data[x]['content'][i])

I probably need to look at your json file to be absolutely sure, but it seems to me that the data2 list is a list of strings. Thus, data2[1] is a string. When you do data2[1][0], the expected result is what you are getting - the character at the 0th index in the string.
>>> data2[1]
'THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine.'
>>> data2[1][0]
'T'
To get the first word, naively, you can split the string by spaces
>>> data2[1].split()
['THE', 'BRIDES', 'ORNAMENTS,', 'Viz.', 'Fiue', 'MEDITATIONS,', 'Morall', 'and', 'Diuine.']
>>> data2[1].split()[0]
'THE'
However, this will cause issues with punctuation, so you probably need to tokenize the text. This link should help - http://www.nltk.org/_modules/nltk/tokenize.html

Detecting encoding of sting and converting it

I have string:
string <- "{'text': u'Kandydaci PSL do Parlamentu Europejskiego \\u2013 OKR\\u0118G nr 1: Obejmuje obszar wojew\\xf3dztwa pomorskiego z siedzib\\u0105 ok... http://t.co/aZbjK7ME1O', 'created_at': u'Mon May 19 11:30:07 +0000 2014'}"
As you can see I have some codes instead of letters. As far as I know there are UTH-8 codes for polish characters like ą, ć, ź, ó and so on. How can I convert this string to obtain the output
"{'text': u'Kandydaci PSL do Parlamentu Europejskiego \\u2013 OKRĄG nr 1: Obejmuje obszar województwa pomorskiego z siedzibą ok... http://t.co/aZbjK7ME1O', 'created_at': u'Mon May 19 11:30:07 +0000 2014'}"

Here's a regular expression to find all escaped characters in the form \udddd and \xdd. We then take those values, and re-parse them to turn them into characters. Finally we replace the original matched values with the true characters
m <- gregexpr("\\\\u\\d{4}|\\\\x[0-9A_Fa-f]{2}", string)
a <- enc2utf8(sapply(parse(text=paste0('"', regmatches(string,m)[[1]], '"')), eval))
regmatches(string,m)[[1]] <- a
This will do them all. If you only want to do a subset, you could filter the vector of possible replacements.

R parse HTML document and use xpath to get all matches of two patterns

So, I parsed HTML code from FIFA worldcup website, and want to get all the matches:
wcup <- htmlTreeParse("http://www.fifa.com/worldcup/matches/", useInternalNodes=T)
However, the field for one country is 't-nText kern' and for the rest of countries is 't-nText '.
<span class="t-nText kern">Bosnia and Herzegovina</span>
Therefore, if I use this command, I will miss 'Bosnia and Herzegovina', like this command:
xpathSApply(wcup, "//span[#class='t-nText ']", xmlValue)
So, is there any way that I can search for both attributes 't-nText ' and 't-nText kern' at the same time? Or do you have any other solution? I want to keep the order of the matches as is.
xpath doesn't support logical OR:
xpathSApply(wcup, "//span[#class='t-nText ' || 't-nText kern']", xmlValue)
XPath error : Invalid expression
//span[#class='t-nText ' || 't-nText kern']
^
XPath error : Invalid expression
//span[#class='t-nText ' || 't-nText kern']
^
Error in xpathApply.XMLInternalDocument(doc, path, fun, ..., namespaces = namespaces, :
error evaluating xpath expression //span[#class='t-nText ' || 't-nText kern']

Use 'or' or perhaps 'starts-with()',
wcup["//span[#class='t-nText kern' or #class='t-nText ']"]
wcup["//span[starts-with(#class, 't-nText ')]"]

I originally posted this ,, then noticed order was needed, so I searched SO for "XPath OR"
Why not just append the results of the two searches together:
c( xpathSApply(wcup, "//span[#class='t-nText kern']", xmlValue),
xpathSApply(wcup, "//span[#class='t-nText ']", xmlValue)
)
Lo and behold I came up with:
xpathSApply(wcup, "//*[starts-with(#class,'t-nText')]", xmlValue)
Which appears mighty similar to Martin Morgan's solution. I had not realized that XPath was it's own language. Guess I'm at least 10 years behind the times.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Unexpected behavior of gsub in R - html

Related

How can I set the numbering of the x-axis of an Octave plot to engineering notation?

R convert json to list to data.table

Json Files parsing

Detecting encoding of sting and converting it

R parse HTML document and use xpath to get all matches of two patterns

Categories

Resources