Detecting encoding of sting and converting it - json

I have string:
string <- "{'text': u'Kandydaci PSL do Parlamentu Europejskiego \\u2013 OKR\\u0118G nr 1: Obejmuje obszar wojew\\xf3dztwa pomorskiego z siedzib\\u0105 ok... http://t.co/aZbjK7ME1O', 'created_at': u'Mon May 19 11:30:07 +0000 2014'}"
As you can see I have some codes instead of letters. As far as I know there are UTH-8 codes for polish characters like ą, ć, ź, ó and so on. How can I convert this string to obtain the output
"{'text': u'Kandydaci PSL do Parlamentu Europejskiego \\u2013 OKRĄG nr 1: Obejmuje obszar województwa pomorskiego z siedzibą ok... http://t.co/aZbjK7ME1O', 'created_at': u'Mon May 19 11:30:07 +0000 2014'}"

Here's a regular expression to find all escaped characters in the form \udddd and \xdd. We then take those values, and re-parse them to turn them into characters. Finally we replace the original matched values with the true characters
m <- gregexpr("\\\\u\\d{4}|\\\\x[0-9A_Fa-f]{2}", string)
a <- enc2utf8(sapply(parse(text=paste0('"', regmatches(string,m)[[1]], '"')), eval))
regmatches(string,m)[[1]] <- a
This will do them all. If you only want to do a subset, you could filter the vector of possible replacements.

Related

Why octave error with function huffmandeco about large index types?

I've got a little MatLab script, which I try to understand. It doesn't do very much. It only reads a text from a file and encode and decode it with the Huffman-functions.
But it throws an error while decoding:
"error: out of memory or dimension too large for Octave's index type
error: called from huffmandeco>dict2tree at line 95 column 19"
I don't know why, because I debugged it and don't see a large index type.
I added the part which calculates p from the input text.
%text is a random input text file in ASCII
%calculate the relative frequency of every Symbol
for i=0:127
nlet=length(find(text==i));
p(i+1)=nlet/length(text);
end
symb = 0:127;
dict = huffmandict(symb,p); % Create dictionary
compdata = huffmanenco(fdata,dict); % Encode the data
dsig = huffmandeco(compdata,dict); % Decode the Huffman code
I can oly use octave instead of MatLab. I don't know, if there is an unexpected error. I use the Octave Version 6.2.0 on Win10. I tried the version for large data, it didn't change anything.
Maybe anyone knows the error in this context?
EDIT:
I debugged the code again. In the function huffmandeco I found the following function:
function tree = dict2tree (dict)
L = length (dict);
lengths = zeros (1, L);
## the depth of the tree is limited by the maximum word length.
for i = 1:L
lengths(i) = length (dict{i});
endfor
m = max (lengths);
tree = zeros (1, 2^(m+1)-1)-1;
for i = 1:L
pointer = 1;
word = dict{i};
for bit = word
pointer = 2 * pointer + bit;
endfor
tree(pointer) = i;
endfor
endfunction
The maximum length m in this case is 82. So the function calculates:
tree = zeros (1, 2^(82+1)-1)-1.
So it's obvious why the error called a too large index type.
But there must be a solution or another error, because the code is tested before.
I haven't weeded through the code enough to know why yet, but huffmandict is not ignoring zero-probability symbols the way it claims to. Nor have I been able to find a bug report on Savannah, but again I haven't searched thoroughly.
A workaround is to limit the symbol list and their probabilities to only the symbols that actually occur. Using containers.Map would be ideal, but in Octave you can do that with a couple of the outputs from unique:
% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.'; % just make it easier to read
For the string
textstr = 'Random String Input.';
the result is:
>> symbols
symbols = .IRSadgimnoprtu
>> inds
inds =
Columns 1 through 19:
4 6 11 7 12 10 1 5 15 14 9 11 8 1 3 11 13 16 15
Column 20:
2
So the first symbol in the input string is symbols(4), the second is symbols(6), and so on.
From there, you just use symbols and inds to create the dictionary and encode/decode the signal. Here's a quick demo script:
textstr = 'Random String Input.';
fprintf("Starting string: %s\n", textstr);
% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.'; % just make it easier to read
% Calculate the frequency of each symbol in table
% max(inds) == numel(symbols)
p = histc(inds, 1:max(inds))/numel(inds);
dict = huffmandict(symbols, p);
compdata = huffmanenco(inds, dict);
dsig = huffmandeco(compdata, dict);
fprintf("Decoded string: %s\n", symbols(dsig));
And the output:
Starting string: Random String Input.
Decoded string: Random String Input.
To encode strings other than the original input string, you would have to map the characters to symbol indices (ensuring that all symbols in the string are actually present in the symbol table, obviously):
>> [~, s_idx] = ismember('trogdor', symbols)
s_idx =
15 14 12 8 7 12 14
>> compdata = huffmanenco(s_idx, dict);
>> dsig = huffmandeco(compdata, dict);
>> fprintf("Decoded string: %s\n", symbols(dsig));
Decoded string: trogdor

Unexpected behavior of gsub in R

Excuse me for not being more specific in the title, but I don't know how to explain this without an example.
I have a .html file that looks like this:
<TR><TD>log p-value:</TD><TD>-2.797e+02</TD></TR>
<TR><TD>Information Content per bp:</TD><TD>1.736</TD></TR>
<TR><TD>Number of Target Sequences with motif</TD><TD>894.0</TD></TR>
<TR><TD>Percentage of Target Sequences with motif</TD><TD>47.58%</TD></TR>
<TR><TD>Number of Background Sequences with motif</TD><TD>10864.6</TD></TR>
<TR><TD>Percentage of Background Sequences with motif</TD><TD>22.81%</TD></TR>
<TR><TD>Average Position of motif in Targets</TD><TD>402.4 +/- 261.2bp</TD></TR>
<TR><TD>Average Position of motif in Background</TD><TD>400.6 +/- 246.8bp</TD></TR>
<TR><TD>Strand Bias (log2 ratio + to - strand density)</TD><TD>-0.0</TD></TR>
<TR><TD>Multiplicity (# of sites on avg that occur together)</TD><TD>1.48</TD></TR>
I read it in:
html = readLines("file.html")
I am interested in whatever is between </TD><TD> and </TD></TR>. When I run the following, I get the result I want:
mypattern = '<TR><TD>log p-value:</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
[1] "-2.797e+02"
It works well for almost all lines I want to match, but when I do the same thing for the last two lines, it does not extract anything.
mypattern = '<TR><TD>Strand Bias (log2 ratio + to - strand density)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
character(0)
mypattern = '<TR><TD>Multiplicity (# of sites on avg that occur together)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
character(0)
Why is this happening?
Thank you for your help.
If your data structure is really like this. You have a xml file with keys and values so I assume it is easier to utilize this!
library(xml2)
xd <- read_xml("file.html", as_html = TRUE)
key_values <- xml_text(xml_find_all(xd, "//td"))
is_key <- as.logical(seq_along(key_values) %% 2)
setNames(key_values[!is_key], key_values[is_key])
First, I'll say that I would actually solve this problem like this:
gsub(".+>([^<]+)</TD></TR>", "\\1", html)
#> [1] "-2.797e+02" "1.736" "894.0"
#> [4] "47.58%" "10864.6" "22.81%"
#> [7] "402.4 +/- 261.2bp" "400.6 +/- 246.8bp" "-0.0"
#> [10] "1.48"
But, to answer the question of why your way didn't work, we need to checkout the help file for R regular expressions (help("regex")):
Any metacharacter with special meaning may be quoted by preceding it with a backslash. The metacharacters in extended regular expressions are . \ | ( ) [ { ^ $ * + ? ...
The patterns that you had trouble with included parentheses, which you needed to escape (note the double backslash, since backslashes themselves need to be escaped):
mypattern = '<TR><TD>Multiplicity \\(# of sites on avg that occur together\\)</TD><TD>([^<]*)</TD></TR>'
gsub(mypattern,'\\1',grep(mypattern,html,value=TRUE))
# [1] "1.48"

How does the 'k' modifier in FINDC() work in SAS?

I'm reading through the book, "SAS Functions by Example - Second Edition" and having trouble trying to understand a certain function due to the example and output they get.
Function: FINDC
Purpose: To locate a character that appears or does not appear within a string. With optional arguments, you can define the starting point for the search, set the direction of the search, ignore case or trailing blanks, or look for characters except the ones listed.
Syntax: FINDC(character-value, find-characters <,'modifiers'> <,start>)
Two of the modifiers are i and k:
i ignore case
k count only characters that are not in the list of find-characters
So now one of the examples has this:
Note: STRING1 = "Apples and Books"
FINDC(STRING1,"aple",'ki')
For the Output, they said it returns 1 because the position of "A" in Apple. However this is what confuses me, because I thought the k modifier says to find characters that are not in the find-characters list. So why is it searching for a when the letter "A", case-ignored, is in the find-characters list. To me, I feel like this example should output 6 for the "s" in Apples.
Is anyone able to help explain the k modifier to me any better, and why the output for this answer is 1 instead of 6?
Edit 1
Reading the SAS documentation online, I found this example which seems to contradict the book I'm reading:
Example 3: Searching for Characters and Using the K Modifier
This example searches a character string and returns the characters that do
not appear in the character list.
data _null_;
string = 'Hi, ho!';
charlist = 'hi';
j = 0;
do until (j = 0);
j = findc(string, charlist, "k", j+1);
if j = 0 then put +3 "That's all";
else do;
c = substr(string, j, 1);
put +3 j= c=;
end;
end;
run;
SAS writes the following output to the log:
j=1 c=H
j=3 c=,
j=4 c=
j=6 c=o
j=7 c=!
That's all
So, is the book wrong?
The book is wrong.
511 data _null_;
512 STRING1 = "Apples and Books" ;
513 x=FINDC(STRING1,"aple",'ki');
514 put x=;
515 if x then do;
516 ch=char(string1,x);
517 put ch=;
518 end;
519 run;
x=6
ch=s

Json Files parsing

So I am trying to open some json files to look for a publication year and sort them accordingly. But before doing this, I decided to experiment on a single file. I am having trouble though, because although I can get the files and the strings, when I try to print one word, it starts printinf the characters.
For example:
print data2[1] #prints
THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine. #results
but now
print data2[1][0] #should print THE
T #prints T
This is my code right now:
json_data =open(path)
data = json.load(json_data)
i=0
data2 = []
for x in range(0,len(data)):
data2.append(data[x]['section'])
if len(data[x]['content']) > 0:
for i in range(0,len(data[x]['content'])):
data2.append(data[x]['content'][i])
I probably need to look at your json file to be absolutely sure, but it seems to me that the data2 list is a list of strings. Thus, data2[1] is a string. When you do data2[1][0], the expected result is what you are getting - the character at the 0th index in the string.
>>> data2[1]
'THE BRIDES ORNAMENTS, Viz. Fiue MEDITATIONS, Morall and Diuine.'
>>> data2[1][0]
'T'
To get the first word, naively, you can split the string by spaces
>>> data2[1].split()
['THE', 'BRIDES', 'ORNAMENTS,', 'Viz.', 'Fiue', 'MEDITATIONS,', 'Morall', 'and', 'Diuine.']
>>> data2[1].split()[0]
'THE'
However, this will cause issues with punctuation, so you probably need to tokenize the text. This link should help - http://www.nltk.org/_modules/nltk/tokenize.html

Convert bitstring to tuple

I'm trying to find out how to convert an Erlang bitstring to a tuple, but so far without any luck.
What I want is to get from for example <<"{1,2}">> the tuple {1,2}.
You can use the modules erl_scan and erl_parse, as in this answer. Since erl_scan:string requires a string, not a binary, you have to convert the value with binary_to_list first:
> {ok, Scanned, _} = erl_scan:string(binary_to_list(<<"{1,2}">>)).
{ok,[{'{',1},{integer,1,1},{',',1},{integer,1,2},{'}',1}],1}
Then, you'd use erl_parse:parse_term to get the actual term. However, this function expects the term to end with a dot, so we have to add it explicitly:
> {ok, Parsed} = erl_parse:parse_term(Scanned ++ [{dot,0}]).
{ok,{1,2}}
Now the variable Parsed contains the result:
> Parsed.
{1,2}
You can use binary functions and erlang:list_to_tuple/1
1> B = <<"{1,2}">>.
<<"{1,2}">>
2> list_to_tuple([list_to_integer(binary_to_list(X)) || X <- binary:split(binary:part(B, 1, byte_size(B)-2), <<",">>, [global])]).
{1,2}