Why octave error with function huffmandeco about large index types? - octave

I've got a little MatLab script, which I try to understand. It doesn't do very much. It only reads a text from a file and encode and decode it with the Huffman-functions.
But it throws an error while decoding:
"error: out of memory or dimension too large for Octave's index type
error: called from huffmandeco>dict2tree at line 95 column 19"
I don't know why, because I debugged it and don't see a large index type.
I added the part which calculates p from the input text.
%text is a random input text file in ASCII
%calculate the relative frequency of every Symbol
for i=0:127
nlet=length(find(text==i));
p(i+1)=nlet/length(text);
end
symb = 0:127;
dict = huffmandict(symb,p); % Create dictionary
compdata = huffmanenco(fdata,dict); % Encode the data
dsig = huffmandeco(compdata,dict); % Decode the Huffman code
I can oly use octave instead of MatLab. I don't know, if there is an unexpected error. I use the Octave Version 6.2.0 on Win10. I tried the version for large data, it didn't change anything.
Maybe anyone knows the error in this context?
EDIT:
I debugged the code again. In the function huffmandeco I found the following function:
function tree = dict2tree (dict)
L = length (dict);
lengths = zeros (1, L);
## the depth of the tree is limited by the maximum word length.
for i = 1:L
lengths(i) = length (dict{i});
endfor
m = max (lengths);
tree = zeros (1, 2^(m+1)-1)-1;
for i = 1:L
pointer = 1;
word = dict{i};
for bit = word
pointer = 2 * pointer + bit;
endfor
tree(pointer) = i;
endfor
endfunction
The maximum length m in this case is 82. So the function calculates:
tree = zeros (1, 2^(82+1)-1)-1.
So it's obvious why the error called a too large index type.
But there must be a solution or another error, because the code is tested before.

I haven't weeded through the code enough to know why yet, but huffmandict is not ignoring zero-probability symbols the way it claims to. Nor have I been able to find a bug report on Savannah, but again I haven't searched thoroughly.
A workaround is to limit the symbol list and their probabilities to only the symbols that actually occur. Using containers.Map would be ideal, but in Octave you can do that with a couple of the outputs from unique:
% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.'; % just make it easier to read
For the string
textstr = 'Random String Input.';
the result is:
>> symbols
symbols = .IRSadgimnoprtu
>> inds
inds =
Columns 1 through 19:
4 6 11 7 12 10 1 5 15 14 9 11 8 1 3 11 13 16 15
Column 20:
2
So the first symbol in the input string is symbols(4), the second is symbols(6), and so on.
From there, you just use symbols and inds to create the dictionary and encode/decode the signal. Here's a quick demo script:
textstr = 'Random String Input.';
fprintf("Starting string: %s\n", textstr);
% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.'; % just make it easier to read
% Calculate the frequency of each symbol in table
% max(inds) == numel(symbols)
p = histc(inds, 1:max(inds))/numel(inds);
dict = huffmandict(symbols, p);
compdata = huffmanenco(inds, dict);
dsig = huffmandeco(compdata, dict);
fprintf("Decoded string: %s\n", symbols(dsig));
And the output:
Starting string: Random String Input.
Decoded string: Random String Input.
To encode strings other than the original input string, you would have to map the characters to symbol indices (ensuring that all symbols in the string are actually present in the symbol table, obviously):
>> [~, s_idx] = ismember('trogdor', symbols)
s_idx =
15 14 12 8 7 12 14
>> compdata = huffmanenco(s_idx, dict);
>> dsig = huffmandeco(compdata, dict);
>> fprintf("Decoded string: %s\n", symbols(dsig));
Decoded string: trogdor

Related

Dimension problem when converting a MATLAB .m script into an Octave compatible syntax

I want to run a MATLAB script M-file to reconstruct a point cloud in Octave. Therefore I had to rewrite some parts of the code to make it compatible with Octave. Actually the M-file works fine in Octave (I don't get any errors) and also the plotted point cloud looks good at first glance, but it seems that the variables are only half the size of the original MATLAB variables. In the attached screenshots you can see what I mean.
Octave:
MATLAB:
You can see that the dimension of e.g. M in Octave is 1311114x3 but in MATLAB it is 2622227x3. The actual number of rows in my raw file is 2622227 as well.
Here you can see an extract of the raw file (original data) that I use.
Rotation angle Measured distance
-0,090 26,295
-0,342 26,294
-0,594 26,294
-0,846 26,295
-1,098 26,294
-1,368 26,296
-1,620 26,296
-1,872 26,296
In MATLAB I created my output variable as follows.
data = table;
data.Rotationangle = cell2mat(raw(:, 1));
data.Measureddistance = cell2mat(raw(:, 2));
As there is no table function in Octave I wrote
data = cellfun(#(x)str2num(x), strrep(raw, ',', '.'))
instead.
Octave also has no struct2array function, so I had to replace it as well.
In MATLAB I wrote.
data = table2array(data);
In Octave this was a bit more difficult to do. I had to create a struct2array function, which I did by means of this bug report.
%% Create a struct2array function
function retval = struct2array (input_struct)
%input check
if (~isstruct (input_struct) || (nargin ~= 1))
print_usage;
endif
%convert to cell array and flatten/concatenate output.
retval = [ (struct2cell (input_struct)){:}];
endfunction
clear b;
b.a = data;
data = struct2array(b);
Did I make a mistake somewhere and could someone help me to solve this problem?
edit:
Here's the part of my script where I'm using raw.
delimiter = '\t';
startRow = 5;
formatSpec = '%s%s%[^\n\r]';
fileID = fopen(filename,'r');
dataArray = textscan(fileID, formatSpec, 'Delimiter', delimiter, 'HeaderLines' ,startRow-1, 'ReturnOnError', false, 'EndOfLine', '\r\n');
fclose(fileID);
%% Convert the contents of columns containing numeric text to numbers.
% Replace non-numeric text with NaN.
raw = repmat({''},length(dataArray{1}),length(dataArray)-1);
for col=1:length(dataArray)-1
raw(1:length(dataArray{col}),col) = mat2cell(dataArray{col}, ones(length(dataArray{col}), 1));
end
numericData = NaN(size(dataArray{1},1),size(dataArray,2));
for col=[1,2]
% Converts text in the input cell array to numbers. Replaced non-numeric
% text with NaN.
rawData = dataArray{col};
for row=1:size(rawData, 1)
% Create a regular expression to detect and remove non-numeric prefixes and
% suffixes.
regexstr = '(?<prefix>.*?)(?<numbers>([-]*(\d+[\.]*)+[\,]{0,1}\d*[eEdD]{0,1}[-+]*\d*[i]{0,1})|([-]*(\d+[\.]*)*[\,]{1,1}\d+[eEdD]{0,1}[-+]*\d*[i]{0,1}))(?<suffix>.*)';
try
result = regexp(rawData(row), regexstr, 'names');
numbers = result.numbers;
% Detected commas in non-thousand locations.
invalidThousandsSeparator = false;
if numbers.contains('.')
thousandsRegExp = '^\d+?(\.\d{3})*\,{0,1}\d*$';
if isempty(regexp(numbers, thousandsRegExp, 'once'))
numbers = NaN;
invalidThousandsSeparator = true;
end
end
% Convert numeric text to numbers.
if ~invalidThousandsSeparator
numbers = strrep(numbers, '.', '');
numbers = strrep(numbers, ',', '.');
numbers = textscan(char(numbers), '%f');
numericData(row, col) = numbers{1};
raw{row, col} = numbers{1};
end
catch
raw{row, col} = rawData{row};
end
end
end
You don't see any raw in my workspaces because I clear all temporary variables before I reconstruct my point cloud.
Also my original data in row 1311114 and 1311115 look normal.
edit 2:
As suggested here is a small example table to clarify what I want and what MATLAB does with the table2array function in my case.
data =
-0.0900 26.2950
-0.3420 26.2940
-0.5940 26.2940
-0.8460 26.2950
-1.0980 26.2940
-1.3680 26.2960
-1.6200 26.2960
-1.8720 26.2960
With the struct2array function I used in Octave I get the following array.
data =
-0.090000 26.295000
-0.594000 26.294000
-1.098000 26.294000
-1.620000 26.296000
-2.124000 26.295000
-2.646000 26.293000
-3.150000 26.294000
-3.654000 26.294000
If you compare the Octave array with my original data, you can see that every second row is skipped. This seems to be the reason for 1311114 instead of 2622227 rows.
edit 3:
I tried to solve my problem with the suggestions of #Tasos Papastylianou, which unfortunately was not successful.
First I did the variant with a struct.
data = struct();
data.Rotationangle = [raw(:,1)];
data.Measureddistance = [raw(:,2)];
data = cell2mat( struct2cell (data ).' )
But this leads to the following structure in my script. (Unfortunately the result is not what I would like to have as shown in edit 2. Don't be surprised, I only used a small part of my raw file to accelerate the run of my script, so here are only 769 lines.)
[766,1] = -357,966
[767,1] = -358,506
[768,1] = -359,010
[769,1] = -359,514
[1,2] = 26,295
[2,2] = 26,294
[3,2] = 26,294
[4,2] = 26,296
Furthermore I get the following error.
error: unary operator '-' not implemented for 'cell' operands
error: called from
Cloud_reconstruction at line 137 column 11
Also the approach with the dataframe octave package didn't work. When I run the following code it leads to the error you can see below.
dataframe2array = #(df) cell2mat( struct(df).x_data );
pkg load dataframe;
data = dataframe();
data.Rotationangle = [raw(:, 1)];
data.Measureddistance = [raw(:, 2)];
dataframe2array(data)
error:
warning: Trying to overwrite colum names
warning: called from
df_matassign at line 147 column 13
subsasgn at line 172 column 14
Cloud_reconstruction at line 106 column 20
warning: Trying to overwrite colum names
warning: called from
df_matassign at line 176 column 13
subsasgn at line 172 column 14
Cloud_reconstruction at line 106 column 20
warning: Trying to overwrite colum names
warning: called from
df_matassign at line 147 column 13
subsasgn at line 172 column 14
Cloud_reconstruction at line 107 column 23
warning: Trying to overwrite colum names
warning: called from
df_matassign at line 176 column 13
subsasgn at line 172 column 14
Cloud_reconstruction at line 107 column 23
error: RHS(_,2): but RHS has size 768x1
error: called from
df_matassign at line 179 column 11
subsasgn at line 172 column 14
Cloud_reconstruction at line 107 column 23
Both error messages refer to the following part of my script where I'm doing the reconstruction of the point cloud in cylindrical coordinates.
distLaserCenter = 47; % Distance between the pipe centerline and the blind zone in mm
m = size(data,1); % Find the length of the first dimension of data
zincr = 0.4/360; % z increment in mm per deg
data(:,1) = -data(:,1);
for i = 1:m
data(i,2) = data(i,2) + distLaserCenter;
if i == 1
data(i,3) = 0;
elseif abs(data(i,1)-data(i-1)) < 100
data(i,3) = data(i-1,3) + zincr*(data(i,1)-data(i-1));
else abs(data(i,1)-data(i-1)) > 100;
data(i,3) = data(i-1,3) + zincr*(data(i,1)-(data(i-1)-360));
end
end
To give some background information for a better understanding. The script is used to reconstruct a pipe as a point cloud. The surface of the pipe was scanned from inside with a laser and the laser measured several points (distance from laser to the inner wall of the pipe) at each deg of rotation. I hope this helps to understand what I want to do with my script.
Not sure exactly what you're trying to do, but here's a toy example of how a struct could be used in an equivalent manner to a table:
matlab:
data = table;
data.A = [1;2;3;4;5];
data.B = [10;20;30;40;50];
table2array(data)
octave:
data = struct();
data.A = [1;2;3;4;5];
data.B = [10;20;30;40;50];
cell2mat( struct2cell (data ).' )
Note the transposition operation (.') before passing the result to cell2mat, since in a table, the 'fieldnames' are arranged horizontally in columns, whereas the struct2cell ends up arranging what used to be the 'fieldnames' as rows.
You might also be interested in the dataframe octave package, which performs similar functions to matlab's table (or in fact, R's dataframe object): https://octave.sourceforge.io/dataframe/ (you can install this by typing pkg install -forge dataframe in your console)
Unfortunately, the way to display the data as an array is still not ideal (see: https://stackoverflow.com/a/55417141/4183191), but you can easily convert that into a tiny function, e.g.
dataframe2array = #(df) cell2mat( struct(df).x_data );
Your code can then become:
pkg load dataframe;
data = dataframe();
data.A = [1;2;3;4;5];
data.B = [10;20;30;40;50];
dataframe2array(data)

How does the 'k' modifier in FINDC() work in SAS?

I'm reading through the book, "SAS Functions by Example - Second Edition" and having trouble trying to understand a certain function due to the example and output they get.
Function: FINDC
Purpose: To locate a character that appears or does not appear within a string. With optional arguments, you can define the starting point for the search, set the direction of the search, ignore case or trailing blanks, or look for characters except the ones listed.
Syntax: FINDC(character-value, find-characters <,'modifiers'> <,start>)
Two of the modifiers are i and k:
i ignore case
k count only characters that are not in the list of find-characters
So now one of the examples has this:
Note: STRING1 = "Apples and Books"
FINDC(STRING1,"aple",'ki')
For the Output, they said it returns 1 because the position of "A" in Apple. However this is what confuses me, because I thought the k modifier says to find characters that are not in the find-characters list. So why is it searching for a when the letter "A", case-ignored, is in the find-characters list. To me, I feel like this example should output 6 for the "s" in Apples.
Is anyone able to help explain the k modifier to me any better, and why the output for this answer is 1 instead of 6?
Edit 1
Reading the SAS documentation online, I found this example which seems to contradict the book I'm reading:
Example 3: Searching for Characters and Using the K Modifier
This example searches a character string and returns the characters that do
not appear in the character list.
data _null_;
string = 'Hi, ho!';
charlist = 'hi';
j = 0;
do until (j = 0);
j = findc(string, charlist, "k", j+1);
if j = 0 then put +3 "That's all";
else do;
c = substr(string, j, 1);
put +3 j= c=;
end;
end;
run;
SAS writes the following output to the log:
j=1 c=H
j=3 c=,
j=4 c=
j=6 c=o
j=7 c=!
That's all
So, is the book wrong?
The book is wrong.
511 data _null_;
512 STRING1 = "Apples and Books" ;
513 x=FINDC(STRING1,"aple",'ki');
514 put x=;
515 if x then do;
516 ch=char(string1,x);
517 put ch=;
518 end;
519 run;
x=6
ch=s

Matlab structure to MySql database

I have a structure in Matlab. The structure contain 2.1 million rows with a mixture of Double, Integers and Char. The structure is called TaqQ;
TaqQ.time 2100000x1 uint32,
TaqQ.bid 2100000x1 double,
TaqQ.ex 2100000x4 char,
How can i in a fast, way transfere that structure to MySql?
Maybe by save the structure to an csv file and then import it to mysql. I tried that:
csvwrite('test.csv',[TaqQ.time TaqQ.bid TaqQ.ex]) %this is very slow
csvwrite('test.csv',[TaqQ.time' ; TaqQ.bid'; TaqQ.ex']) % fast but don't know how to deal with it in MySql!?
I also tried using fastinsert, but i was way to slow.
I also tried:
connHandle = conn.Handle;
stmt = connHandle.createStatement;
for i= 1:2100000
stmt.addBatch('INSERT INTO Quotes (time,bid,ex) VALUES (TaqQ.time(i),
TaqQ.bid(i), TaqQ.ex(i))')
end
stmt.executeBatch;
stmt.close;
`
But i was also way to slow
can somebody help me??
This will take a lot of RAM, but it will do things in memory, minimize the number of function calls, and will write everything at once, so it should be fast:
% Test struct
TaqQ = struct( ...
'time', [1;2;3;4;5], ...
'bid', [110.1;120.1;130.1;140.1;150.1], ...
'ex', ['ABCD';'EFGH';'IJKL';'MNOP';'QRST'] ...
);
% Writing the file
NL = 10; % ASCII new line
DQ = double('"'); % ASCII double quote
f = fopen('test.csv', 'wb');
fwrite(f, [ ...
reshape(sprintf('% 10u,', TaqQ.time), 11, []); ... % 10 digits
reshape(sprintf('% 10.3f,', TaqQ.bid), 11, []); ... % 10 digits + point + sign
char(DQ*(ones(1,size(TaqQ.ex,1)))); ... % open quotes
TaqQ.ex.'; ... % already char
char(DQ*(ones(1,size(TaqQ.ex,1)))); ... % end quotes
char(NL*(ones(1,size(TaqQ.ex,1)))) ... % newlines
]);
fclose(f);
Actually I'm curious if it's fast, I could not test on huge data. Please let me know.
If you want to squeeze out all the spaces, then this is a (slower) variant:
% Preparing content
NL = 10; % ASCII new line
DQ = double('"'); % ASCII double quote
bytes = [ ...
reshape(sprintf('% 10u,', TaqQ.time), 11, []); ... % 10 digits
reshape(sprintf('% 10.3f,', TaqQ.bid), 11, []); ... % 10 digits + point + sign
char(DQ*(ones(1,size(TaqQ.ex,1)))); ... % open quotes
TaqQ.ex.'; ... % already char
char(DQ*(ones(1,size(TaqQ.ex,1)))); ... % end quotes
char(NL*(ones(1,size(TaqQ.ex,1)))) ... % newlines
];
% Writing content without spaces
f = fopen('test.csv', 'wb');
fwrite(f,bytes(bytes~=' '));
fclose(f);

LZW Compression In Lua

Here is the Pseudocode for Lempel-Ziv-Welch Compression.
pattern = get input character
while ( not end-of-file ) {
K = get input character
if ( <<pattern, K>> is NOT in
the string table ){
output the code for pattern
add <<pattern, K>> to the string table
pattern = K
}
else { pattern = <<pattern, K>> }
}
output the code for pattern
output EOF_CODE
I am trying to code this in Lua, but it is not really working. Here is the code I modeled after an LZW function in Python, but I am getting an "attempt to call a string value" error on line 8.
function compress(uncompressed)
local dict_size = 256
local dictionary = {}
w = ""
result = {}
for c in uncompressed do
-- while c is in the function compress
local wc = w + c
if dictionary[wc] == true then
w = wc
else
dictionary[w] = ""
-- Add wc to the dictionary.
dictionary[wc] = dict_size
dict_size = dict_size + 1
w = c
end
-- Output the code for w.
if w then
dictionary[w] = ""
end
end
return dictionary
end
compressed = compress('TOBEORNOTTOBEORTOBEORNOT')
print (compressed)
I would really like some help either getting my code to run, or helping me code the LZW compression in Lua. Thank you so much!
Assuming uncompressed is a string, you'll need to use something like this to iterate over it:
for i = 1, #uncompressed do
local c = string.sub(uncompressed, i, i)
-- etc
end
There's another issue on line 10; .. is used for string concatenation in Lua, so this line should be local wc = w .. c.
You may also want to read this with regard to the performance of string concatenation. Long story short, it's often more efficient to keep each element in a table and return it with table.concat().
You should also take a look here to download the source for a high-performance LZW compression algorithm in Lua...

Retrieve blob field from mySQL database with MATLAB

I'm accessing public mySQL database using JDBC and mySQL java connector. exonCount is int(10), exonStarts and exonEnds are longblob fields.
javaaddpath('mysql-connector-java-5.1.12-bin.jar')
host = 'genome-mysql.cse.ucsc.edu';
user = 'genome';
password = '';
dbName = 'hg18';
jdbcString = sprintf('jdbc:mysql://%s/%s', host, dbName);
jdbcDriver = 'com.mysql.jdbc.Driver';
dbConn = database(dbName, user , password, jdbcDriver, jdbcString);
gene.Symb = 'CDKN2B';
% Check to make sure that we successfully connected
if isconnection(dbConn)
qry = sprintf('SELECT exonCount, exonStarts, exonEnds FROM refFlat WHERE geneName=''%s''',gene.Symb);
result = get(fetch(exec(dbConn, qry)), 'Data');
fprintf('Connection failed: %s\n', dbConn.Message);
end
Here is the result:
result =
[2] [18x1 int8] [18x1 int8]
[2] [18x1 int8] [18x1 int8]
result{1,2}'
ans =
50 49 57 57 50 57 48 49 44 50 49 57 57 56 54 55 51 44
This is wrong. The length of vectors in 2nd and 3rd columns should match the numbers in the 1st column.
The 1st blob, for example, should be [21992901; 21998673]. How I can convert it?
Update:
Just after submitting this question I thought it might be hex representation of a string.
And it was confirmed:
>> char(result{1,2}')
ans =
21992901,21998673,
So now I need to convert all blobs hex data into numeric vectors. Still thinking to do it in a vectorized way, since number of rows can be large.
This will convert your character data to numeric vectors for all except the first column of data in result, placing the results back into the appropriate cells:
result(:,2:end) = cellfun(#(x) str2num(char(x'))',... %# Apply fcn to each cell
result(:,2:end),... %# Input cells
'UniformOutput',false); %# Output as a cell array
I suggest using textscan
exons = cellfun(#(x) textscan(char(x'),'%d','Delimiter',','),...
result(:,2:end),'UniformOutput',false);
To get a cell array for each of the two numbers, you can replace the format string by %d,%d and drop the Delimiter option.
Here is what I do:
function res = blob2num(x)
res = str2double(regexp(char(x'),'[^,]+','match')');
then
exons = cellfun(#blob2num,result(:,2:3)','UniformOutput',0)
exons =
[2x1 double] [2x1 double]
[2x1 double] [2x1 double]
Any better solution? May be on the step of retrieving data?