Reading ascii file to binary - binary

I've created a file.txt with the string "7F" in it. I read it using apaches library:
byte[] byteArray = IOUtils.toByteArray(new Base64InputStream(new java.io.FileInputStream(fileName)));
And this is the array I get:
[-20]
which equates to 1110 1100 when i was expecting 1111 1111
I guess my question is how to encode a string in ascii which would generate the Byte 1111 1111?

1111 1111 binary = 255 decimal. According to this ASCII chart, that would be the ÿ character.

You will have to use the ASCII Character:'ÿ'
The following code should get you what you want:
Character s= 'ÿ';
System.out.println(Integer.toBinaryString(s));
You can use an online utility like:
https://www.branah.com/ascii-converter
to help you out.

Related

Why octave error with function huffmandeco about large index types?

I've got a little MatLab script, which I try to understand. It doesn't do very much. It only reads a text from a file and encode and decode it with the Huffman-functions.
But it throws an error while decoding:
"error: out of memory or dimension too large for Octave's index type
error: called from huffmandeco>dict2tree at line 95 column 19"
I don't know why, because I debugged it and don't see a large index type.
I added the part which calculates p from the input text.
%text is a random input text file in ASCII
%calculate the relative frequency of every Symbol
for i=0:127
nlet=length(find(text==i));
p(i+1)=nlet/length(text);
end
symb = 0:127;
dict = huffmandict(symb,p); % Create dictionary
compdata = huffmanenco(fdata,dict); % Encode the data
dsig = huffmandeco(compdata,dict); % Decode the Huffman code
I can oly use octave instead of MatLab. I don't know, if there is an unexpected error. I use the Octave Version 6.2.0 on Win10. I tried the version for large data, it didn't change anything.
Maybe anyone knows the error in this context?
EDIT:
I debugged the code again. In the function huffmandeco I found the following function:
function tree = dict2tree (dict)
L = length (dict);
lengths = zeros (1, L);
## the depth of the tree is limited by the maximum word length.
for i = 1:L
lengths(i) = length (dict{i});
endfor
m = max (lengths);
tree = zeros (1, 2^(m+1)-1)-1;
for i = 1:L
pointer = 1;
word = dict{i};
for bit = word
pointer = 2 * pointer + bit;
endfor
tree(pointer) = i;
endfor
endfunction
The maximum length m in this case is 82. So the function calculates:
tree = zeros (1, 2^(82+1)-1)-1.
So it's obvious why the error called a too large index type.
But there must be a solution or another error, because the code is tested before.
I haven't weeded through the code enough to know why yet, but huffmandict is not ignoring zero-probability symbols the way it claims to. Nor have I been able to find a bug report on Savannah, but again I haven't searched thoroughly.
A workaround is to limit the symbol list and their probabilities to only the symbols that actually occur. Using containers.Map would be ideal, but in Octave you can do that with a couple of the outputs from unique:
% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.'; % just make it easier to read
For the string
textstr = 'Random String Input.';
the result is:
>> symbols
symbols = .IRSadgimnoprtu
>> inds
inds =
Columns 1 through 19:
4 6 11 7 12 10 1 5 15 14 9 11 8 1 3 11 13 16 15
Column 20:
2
So the first symbol in the input string is symbols(4), the second is symbols(6), and so on.
From there, you just use symbols and inds to create the dictionary and encode/decode the signal. Here's a quick demo script:
textstr = 'Random String Input.';
fprintf("Starting string: %s\n", textstr);
% Create a symbol table of the unique characters in the input string
% and the indices into the table for each character in the string.
[symbols, ~, inds] = unique(textstr);
inds = inds.'; % just make it easier to read
% Calculate the frequency of each symbol in table
% max(inds) == numel(symbols)
p = histc(inds, 1:max(inds))/numel(inds);
dict = huffmandict(symbols, p);
compdata = huffmanenco(inds, dict);
dsig = huffmandeco(compdata, dict);
fprintf("Decoded string: %s\n", symbols(dsig));
And the output:
Starting string: Random String Input.
Decoded string: Random String Input.
To encode strings other than the original input string, you would have to map the characters to symbol indices (ensuring that all symbols in the string are actually present in the symbol table, obviously):
>> [~, s_idx] = ismember('trogdor', symbols)
s_idx =
15 14 12 8 7 12 14
>> compdata = huffmanenco(s_idx, dict);
>> dsig = huffmandeco(compdata, dict);
>> fprintf("Decoded string: %s\n", symbols(dsig));
Decoded string: trogdor

How does the 'k' modifier in FINDC() work in SAS?

I'm reading through the book, "SAS Functions by Example - Second Edition" and having trouble trying to understand a certain function due to the example and output they get.
Function: FINDC
Purpose: To locate a character that appears or does not appear within a string. With optional arguments, you can define the starting point for the search, set the direction of the search, ignore case or trailing blanks, or look for characters except the ones listed.
Syntax: FINDC(character-value, find-characters <,'modifiers'> <,start>)
Two of the modifiers are i and k:
i ignore case
k count only characters that are not in the list of find-characters
So now one of the examples has this:
Note: STRING1 = "Apples and Books"
FINDC(STRING1,"aple",'ki')
For the Output, they said it returns 1 because the position of "A" in Apple. However this is what confuses me, because I thought the k modifier says to find characters that are not in the find-characters list. So why is it searching for a when the letter "A", case-ignored, is in the find-characters list. To me, I feel like this example should output 6 for the "s" in Apples.
Is anyone able to help explain the k modifier to me any better, and why the output for this answer is 1 instead of 6?
Edit 1
Reading the SAS documentation online, I found this example which seems to contradict the book I'm reading:
Example 3: Searching for Characters and Using the K Modifier
This example searches a character string and returns the characters that do
not appear in the character list.
data _null_;
string = 'Hi, ho!';
charlist = 'hi';
j = 0;
do until (j = 0);
j = findc(string, charlist, "k", j+1);
if j = 0 then put +3 "That's all";
else do;
c = substr(string, j, 1);
put +3 j= c=;
end;
end;
run;
SAS writes the following output to the log:
j=1 c=H
j=3 c=,
j=4 c=
j=6 c=o
j=7 c=!
That's all
So, is the book wrong?
The book is wrong.
511 data _null_;
512 STRING1 = "Apples and Books" ;
513 x=FINDC(STRING1,"aple",'ki');
514 put x=;
515 if x then do;
516 ch=char(string1,x);
517 put ch=;
518 end;
519 run;
x=6
ch=s

Detecting encoding of sting and converting it

I have string:
string <- "{'text': u'Kandydaci PSL do Parlamentu Europejskiego \\u2013 OKR\\u0118G nr 1: Obejmuje obszar wojew\\xf3dztwa pomorskiego z siedzib\\u0105 ok... http://t.co/aZbjK7ME1O', 'created_at': u'Mon May 19 11:30:07 +0000 2014'}"
As you can see I have some codes instead of letters. As far as I know there are UTH-8 codes for polish characters like ą, ć, ź, ó and so on. How can I convert this string to obtain the output
"{'text': u'Kandydaci PSL do Parlamentu Europejskiego \\u2013 OKRĄG nr 1: Obejmuje obszar województwa pomorskiego z siedzibą ok... http://t.co/aZbjK7ME1O', 'created_at': u'Mon May 19 11:30:07 +0000 2014'}"
Here's a regular expression to find all escaped characters in the form \udddd and \xdd. We then take those values, and re-parse them to turn them into characters. Finally we replace the original matched values with the true characters
m <- gregexpr("\\\\u\\d{4}|\\\\x[0-9A_Fa-f]{2}", string)
a <- enc2utf8(sapply(parse(text=paste0('"', regmatches(string,m)[[1]], '"')), eval))
regmatches(string,m)[[1]] <- a
This will do them all. If you only want to do a subset, you could filter the vector of possible replacements.

how to store text containing escape sequences in ms access

When i try to store text containing 'C' code in MS ACCESS table (programatically). It replaces escape sequences ('\n', '\t') with some question-mark symbol.
Example :
code to store :
#include<stdio.h>
int main()
{
printf("\n\n\t Hi there...");
return 0;
}
When i see MS-Access table for above inserted code it shows every newline and '\t' character replaced with a '?' kind of symbol.
My question "is there any other data type for MS-Access filed which stores code as it is without replacing escape sequences with some symbol?"
and
"Is 'raw' data type present in other DBMS like MYSQL will do my job? "
This is how it shows in access-07 :
It looks like the line breaks in your source text are not the Windows-standard CRLF (carriage return, line feed). Find out the character codes of those mystery characters.
Using the procedure below, I can feed it a text string, and it will list the code of each character. Here is an example from the Immediate window.
AsciiValues "a" & vbcrlf & "b"
position Asc AscW
1 97 97
2 13 13
3 10 10
4 98 98
If I want to examine the value stored in a table text field, I can use DLookup to fetch that value and feed it to the function.
AsciiValues DLookup("memo_field", "tblFoo", "id=1")
position Asc AscW
1 108 108
2 105 105
3 110 110
4 101 101
5 32 32
Once you determine the codes of the problem characters, you can execute an UPDATE statement to replace the problem character codes with suitable alternatives.
UPDATE YourTable
SET YourField = Replace(YourField, Chr(x), Chr(y));
And this is the procedure ...
Public Sub AsciiValues(ByVal pInput As String)
Dim i As Long
Dim lngSize As Long
lngSize = Len(pInput)
Debug.Print "position", "Asc", "AscW"
For i = 1 To lngSize
Debug.Print i, Asc(Mid(pInput, i, 1)), AscW(Mid(pInput, i, 1))
Next
End Sub
I'd say it's probably that you're lacking the whole newline. A newline in Access consists of a Carriage Return (ASCII 13) AND a Line Feed (ASCII 10). This is abbreviated as CRLF. You probably only have one or the other, but not both.
Use HansUp's AsciiValues procedure to take a look.

HTMLEntities decodes to ascii 194, shouldn't it be 160?

I'm using HTMLEntities to decode HTML-Strings. Today I saw that is decoded to 194 instead of 160.
jruby-1.6.2 :002 > HTMLEntities.new.decode( " " )[0]
=> 194
Is 194 correct, or am I doing something wrong (maybe something with UTF-8-Strings in Ruby)?
(JRuby = 1.6.2, Rails = 2.3.11, HTMLEntities = 4.3.0)
What you are seeing is the first byte of a two-byte UTF-8 sequence. Try unpacking it to see the expected Unicode code point:
HTMLEntities.new.decode( " " ).unpack('U*')[0]