HTMLEntities decodes to ascii 194, shouldn't it be 160? - html

I'm using HTMLEntities to decode HTML-Strings. Today I saw that is decoded to 194 instead of 160.
jruby-1.6.2 :002 > HTMLEntities.new.decode( " " )[0]
=> 194
Is 194 correct, or am I doing something wrong (maybe something with UTF-8-Strings in Ruby)?
(JRuby = 1.6.2, Rails = 2.3.11, HTMLEntities = 4.3.0)

What you are seeing is the first byte of a two-byte UTF-8 sequence. Try unpacking it to see the expected Unicode code point:
HTMLEntities.new.decode( " " ).unpack('U*')[0]

Related

python error on string format with "\n" exec(compile(contents+"\n", file, 'exec'), glob, loc)

i try to construct JSON with string that contains "\n" in it like this :
ver_str= 'Package ID: version_1234\nBuild\nnumber: 154\nBuilt\n'
proj_ver_str = 'Version_123'
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
json_content = json.loads()
d =json.dumps(json_content )
getting this error:
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Dev/python/new_tester/simple_main.py", line 18, in <module>
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
KeyError: '"r_content"'
The error arises not because of newlines in your values, but because of { and } characters in your format string other than the placeholders {0} and {1}. If you want to have an actual { or a } character in your string, double them.
Try replacing the line
comb = '{"r_content": {0}, "s_version": {1}}'.format(ver_str,proj_ver_str)
with
comb = '{{"r_content": {0}, "s_version": {1}}}'.format(ver_str,proj_ver_str)
However, this will give you a different error on the next line, loads() missing 1 required positional argument: 's'. This is because you presumably forgot to pass comb to json.loads().
Replacing json.loads() with json.loads(comb) gives you another error: json.decoder.JSONDecodeError: Expecting value: line 1 column 15 (char 14). This tells you that you've given json.loads malformed JSON to parse. If you print out the value of comb, you see the following:
{"r_content": Package ID: version_1234
Build
number: 154
Built
, "s_version": Version_123}
This isn't valid JSON, because the string values aren't surrounded by quotes. So a JSON parsing error is to be expected.
At this point, let's take a look at what your code is doing and what you seem to want it to do. It seems you want to construct a JSON string from your data, but your code puts together a JSON string from your data, parses it to a dict and then formats it back as a JSON string.
If you want to create a JSON string from your data, it's far simpler to create a dict with your values and use json.dumps on that:
d = json.dumps({"r_content": ver_str, "s_version": proj_ver_str})

Python csv data logging doesn't work in while loop

I have been trying to log the data received from the Arduino through USB port and the strange thing is that the code works on my mac just fine but on windows it won't write it. At the start I expected the initial writing "DATA" but it didn't even write that. And when I commented out the entire loop it worked (It says "DATA" in the csv file).
import serial
count = 1
port = serial.Serial('COM4', baudrate=9600, bytesize=8)
log = open("data_log.csv", "w")
log.write("DATA")
log.write("\n")
while 1:
value = str(port.read(8), 'utf-8')
value = value.replace('\r', '').replace('\n', '')
if value.strip():
log.write(str(count))
log.write(',')
log.write(value)
log.write('\n')
print(count)
count += 1
print(value)
\n = CR (Carriage Return) // Used as a new line character in Unix
\r = LF (Line Feed) // Used as a new line character in Mac OS
\n\r = CR + LF // Used as a new line character in Windows
I think it's not working in windows because you need to look for a CR LF.
Might try using Environment.NewLine as it will act as any of the above depending on the operating system.

Read a file in R with mixed character encodings

I'm trying to read tables into R from HTML pages that are mostly encoded in UTF-8 (and declare <meta charset="utf-8">) but have some strings in some other encodings (I think Windows-1252 or ISO 8859-1). Here's an example. I want everything decoded properly into an R data frame. XML::readHTMLTable takes an encoding argument but doesn't seem to allow one to try multiple encodings.
So, in R, how can I try several encodings for each line of the input file? In Python 3, I'd do something like:
with open('file', 'rb') as o:
for line in o:
try:
line = line.decode('UTF-8')
except UnicodeDecodeError:
line = line.decode('Windows-1252')
There do seem to be R library functions for guessing character encodings, like stringi::stri_enc_detect, but when possible, it's probably better to use the simpler determinstic method of trying a fixed set of encodings in order. It looks like the best way to do this is to take advantage of the fact that when iconv fails to convert a string, it returns NA.
linewise.decode = function(path)
sapply(readLines(path), USE.NAMES = F, function(line) {
if (validUTF8(line))
return(line)
l2 = iconv(line, "Windows-1252", "UTF-8")
if (!is.na(l2))
return(l2)
l2 = iconv(line, "Shift-JIS", "UTF-8")
if (!is.na(l2))
return(l2)
stop("Encoding not detected")
})
If you create a test file with
$ python3 -c 'with open("inptest", "wb") as o: o.write(b"This line is ASCII\n" + "This line is UTF-8: I like π\n".encode("UTF-8") + "This line is Windows-1252: Müller\n".encode("Windows-1252") + "This line is Shift-JIS: ハローワールド\n".encode("Shift-JIS"))'
then linewise.decode("inptest") indeed returns
[1] "This line is ASCII"
[2] "This line is UTF-8: I like π"
[3] "This line is Windows-1252: Müller"
[4] "This line is Shift-JIS: ハローワールド"
To use linewise.decode with XML::readHTMLTable, just say something like XML::readHTMLTable(linewise.decode("http://example.com")).

JSONDecodeError: Expecting value: line 1 column 1

I am receiving this error in Python 3.5.1.
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Here is my code:
import json
import urllib.request
connection = urllib.request.urlopen('http://python-data.dr-chuck.net/comments_220996.json')
js = connection.read()
print(js)
info = json.loads(str(js))
If you look at the output you receive from print() and also in your Traceback, you'll see the value you get back is not a string, it's a bytes object (prefixed by b):
b'{\n "note":"This file .....
If you fetch the URL using a tool such as curl -v, you will see that the content type is
Content-Type: application/json; charset=utf-8
So it's JSON, encoded as UTF-8, and Python is considering it a byte stream, not a simple string. In order to parse this, you need to convert it into a string first.
Change the last line of code to this:
info = json.loads(js.decode("utf-8"))
in my case, some characters like " , :"'{}[] " maybe corrupt the JSON format, so use try json.loads(str) except to check your input

What characters have to be escaped to prevent (My)SQL injections?

I'm using MySQL API's function
mysql_real_escape_string()
Based on the documentation, it escapes the following characters:
\0
\n
\r
\
'
"
\Z
Now, I looked into OWASP.org's ESAPI security library and in the Python port it had the following code (http://code.google.com/p/owasp-esapi-python/source/browse/esapi/codecs/mysql.py):
"""
Encodes a character for MySQL.
"""
lookup = {
0x00 : "\\0",
0x08 : "\\b",
0x09 : "\\t",
0x0a : "\\n",
0x0d : "\\r",
0x1a : "\\Z",
0x22 : '\\"',
0x25 : "\\%",
0x27 : "\\'",
0x5c : "\\\\",
0x5f : "\\_",
}
Now, I'm wondering whether all those characters are really needed to be escaped. I understand why % and _ are there, they are meta characters in LIKE operator, but I can't simply understand why did they add backspace and tabulator characters (\b \t)? Is there a security issue if you do a query:
SELECT a FROM b WHERE c = '...user input ...';
Where user input contains tabulators or backspace characters?
My question is here: Why did they include \b \t in the ESAPI security library? Are there any situations where you might need to escape those characters?
A guess concerning the backspace character: Imagine I send you an email "Hi, here's the query to update your DB as you wanted" and an attached textfile with
INSERT INTO students VALUES ("Bobby Tables",12,"abc",3.6);
You cat the file, see it's okay, and just pipe the file to MySQL. What you didn't know, however, was that I put
DROP TABLE students;\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b\b
before the INSERT STATEMENT which you didn't see because on console output the backspaces overwrote it. Bamm!
Just a guess, though.
Edit (couldn't resist):
The MySQL manual page for strings says:
\0   An ASCII NUL (0x00) character.
\'   A single quote (“'”) character.
\"   A double quote (“"”) character.
\b   A backspace character.
\n   A newline (linefeed) character.
\r   A carriage return character.
\t   A tab character.
\Z   ASCII 26 (Control-Z). See note following the table.
\\   A backslash (“\”) character.
\%   A “%” character. See note following the table.
\_   A “_” character. See note following the table.
Blacklisting (identifying bad characters) is never the way to go, if you have any other options.
You need to use a conbination of whitelisting, and more importantly, bound-parameter approaches.
Whilst this particular answer has a PHP focus, it still helps plenty and will help explain that just running a string through a char filter doesn't work in many cases. Please, please see Do htmlspecialchars and mysql_real_escape_string keep my PHP code safe from injection?
Where user input contains tabulators or backspace characters?
It's quite remarkable a fact that up to this day most users do believe that it's user input have to be escaped, and such escaping "prevents injections".
Java solution:
public static String filter( String s ) {
StringBuffer buffer = new StringBuffer();
int i;
for( byte b : s.getBytes() ) {
i = (int) b;
switch( i ) {
case 9 : buffer.append( " " ); break;
case 10 : buffer.append( "\\n" ); break;
case 13 : buffer.append( "\\r" ); break;
case 34 : buffer.append( "\\\"" ); break;
case 39 : buffer.append( "\\'" ); break;
case 92 : buffer.append( "\\" );
if( i > 31 && i < 127 ) buffer.append( new String( new byte[] { b } ) );
}
}
return buffer.toString();
}
couldn't one just delete the single quote(s) from user input?
eg: $input =~ s/\'|\"//g;