Parsing HTML5 EventSource stream from network - html

I've studied HTML5 EventSource specification and can't figure out how to parse and handle carriage return at end of received data.
App receives data stream that is composed of lines. Each line can be terminated by \r\n, \n or \r. On blank line, event should be considered ready and fired to listeners.
data: foobar\r\n
id: 1\r\n
\r\n
Equally valid event with same content
data: foobar\n
id: 1\r\n
\r
Full spec here, http://dev.w3.org/html5/eventsource/ The chapter 6. describes the BNF of the input.
Problem is the carriage return when it's seen at end of received data. Now, as far as I can understand, proper way of parsing is to do longest match search and thua wait for next data batch. Problem is, that if \r truly was the empty line marker, the event wont be fired until next data batch arrives and parser has enough data to attempt longest match.
Current data batch
data: foobar\r\n
id: 1\r\n
\r
Next data batch
\n
data: foobar2\r\n
id: 1\r\n
\r\n
Alternative Case. Next data batch
data: foobar2\r\n
id: 1\r\n
\r\n
This would not be problem in traditional parsing, but it's in EventSource because I need to trigger events as soon as possible, so if implementation waits for next data batch to get longest match, it might wait for a long time if sender used single character '\r' as empty line marker and it's not going to send anything else for a while.

Interesting problem! I assume you are not using a browser but writing your own client? (If writing server-side code, always send just \n or just \r!!)
The solution is when reading from the socket, to convert any "\r\n" sequence to "\r".
In other words, as soon as you get the "\r" you can treat it as end-of-line, do whatever processing you need, and set a CR_just_received flag. If you receive a "\n" and CR_just_received==true then quietly swallow it. Make sure CR_just_received is cleared whenever any byte except \r is received.

Related

github api "message":"Problems parsing JSON" for large Base64 string

I'm trying to learn to use the guthub api from Java.
I've created a simple program that can read and commit new versions of a file.
I have tested this for many text files of short lenght and I think I'm correctly using the mime base64.
I'm now trying to upload a larger file, in the order of 5 MB.
And this means having a JSON in the body looking like this:
{
"owner": "example42gdrive",
"repo": "Example1",
"message": "FileSystem 42 module on github",
"content": "rO0ABXNyABdpcy5MNDIuZ2V ...5MB of JS string here... ABGluZm90AB5MfqIDcQB+AAU=",
"sha":"a7ef93d3eb50383028578cb916b70060067d9c8a"
}
And I get back as a response
400
{"message":"Problems parsing JSON","documentation_url":"https://docs.github.com/rest/reference/repos#create-or-update-file-contents"}
Notes:
The same exact code works for smaller content
java Base64.getMimeEncoder() will insert some \n in the result to separate it in lines. I'm removing those newlines in order to get a valid JS string.
Does anyone knows what I'm doing wrong or what should I do instead?
EDIT: after some experimentation, the problem seams to be in the \n:
if I produce a base64 string short enough that java Base64.getMimeEncoder() does not insert any \n, all is fine. Of course, a string with a \n can not be 'stringyfied' by simply adding (") before and after, so I tried
removing the \n, no effect -> Problems parsing JSON
replacing the \n with \n (so that the parser will see them as \n inside a string) -> Problems parsing JSON
replacing the \n with \\n (so that the parser will see them as \n, this may help if there are somehow two levels of escape server side ->Problems parsing JSON
replacing the \n with a space ->Problems parsing JSON
In https://en.wikipedia.org/wiki/Base64
wikipedia clearly states that
(newlines and white spaces may be present anywhere but are to be ignored
on decoding)
I'm starting to think that there is something I do not understand and that is so obvious that the github api do not mention it
Ok, I did it.
I found the answer indirectly by reading
Java 8 Base64 Encode (Basic) doesn't add new line anymore. How can I reimplement this?
Basically, just looking to the screen I belived the 'newline' was just "\n", instead it was a "\r\n". Thus, I was replacing only the "\n" leaving the "\r" in place.
Replacing "\r\n" with "" works.
However, replacing "\r\n" with "\r\n" doesnot. This suggests a bug in the github decoder (if wikipedia is right and new lines must be allowed)
I hate this! I hate that we have more then one way to express 'new line' and that in most context they are rendered the same graphically!!!

Golang CSV read : extraneous " in field error

I am using a simple program to read CSV file, somehow I noticed when I created a CSV using EXCEL or windows based computer go library fails to read it. even when I use cat command it only shows me last line on the terminal. It always results in this error extraneous " in field.
I researched somewhat than I found it is somewhat related to carriage return differences between OS.
But I really want to ask how to make a generic csv reader. I tried reading the same csv using pandas and it was reading successfully. But i am not been able to achieve this using my Go code.
Also screen shot of correct csv Is here
Your file clearly shows that you've got an extra quote at the end of the content. While programs like pandas may be fine with that, I assume it's not valid csv so go does return an error.
Quick example of what's wrong with your data: https://play.golang.org/p/KBikSc1nzD
Update: After your update and a little bit of searching, I have to apoligize, the carriage return does matter and seems to be tha main culprit here, Go seems to be ok handling the \r\n windows variant but not the \r one. In that case what you can do is wrap the bytes.Reader into a custom reader that replaces the \r byte with the \n byte.
Here's an example: https://play.golang.org/p/vNjzwAHmtg
Please note, that the example is just that, an example, it's not handling all the possible cases where \r might be a legit byte.

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde

org.supercsv.exception.SuperCsvException: unexpected end of file while reading quoted column beginning on line

I'm reading csv files using superCSV reader and got the following exception. the file has 80000 lines. As I remove the end lines the exception still happens so there's some line in file that's causing this problem. how do I fix this?
org.supercsv.exception.SuperCsvException: unexpected end of file while reading quoted column beginning on line 80000 and ending on line 80000
context=null
at org.supercsv.io.Tokenizer.readColumns(Tokenizer.java:198)
at org.supercsv.io.AbstractCsvReader.readRow(AbstractCsvReader.java:179)
at org.supercsv.io.CsvListReader.read(CsvListReader.java:69)
at csv.filter.CSVFilter.filterFile(CSVFilter.java:400)
at csv.filter.CSVFilter.filter(CSVFilter.java:369)
at csv.filter.CSVFilter.main(CSVFilter.java:292)
ICsvListReader reader = null;
String[] line=null;
ListlineList=null;
try{
reader = new CsvListReader(new FileReader(inputFile), CsvPreference.STANDARD_PREFERENCE);
while((lineList=reader.read())!=null){
line=lineList.toArray(new String[lineList.size()]);
}
}catch(Exception exp){
exp.printStackTrace();
error=true;
}
The fact that the exception states it begins and ends on line 80000 should mean that there's an incorrect number of quotes on that line.
You should get the same error with the following CSV (but the exception will say line 1):
one,two,"three,four
Because the 3rd column is missing the trailing quote, so Super CSV will reach the end of the file and not know how to interpret the input.
FYI here is the relevant unit test for this scenario from the project source.
You can try removing lines to find the culprit, just remember that CSV can span multiple lines so make sure you remove whole records.
The line shown in the error message is not necessarily the one with the problem, since unbalanced quotechars throw off SuperCSV's line detection.
If possible, open the csv in a spreadsheet problem (for instance libreoffice calc) and search (as in CTRL-F search) for the quote char.
Calc will usually import the file well, even if there is a mismatch but you will see the quotechar somewhere if you search for it. Then check in the csv if it is properly escaped. If it is, make sure SuperCSV knows about it. If it isn't, complain to the producer of the csv.

Mysql fails to save UTF string in some cases

during spam fighting, I found some spam comments stored without any content...
After trying to isolate the problem, here is what I have found after saving similar comments to file along with the MySQL database...
This is (HEX because of unknown input encoding) what comment first few "chars" look like:
D1EA E0F7 E0F2 FC20 EFEE EFF3 EBFF F0ED FBE5 20EF F0EE E3F0 E0EC ECFB
After executing INSERT INTO test VALUES (0xD1EAE0F7E0F2FC20EFEEEFF3EBFFF0EDFBE520EFF0EEE3F0E0ECECFB21),(0x21D1EAE0F7E0F2FC20EFEEEFF3EBFFF0EDFBE520EFF0EEE3F0E0ECECFB), (0x21) test mysql table (utf-8) contains 3 rows, first without any text, second and third with single character "!" as a text... (note that 21 hex code for "!" is also in the end of first entry, yet it is not saved). (latin1 encoding saved some useless text replacements for every byte, but this post is not about it)
Of course, D1EA (D=1101 0001 should be followed by one 10xxxxxx byte, not 1110xxxx) isn't valid UTF-8 character, but robust system like database server should be able to deal with it...
My guess is, Mysql (ver. 5.1.66-0+squeeze1) shouldn't choose when to save data and when not, even if it's not valid UTF-8 encoded character... Or at least, it should not claim query was successfull when it decides not to store the data!
Is it bug in mysql, or what?
Thanks
Encoding is Windows-1251, and decodes to
Скачать популярные программы
//"Download popular software" google translated
You should reject non-UTF8 input in your code before doing anything with it.
if( !mb_check_encoding($input, "UTF-8") ) {
header("HTTP/1.1 400 Bad Request");
die("Invalid encoding");
}
FTR, your queries are hex literals, not misencoded text.