Apache drill cannot parse CSV files with windows EOL correctly? - csv

Ok, let's save someone 8 hours of clueless debugging.
TL;DR: Apache drill cannot correctly parse CSV files generated on windows machines. That's because their EOL is set to \r\n by default unlike to unix system, where it is set to \n. And this leads to horribly undebuggable errors because the leading \r probably stays clued to the last field's value. And what's funny, you won't notice this because it's invisible.
Let's have two files, one created in linux and the second in windows: hello.linux.csv and hello.win.csv. The content is the same (at least it looks like it is ...)
field_a,field_b
Hello,0.5
Let's have a query.
SELECT * from (...)/hello.linux.csv;
---
field_a, field_b
Hello, "0.5"
SELECT * from (...)/hello.win.csv;
---
field_a, field_b
Hello, "0.5"
Fine! Let's do something with the data. Cast "0.5" to number should be fine (and necessary).
SELECT
field_a, CAST (field_b as DECIMAL(10, 2)) as test
from (...)/hello.linux.csv;
---
field_a, test
Hello, 0.5
-- ... aaand, here we go!
SELECT
field_a, CAST (field_b as DECIMAL(10, 2)) as test
from (...)/hello.win.csv;
[30038]Query execution error. Details:[
SYSTEM ERROR: NumberFormatException
Fragment 0:0
Please, refer to logs for more information. -- In the logs, there is only useless java stacktrace, of course.
[Error Id: 3551c939-3f5b-42c1-9b58-d600da5f12a0 on drill-develop-7bdb45c597-52rnz:31010]
]
...
(And now, imagine how much time would take to reveal this on a complex production setup where the queries, data and other factors are somehow more complicated.)
The question: Is there a way how to force apache drill (v 1.15) to process CSV files created with windows EOLs?

You can update csv format line delimiter to \r\n but this would apply to all csv files in the scope of your text plugin. To change delimiter per table use table function.
https://drill.apache.org/docs/plugin-configuration-basics/

Related

What does 'multiline strings are different' meant by from RIDE (Robot Framework) output?

i am trying to compare two csv file data and followed below process in RIDE -
${csvA} = Get File ${filePathA}
${csvB} = Get File ${filePathB}
Should Be Equal As Strings ${csvA} ${csvB}
Here are my two csv contents -
csvA data
Harshil,45,8.03,DMJ
Divy,55,8,VVN
Parth,1,9,vvn
kjhjmb,44,0.5,bugg
csvB data
Harshil,45,8.03,DMJ
Divy,55,78,VVN
Parth,1,9,vvnbcb
acc,5,6,afafa
As few of the data is not in match, when i Run the code in RIDE, the result is FAIL. But in the log below data is shown -
**
Multiline strings are different:
--- first
+++ second
## -1,4 +1,4 ##
Harshil,45,8.03,DMJ
-Divy,55,8,VVN
-Parth,1,9,vvn
-kjhjmb,44,0.5,bugg
+Divy,55,78,VVN
+Parth,1,9,vvnbcb
+acc,5,6,afafa**
I would like to know the meaning of ---first +++second ##-1,4+1,4## content.
Thanks in advance!
When robot compares multiline strings (data that has newlines in it), it uses the standard unix tool diff to show the differences. Those characters are all part of what's called a unified diff. Even though you pass in raw data, it's treating the data as two files and showing the differences between the two in a format familiar to most programmers.
Here are two references to read more about the format:
What does "## -1 +1 ##" mean in Git's diff output?. (stackoverflow)
the diff man page (gnu.org)
In short, the ## gives you a reference for which line numbers are different, and the + and - show you which lines are different.
In your specific example it's telling you that three lines were different between the two strings: the line beginning with Divy, the line beginning with Parth, and the line beginning with acc. Since the line beginning with Harshil does not show a + or -, that means it was identical between the two strings.

Entry delimiter of JSON files for Hive table

We are collecting JSON data (public social media posts in particular) via REST API invocations, which we plan to dump into HDFS, then abstract a Hive table on top it using SerDe. I wonder though what would be the appropriate delimiter per JSON entry in a file? Is it new line ("\n")? So it would look like this:
{ id: entry1 ... post: }
{ id: entry2 ... post: }
...
{ id: entryn ... post: }
How about if we encounter a new line character within the JSON data itself, for example in post?
The best way would be one record per line, separated by "\n" exactly as you guessed.
This also means that you should be careful to escape "\n" that may be inside the JSON elements.
Indented JSON won't work well with hadoop/hive, since to distribute processing, hadoop must be able to tell when a records ends, so it can split processing of a file with N bytes with W workers in W chunks of size roughly N/W.
The splitting is done by the particular InputFormat that's been used, in case of text, TextInputFormat.
TextInputFormat will basically split the file at the first instance of "\n" found after byte i*N/W (for i from 1 to W-1).
For this reason, having other "\n" around would confuse Hadoop and it will give you incomplete records.
As an alternative, I wouldn't recommend it, but if you really wanted you could use a character other than "\n" by configuring the property "textinputformat.record.delimiter" when reading the file through hadoop/hive, using a character that won't be in JSON (for instance, \001 or CTRL-A is commonly used by Hive as a field delimiter) but that can be tricky since it has to also be supported by the SerDe.
Also, if you change the record delimiter, anybody who copies/uses the file on HDFS must be aware of the delimiter, or they won't be able to parse it correctly, and will need special code to do it, while keeping "\n" as a delimiter, the files will still be normal text files and can be used by other tools.
As for the SerDe, I'd recommend this one, with the disclaimer that I wrote it :)
https://github.com/rcongiu/Hive-JSON-Serde

Pandas read_csv errors on number of fields, but visual inspection looks fine

I'm trying to load a large csv file, 3,715,259 lines.
I created this file myself and there are 9 fields separated by commas.
Here's the error:
df = pd.read_csv("avaya_inventory_rev2.csv", error_bad_lines=False)
Skipping line 2924525: expected 9 fields, saw 11
Skipping line 2924526: expected 9 fields, saw 10
Skipping line 2924527: expected 9 fields, saw 10
Skipping line 2924528: expected 9 fields, saw 10
This doesn't make sense to me, I inspected the offending lines using:
sed -n "2924524,2924525p" infile.csv
I can't list the outputs as they contain proprietary information for a client. I'll try to synthesize a meaningful replacement.
Lines 2924524 and 2924525 look to have to same number of fields to me.
Also, I was able to load the same file into a mySQL table with no error.
create table Inventory (path varchar (255), isText int, ext varchar(5), type varchar(100), size int, sloc int, comments int, blank int, tot_lines int);
I don't know enough about mySQL to understand why that may or maynot be a valid test and why pandas would have a different outcome from loading the same file.
TIA !
'''UPDATE''': I tried to read with the engine='python':
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
When I create this csv, I'm using a shell script I wrote. I feed lines to the file with redirect >>
I tried the suggested fix:
input = open(input, 'rU')
df.read_csv(input, engine='python')
Back to the same error:
ValueError: Expected 9 fields in line 5157, saw 11
I'm guessing it has to do with my csv creation script and how I dealt with
quoting in that. I don't know how to investigate this further.
I opened the csv input file in vim and on line 5157 there's a ^M which google says it Windows CR.
OK...I'm closer, although I did kinda suspect something like this and used dos2unix on the csv intput.
I removed the ^M using vim, and re-ran with same error about
11 fields. However, I can now see the 11 fields where as before I just saw
9. There's v's which is likely some kind of Windows hold over ?
SUMMARY: SOme somebody thought it'd be cute to name files with fobar.sh,v
So my profiler didn't mess up it's was just a name weirdness...plus the random \cr\lf from windows that snuck in....
Cheers

Powershell converting a character to ASCII

Currently I have a powershell proccess that is scanning a SQL Server Table and is reading a columns containing text. Currently we have characters that are in the extended ASCII land that are causing our downstream processes to break. I was orginally idenitfying these differences in SQL Server but it is terrible at text parsing so I decided to write a powershell script to do this that combined regular expressions. I will post the code for that as well to help other lost souls looking for such a regex.
$x = [regex]::Escape("\``~!##$%^&*()_|{}=+:;`"'<,>.?/-")
$y = "([^A-z0-9 \0x005D\0x005B\t\n"+$x+"])"
$a = [regex]::match( $($Row[1]), $y)
The problem comes when I want to display some of the ascii values back in an email saying that I'm scrubbing the data. The numbers don't come out the same as SQL Server. Caution I'm not sure if your results will be the same copying from you browser because these are extended ascii.
In powershell
[int]"–"[-0]; #result 8211 that appears to be wrong
[int]" "[-0]; #result 160 this appears to be right
In SQL Server
select ASCII('–') --result 150
select ASCII(' ') --result 160
What in powershell will help you to get the same results as SQL Server on the ASCII look up, if there is one.
TLDR; So my question is, is the above the correct method to look up ASCII values in powershell because it works for most values but doesn't work for the ASCII value 150 (this is the long dash that is from word).
In SQL Server,
select UNICODE('–')
will return 8211.
I don't think PowerShell supports ANSI, except for I/O; it works in Unicode internally.

How can I check if a binary string is UTF-8 in mysql?

I've found a Perl regexp that can check if a string is UTF-8 (the regexp is from w3c site).
$field =~
m/\A(
[\x09\x0A\x0D\x20-\x7E] # ASCII
| [\xC2-\xDF][\x80-\xBF] # non-overlong 2-byte
| \xE0[\xA0-\xBF][\x80-\xBF] # excluding overlongs
| [\xE1-\xEC\xEE\xEF][\x80-\xBF]{2} # straight 3-byte
| \xED[\x80-\x9F][\x80-\xBF] # excluding surrogates
| \xF0[\x90-\xBF][\x80-\xBF]{2} # planes 1-3
| [\xF1-\xF3][\x80-\xBF]{3} # planes 4-15
| \xF4[\x80-\x8F][\x80-\xBF]{2} # plane 16
)*\z/x;
But I'm not sure how to port it to MySQL as it seems that MySQL don't support hex representation of characters see this question.
Any thoughts how to port the regexp to MySQL?
Or maybe you know any other way to check if the string is valid UTF-8?
UPDATE:
I need this check working on the MySQL as I need to run it on the server to correct broken tables. I can't pass the data through a script as the database is around 1TB.
I've managed to repair my database using a test that works only if your data can be represented using a one-byte encoding in my case it was a latin1.
I've used the fact that mysql changes the bytes that aren't utf-8 to '?' when converting to latin1.
Here is how the check looks like:
SELECT (
CONVERT(
CONVERT(
potentially_broken_column
USING latin1)
USING utf8))
!=
potentially_broken_column) AS INVALID ....
If you are in control of both the input and output side of this DB then you should be able to verify that your data is UTF-8 on whichever side you like and implement constraints as necessary. If you are dealing with a system where you don't control the input side then you are going to have to check it after you pull it out and possibly convert in your language of choice (Perl it sounds like).
The database is a REALLY good storage facility but should not be used aggressively for other applications. I think this is one spot where you should just let the MySQL hold the data until you need to do something further with it.
If you want to continue on the path you are on then check out this MySQL Manual Page: http://dev.mysql.com/doc/refman/5.0/en/regexp.html
REGEX is normally VERY similar between languages (in fact I can almost always copy between JavaScript, PHP, and Perl with only minor adjustments for their wrapping functions) so if that is working REGEX then you should be able to port it easily.
GL!
EDIT: Look at this Stack article--you might want to use Stored Procedures considering you cannot using scripting to handle the data: Regular expressions in stored procedures
With Stored Procedures you can loop through the data and do a lot of handling without ever leaving MySQL. That second article is going to refer you right back to the one I listed though so I think you need to first prove out your REGEX and get it working, then look into Stored Procedures.