Pandas read_csv errors on number of fields, but visual inspection looks fine - mysql

I'm trying to load a large csv file, 3,715,259 lines.
I created this file myself and there are 9 fields separated by commas.
Here's the error:
df = pd.read_csv("avaya_inventory_rev2.csv", error_bad_lines=False)
Skipping line 2924525: expected 9 fields, saw 11
Skipping line 2924526: expected 9 fields, saw 10
Skipping line 2924527: expected 9 fields, saw 10
Skipping line 2924528: expected 9 fields, saw 10
This doesn't make sense to me, I inspected the offending lines using:
sed -n "2924524,2924525p" infile.csv
I can't list the outputs as they contain proprietary information for a client. I'll try to synthesize a meaningful replacement.
Lines 2924524 and 2924525 look to have to same number of fields to me.
Also, I was able to load the same file into a mySQL table with no error.
create table Inventory (path varchar (255), isText int, ext varchar(5), type varchar(100), size int, sloc int, comments int, blank int, tot_lines int);
I don't know enough about mySQL to understand why that may or maynot be a valid test and why pandas would have a different outcome from loading the same file.
TIA !
'''UPDATE''': I tried to read with the engine='python':
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
When I create this csv, I'm using a shell script I wrote. I feed lines to the file with redirect >>
I tried the suggested fix:
input = open(input, 'rU')
df.read_csv(input, engine='python')
Back to the same error:
ValueError: Expected 9 fields in line 5157, saw 11
I'm guessing it has to do with my csv creation script and how I dealt with
quoting in that. I don't know how to investigate this further.
I opened the csv input file in vim and on line 5157 there's a ^M which google says it Windows CR.
OK...I'm closer, although I did kinda suspect something like this and used dos2unix on the csv intput.
I removed the ^M using vim, and re-ran with same error about
11 fields. However, I can now see the 11 fields where as before I just saw
9. There's v's which is likely some kind of Windows hold over ?
SUMMARY: SOme somebody thought it'd be cute to name files with fobar.sh,v
So my profiler didn't mess up it's was just a name weirdness...plus the random \cr\lf from windows that snuck in....
Cheers

Related

Error in a line on big CSV imported to BigQuery

I'm trying to import a big CSV file to BigQuery (2.2 GB+). This is the error I get:
"Error while reading data, error message: CSV table references column position 33, but line starting at position:254025076 contains only 26 columns."
There are more errors on that file – and on that file only, out of one per state. Usually I would skip the faulty lines, but then I would lose a lot of data.
What can be a good way to check and correct the errors in a file that big?
EDIT: This is what seems to happen in the file. It's one single line and it breaks between "Instituto" and "Butantan". As a result, BigQuery parses it as one line with 26 columns and another with nine columns. That repeats a lot.
As far as I've seen, it's just with Butantan, but sometimes the first word is described differently (I caught "Instituto" and "Fundação"). Can I correct that maybe with grep on the command line? If so, what syntax?
Actually 2.2GB is quite manageble size. It can be quickly pre-processed with command line tools or simple python script on any +/- modern laptop/desktop or on a small VM in GCP.
You can start from looking at the problematic row:
head -n 254025076 your_file.csv | tail -n 1
If problematic rows just have missing values for last columns - you can use "--allow_jagged_rows" loading CSV option.
Otherwise I'm usually using simple python script like this:
import fileinput
def process_line(line):
# your logic to fix line
return line
if __name__ == '__main__':
for line in fileinput.input():
print(process_line(line))
and run it with:
cat your_file.csv | python3 preprocess.py > new_file.csv
UPDATE:
For newline characters in value - try BigQuery "Allow quoted newlines" option.

Octave - dlmread and csvread convert the first value to zero

When I try to read a csv file in Octave I realize that the very first value from it is converted to zero. I tried both csvread and dlmread and I'm receiving no errors. I am able to open the file in a plain text editor and I can see the correct value there. From what I can tell, there are no funny hidden characters, spacings, or similar in the csv file. Files also contain only numbers. The only thing that I feel might be important is that I have five columns/groups that each have different number of values in them.
I went through the commands' documentation on Octave Forge and I do not know what may be causing this. Does anyone have an idea what I can troubleshoot?
To try to illustrate the issue, if I try to load a file with the contents:
1.1,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
Command window will return:
0.0,2.1,3.1,4.1,5.1
,2.2,3.2,4.2,5.2
,2.3,3.3,4.3,
,,3.4,4.4
,,3.5,
( with additional trailing zeros after the decimal point).
Command syntaxes I'm using are:
dt = csvread("FileName.csv")
and
dt = dlmread("FileName.csv",",")
and they both return the same.
Your csv file contains a Byte Order Mark right before the first number. You can confirm this if you open the file in a hex editor, you will see the sequence EF BB BF before the numbers start.
This causes the first entry to be interpreted as a 'string', and since strings are parsed based on whether there are numbers in 'front' of the string sequence, this is parsed as the number zero. (see also this answer for more details on how csv entries are parsed).
In my text editor, if I start at the top left of the file, and press the right arrow key once, you can tell that the cursor hasn't moved (meaning I've just gone over the invisible byte order mark, which takes no visible space). Pressing backspace at this point to delete the byte order mark allows the csv to be read properly. Alternatively, you may have to fix your file in a hex editor, or find some other way to convert it to a proper Ascii file (or UTF without the byte order mark).
Also, it may be worth checking how this file was produced; if you have any control in that process, perhaps you can find why this mark was placed in the first place and prevent it. E.g., if this was exported from Excel, you can choose plain 'csv' format instead of 'utf-8 csv'.
UPDATE
In fact, this issue seems to have already been submitted as a bug and fixed in the development branch of octave. See #58813 :)

How do I preserve the leading 0 of a number using Unoconv when converting from a .csv file to a .xls file?

I have a 3 column csv file. The 2nd column contains numbers with a leading zero. For example:
044934343
I need to convert a .csv file into a .xls and to do that I'm using the command line tool called 'unoconv'.
It's converting as expected, however when I load up the .xls in Excel instead of showing '04493434', the cell shows '4493434' (the leading 0 has been removed).
I have tried surrounding the number in the .csv file with a single quote and a double quote however the leading 0 is still removed after conversion.
Is there a way to tell unoconv that a particular column should be of a TEXT type? I've tried to read the man page of unocov however the options are little confusing.
Any help would be greatly appreciated.
Perhaps I came too late at the scene, but just in case someone is looking for an answer for a similar question this is how to do:
unoconv -i FilterOptions=44,34,76,1,1/1/2/2/3/1 --format xls <csvFileName>
The key here is "1/1/2/2/3/1" part, which tells unoconv that the second column's type should be "TEXT", leaving the first and third as "Standard".
You can find more info here: https://wiki.openoffice.org/wiki/Documentation/DevGuide/Spreadsheets/Filter_Options#Token_7.2C_csv_import
BTW this is my first post here...

Read a log file in R

I'm trying to read a log file in R.
It looks like an extract from a JSON file to me, but when trying to read it using jsonlite I get the following error message: "Error: parse error: trailing garbage".
Here is how my log file look like:
{"date":"2017-05-11T04:37:15.587Z","userId":"admin","module":"Quote","action":"CreateQuote","identifier":"-.admin1002"},
{"date":"2017-05-11T05:12:24.939Z","userId":"a145fhyy","module":"Quote","action":"Call","identifier":"RunUY"},
{"date":"2017-05-11T05:12:28.174Z","userId":"a145fhyy","license":"named","usage":"External","module":"Catalog","action":"OpenCatalog","identifier":"wks.klu"},
Has you can see, the column name is precised directly in front of the content for each line (e.g: "date": or "action":)
And some line can skip some columns and add some other.
What I want to get as output would be to have 7 columns with the corresponding data filled in each:
date
userId
license
usage
module
action
identifier
Does anyone has a suggestion about how to get there?
Thanks a lot in advance
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Thanks everyone for your answers. Here are some precisions about my issue:
The data that I gave as example in an extract of one of my log files. I've got a lot of them that I need to read as one unique table.
I haven't added any commas or anything to it.
#r2evans
I've tried the following:
Log3 <-read.table("/Projects/data/analytics.log.agregated.2017-05‌​-11.log") jsonlite::stream_in(textConnection(gsub(",$","",Log3)))
It returns the following error:
Error: lexical error: invalid char in json text.
c(17, 18, 19, 20, 21, 22, 23, 2
(right here) ------^
I'm not sure how to use sed -e 's/,$//g' infile > outfile and Sys.which("sed"), that something I'm not familiar with. I'm looking into it, but if you have anymore precisions to give me about the usage of it that would be great.
I have saved your example as a file "test.json" and was able to read and parse it like this:
library(jsonlite)
rf <- read_file("test.json")
rfr <- gsub("\\},", "\\}", rf)
data <- stream_in(textConnection(rfr))
It parses and simplifies into a neat data frame exactly like you want. What I do is look for "}," rather than ",$", because the very last comma is not (necessarily) followed by a newline character(s).
However, this might not be the best solution for very large files.. For those you may need to first look for a way to modify the text file itself by getting rid of the commas. Or, if that's possible, ask the people who exported this file to export it in a normal ndjson format:-)

Golang CSV read : extraneous " in field error

I am using a simple program to read CSV file, somehow I noticed when I created a CSV using EXCEL or windows based computer go library fails to read it. even when I use cat command it only shows me last line on the terminal. It always results in this error extraneous " in field.
I researched somewhat than I found it is somewhat related to carriage return differences between OS.
But I really want to ask how to make a generic csv reader. I tried reading the same csv using pandas and it was reading successfully. But i am not been able to achieve this using my Go code.
Also screen shot of correct csv Is here
Your file clearly shows that you've got an extra quote at the end of the content. While programs like pandas may be fine with that, I assume it's not valid csv so go does return an error.
Quick example of what's wrong with your data: https://play.golang.org/p/KBikSc1nzD
Update: After your update and a little bit of searching, I have to apoligize, the carriage return does matter and seems to be tha main culprit here, Go seems to be ok handling the \r\n windows variant but not the \r one. In that case what you can do is wrap the bytes.Reader into a custom reader that replaces the \r byte with the \n byte.
Here's an example: https://play.golang.org/p/vNjzwAHmtg
Please note, that the example is just that, an example, it's not handling all the possible cases where \r might be a legit byte.