Error in a line on big CSV imported to BigQuery - csv

I'm trying to import a big CSV file to BigQuery (2.2 GB+). This is the error I get:
"Error while reading data, error message: CSV table references column position 33, but line starting at position:254025076 contains only 26 columns."
There are more errors on that file – and on that file only, out of one per state. Usually I would skip the faulty lines, but then I would lose a lot of data.
What can be a good way to check and correct the errors in a file that big?
EDIT: This is what seems to happen in the file. It's one single line and it breaks between "Instituto" and "Butantan". As a result, BigQuery parses it as one line with 26 columns and another with nine columns. That repeats a lot.
As far as I've seen, it's just with Butantan, but sometimes the first word is described differently (I caught "Instituto" and "Fundação"). Can I correct that maybe with grep on the command line? If so, what syntax?

Actually 2.2GB is quite manageble size. It can be quickly pre-processed with command line tools or simple python script on any +/- modern laptop/desktop or on a small VM in GCP.
You can start from looking at the problematic row:
head -n 254025076 your_file.csv | tail -n 1
If problematic rows just have missing values for last columns - you can use "--allow_jagged_rows" loading CSV option.
Otherwise I'm usually using simple python script like this:
import fileinput
def process_line(line):
# your logic to fix line
return line
if __name__ == '__main__':
for line in fileinput.input():
print(process_line(line))
and run it with:
cat your_file.csv | python3 preprocess.py > new_file.csv
UPDATE:
For newline characters in value - try BigQuery "Allow quoted newlines" option.

Related

Admin import - Group not found

I am trying to load multiple csv files into a new db using the neo4j-admin import tool on a machine running Debian 11. To try to ensure there's no collisions in the ID fields, I've given every one of my node and relationship files.
However, I'm getting this error:
org.neo4j.internal.batchimport.input.HeaderException: Group 'INVS' not found. Available groups are: [CUST]
This is super frustrating, as I know that the INV group definitely exists. I've checked every file that uses that ID Space and they all include it.Another strange thing is that there are more ID spaces than just the CUST and INV ones. It feels like it's trying to load in relationships before it finishes loading in all of the nodes for some reason.
Here is what I'm seeing when I search through my input files
$ grep -r -h "(INV" ./import | sort | uniq
:ID(INVS),total,:LABEL
:START_ID(INVS),:END_ID(CUST),:TYPE
:START_ID(INVS),:END_ID(ITEM),:TYPE
The top one is from my $NEO4J_HOME/import/nodes folder, the other two are in my $NEO4J_HOME/import/relationships folder.
Is there a nice solution to this? Or have I just stumbled upon a bug here?
Edit: here's the command I've been using from within my $NEO4J_HOME directory:
neo4j-admin import --force=true --high-io=true --skip-duplicate-nodes --nodes=import/nodes/\.* --relationships=import/relationships/\.*
Indeed, such a thing would be great, but i don't think it's possible at the moment.
Anyway it doesn't seems a bug.
I suppose it may be a wanted behavior and / or a feature not yet foreseen.
In fact, on the documentation regarding the regular expression it says:
Assume that you want to include a header and then multiple files that matches a pattern, e.g. containing numbers.
In this case a regular expression can be used
while on the description of --nodes command:
Node CSV header and data. Multiple files will be
logically seen as one big file from the
perspective of the importer. The first line must
contain the header. Multiple data sources like
these can be specified in one import, where each
data source has its own header.
So, it appears that the neo4j-admin import considers the --nodes=import/nodes/\.* as a single .csv with the first header found, hence the error.
Contrariwise with more --nodes there are no problems.

line feed within a column in csv

I have a csv like below. some of columns have line break like column B below. when I doing wc -l file.csv unix is returning 4 but actually these are 3 records. I don't want to replace line break with space, I am going to load data in database using sql loader and want to load data as it is. what should I do so that unix consider line break as one record?
A,B,C,D
1,"hello
world",sds,sds
2,sdsd,sdds,sdds
Unless you're dealing with trivial cases (No quoted fields, no embedded commas, no embedded newlines, etc.), CSV data is best processed with tools that understand the format. Languages like perl and python have CSV parsing libraries available, there are packages like csvkit that provide useful utilities, and more.
Using csvstat from csvkit on your example:
$ csvstat -H --count foo.csv
Row count: 3

Read a log file in R

I'm trying to read a log file in R.
It looks like an extract from a JSON file to me, but when trying to read it using jsonlite I get the following error message: "Error: parse error: trailing garbage".
Here is how my log file look like:
{"date":"2017-05-11T04:37:15.587Z","userId":"admin","module":"Quote","action":"CreateQuote","identifier":"-.admin1002"},
{"date":"2017-05-11T05:12:24.939Z","userId":"a145fhyy","module":"Quote","action":"Call","identifier":"RunUY"},
{"date":"2017-05-11T05:12:28.174Z","userId":"a145fhyy","license":"named","usage":"External","module":"Catalog","action":"OpenCatalog","identifier":"wks.klu"},
Has you can see, the column name is precised directly in front of the content for each line (e.g: "date": or "action":)
And some line can skip some columns and add some other.
What I want to get as output would be to have 7 columns with the corresponding data filled in each:
date
userId
license
usage
module
action
identifier
Does anyone has a suggestion about how to get there?
Thanks a lot in advance
/////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////////
Thanks everyone for your answers. Here are some precisions about my issue:
The data that I gave as example in an extract of one of my log files. I've got a lot of them that I need to read as one unique table.
I haven't added any commas or anything to it.
#r2evans
I've tried the following:
Log3 <-read.table("/Projects/data/analytics.log.agregated.2017-05‌​-11.log") jsonlite::stream_in(textConnection(gsub(",$","",Log3)))
It returns the following error:
Error: lexical error: invalid char in json text.
c(17, 18, 19, 20, 21, 22, 23, 2
(right here) ------^
I'm not sure how to use sed -e 's/,$//g' infile > outfile and Sys.which("sed"), that something I'm not familiar with. I'm looking into it, but if you have anymore precisions to give me about the usage of it that would be great.
I have saved your example as a file "test.json" and was able to read and parse it like this:
library(jsonlite)
rf <- read_file("test.json")
rfr <- gsub("\\},", "\\}", rf)
data <- stream_in(textConnection(rfr))
It parses and simplifies into a neat data frame exactly like you want. What I do is look for "}," rather than ",$", because the very last comma is not (necessarily) followed by a newline character(s).
However, this might not be the best solution for very large files.. For those you may need to first look for a way to modify the text file itself by getting rid of the commas. Or, if that's possible, ask the people who exported this file to export it in a normal ndjson format:-)

Very weird behaviour in Neo4j load CSV

What I'm trying to import is a CSV file with phone calls, and represent it as phone numbers in nodes and each call as an arrow.
The file is separated by pipes.
I have tried a first version:
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
return line
limit 5
and worked as expected, one node (outgoing number) is created for each row.
After that I could test what I've done with a simple
Match (a) return a
So I've tried the following step is creating the second node of the call (receiver)
load csv from 'file:///com.csv' as line FIELDTERMINATOR '|'
with line
merge (a:line {number:COALESCE(line[1],"" )})
merge (b:line {number:COALESCE(line[2],"" )})
return line
limit 5
After I run this code I receive no answer (I'm using the browser GUI at localhost:7474/broser) of this operation and if I try to perform any query on this server I get no result either.
So again if I run
match (a) return a
nothing happens.
The only way I've got to go back to life is stoping the server and starting it again.
Any ideas?
It is possible, that opening that big file twice will cause the problem because it is heavily based on the operational system how to handle big files.
Anyway, if you run it accidentally without the 'limit 5' clause then It can happen, since you are trying to load the 26GB in a single transaction.
Since LOAD CSV is for medium sized datasets, I recommend two solutions:
- Using the neo4j-import tool, or
- I would try to split up the file to smaller parts, and you should use periodic commit to prevent the out of memory situations and hangs, like this:
USING PERIODIC COMMIT 100000
LOAD CSV FROM ...

Pandas read_csv errors on number of fields, but visual inspection looks fine

I'm trying to load a large csv file, 3,715,259 lines.
I created this file myself and there are 9 fields separated by commas.
Here's the error:
df = pd.read_csv("avaya_inventory_rev2.csv", error_bad_lines=False)
Skipping line 2924525: expected 9 fields, saw 11
Skipping line 2924526: expected 9 fields, saw 10
Skipping line 2924527: expected 9 fields, saw 10
Skipping line 2924528: expected 9 fields, saw 10
This doesn't make sense to me, I inspected the offending lines using:
sed -n "2924524,2924525p" infile.csv
I can't list the outputs as they contain proprietary information for a client. I'll try to synthesize a meaningful replacement.
Lines 2924524 and 2924525 look to have to same number of fields to me.
Also, I was able to load the same file into a mySQL table with no error.
create table Inventory (path varchar (255), isText int, ext varchar(5), type varchar(100), size int, sloc int, comments int, blank int, tot_lines int);
I don't know enough about mySQL to understand why that may or maynot be a valid test and why pandas would have a different outcome from loading the same file.
TIA !
'''UPDATE''': I tried to read with the engine='python':
Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?
When I create this csv, I'm using a shell script I wrote. I feed lines to the file with redirect >>
I tried the suggested fix:
input = open(input, 'rU')
df.read_csv(input, engine='python')
Back to the same error:
ValueError: Expected 9 fields in line 5157, saw 11
I'm guessing it has to do with my csv creation script and how I dealt with
quoting in that. I don't know how to investigate this further.
I opened the csv input file in vim and on line 5157 there's a ^M which google says it Windows CR.
OK...I'm closer, although I did kinda suspect something like this and used dos2unix on the csv intput.
I removed the ^M using vim, and re-ran with same error about
11 fields. However, I can now see the 11 fields where as before I just saw
9. There's v's which is likely some kind of Windows hold over ?
SUMMARY: SOme somebody thought it'd be cute to name files with fobar.sh,v
So my profiler didn't mess up it's was just a name weirdness...plus the random \cr\lf from windows that snuck in....
Cheers