I am trying to import a bunch of .tsv files into a MySQL database. However, in some of the files, there are errors in some of the rows (these files were generated from another system where data is manually inputted, so these errors are human errors). When I try to use LOAD DATA INFILE to import them, when the command gets to that row of bad data, the command writes NULL values for that field and then proceeds to stop the command, whereas I need it to keep going.
The bad rows look like this:
value1, value 2, value 3
bob, 3, st
john, 4, rd
dianne4ln
jack, 7, cir
I've made sure the line terminators are correct, and use Ignore and Replace parameters to no avail.
Use IGNORE in your query to skip error lines and proceed. See here and here
Related
I'm having some difficulties creating a table in Google BigQuery using CSV data that we download from another system.
The goal is to have a bucket in the Google Cloud Platform that we will upload a 1 CSV file per month. This CSV files have around 3,000 - 10,000 rows of data, depending on the month.
The error I am getting from the job history in the Big Query API is:
Error while reading data, error message: CSV table encountered too
many errors, giving up. Rows: 2949; errors: 1. Please look into the
errors[] collection for more details.
When I am uploading the CSV files, I am selecting the following:
file format: csv
table type: native table
auto detect: tried automatic and manual
partitioning: no partitioning
write preference: WRITE_EMPTY (cannot change this)
number of errors allowed: 0
ignore unknown values: unchecked
field delimiter: comma
header rows to skip: 1 (also tried 0 and manually deleting the header rows from the csv files).
Any help would be greatly appreciated.
This usually points to the error in the structure of data source (in this case your CSV file). Since your CSV file is small, you can run a little validation script to see that the number of columns is exactly the same across all your rows in the CSV, before running the export.
Maybe something like:
cat myfile.csv | awk -F, '{ a[NF]++ } END { for (n in a) print n, "rows have",a[n],"columns" }'
Or, you can bind it to the condition (lets say if your number of columns should be 5):
ncols=$(cat myfile.csv | awk -F, 'x=0;{ a[NF]++ } END { for (n in a){print a[n]; x++; if (x==1){break}}}'); if [ $ncols==5 ]; then python myexportscript.py; else echo "number of columns invalid: ", $ncols; fi;
It's impossible to point out the error without seeing an example CSV file, but it's very likely that your file is incorrectly formatted. As a result, one typo confuses BQ into thinking there are thousands. Let's say you have the following csv file:
Sally Whittaker,2018,McCarren House,312,3.75
Belinda Jameson 2017,Cushing House,148,3.52 //Missing a comma after the name
Jeff Smith,2018,Prescott House,17-D,3.20
Sandy Allen,2019,Oliver House,108,3.48
With the following schema:
Name(String) Class(Int64) Dorm(String) Room(String) GPA(Float64)
Since the schema is missing a comma, everything is shifted one column over. If you have a large file, it results in thousands of errors as it attempts to inserts Strings into Ints/Floats.
I suggest you run your csv file through a csv validator before uploading it to BQ. It might find something that breaks it. It's even possible that one of your fields has a comma inside the value which breaks everything.
Another theory to investigate is to make sure that all required columns receive an appropriate (non-null) value. A common cause of this error is if you cast data incorrectly which returns a null value for a specific field in every row.
As mentioned by Scicrazed, this issue seems to be generated as some file rows has an incorrect format, in which case it is required to validate the content data in order to figure out the specific error that is leading this issue.
I recommend you to check the errors[] collection that might contains additional information about the aspects that can be making to fail the process. You can do this by using the Jobs: get method that returns detailed information about your BigQuery Job or refer to the additionalErrors field of the JobStatus Stackdriver logs that contains the same complete error data that is reported by the service.
I'm probably too late for this, but it seems the file has some errors (it can be a character that cannot be parsed or just a string in an int column) and BigQuery cannot upload it automatically.
You need to understand what the error is and fix it somehow. An easy way to do it is by running this command on the terminal:
bq --format=prettyjson show -j <JobID>
and you will be able to see additional logs for the error to help you understand the problem.
If the error happens only a few times you just can increase the number of errors allowed.
If it happens many times you will need to manipulate your CSV file before you upload it.
Hope it helps
I'm attempting to COPY a CSV file to Redshift from an S3 bucket. When I execute the command, I don't get any error messages, however the load doesn't work.
Command:
COPY temp FROM 's3://<bucket-redacted>/<object-redacted>.csv'
CREDENTIALS 'aws_access_key_id=<redacted>;aws_secret_access_key=<redacted>'
DELIMITER ',' IGNOREHEADER 1;
Response:
Load into table 'temp' completed, 0 record(s) loaded successfully.
I attempted to isolate the issue via the system tables, but there is no indication there are issues.
Table Definition:
CREATE TABLE temp ("id" BIGINT);
CSV Data:
id
123,
The line endings in your csv file probably don't have a unix new line character at the end, so the COPY command probably sees your file as:
id123,
Given you have the IGNOREHEADER option enabled, and the line endings in the file aren't what COPY is expecting (my assumption based on past experience), the file contents get treated as one line, and then skipped.
I had this occur for some files created from a Windows environment.
I guess one thing to remember is that CSV is not a standard, more a convention, and different products/vendors have different implementations for csv file creation.
I repeated your instructions, and it worked just fine:
First, the CREATE TABLE
Then, the LOAD (from my own text file containing just the two lines you show)
This resulted in:
Code: 0 SQL State: 00000 --- Load into table 'temp' completed, 1 record(s) loaded successfully.
So, there's nothing obviously wrong with your commands.
At first, I thought that the comma at the end of your data line could cause Amazon Redshift to think that there is an additional column of data that it can't map to your table, but it worked fine for me. Nonetheless, you might try removing the comma, or create an additional column to store this 'empty' value.
I want to read a csv file into SAS, but I only want to read in part of the file.
For example, I want my first row of data to start at row 18, while I want to read in columns 9, 11, 12, 13, 19, 20, 36. Is there an efficient way of doing this manually in a data step to read in the file portions I want, or is my best bet just to read in the entire file using the import wizard and just keep the columns of desire?
You can change the row you start with the DATAROW option on PROC IMPORT, or FIRSTOBS option on a data step input.
You cannot easily read in only select columns, however. You would have to read in all columns up to the last column you are interested in, and then drop the uninteresting ones. You could read them all in with a $1 character called "blank" or something (even the same name each time), but you do have to ask for them.
The only workaround would be to write a regular expression to read in your data, in which case you could tell it to look for ,.*?,.*?, etc. for each skipped column.
If you can use variable names instead of column numbers, this will work. I'd recommend using variable names instead of numbers anyways, as it adds substantive meaning to your code and might help you catch a problem if the input file columns are ever changed.
PROC IMPORT datafile = "filename.csv"
out = data_read (keep = var1 var2 var3)
dbms = csv
replace;
datarow = 18;
RUN;
ICE Version: infobright-3.5.2-p1-win_32
I’m trying to load a large file but keep running into problems with errors such as:
Wrong data or column definition. Row: 989, field: 5.
This is row 989, field 5:
”(450)568-3***"
Note: The last 3 chars are numbers as well, but didn’t want to post somebodys phone number on here.
It’s really no different to any of the other entries in that field.
The datatype of that field is VARCHAR(255) NOT NULL
Also, if you upgrade to the current release 4.0.6, we now support row level error checking during LOAD and support a reject file.
To enable the reject file functionality, you must specify BH_REJECT_FILE_PATH and one of the associated parameters (BH_ABORT_ON_COUNT or BH_ABORT_ON_THRESHOLD). For example, if you want to load data from the file DATAFILE.csv to table T but you expects that 10 rows in this file might be wrongly formatted, you would run the following commands:
set #BH_REJECT_FILE_PATH = '/tmp/reject_file';
set #BH_ABORT_ON_COUNT = 10;
load data infile DATAFILE.csv into table T;
If less than 10 rows are rejected, a warning will be output, the load will succeed and all problematic rows will be output to the file /tmp/reject_file. If the Infobright Loader finds a tenth bad row, the load will terminate with an error and all bad rows found so far will be output to the file /tmp/reject_file.
I've run into this issue when the last line of the file is not terminated by the value of --lines-terminated-by="\n".
For example If I am importing a file with 9000 lines of data I have to make sure there is a new line at the end of the file.
Depending on the size of the file, you can just open it with a text editor and hit the return k
I have found this to be consistent with the '\r\n' vs. '\n' difference. Even when running on the loader on Windows, '\n' succeeds 100% of times (assuming you don't have real issues with your data vs. col. definition)
I've started learning SQL over the past few days, but am stuck while attempting to get my data into the table.
The data's stored in a text file with the format:
ColumnName1=SomeInteger
ColumnName2=SomeInteger
ColumnName3=SomeString
... etc
So far I've managed to create a table (which has about 150 Columns, that I'm hoping to split up and group seperately once I know more) by stripping the =SomeValue in Python. Then wrapping the column names with CREATE TABLE in a spreadsheet. A bit messy, but it works for now.
Now I'm stuck at the following point:
LOAD DATA INFILE 'path/to/file.txt'
INTO TABLE tableName
COLUMNS TERMINATED BY '\n'
LINES STARTING BY '=';
I'm trying to get SQL to insert the data into the column names specified (incase they're not always in the same order), ignore the equals sign, and use the unique filename as my index.
I've also tried escaping the equals character with '\=', because the MySQL documentation mentions that everything before the LINES STARTING BY parameter should be ignored. Typing LINES STARTING BY 'ColumnName1=' manages to ignore the first instance, but it's not exactly what I want, and doesn't work for the remaining lines.
I'm not averse to reading more documentation or tutorials, if someone could point me in the right direction.
edit: Rows are delimited like so: I've been given about 100,000 ini files. Each of which is named FirstName_LastName.ini (uniqueness is guaranteed), and each row of data is contained within the ini files. I need to bring the archaic method of account storage into the 21st century.
MySQL's LOAD DATA is rumored to be especially fast for this type of task, which is why I began looking into it as an option. I was just wondering if it's possible to manipulate it to work with data in my format, or if I'm better off putting all 100k files through a parser. I'm still open to suggestions that use SQL if there's any magicians reading this.
p.s: If anyone has better ideas for how to get my data (from this text format) into individual tables, I'd love to hear them too.
Personally, I would probably do the whole thing in python, using the MySQLdb module (probably available in a package named something like python-mysqldb or MySQL-python in your favorite distribution). Format your data into a list of tuples and then insert it. Example from http://mysql-python.sourceforge.net/MySQLdb.html:
import MySQLdb
datalist = [("Spam and Sausage Lover's Plate", 5, 1, 8, 7.95 ),
("Not So Much Spam Plate", 3, 2, 0, 3.95 ),
("Don't Wany ANY SPAM! Plate", 0, 4, 3, 5.95 )]
db = MySQLdb.connect(user='dude', passwd='foo', db='mydatabase')
c = db.cursor()
c.executemany(
"""INSERT INTO breakfast (name, spam, eggs, sausage, price)
VALUES (%s, %s, %s, %s, %s)""",
datalist)