Have to move a table from MS SQL Server to MySQL (~ 8M rows with 8 coloumns). One of the coloumns (DECIMAL Type) is exported as empty string with "bcp" export to a csv file. When I'm using this csv file to load data into MySQL table, it fails saying "Incorrect decimal value".
Looking for possible work arounds or suggestions.
I would create a view in MS SQL which converts the decimal column to a varchar column:
CREATE VIEW MySQLExport AS
SELECT [...]
COALESCE(CAST(DecimalColumn AS VARCHAR(50)),'') AS DecimalColumn
FROM SourceTable;
Then, import into a staging table in MySQL, and use a CASE statement for the final INSERT:
INSERT INTO DestinationTable ([...])
SELECT [...]
CASE DecimalColumn
WHEN '' THEN NULL
ELSE CAST(DecimalColumn AS DECIMAL(10,5))
END AS DecimalColumn,
[...]
FROM ImportMSSQLStagingTable;
This is safe because the only way the value can be an empty string in the export file is if it's NULL.
Note that I doubt you can cheat by exporting it with COALESCE(CAST(DecimalColumn AS VARCHAR(50)),'\N'), because LOAD INFILE would see that as '\N', which is not the same as \N.
Related
I am working on some benchmarks and need to compare ORC, Parquet and CSV formats. I have exported TPC/H (SF1000) to ORC based tables. When I want to export it to Parquet I can run:
CREATE TABLE hive.tpch_sf1_parquet.region
WITH (format = 'parquet')
AS SELECT * FROM hive.tpch_sf1_orc.region
When I try the similar approach with CSV, then I get the error Hive CSV storage format only supports VARCHAR (unbounded). I would assumed that it would convert the other datatypes (i.e. bigint) to text and store the column format in the Hive metadata.
I can export the data to CSV using trino --server trino:8080 --catalog hive --schema tpch_sf1_orc --output-format=CSV --execute 'SELECT * FROM nation, but then it gets emitted to a file. Although this works for SF1 it quickly becomes unusable for SF1000 scale-factor. Another disadvantage is that my Hive metastores wouldn't have the appropriate meta-data (although I could patch it manually if nothing else works).
Anyone an idea how to convert my ORC/Parquet data to CSV using Hive?
In Trino Hive connector, the CSV table can contain varchar columns only.
You need to cast the exported columns to varchar when creating the table
CREATE TABLE region_csv
WITH (format='CSV')
AS SELECT CAST(regionkey AS varchar), CAST(name AS varchar), CAST(comment AS varchar)
FROM region_orc
Note that you will need to update your benchmark queries accordingly, e.g. by applying reverse casts.
DISCLAIMER: Read the full post, before using anything discussed here. It's not real CSV and you migth screw up!
It is possible to create typed CSV-ish tables when using the TEXTFILE format and use ',' as the field separator:
CREATE TABLE hive.test.region (
regionkey bigint,
name varchar(25),
comment varchar(152)
)
WITH (
format = 'TEXTFILE',
textfile_field_separator = ','
);
This will create a typed version of the table in the Hive catalog using the TEXTFILE format. It normally uses the ^A character (ASCII 10), but when set to ',' it resembles the same structure as CSV formats.
IMPORTANT: Although it looks like CSV, it is not real CSV. It doesn't follow RFC 4180, because it doesn't properly quote and escape. The following INSERT will not be inserted co:
INSERT INTO hive.test.region VALUES (
1,
'A "quote", with comma',
'The comment contains a newline
in it');
The text will be copied unmodified to the file without escaping quotes or commas. This should have been written like this to be proper CSV:
1,"A ""quote"", with comma","The comment contains a newline
in it"
Unfortunately, it is written as:
1,A "quote", with comma,The comment contains a newline
in it
This results in invalid data that will be represented by NULL columns. For this reason, this method can only be used when you have full control over the text-based data and are sure that it doesn't contain newlines, quotes, commas, ...
I've got a process that creates a csv file that contains ONE set of values that I need to import into a field in a MySQL database table. This process creates a specific file name that identifies the values of the other fields in that table. For instance, the file name T001U020C075.csv would be broken down as follows:
T001 = Test 001
U020 = User 020
C075 = Channel 075
The file contains a single row of data separated by commas for all of the test results for that user on a specific channel and it might look something like:
12.555, 15.275, 18.333, 25.000 ... (there are hundreds, maybe thousands, of results per user, per channel).
What I'm looking to do is to import directly from the CSV file adding the field information from the file name so that it looks something like:
insert into results (test_no, user_id, channel_id, result) values (1, 20, 75, 12.555)
I've tried to use "Bulk Insert" but that seems to want to import all of the fields where each ROW is a record. Sure, I could go into each file and convert the row to a column and add the data from the file name into the columns preceding the results but that would be a very time consuming task as there are hundreds of files that have been created and need to be imported.
I've found several "import CSV" solutions but they all assume all of the data is in the file. Obviously, it's not...
The process that generated these files is unable to be modified (yes, I asked). Even if it could be modified, it would only provide the proper format going forward and what is needed is analysis of the historical data. And, the new format would take significantly more space.
I'm limited to using either MATLAB or MySQL Workbench to import the data.
Any help is appreciated.
Bob
A possible SQL approach to getting the data loaded into the table would be to run a statement like this:
LOAD DATA LOCAL INFILE '/dir/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = '001'
, user_id = '020'
, channel_id = '075'
;
We need the comma to be the line separator. We can specify some character that we are guaranteed not to tppear to be the field separator. So we get LOAD DATA to see a single "field" on each "line".
(If there isn't trailing comma at the end of the file, after the last value, we need to test to make sure we are getting the last value (the last "line" as we're telling LOAD DATA to look at the file.)
We could use user-defined variables in place of the literals, but that leaves the part about parsing the filename. That's really ugly in SQL, but it could be done, assuming a consistent filename format...
-- parse filename components into user-defined variables
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'T',-1),'U',1) AS t
, SUBSTRING_INDEX(SUBSTRING_INDEX(f.n,'U',-1),'C',1) AS u
, SUBSTRING_INDEX(f.n,'C',-1) AS c
, f.n AS n
FROM ( SELECT SUBSTRING_INDEX(SUBSTRING_INDEX( i.filename ,'/',-1),'.csv',1) AS n
FROM ( SELECT '/tmp/T001U020C075.csv' AS filename ) i
) f
INTO #ls_u
, #ls_t
, #ls_c
, #ls_n
;
while we're testing, we probably want to see the result of the parsing.
-- for debugging/testing
SELECT #ls_t
, #ls_u
, #ls_c
, #ls_n
;
And then the part about running of the actual LOAD DATA statement. We've got to specify the filename again. We need to make sure we're using the same filename ...
LOAD DATA LOCAL INFILE '/tmp/T001U020C075.csv'
INTO TABLE results
FIELDS TERMINATED BY '|'
LINES TERMINATED BY ','
( result )
SET test_no = #ls_t
, user_id = #ls_u
, channel_id = #ls_c
;
(The client will need read permission the .csv file)
Unfortunately, we can't wrap this in a procedure because running LOAD DATA
statement is not allowed from a stored program.
Some would correctly point out that as a workaround, we could compile/build a user-defined function (UDF) to execute an external program, and a procedure could call that. Personally, I wouldn't do it. But it is an alternative we should mention, given the constraints.
I am migrating a MySQL 5.5 physical host database to a MySQL 5.6 AWS Aurora database. I noticed that when data is written to a file using INTO OUTFILE, 5.5 writes NULL value as '\N' and empty string as ''. However, 5.6 writes both empty string and NULL as ''.
Query
SELECT * FROM $databasename.$tablename INTO OUTFILE $filename CHARACTER SET utf8 FIELDS ESCAPED BY '\\\\' TERMINATED BY $delimiter;
I found official documents about this:
https://dev.mysql.com/doc/refman/5.6/en/load-data.html
With fixed-row format (which is used when FIELDS TERMINATED BY and
FIELDS ENCLOSED BY are both empty), NULL is written as an empty
string. This causes both NULL values and empty strings in the table to
be indistinguishable when written to the file because both are written
as empty strings. If you need to be able to tell the two apart when
reading the file back in, you should not use fixed-row format.
How do I export NULL as '\N'?
How do I export NULL as '\N'?
First of all that's strange and why you want to do that? But if for some reason you want to export it that way then you will have to change your query from select * to using a CASE expression like
select
case when col1 is null then '\\N' else col1 end as col1,
...
from $databasename.$tablename....
As commented you can as well use IFNULL() function or COALESCE() function for the same purpose.
I'm building an AWS pipeline to insert CSV files from S3 to an RDS MySQL DB. The problem I'm facing is that when it attempts to load the file, it treats blanks as empty strings instead of NULLs. For example, Line 1 of the CSV is:
"3","John","Doe",""
Where the value is an integer in the MySQL table, and of course the error in the pipeline is:
Incorrect integer value: '' for column 'col4' at row 1
I was researching the jdbc MySQL paramaters to modify the connection string:
jdbc:mysql://my-rds-endpoint:3306/my_db_name?
jdbcCompliantTruncation=false
jdbcCompliantTruncationis is just an example, is there any of these parameters that can help me insert those blanks as nulls?
Thanks!
EDIT:
A little context, the CSV files are UNLOADS from redshift, so the blanks are originally NULLs when I put them in S3.
the csv files are UNLOADS from redshift
Then look at the documentation for the Redshift UNLOAD command and add the NULL AS option. For example:
NULL AS 'NULL'
use null as '\N' converts blank to null
unload ('SELECT * FROM table')
to 's3://path' credentials
'aws_access_key_id=sdfsdhgfdsjfhgdsjfhgdsjfh;aws_secret_access_key=dsjfhsdjkfhsdjfksdhjkfsdhfjkdshfs'
delimiter '|' null as '\\N' ;
I resolve this issue using the NULLIF function:
insert into table values (NULLIF(?,''),NULLIF(?,''),NULLIF(?,''),NULLIF(?,''))
I need to export a large number of tables (~50) as TSVs from a remote MySQL server (so SELECT INTO OUTFILE is not an option per documentation). I am using mysql -e 'SELECT * FROM table' > file.tsv (in a script that loops for each of the ~50 tables). The problem is that with this way, NULL values in all the table are represented as 'NULL' string instead of \N. The 'NULL' then gets converted/casted to odd/undesirable values when the TSVs are loaded back into a local DB (using LOAD DATA INFILE). For example, a date column with NULL is read as '00-00-0000', a varchar(3) column is read as 'NUL'.
I've confirmed that when using SELECT INTO OUTFILE NULLs are correctly represented as \N in the TSV and therefore loaded back into DB correctly. I've also confirmed that if I change the 'NULL' in the TSV to \N the data is loaded corrected.
My question is how do I export the data from the remote server such that the TSVs retain \N in the first place. Are the better ways then doing a SELECT * and redirecting the output to a file? Appreciate any tips! as this NULL issue is bothersome.