Importing Scalding-produced CSV into MySQL - mysql

I produced a CSV file using Scalding's default Csv writer (specifying on the p parameter for the path to write to, and not any of the other parameters for how to write the CSV data) that I am looking to import into MySql. I am running into a problem on the import.
Example queries to load the data:
CREATE TABLE `example_table` (
`a` varchar(255) DEFAULT NULL,
`b` varchar(255) DEFAULT NULL,
`c` varchar(255) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;
LOAD DATA LOCAL INFILE '~/example.csv'
INTO TABLE example_table
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(a,b,c)
Example data (i.e. ~/examples.csv):
row1,this is neither quoted nor enclosed,"This is quoted, and contains a comma"
row2,"this is enclosed","This is quoted, and contains a comma"
row3,""this"" is quoted at the start,"This is quoted, and contains a comma"
row4,"""this"" is quoted at the start and enclosed","This is quoted, and contains a comma"
When I run the queries with the data file, the resulting table is (excuse the formatting, I can't figure out how to make a table nicely here):
row1|this is neither quoted nor enclosed|This is quoted, and contains a comma
row2|this is enclosed|This is quoted, and contains a comma
row3|"this" is quoted at the start,"This is quoted, and contains a comma|NULL
row4|"this" is quoted at the start and enclosed|This is quoted, and contains a comma
Row 3 is malformed, and is how Scalding outputs CSV if that field equals "this" is quoted at the start (i.e. it has quotes at the beginning of the string, and doesn't contain the field delimiter, in which case it would look like row 4).
Is there a way screwing with the FIELDS TERMINATED BY, OPTIONALLY ENCLOSED BY, etc options in MySql to get it to import the fields correctly?

Related

Impala create external table prevent parsing of newline in double quotes

Impala version: impalad version 4.0.0.2022.0.11.0-122
I have a CSV in S3 that has a field with newlines in it but the field is wrapped in double quotes. I can see that the CSV ignores the newlines in the field correctly but when issuing the CREATE statement in Impala it takes the newline as an actual newline for the row instead of just inside the field value, and messes up the structure of the CSV being ingested.
What can I do to ensure that newlines inside field values, that are wrapped in double quotations in the Impala table, are ignored?
CSV:
SQL CREATE statement:
CREATE EXTERNAL TABLE IF NOT EXISTS schema_name.table_name (
`week` VARCHAR(10),
notes STRING,
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
-- ESCAPED BY '"' -- tried this, didn't work
STORED AS TEXTFILE
LOCATION 's3a://bucket_name/folder_name/'
TBLPROPERTIES("skip.header.line.count"="1")
-- Also tried this (get syntax error, also tried without ROW FORMAT keywords):
-- ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde' WITH SERDEPROPERTIES ( "separatorChar" = ",", "quoteChar" = """ )
Table:

Import CSV with comma in a field with load data infile

I have a CSV file with format as below
nconst,primaryName,birthYear,deathYear,primaryProfession,knownForTitles
nm0000001,Fred,1899,1987,"soundtrack,actor,miscellaneous","tt0072308,tt0045537"
nm0000002,Lauren,1924,2014,"actress,soundtrack","tt0038355,tt0117057,tt0037382"
Some fields are enclosed in quotes ("") which is not an issue and also fields have a comma (,) as a delimiter.
I am using below command in mysql command line:
load data local infile '/home/ec2-user/sample.csv' into table movies.`sample` fields terminated by ',' enclosed by '"' lines terminated by '\n' ignore 1 lines;
which intern gives no error but data is inserted in the table in below wrong format:
**nm0000001** Fred 1899 1987 soundtrack,actor,miscellaneous tt0043044,tt0072308,tt0050419,tt0045537" **nm0000002**,Lauren Bacall,1924,2014,"actress,soundtrack
As we can clearly see , data from 2nd row appends in the first row
Thanks in advance
EDIT :
table definition :
CREATE TABLE `sample` (
`nconst` text,
`primaryName` text,
`birthYear` int(11) DEFAULT NULL,
`deathYear` text,
`primaryProfession` text,
`knownForTitles` text
) ENGINE=InnoDB DEFAULT CHARSET=latin1;

How to modify text file so that it can be loaded into mysql database?

I have over a 100 columns in my table and some can be null. I want to input from a txt file. So, some rows may not have all 100 odd entries.
For example, a line in my text file might have around 50 entries separated by a 'tab' but how do i specify which columns these 50 entries have to be made out of the 100+ columns?
I've read about
LOAD DATA INFILE but i'm still confused about my problem. Any help is appreciated.
suppose the text file is: Toby, LA Carl, 246 and i want the resultant table to be: Toby NULL LA Carl 246 NULL How do i do this?
The text file needs to be organized such that it has a consistent column set. If you have 50 tab-delimited values in one line and then 100 tab-delimited values in the next line then you'll never be able to load the data cleanly into MySQL.
Missing values still need to be accounted for in the text file, using empty string, NULL, etc to signify the missing value.
If you have some existing data in your table that includes NULL values then you can dump a subset of that data to disk in order to get an example of the proper way to format missing values in a tab-delimited file:
select *
into outfile '/tmp/test.tsv'
FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
from your_table
limit 1000
With this instruction:
LOAD DATA INFILE '/tmp/test.txt' INTO TABLE test
FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
This is the route of your txt:
'/tmp/test.txt'
You dump the txt on the table named test.
Your txt should have this structure:
Toby, 245, LA
Carl, 246, CA
For a result like:
Toby NULL LA
Carl 246 NULL
The txt should be like:
Toby, NULL, LA
Carl, 246 , NULL
Thats because FIELD SCAPED BY Its not specified.
From Mysql documentation:
Handling of NULL values varies according to the FIELDS and LINES options in use:
For the default FIELDS and LINES values, NULL is written as a field value of \N for output, and a field value of \N is read as NULL for input (assuming that the ESCAPED BY character is “\”).
If FIELDS ENCLOSED BY is not empty, a field containing the literal word NULL as its value is read as a NULL value. This differs from the word NULL enclosed within FIELDS ENCLOSED BY characters, which is read as the string 'NULL'.
If FIELDS ESCAPED BY is empty, NULL is written as the word NULL.
3 fields separated by ',' whit end line \n.....you table test should have the same structure

error 1265. Data truncated for column when trying to load data from txt file

I have table in mysql table
table looks like
create table Pickup
(
PickupID int not null,
ClientID int not null,
PickupDate date not null,
PickupProxy varchar (40) ,
PickupHispanic bit default 0,
EthnCode varchar(2),
CategCode varchar (2) not null,
AgencyID int(3) not null,
Primary Key (PickupID),
FOREIGN KEY (CategCode) REFERENCES Category(CategCode),
FOREIGN KEY (AgencyID) REFERENCES Agency(AgencyID),
FOREIGN KEY (ClientID) REFERENCES Clients (ClientID),
FOREIGN KEY (EthnCode) REFERENCES Ethnicity (EthnCode)
);
sample data from my txt file
1065535,7709,1/1/2006,,0,,SR,6
1065536,7198,1/1/2006,,0,,SR,7
1065537,11641,1/1/2006,,0,W,SR,24
1065538,9805,1/1/2006,,0,N,SR,17
1065539,7709,2/1/2006,,0,,SR,6
1065540,7198,2/1/2006,,0,,SR,7
1065541,11641,2/1/2006,,0,W,SR,24
when I am trying to submit it by using
LOAD DATA INFILE 'Pickup_withoutproxy2.txt' INTO TABLE pickup;
it throws error
Error Code: 1265. Data truncated for column 'PickupID' at row 1
I am using MySQL 5.2
This error means that at least one row in your Pickup_withoutproxy2.txt file has a value in its first column that is larger than an int (your PickupId field).
An Int can only accept values between -2147483648 to 2147483647.
Review your data to see what's going on. You could try to load it into a temp table with a varchar data type if your txt file is extremely large and difficult to see. Easy enough to check for an int once loaded in the database.
Good luck.
You're missing FIELDS TERMINATED BY ',' and it's assuming you're delimiting by tabs by default.
I had same problem. I wanted to edit ENUM values in table structure. Problem was because of rows that was saved before and new ENUM values doesn't contain saved values.
Solution was updating old saved rows in MySql table.
I had this issue when trying to convert an existing varchar column to enum. For me the issue was that there were existing values for that column that were not part of the enum's list of accepted values. So if your enum will only allow values, say ('dog', 'cat') but there is a row with bird in your table, the MODIFY COLUMN will fail with this error.
I have met this problem with a column that has ENUM values('0','1').
When I was trying to save a new record, I was assigning value 0 for the ENUM variable.
For the solution: I have changed ENUM variable value from 0 to 1, and 1 to 2.
The reason is that mysql expecting end of the row symbol in the text file after last specified column, and this symbol is char(10) or '\n'. Depends on operation system where text file created or if you created your text file yourself, it can be other combination (Windows uses '\r\n' (chr(13)+chr(10)) as rows separator). Thus, if you use Windows generated text file, add following suffix to your LOAD command: “ LINES TERMINATED BY '\r\n' ”. Otherwise, check how rows are separated in your text file. On default mysql expecting char(10) as rows separator.
I had the same problem, my mistake was that I was trying to load a value that I hadn't defined previously to an ENUM().
For example, ENUM('sell','lend') and I was trying to load the value return to that column. I needed to load it, so I added it to the ENUM and load it again.
I have seen the same warning when my data has extra space, tabs, newlines or other characters in my column which is decimal(10,2) to solve that, I had to remove those characters from value.
here is how I handled it.
LOAD DATA LOCAL INFILE 'c:/Users/Hitesh/Downloads/InventoryMasterReportHitesh.csv'
INTO TABLE stores_inventory_tmp
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(#col1, #col2, #col3, #col4, #col5)
SET sku = TRIM(REPLACE(REPLACE(REPLACE(REPLACE(#col1,'\t',''), '$',''), '\r', ''), '\n', ''))
, product_name = TRIM(REPLACE(REPLACE(REPLACE(REPLACE(#col2,'\t',''), '$',''), '\r', ''), '\n', ''))
, department_number = TRIM(REPLACE(REPLACE(REPLACE(REPLACE(#col3,'\t',''), '$',''), '\r', ''), '\n', ''))
, department_name = TRIM(REPLACE(REPLACE(REPLACE(REPLACE(#col4,'\t',''), '$',''), '\r', ''), '\n', ''))
, price = TRIM(REPLACE(REPLACE(REPLACE(REPLACE(#col5,'\t',''), '$',''), '\r', ''), '\n', ''))
;
I've got that hint from this answer
I was facing the same issue, while importing csv files into mysql xampp. I have used LINES TERMINATED BY '\r' instead of LINES TERMINATED BY '\n' and LINES TERMINATED BY '\r\n'.
This error can also be the result of not having the line,
FIELDS SPECIFIED BY ','
(if you're using commas to separate the fields) in your MySQL syntax, as described in this page of the MySQL docs.

MySQL: How to escape backslashes in outfile?

I want to output some fields into file using this query:
SELECT
CONCAT('[',
GROUP_CONCAT(
CONCAT(CHAR(13), CHAR(9), '{"name":"', name, '",'),
CONCAT('"id":', CAST(rid AS UNSIGNED), '}')
),
CHAR(13), ']')
AS json FROM `role`
INTO OUTFILE '/tmp/roles.json'
In output file I'm getting something like this:
[
\ {"name":"anonymous user","rid":1},
\ {"name":"authenticated user","rid":2},
\ {"name":"admin","rid":3},
\ {"name":"moderator","rid":4}
]
As you can see, newlines (char(13)) has no backslashes, but tab characters (char(9)) has. How can I get rid of them?
UPDATE
Sundar G gave me a cue, so I modified the query to this:
SELECT
CONCAT('"name":', name),
CONCAT('"rid":', rid)
INTO outfile '/tmp/roles.json'
FIELDS TERMINATED BY ','
LINES STARTING BY '\t{' TERMINATED BY '},\n'
FROM `role`
I don't know why, but this syntax strips backslashes from output file:
{"name":"anonymous user","rid":1},
{"name":"authenticated user","rid":2},
{"name":"admin","rid":3},
{"name":"moderator","rid":4}
This is already pretty nice output, but I also would like to add opening and closing square brackets at the beginning and at the end of the file. Can I do this by means of MySQL syntax or I have to do that manually?
As described in SELECT ... INTO Syntax:
The syntax for the export_options part of the statement consists of the same FIELDS and LINES clauses that are used with the LOAD DATA INFILE statement. See Section 13.2.6, “LOAD DATA INFILE Syntax”, for information about the FIELDS and LINES clauses, including their default values and permissible values.
That referenced page says:
If you specify no FIELDS or LINES clause, the defaults are the same as if you had written this:
FIELDS TERMINATED BY '\t' ENCLOSED BY '' ESCAPED BY '\\'
LINES TERMINATED BY '\n' STARTING BY ''
and later explains:
For output, if the FIELDS ESCAPED BY character is not empty, it is used to prefix the following characters on output:
The FIELDS ESCAPED BY character
The FIELDS [OPTIONALLY] ENCLOSED BY character
The first character of the FIELDS TERMINATED BY and LINES TERMINATED BY values
ASCII 0 (what is actually written following the escape character is ASCII “0”, not a zero-valued byte)
If the FIELDS ESCAPED BY character is empty, no characters are escaped and NULL is output as NULL, not \N. It is probably not a good idea to specify an empty escape character, particularly if field values in your data contain any of the characters in the list just given.
Therefore, since you have not explicitly specifying a FIELDS clause, any occurrences of the default TERMINATED BY character (i.e. tab) within a field will be escaped by the default ESCAPED BY character (i.e. backslash): so the tab character that you are creating gets so escaped. To avoid that, explicitly specify either a different field termination character or use the empty string as the escape character.
However, you should also note that the size of your results will be limited by group_concat_max_len. Perhaps a better option would be:
SELECT json FROM (
SELECT 1 AS sort_col, '[' AS json
UNION ALL
SELECT 2, CONCAT('\t{"name":', QUOTE(name), ',"id":', CAST(rid AS UNSIGNED), '}')
FROM role
UNION ALL
SELECT 3, ']'
) t
ORDER BY sort_col
INTO OUTFILE '/tmp/roles.json' FIELDS ESCAPED BY ''
Try this query like
SELECT your_fields
INTO outfile '/path/file' fields enclosed by '"' terminated by ',' lines terminated by '\n'
FROM table;
hope this works..