Import CSV with multi-valued (collection) attributes to Cassandra

Import CSV with multi-valued (collection) attributes to Cassandra - csv

Suppose I would like to import a csv file into the following table:
CREATE TABLE example_table (
id int PRIMARY KEY,
comma_delimited_str_list list<ascii>,
space_delimited_str_list list<ascii>
);
where comma_delimited_str_list and space_delimited_str_list are two list-attributes which use comma and space as their delimiter respectively.
An example csv record would be:
12345,"hello,world","stack overflow"
where I would like to treat "hello,world" and "stack overflow" as two multi-valued attributes.
Can I know how to import such CSV file into its corresponding table in Cassandra? Preferably using CQL COPY?

CQL 1.2 is able to port CSV file with multi-valued fields directly to a table. However, the format of those multi-valued fields must match the CQL format.
For example, lists must be in the form ['abc','def','ghi'], and sets must be in the form {'123','456','789'}.
Below is an example of importing CSV formatted data to the example_table mentioned in the OP from STDIN:
cqlsh:demo> copy example_table from STDIN;
[Use \. on a line by itself to end input]
[copy] 12345,"['hello','world']","['stack','overflow']"
[copy] 56780,"['this','is','a','test','list']","['here','is','another','one']"
[copy] \.
2 rows imported in 11.304 seconds.
cqlsh:demo> select * from example_table;
id | comma_delimited_str_list | space_delimited_str_list
-------+---------------------------+--------------------------
12345 | [hello, world] | [stack, overflow]
56780 | [this, is, a, test, list] | [here, is, another, one]
Importing incorrect formatted list or set values from a CSV file will raise an error:
cqlsh:demo> copy example_table from STDIN;
[Use \. on a line by itself to end input]
[copy] 9999,"hello","world"
Bad Request: line 1:108 no viable alternative at input ','
Aborting import at record #0 (line 1). Previously-inserted values still present.
The above input should be replaced by 9999,"['hello']","['world']":
cqlsh:demo> copy example_table from STDIN;
[Use \. on a line by itself to end input]
[copy] 9999,"['hello']","['world']"
[copy] \.
1 rows imported in 16.859 seconds.
cqlsh:demo> select * from example_table;
id | comma_delimited_str_list | space_delimited_str_list
-------+---------------------------+--------------------------
9999 | [hello] | [world]
12345 | [hello, world] | [stack, overflow]
56780 | [this, is, a, test, list] | [here, is, another, one]

Related

MySQL Error Code: 1262. Row x was truncated; it contained more data than there were input columns

I need to load file content into a table. The file contains text separated by commas. It is a very large file. I can not change the file it is already given to me like this.
12.com,128.15.8.6,TEXT1,no1,['128.15.8.6']
23com,122.14.10.7,TEXT2,no2,['122.14.10.7']
45.com,91.33.10.4,TEXT3,no3,['91.33.10.4']
67.com,88.22.88.8,TEXT4,no4,['88.22.88.8', '5.112.1.10']
I need to load the file into a table of four columns. So for example, the last row above should be in the table as follows:
table.col1: 67.com
table.col2: 88.22.88.8
table.col3: TEXT3
table.col4: no3
table.col5: ['88.22.88.8', '5.112.1.10']
Using MySQL workbench, I created a table with five columns all are of type varchar. Then I run the following SQL command:
LOAD DATA INFILE '/var/lib/mysql-files/myfile.txt'
INTO TABLE `mytable`.`myscheme`
fields terminated BY ','
The last column string (which contains commas that I do not want to separate) causes an issue.
Error:
Error Code: 1262. Row 4 was truncated; it contained more data than there were input columns
How can I overcome this problem please.

Not that difficult simply using load data infile - note the use of a variable.
drop table if exists t;
create table t(col1 varchar(20),col2 varchar(20), col3 varchar(20), col4 varchar(20),col5 varchar(100));
truncate table t;
load data infile 'test.csv' into table t LINES TERMINATED BY '\r\n' (#var1)
set col1 = substring_index(#var1,',',1),
col2 = substring_index(substring_index(#var1,',',2),',',-1),
col3 = substring_index(substring_index(#var1,',',3),',',-1),
col4 = substring_index(substring_index(#var1,',',4),',',-1),
col5 = concat('[',(substring_index(#var1,'[',-1)))
;
select * from t;
+--------+-------------+-------+------+------------------------------+
| col1 | col2 | col3 | col4 | col5 |
+--------+-------------+-------+------+------------------------------+
| 12.com | 128.15.8.6 | TEXT1 | no1 | ['128.15.8.6'] |
| 23com | 122.14.10.7 | TEXT2 | no2 | ['122.14.10.7'] |
| 45.com | 91.33.10.4 | TEXT3 | no3 | ['91.33.10.4'] |
| 67.com | 88.22.88.8 | TEXT4 | no4 | ['88.22.88.8', '5.112.1.10'] |
+--------+-------------+-------+------+------------------------------+
4 rows in set (0.00 sec)

In this case for avoid the problem related with the improper presence of comma you could import the rows .. in single column table .. (of type TEXT on Medimun TEXT ..as you need)
ther using locate (one for 1st comma , one for 2nd, one for 3th .. ) and substring you could extract form each rows the four columns you need
and last with a insert select you could populate the destination table .. separating the columns as you need ..

This is too long for a comment.
You have a horrible data format in your CSV file. I think you should regenerate the file.
MySQL has facilities to help you handle this data, particularly the OPTIONALLY ENCLOSED BY option in LOAD DATA INFILE. The only caveat is that this allows one escape character rather than two.
My first suggestion would be to replace the field separates with another character -- tab or | come to mind. Any character that is not used for values within a field.
The second is to use a double quote for OPTIONALLY ENCLOSED BY. Then replace '[' with '"[' and ] with ']"' in the data file. Even if you cannot regenerate the file, you can pre-process it using something like grep or pearl or python to make this simple substitution.
Then you can use the import facilities for MySQL to load the file.

Why redshift does not have entry for csv file in stl_load_commits ??

Even though I know aws has mentioned on their documentation that csv is more like txt file for them. But why there is no entry for CSV file.
For example:
If I am running a query like:
COPY "systemtable" FROM 's3://test/example.txt' <credentials> IGNOREHEADER 1 delimiter as ','
then its creating entry in stl_load_commits, which I can query by:
select query, curtime as updated from stl_load_commits where query = pg_last_copy_id();
But, in same way when I am tryig with:
COPY "systemtable" FROM 's3://test/example.csv'
<credentials> IGNOREHEADER 1 delimiter as ',' format csv;
then return from
select query, curtime as updated from stl_load_commits where query = pg_last_copy_id();
is blank, Why aws does not create entry for csv ?
This is the first part of the question. Secondly, there must be some way through which we can check the status of the loaded file?
How can we check if the file has successfully loaded in DB if the file is of type csv?

The format of the file does not affect the visibility of success or error information in system tables.
When you run COPY it returns confirmation of success and a count of rows loaded. Some SQL clients may not return this information to you but here's what it looks like using psql:
COPY public.web_sales from 's3://my-files/csv/web_sales/'
FORMAT CSV
GZIP
CREDENTIALS 'aws_iam_role=arn:aws:iam::01234567890:role/redshift-cluster'
;
-- INFO: Load into table 'web_sales' completed, 72001237 record(s) loaded successfully.
-- COPY
If the load succeeded you can see the files in stl_load_commits:
SELECT query, TRIM(file_format) format, TRIM(filename) file_name, lines, errors FROM stl_load_commits WHERE query = pg_last_copy_id();
query | format | file_name | lines | errors
---------+--------+---------------------------------------------+---------+--------
1928751 | Text | s3://my-files/csv/web_sales/0000_part_03.gz | 3053206 | -1
1928751 | Text | s3://my-files/csv/web_sales/0000_part_01.gz | 3053285 | -1
If the load fails you should get an error. Here's an example error (note the table I try to load):
COPY public.store_sales from 's3://my-files/csv/web_sales/'
FORMAT CSV
GZIP
CREDENTIALS 'aws_iam_role=arn:aws:iam::01234567890:role/redshift-cluster'
;
--ERROR: Load into table 'store_sales' failed. Check 'stl_load_errors' system table for details.
You can see the error details in stl_load_errors.
SELECT query, TRIM(filename) file_name, TRIM(colname) "column", line_number line, TRIM(err_reason) err_reason FROM stl_load_errors where query = pg_last_copy_id();
query | file_name | column | line | err_reason
---------+------------------------+-------------------+------+---------------------------
1928961 | s3://…/0000_part_01.gz | ss_wholesale_cost | 1 | Overflow for NUMERIC(7,2)
1928961 | s3://…/0000_part_02.gz | ss_wholesale_cost | 1 | Overflow for NUMERIC(7,2)

Cassandra invalid row length on copy (import)

I'm trying to import a large dataset (this one https://www.kaggle.com/secareanualin/football-events/data) into cassandra but I'm stuck. I created the table with the following command :
create table test.football_event(id_odsp text, id_event text, sort_order text, time text, text text, event_type text, event_type2 text, side text, event_team text, opponent text, player text, player2 text, player_in text, player_out text, shot_place text, shot_outcome text, is_goal text, location text, bodypart text, assist_method text, situation text, fast_break text, primary key(id_odsp));
This table matches the csv containing the data. When I try to import with this command
copy test.football_event(id_odsp, id_event, sort_order, time, text, event_type, event_type2, side, event_team, opponent, player, player2, player_in, player_out, shot_place, shot_outcome, is_goal, location, bodypart, assist_method, situation, fast_break) from '/path/to/events_import.csv' with delimiter = ',';
I'm getting the following error Failed to import XX rows: ParseError - Invalid row length 24 should be 23, given up without retries or same error with row length 23 should be 22. I assume that the data in the csv aren't perfect and that there are some errors so I increased the number of columns in my table to 24 but this didn't resolve the problem.
I was wondering if it didn't exist an option to manage the level of "strictness" during import but I didn't find anything about it. I would like an option that would allow me to fill up the entire table row when length is 24 or add one or two null in the last fields if row length is 23 or 22.
If it has any importance, I'm running cassandra on Linux Mint 18.1
Thanks in advance

Cassandra/Scylla are schema forced systems, the schema should include any required column. The copy command expects same number of elements fetched as specified in the columns part of the command.
In Cassandra/Scylla the copy command should create an error file on your loader node, the error file should include the rows that "created" the issue. you can review the wrong rows and decide if they are of interest for you, and remove/fix them.
It does not mean the other rows were not uploaded correctly. See example below:
The csv files looks as the following:
cat myfile.csv
id,col1,col2,col3,col4
1,bob,alice,charlie,david
2,bob,charlie,david,bob
3,alice,bob,david
4,david,bob,alice
cqlsh> create KEYSPACE myks WITH replication = {'class':'SimpleStrategy', 'replication_factor': 1};
cqlsh> USE myks ;
cqlsh:myks> create TABLE mytable (id int PRIMARY KEY,col1 text,col2 text,col3 text ,col4 text);
cqlsh> COPY myks.mytable (id, col1, col2, col3 , col4 ) FROM 'myfile.csv' WITH HEADER= true ;
Using 1 child processes
Starting copy of myks.mytable with columns [id, col1, col2, col3, col4].
Failed to import 2 rows: ParseError - Invalid row length 4 should be 5, given up without retries
Failed to process 2 rows;
failed rows written to import_myks_mytable.err
Processed: 4 rows; Rate: 7 rows/s; Avg. rate: 10 rows/s
4 rows imported from 1 files in 0.386 seconds (0 skipped).
cqlsh> SELECT * FROM myks.mytable ;
id | col1 | col2 | col3 | col4
----+------+---------+---------+-------
1 | bob | alice | charlie | david
2 | bob | charlie | david | bob
The Error file explains which rows have an issue:
cat import_myks_mytable.err
3,alice,bob,david
4,david,bob,alice

Exporting Data from Cassandra to CSV file

Table Name : Product
uid | productcount | term | timestamp
304ad5ac-4b6d-4025-b4ea-8b7991a3fe72 | 26 | dress | 1433110980000
6097e226-35b5-4f71-b158-a1fe39a430c1 | 0 | #751104 | 1433861040000
Command :
COPY product (uid, productcount, term, timestamp) TO 'temp.csv';
Error:
Improper COPY command.
Am I missing something?

The syntax of your original COPY command is also fine. The problem is with your column named timestamp, which is a data type and is a reserved word in this context. For this reason you need to escape your column name as follows:
COPY product (uid, productcount, term, "timestamp") TO 'temp.csv';
Even better, try to use a different field name, because this can cause other problems as well.

I am able to export the data into CSV files by using by below command.
Avoiding the column names did the trick.
copy product to 'temp.csv' ;

Use following commands to get the data from Cassandra Tables to CSV
This command will copy Top 100 rows to CSV file.
cqlsh -e"SELECT * FROM employee.employee_details" > /home/hadoop/final_Employee.csv
This command will copy All the rows to CSV file.
cqlsh -e"PAGING OFF;SELECT * FROM employee.employee_details" > /home/hadoop/final_Employee.csv

MySql - Load Local Data Infile - how to avoid insertion of row containing invalid data

I'm importing data from .csv file to Mysql db using "LOAD DATA LOCAL INFILE" query.
.csv contains foll:
ID | Name | Date | Price
1. 01 | abc | 13-02-2013 | 1500
2. 02 | nbd | blahblahbl | 1000
3. 03 | kgj | 11-02-2012 | jghj
My Mysql Table contains following columns:
Id INTEGER
NAME Varchat(100)
InsertionTimeStamp Date
Price INTEGER
MySQL query to load .csv data to the table above :
LOAD DATA LOCAL INFILE 'UserDetails.csv' INTO TABLE userdetails
FIELDS TERMINATED BY ','
IGNORE 1 LINES
(Id,Name,InsertionTimeStamp,Price)
set InsertionTimeStamp = str_to_date(#d,'%d-%m-%Y');
When I executed the query, 3 records got inserted into the table with 2 warnings
a) Incorrect datetime value : 'blahblahbl' for function str_to_date
b) Data truncate at row 3 (because of invalid INTEGER for Price column)
Question
1. Is there any way to avoid data being inserted in table which shows warnings/errors or the row which has invalid data
I dont want Row 2 & Row 3 to be inserted as it contains invalid data
2. For WARNING of Incorrect datetime value above, can I get the row no also?
Basically I want to get exact warning/error with the row number.

I think it would be way more easy if you'd validate the input with some other language(for example php).
You'd just have to iterate through the lines of the csv and call something like this , this and this!
If you just have to fiddle with SQL, this might help!

You can try Data Import tool (CSV or TXT import format) in dbForge Studio for MySQL.
In Data Import wizard on Modes page uncheck Use bulk insert oprion, on Errors handling page check Ignore all errors option, it will help you to skip import errors.

I know you are trying to skip problematic rows but don't you think you have a mistake in your LOAD DATA command? Should it be:
LOAD DATA LOCAL INFILE 'UserDetails.csv' INTO TABLE userdetails
FIELDS TERMINATED BY ','
IGNORE 1 LINES
(Id,Name,#d,Price)
set InsertionTimeStamp = str_to_date(#d,'%d-%m-%Y');
Shouldn't you be using the variable name (d) in the list of columns instead of the actual name of the column (InsertionTimeStamp)? That could be the reason why you are getting the error message about datetime.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Import CSV with multi-valued (collection) attributes to Cassandra - csv

Related

MySQL Error Code: 1262. Row x was truncated; it contained more data than there were input columns

Why redshift does not have entry for csv file in stl_load_commits ??

Cassandra invalid row length on copy (import)

Exporting Data from Cassandra to CSV file

MySql - Load Local Data Infile - how to avoid insertion of row containing invalid data

Categories

Resources