Cassandra invalid row length on copy (import) - csv

I'm trying to import a large dataset (this one https://www.kaggle.com/secareanualin/football-events/data) into cassandra but I'm stuck. I created the table with the following command :
create table test.football_event(id_odsp text, id_event text, sort_order text, time text, text text, event_type text, event_type2 text, side text, event_team text, opponent text, player text, player2 text, player_in text, player_out text, shot_place text, shot_outcome text, is_goal text, location text, bodypart text, assist_method text, situation text, fast_break text, primary key(id_odsp));
This table matches the csv containing the data. When I try to import with this command
copy test.football_event(id_odsp, id_event, sort_order, time, text, event_type, event_type2, side, event_team, opponent, player, player2, player_in, player_out, shot_place, shot_outcome, is_goal, location, bodypart, assist_method, situation, fast_break) from '/path/to/events_import.csv' with delimiter = ',';
I'm getting the following error Failed to import XX rows: ParseError - Invalid row length 24 should be 23, given up without retries or same error with row length 23 should be 22. I assume that the data in the csv aren't perfect and that there are some errors so I increased the number of columns in my table to 24 but this didn't resolve the problem.
I was wondering if it didn't exist an option to manage the level of "strictness" during import but I didn't find anything about it. I would like an option that would allow me to fill up the entire table row when length is 24 or add one or two null in the last fields if row length is 23 or 22.
If it has any importance, I'm running cassandra on Linux Mint 18.1
Thanks in advance

Cassandra/Scylla are schema forced systems, the schema should include any required column. The copy command expects same number of elements fetched as specified in the columns part of the command.
In Cassandra/Scylla the copy command should create an error file on your loader node, the error file should include the rows that "created" the issue. you can review the wrong rows and decide if they are of interest for you, and remove/fix them.
It does not mean the other rows were not uploaded correctly. See example below:
The csv files looks as the following:
cat myfile.csv
id,col1,col2,col3,col4
1,bob,alice,charlie,david
2,bob,charlie,david,bob
3,alice,bob,david
4,david,bob,alice
cqlsh> create KEYSPACE myks WITH replication = {'class':'SimpleStrategy', 'replication_factor': 1};
cqlsh> USE myks ;
cqlsh:myks> create TABLE mytable (id int PRIMARY KEY,col1 text,col2 text,col3 text ,col4 text);
cqlsh> COPY myks.mytable (id, col1, col2, col3 , col4 ) FROM 'myfile.csv' WITH HEADER= true ;
Using 1 child processes
Starting copy of myks.mytable with columns [id, col1, col2, col3, col4].
Failed to import 2 rows: ParseError - Invalid row length 4 should be 5, given up without retries
Failed to process 2 rows;
failed rows written to import_myks_mytable.err
Processed: 4 rows; Rate: 7 rows/s; Avg. rate: 10 rows/s
4 rows imported from 1 files in 0.386 seconds (0 skipped).
cqlsh> SELECT * FROM myks.mytable ;
id | col1 | col2 | col3 | col4
----+------+---------+---------+-------
1 | bob | alice | charlie | david
2 | bob | charlie | david | bob
The Error file explains which rows have an issue:
cat import_myks_mytable.err
3,alice,bob,david
4,david,bob,alice

Related

Mysql select from mediumint(8) to unique Binary(2)

I am trying to create a query to populate unique data from one table into part of another table. The two rows from the source table I'm selecting are of types mediumint(8) and varchar(20). In the destination table I have a Binary(2) and a varchar(20) that I'm trying to populate from the source table. I'm also only selecting the top 100 unique values.
The problem I have is with the mediumint(8) to Binary(2) conversion. I can see all my mediumint(8) values are unique, and they are all less than 200 numerically, but for some reason I can't get them to fit in a 2 byte unique binary column. My query keeps rejecting the insert because it says the values are larger than 2 bytes.
sample query:
select distinct id, part from partsTable
group by id
limit 100;
This comes back with a table that looks kind of like this
id| part
_________
1 | A
2 | B
3 | C
21 | AB
22 | AC
...
I have tried doing things like ( id & 0xFFFF) to force my id column as 2 bytes, and CAST(id as UNSIGNED) or CAST(id as BINARY(2)), but these always truncate the data so that i get duplicate values. Not really sure what I'm missing or how my unique id values that I can see are all unique get translated to non unique values when I try to cast them into binary.
What's even more confusing is as a test I just inserted two records into my destination table:
INSERT into table1 (id, part) VALUES (0xFFFF, 'A');
INSERT into table1 (id, part) VALUES (0xFFFE, 'B');
And then I just wrote a query to see what that data looked like:
SELECT bin(id) as id, hex(id) as hex, part FROM table1;
Which returns a result like:
id | hex | part
__________________
0 | 0xFFFF| A
0 | 0xFFFE| B
So I have no idea what's going on
2 bytes would be a numeric range of 65535. The problem here is the binary data type in MySQL still contains a string, not a number, but with collation of binary instead of something like utf8 as the char data type would. The difference is that the binary data type restricts the value based on the number of bytes rather than the number of characters in the string that it is storing. In other words, a 3 digit value such as 150 as a string would be 3 characters, totaling 3 bytes. So if this value was converted to BINARY(2), it would lose anything past its 2nd byte, making it 15.
Because it stores a string, the binary data type is not restricted to numeric characters (0,1) as you would expect either. So you could make the values fit in a BINARY(2) by using HEX(). HEX(150) would be '96', and would therefore fit in a BINARY(2). You would then have to use UNHEX() to retreive any of these values. If this table already has other values stored in it I doubt this would work for you though.

MySQL Error Code: 1262. Row x was truncated; it contained more data than there were input columns

I need to load file content into a table. The file contains text separated by commas. It is a very large file. I can not change the file it is already given to me like this.
12.com,128.15.8.6,TEXT1,no1,['128.15.8.6']
23com,122.14.10.7,TEXT2,no2,['122.14.10.7']
45.com,91.33.10.4,TEXT3,no3,['91.33.10.4']
67.com,88.22.88.8,TEXT4,no4,['88.22.88.8', '5.112.1.10']
I need to load the file into a table of four columns. So for example, the last row above should be in the table as follows:
table.col1: 67.com
table.col2: 88.22.88.8
table.col3: TEXT3
table.col4: no3
table.col5: ['88.22.88.8', '5.112.1.10']
Using MySQL workbench, I created a table with five columns all are of type varchar. Then I run the following SQL command:
LOAD DATA INFILE '/var/lib/mysql-files/myfile.txt'
INTO TABLE `mytable`.`myscheme`
fields terminated BY ','
The last column string (which contains commas that I do not want to separate) causes an issue.
Error:
Error Code: 1262. Row 4 was truncated; it contained more data than there were input columns
How can I overcome this problem please.
Not that difficult simply using load data infile - note the use of a variable.
drop table if exists t;
create table t(col1 varchar(20),col2 varchar(20), col3 varchar(20), col4 varchar(20),col5 varchar(100));
truncate table t;
load data infile 'test.csv' into table t LINES TERMINATED BY '\r\n' (#var1)
set col1 = substring_index(#var1,',',1),
col2 = substring_index(substring_index(#var1,',',2),',',-1),
col3 = substring_index(substring_index(#var1,',',3),',',-1),
col4 = substring_index(substring_index(#var1,',',4),',',-1),
col5 = concat('[',(substring_index(#var1,'[',-1)))
;
select * from t;
+--------+-------------+-------+------+------------------------------+
| col1 | col2 | col3 | col4 | col5 |
+--------+-------------+-------+------+------------------------------+
| 12.com | 128.15.8.6 | TEXT1 | no1 | ['128.15.8.6'] |
| 23com | 122.14.10.7 | TEXT2 | no2 | ['122.14.10.7'] |
| 45.com | 91.33.10.4 | TEXT3 | no3 | ['91.33.10.4'] |
| 67.com | 88.22.88.8 | TEXT4 | no4 | ['88.22.88.8', '5.112.1.10'] |
+--------+-------------+-------+------+------------------------------+
4 rows in set (0.00 sec)
In this case for avoid the problem related with the improper presence of comma you could import the rows .. in single column table .. (of type TEXT on Medimun TEXT ..as you need)
ther using locate (one for 1st comma , one for 2nd, one for 3th .. ) and substring you could extract form each rows the four columns you need
and last with a insert select you could populate the destination table .. separating the columns as you need ..
This is too long for a comment.
You have a horrible data format in your CSV file. I think you should regenerate the file.
MySQL has facilities to help you handle this data, particularly the OPTIONALLY ENCLOSED BY option in LOAD DATA INFILE. The only caveat is that this allows one escape character rather than two.
My first suggestion would be to replace the field separates with another character -- tab or | come to mind. Any character that is not used for values within a field.
The second is to use a double quote for OPTIONALLY ENCLOSED BY. Then replace '[' with '"[' and ] with ']"' in the data file. Even if you cannot regenerate the file, you can pre-process it using something like grep or pearl or python to make this simple substitution.
Then you can use the import facilities for MySQL to load the file.

HP Vertica - How do I specify date formats for the CSV parsers

In Vertica 7.2, I'm using COPY with fdelimitedparser. I would like to be able to specify a date or datetime format for some but not all of the columns. Different date columns can have different formats.
I can't list all columns like when using COPY without a parser, since I have many files with different column combinations, and I would rather avoid writing a script to generate my copy command for each file.
Is there any way to do this ?
Additionally, how do I know which parser natively accepts which date format ?
Thanks !
You can use Vertica filler option when loading data.
See example here :
Transform data during load in Vertica
A small example also :
dbadmin=> \! cat /tmp/file.csv
2016-19-11
dbadmin=> copy tbl1 (v_col_1 FILLER date FORMAT 'YYYY-DD-MM',col1 as v_col_1) from '/tmp/file.csv';
Rows Loaded
-------------
1
(1 row)
dbadmin=> select * from tbl1;
col1
------------
2016-11-19
(1 row)
dbadmin=> copy tbl1 (v_col_1 FILLER date FORMAT 'YYYY-MM-DD',col1 as v_col_1) from '/tmp/file.csv';
Rows Loaded
-------------
1
(1 row)
dbadmin=> select * from tbl1;
col1
------------
2016-11-19
2017-07-14
(2 rows)
hope this helped
You can use the format keyword as part of the COPY command
see below eg from Vertica Forum :
create table test3 (id int, Name varchar(16), dt date, f2 int);
CREATE TABLE
vsql=> \!cat /tmp/mydata.data
1|foo|29-Jan-2013|100.0
2|bar|30-Jan-2013|200.0
3|egg|31-Jan-2013|300.0
4|tux|01-Feb-2013|59.9
vsql=> copy test3
vsql-> ( id, Name, dt format 'DD#MON#YYYY', f 2)
vsql-> from '/tmp/mydata.data' direct delimiter '|' abort on error;
Rows Loaded
-------------
4
(1 row)
vsql=> select * from test3;
id | Name | dt | f2
----+------+------------+----------
1 | foo | 2013-01-29 | 100.0000
2 | bar | 2013-01-30 | 200.0000
3 | egg | 2013-01-31 | 300.0000
4 | tux | 2013-02-01 | 59.9000
I understand , you need to chooses between “Simple to load “ And “Fast
to consume”, flex table will add some impacts on the consumers ,
some info on that : Flex table is row based storage it will consume
more disk space and its have zero capability to encode the data ,
you have the ability to materialized the relevant columns as
columnar , but the data will be persist twice , on both ROW and
columnar storages (load time should be slower, and it will require)
. At a Query time , if you plan to query only the materialized
columns you should be ok , but if not , you should expect to have
performance issues

Postgresql 9.2 trigger to separate subfields in a stored string

Postgresql 9.2 DB which automatically collects data from various machines.
The DB stores all the data including the machine id, the firmware, the manufacturer id etc as well as the actual result data. In one stored field (varchar) there are 5 sub fields which are separated by the ^ character.
ACT18!!!8246-EN-2.00013151!1^7.00^F5260046959^H1P1O1R1C1Q1L1^1 (Machine 1)
The order of this data seems to vary from one machine to another. Eg machine 1 2 and 3. The string above shows the firmware version, in this case "7.0" and it appears in sub-field 2. However, another machine sends the data in a different sub-field - in this case sub-field 3 and the value is "1"
BACT/ALERT^A.00^1^^ (Machine 2)
I want to store the values "7.0" and "1" in a different field in a separate table using a CREATE TRIGGER t_machine_id AFTER INSERT function where I can choose which sub-field is used depending on the machine the data has come from.
Is split_part the best function to do this? Can anyone supply an example code that will do this? I can't find anything in the documentation.
You need to (a) split the data using something like regexp_split_to_table then (b) match which parts are which using some criteria, since you don't have field position-order to rely on. Right now I don't see any reliable rule to decide what's the firmware version and what's the machine number; you can't really say where field <> machine_number because if machine 1 had firmware version 1 you'd get no results.
Given dummy data:
CREATE TABLE machine_info(data text, machine_no integer);
INSERT INTO machine_info(data,machine_no) (VALUES
('ACT18!!!8246-EN-2.00013151!1^7.00^F5260046959^H1P1O1R1C1Q1L1^1',1),
('BACT/ALERT^A.00^1^^',2)
);
Something like:
SELECT machine_no, regexp_split_to_table(data,'\^')
FROM machine_info;
will give you a table of split data elements with machine number, but then you need to decide which fields are which:
machine_no | regexp_split_to_table
------------+------------------------------
1 | ACT18!!!8246-EN-2.00013151!1
1 | 7.00
1 | F5260046959
1 | H1P1O1R1C1Q1L1
1 | 1
2 | BACT/ALERT
2 | A.00
2 | 1
2 |
2 |
(10 rows)
You may find the output of substituting regexp_split_to_array more useful, depending on whether you can get any useful info from field order and how you intend to process the data.
regress=# SELECT machine_no, regexp_split_to_array(data,'\^')
FROM machine_info;
machine_no | regexp_split_to_array
------------+------------------------------------------------------------------
1 | {ACT18!!!8246-EN-2.00013151!1,7.00,F5260046959,H1P1O1R1C1Q1L1,1}
2 | {BACT/ALERT,A.00,1,"",""}
(2 rows)
Say there are two firmware versions; version 1 sends code^blah^fwvers^^ and version 2 and higher sends code^fwvers^blah^blah2^machineno. You can then differentiate between the two because you know that version 1 leaves the last two fields blank:
SELECT
machine_no,
CASE WHEN info_arr[4:5] = ARRAY['',''] THEN info_arr[3] ELSE info_arr[2] END AS fw_vers
FROM (
SELECT machine_no, regexp_split_to_array(data,'\^')
FROM machine_info
) string_parts(machine_no, info_arr);
results:
machine_no | fw_vers
------------+---------
1 | 7.00
2 | 1
(2 rows)
Of course, you've only provided two sample data, so the real matching rules are likely to be more complex. Consider writing an SQL function to extract the desired field(s) and return them from the array passed.

mySQL: mysqlimport to import comma delimited file - first column = ID which is NOT in the file to be imported

Hi Folks I am trying to import a very large file that has moisture data recorded daily per minute for 20 cities in the US.
I have 1 table that I named "cityname" and this table has 2 columns:
-city_ID <- INT and is the primary key which increments automatically
-city_name <- character
I have created another table named "citymoisture" and this table has 7 columns:
-city_ID <- INT and is the primary key but does NOT increment automatically
-date_holding VARCHAR(30)
-time_holding VARCHAR(30)
-open
-high
-low
-close
The date_holding is meant to house the date data but because the format isnt what mysql expects (i.e. it is m/d/y) I want to initially store it in this column and then convert it later (unless there is a way to convert it while the data is being imported???). Similarly the time_holding column holds the time which appears as hh:mm:ss AM (or PM). I want to only import the hh:mm:ss and leave out whether it is AM or (PM).
In any case the file that I want to import has SIX columns:
date, time, open, high, low, close.
I want to ensure that the data being imported has the correct city_ID set to match the city_ID in the 'cityname' table. So for example:
city_ID city_name
20 Boston
19 Atlanta
So when the moisture data for Boston is imported into the citymoisture table the city_ID column is set to 20. Similarly when the data for Atlanta is imported into the citymoisture table the city_ID column is set to 19. The citymoisture table will be very large and will store the 1 minute moisture data for 20 cities going forward.
So my questions are:
1) is there a way to import the contents of the files into column 2-7 and manually specify the the value of the first column (city_ID)?
2) any way to convert dates on the fly while I import or do I have to first store the data and then convert and store to what would then be a final table.
3) same question as #2 but for the time column.
I greatly appreciate your help.
THe sample of the moisture data file appears below:
1/4/1999,9:31:00 AM,0.36,0.43,0.23,0.39
1/4/1999,9:32:00 AM,0.39,0.49,0.39,0.43
.
.
.
I'm not sure how the city_ID in the citymoisture table is going to get set. But if there was a way to do that then I can run join queries based on both tables i.e. there is one record per city per date/time.
STR_TO_DATE should work for getting your date and time
mysql> SELECT STR_TO_DATE('01/01/2001', '%m/%d/%Y');
+---------------------------------------+
| STR_TO_DATE('01/01/2001', '%m/%d/%Y') |
+---------------------------------------+
| 2001-01-01 |
+---------------------------------------+
1 row in set (0.00 sec)
mysql> SELECT STR_TO_DATE('10:53:11 AM','%h:%i:%s %p');
+------------------------------------------+
| STR_TO_DATE('10:53:11 AM','%h:%i:%s %p') |
+------------------------------------------+
| 10:53:11 |
+------------------------------------------+
1 row in set (0.00 sec)
mysql>
How are you going to determine what city the data for each row belongs do "manually", can you include a sample row of what the import data file looks like? Assuming somehow you have the city_ID, (replace in code below):
It looks like you are going to want to use this: LOAD DATA INFILE
if the city you wanted to insert data for is Boston from a file named 'Boston.dat', and an entry exists on your cityname table:
SET #c_name = 'Boston';
SET #filename = CONCAT(#c_name,'.dat');
LOAD DATA INFILE #filename
INTO TABLE city_moisture
(#date, #time, open, high, low, close)
SET city_ID=(SELECT city_ID FROM TABLE cityname WHERE city_name=#c_name),
date=STR_TO_DATE(#date, '%m/%d/%Y'),
time=STR_TO_DATE(#time, '%H:%i:%s %p');
Leaving off the AM PM portion of the time just sounds like a bad idea.