How to parse CSV and import to MYSQL issue - mysql

I am really stuck at this moment.. I searched and researched and I did not find anything similar, so maybe what I want is not doable.
I have a .xls file in this format
|contact| col1 |col2 | col3 |
---------------------------------
|name1 | info1 |info2 | info3 |
|address1| | | |
|phone1 | | | |
| | | | |
|name2 | | | |
|address2| info1 |info2 | info3 |
|phone2 | | | |
| | | | |
|name_n | info1 | info2| info3 |
|addres_n| | | |
|phone_n | | | |
----------------------------------
So I was thinking about creating a table which will be called contact, and another called info. The table contact will contain id(primary key),name,address and phone as fields. And another table with id (primary key) , name (foreign key), col1,col2 and col3. In such a way that if I want to know the details of the name (in table info ) I could go to the other table and see all the values.
So I researched about how to import this xls format file and the optimal solution will be convert the xls file to a csv file delimited by comma.
Therefore, the code which I was thinking to use is (after converting to csv- delimited by comma) the following:
But the point is that I just want to to the following:
"Info" Table : name1,info1,info2,info3.
"Contact" Table : name1,address,phone1.
Related to "Info" Table
LOAD DATA INFILE '/home/username/myfile.csv'
INTO TABLE numbers FIELDS TERMINATED BY ';'
LINES STARTING BY '-' TERMINATED BY '\r\n'
IGNORE 1 LINES
(contact,col1,col2,col3);
It is ok when I'm filling the table because I just have to add a '-' at the beginning of the row which I want to fill into the table (the row which contains a non-empty field.)
Related to "Contact" table:
It is being difficult since, if I want to take just the column of contact I will have to add some symbol and then operate. Well, I was thinking in add an '*' (for instance) at the beggining of name1 and another one at the end of phone1 for defining the boundaries of the LINE. But I want to define also the fields, which will be a ';' as well. However, the algorithm of LOAD DATA INFILE takes the empty fields also. And I just want to take the fields which contains something different than "empty string"/"null"/. So the question is if I would be able to say something to avoid the empty string. Because as a I read in the MySQL DOC this is written:
An empty field value is interpreted different from a missing field:
For string types, the column is set to the empty string.
For numeric types, the column is set to 0.
For date and time types, the column is set to the appropriate “zero” value for >the type. See Section 11.3, “Date and Time Types”.
But it does not say anything about avoid it. (I mean, If is an empty value, then pass the the next field and evaluate again until find a non-empty value)
I am asking that because with my idea of face it It will fill a row like that :
Name1,null,null,null,address1,null,null,null,phone1 and so on.
Any ideas?

focusing just on Contact for now, what if your excel spreadsheet has 4 columns where columnA is action with N for all rows. You change it to Y if you want it ultimately in your real table.
Your csv now has 4 columns:
action, Name, address, phone
You load data infile into a worktable. Now you have your data in (some not actually wanted), and an insert into select where action='y' will finish it up, into your real table from worktable.

In Excel you can create two worksheets, the first with contact data, the second with info data. Those sheets can be separately exported as csv files.

Related

Create hive external table with complex data type and load from CSV or TSV having few columns with serialized JSON object

I have CSV (or TSV) with a column ('nw_day' in example below) having serialized array object and another column ('res_m' in example below) having serialized JSON object. It also has columns with STRING, TIMESTAMP, and FLOAT data type.
For the TSV that looks somewhat like (showing first row)
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
| com_id | w_start_time | cap | nw_day | res_m |
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
| dtf_id | 2019-04-24 06:00:03 | 444.3 | {'Fri','Mon','Sat','Sun','Thurs','Tue','Wed'} | {"some_str":"str_one","some_n":1,"some_t":2019-04-24 06:00:03.700+0000}|
+----------+---------------------+-------+-----------------------------------------------+------------------------------------------------------------------------+
I have tried the following statement, but it is not giving me perfect results.
CREATE EXTERNAL TABLE IF NOT EXISTS table_name(
com_id STRING,
w_start_time TIMESTAMP,
cap FLOAT,
nw_day array <STRING>,
res_m STRUCT <
some_str: STRING,
some_n: BIGINT,
some_t: TIMESTAMP
>)
COMMENT 's_e_s'
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
COLLECTION ITEMS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION '/location/to/folder/containing/csv'
TBLPROPERTIES ("skip.header.line.count"="1");
So, I'm thinking I deserialize those objects into hive complex datatypes with ARRAYS and STRUCT. But that is not exactly what I get when I run
select * from table_name limit 1;
which gives me
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| com_id | w_start_time | cap | nw_day | res_m |
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
| dtf_id | 2019-04-24 06:00:03 | 444.3 | ["{'Fri'"," 'Mon'"," 'Sat'"," 'Sun'"," 'Thurs'"," 'Tue'"," 'Wed'}"] | {"some_str":"{\"some_str\":\"str_one\",\"some_n\":1,\"some_t\":2019-04-24 06:00:03.700+0000}\","some_n":null,"some_t":null}|
+----------+---------------------+-------+----------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------+
So, it considering the whole object as a string and split the string by delimiter.
I need some help understanding how to load data from CSV/TSV to complex data types in Hive.
I found a similar question but the requirement is little different and there is no complex datatype involved in there.
Any help would be much appreciated. If this cannot be done and a preprocessing step has to be included prior to loading, some example of input data to complex datatype loads in hive would help me. Thanks in advance!

Bulk insert - create some columns' values based on another column's known values

I am creating a database of country and year specific data. I have a table of countries that includes each country's name, UN code (numeric), 2-digit alpha code, 3-digit alpha code, and ISO code.
There will be many other tables in this database whose rows each include country codes, a year, and a data point of interest. For instance, a "total population" table's rows would each include a year, a population figure, and the UN, alpha-2, alpha-3, and ISO codes for the country to which the record corresponds. So, for any given country, there would be many records (one per year).
The challenge: I'm getting data from several sources, and different sources use different coding systems. I am using CSV files to import all of the data. For instance, here's the query that loads the data for the Countries table.
LOAD DATA LOCAL INFILE 'data/countryCodes.csv'
INTO TABLE Countries
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
(country_name, alpha2_code, alpha3_code, un_code, iso_code);
Of course, any given UN code only corresponds to one 2-digit alpha code, one 3-digit alpha code, and one ISO code. I want to be able to import a CSV that only includes one of these codes, and have the database automatically populate the other codes' entries for each row. For instance, if I imported population data coded by UN code, the database would automatically reference the corresponding other codes in the Countries table and insert the appropriate values.
Is there a way to do this with SQL? If I create this functionality in the database, it will be far easier to systematize the server- and client-side associations between different types of data.
Honnestly I am having a hard time figuring out what your issue is / what you are really wanting to do...
To be usable on the database level, your final dataset should look like this:
Table Country Code
+----+----+-----+----+----+
| id | un | iso | a2 | a3 |
+----+----+-----+----+----+
| 1 | FR | FR | FR | FR |
| 2 | .. | .. | .. | .. |
+----+----+-----+----+----+
Table Population
+----+------+-----------+----------+
| id | year | idCountry | value |
+----+------+-----------+----------+
| 1 | 1979 | 1 | 50000000 |
| 2 | 1980 | 1 | 50000000 |
+----+------+-----------+----------+
To convert "direct" value from CSV to the index value something like this could be done:
ALTER TABLE population ADD extCC CHAR(2);
LOAD DATA LOCAL INFILE 'data/population.csv'
INTO TABLE population (year,value,extCC)
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS (extCC, year, value);
UPDATE population, countryCode SET population.idCountry=countryCode.id WHERE countryCode.iso = population.extCC;
ALTER TABLE population DROP extCC;
Decide on which country_code to use for country_code everywhere. (You will keep the table you described that shows the mapping between ISO, UN, etc.)
LOAD DATA ... - but not directly into the real table. Instead into table t.
Add a column to t, then lookup each code in the ISO/UN/etc table and put the country_code value in.
Then copy the rows from t to the real table. Note the 'real' table will have only the preferred country_code.
The general principle here is to cleanse and cannonify disparate data as part of the loading process. Sure, it takes an extra step, but it is worth it. Keep your 'real' table clean.
What will you do about Czechoslovakia --> Czech Republic + Slovakia? And Yugoslavia. And Upper Volta --> Burkina Faso? Etc.

MySQL Error Code: 1262. Row x was truncated; it contained more data than there were input columns

I need to load file content into a table. The file contains text separated by commas. It is a very large file. I can not change the file it is already given to me like this.
12.com,128.15.8.6,TEXT1,no1,['128.15.8.6']
23com,122.14.10.7,TEXT2,no2,['122.14.10.7']
45.com,91.33.10.4,TEXT3,no3,['91.33.10.4']
67.com,88.22.88.8,TEXT4,no4,['88.22.88.8', '5.112.1.10']
I need to load the file into a table of four columns. So for example, the last row above should be in the table as follows:
table.col1: 67.com
table.col2: 88.22.88.8
table.col3: TEXT3
table.col4: no3
table.col5: ['88.22.88.8', '5.112.1.10']
Using MySQL workbench, I created a table with five columns all are of type varchar. Then I run the following SQL command:
LOAD DATA INFILE '/var/lib/mysql-files/myfile.txt'
INTO TABLE `mytable`.`myscheme`
fields terminated BY ','
The last column string (which contains commas that I do not want to separate) causes an issue.
Error:
Error Code: 1262. Row 4 was truncated; it contained more data than there were input columns
How can I overcome this problem please.
Not that difficult simply using load data infile - note the use of a variable.
drop table if exists t;
create table t(col1 varchar(20),col2 varchar(20), col3 varchar(20), col4 varchar(20),col5 varchar(100));
truncate table t;
load data infile 'test.csv' into table t LINES TERMINATED BY '\r\n' (#var1)
set col1 = substring_index(#var1,',',1),
col2 = substring_index(substring_index(#var1,',',2),',',-1),
col3 = substring_index(substring_index(#var1,',',3),',',-1),
col4 = substring_index(substring_index(#var1,',',4),',',-1),
col5 = concat('[',(substring_index(#var1,'[',-1)))
;
select * from t;
+--------+-------------+-------+------+------------------------------+
| col1 | col2 | col3 | col4 | col5 |
+--------+-------------+-------+------+------------------------------+
| 12.com | 128.15.8.6 | TEXT1 | no1 | ['128.15.8.6'] |
| 23com | 122.14.10.7 | TEXT2 | no2 | ['122.14.10.7'] |
| 45.com | 91.33.10.4 | TEXT3 | no3 | ['91.33.10.4'] |
| 67.com | 88.22.88.8 | TEXT4 | no4 | ['88.22.88.8', '5.112.1.10'] |
+--------+-------------+-------+------+------------------------------+
4 rows in set (0.00 sec)
In this case for avoid the problem related with the improper presence of comma you could import the rows .. in single column table .. (of type TEXT on Medimun TEXT ..as you need)
ther using locate (one for 1st comma , one for 2nd, one for 3th .. ) and substring you could extract form each rows the four columns you need
and last with a insert select you could populate the destination table .. separating the columns as you need ..
This is too long for a comment.
You have a horrible data format in your CSV file. I think you should regenerate the file.
MySQL has facilities to help you handle this data, particularly the OPTIONALLY ENCLOSED BY option in LOAD DATA INFILE. The only caveat is that this allows one escape character rather than two.
My first suggestion would be to replace the field separates with another character -- tab or | come to mind. Any character that is not used for values within a field.
The second is to use a double quote for OPTIONALLY ENCLOSED BY. Then replace '[' with '"[' and ] with ']"' in the data file. Even if you cannot regenerate the file, you can pre-process it using something like grep or pearl or python to make this simple substitution.
Then you can use the import facilities for MySQL to load the file.

"You cannot add or change a record because a related record is required", but related record exists?

I have two related tables, results and userID.
results looks like this:
+----+--------+--------+
| ID | userID | result |
+----+--------+--------+
| 1 | abc | 124 |
| 2 | abc | 792 |
| 3 | def | 534 |
+----+--------+--------+
userID looks like this:
+----+--------+---------+
| id | userID | name |
+----+--------+---------+
| 1 | abc | Angela |
| 2 | def | Gerard |
| 3 | zxy | Enrico |
+----+--------+---------+
In results, the userID field is a lookup field; it stores userID.id but the combo box has userID.userID as its choices.
When I try to enter data into results by setting the userID combo box and entering a value for result, I get this error message:
You cannot add or change a record because a related record
is required in table `userID`.
This is strange, because I'm specifically selecting a value that's provided in the userID combo box.
Oddly, there are about 100 rows of data already in results with the same value for userID.
I thought this might be a database corruption issue, so i created a blank database and imported all the tables into it. But I still got the same error. What's going on here?
Both tables include a text field named LanID. You are using that field in this relationship, which enforces referential integrity:
The problem you're facing is due to the Lookup field properties. This is the Row Source:
SELECT [LanID].ID, [LanID].LanID FROM LanID ORDER BY [LanID];
But the value which gets stored (the Bound Column property) is the first column from that SELECT statement, which is the Long Integer [LanID].ID. So that number will not satisfy the relationship, which requires results.LanID = [LanID].LanID.
You must change the relationship or change the Lookup properties so both reference the same field value.
But if it were me, I would just eliminate the Lookup on the grounds that simple operations (such as this) become unnecessarily confusing when Lookup fields are involved. Make results.LanID a plain numeric or text field. If you want some kind of user-friendly drop-down for data entry, build a form with a combo or list box.
For additional arguments against Lookup fields, see The Evils of Lookup Fields in Tables.
If you are using a parameter query, make sure you have them in the same order as the table you are modifying and the query you have created. You might have one parameter inserting the conflicting data. Parameters are used in the order they are created...not the name of the parameter. I had the same problem and all I had to do was switch the order they were in so they matched the query. This is an old thread, so I hope this helps someone who is just now having this problem.

Is it possible to use a LOAD DATA INFILE type command to UPDATE rows in the db?

Pseudo table:
| primary_key | first_name | last_name | date_of_birth |
| 1 | John Smith | | 07/04/1982 |
At the moment first_name contains a users full name for many rows. The desired outcome is to split the data, so first_name contains "John" and last_name contains "Smith".
I have a CSV file which contains the desired format of data:
| primary_key | first_name | last_name |
| 1 | John | Smith |
Is there a way of using the LOAD DATA INFILE command to process the CSV file to UPDATE all rows in this table using the primary_key - and not replace any other data in the row during the process (i.e. date_of_birth)?
In this situation I usually LOAD DATA INFILE to a temp table with identical structure. Then I do INSERT with ON DUPLICATE KEY UPDATE from the temp table to the real table. This allows for data type checking without wrecking your real table; it's relatively quick and it doesn't require fiddling with your .csv file.
No. While LOAD DATA INFILE has a REPLACE option, it will actually replace the row in question - that is, delete the existing one and insert a new one.
If you configure your LOAD DATA INFILE to only insert certain columns all others will be set to their default values, not to values they currently contain.
Can you modify your CSV file to contain a bunch of UPDATE statements instead? Should be reasonably straightforward via some regex replaces.