Hive imports data from csv to an incorrect columns in table - csv

Below is my table creation and a sample from my csv;
DROP TABLE IF EXISTS xxx.fbp;
CREATE TABLE IF NOT EXISTS xxx.fbp (id bigint, p_name string, h_name string, ufi int, city string, country string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
74905,xxx,xyz,-5420642,City One,France
74993,xxx,zyx,-874432,City,Germany
75729,xxx,yzx,-1284248,City Two Long Name,France
I then load the data into a hive table with the following query:
LOAD DATA
INPATH '/user/xxx/hdfs_import/fbp.csv'
INTO TABLE xxx.fbp;
It seems that there is data leaking from the 5th csv "column" into the 6th column of the table. So, I'm seeing city data in my country column.
SELECT country, count(country) from xxx.fbp group by country
+---------+------+
| country | _c1 |
| Germany | 1143 |
| City | 1 |
+---------+------+
I'm not sure why city data is occasionally being imported to the country column. The csv is downloaded from Google Sheets and I've removed the header.

The reason could be your line termination is not '\n', Windows based tool add additional characters which creates issue. Also may be you have feilds using column separator creating this.
Solution:
1. Try print line which have issue by 'where country = City' clause, this will give you some idea how Hive created the record.
2. Try binary storage format to be 100% sure about data processed by Hive.
Hope it helps.

The issue was within the CSV itself. Some columns, such as p.name contained , in several fields. This would cause a line ending to end sooner than expected. I had to clean the data and remove all ,. After that, it imported correctly. Quickly done with python.
with open("fbp.csv") as infile, open("outfile.csv", "w") as outfile:
for line in infile:
outfile.write(line.replace(",", ""))

Related

Create a Pandas table via string matching

I have a 3 column table where two columns are URLs and 1 column is a string that might be contained in the urls. The first 100,000 rows can be found at this link:
https://raw.githubusercontent.com/Slusks/LeagueDataAnalysis/main/short_urls.csv
In theory, values in eurl and surl should be the same, and for every value of each, there should be a gameid that matches both, ie:
https://datapoint/Identity1/foobar.com | Identity1 | https://datapoint/Identity1/foobar.com
I've tried some SQL queries on the data and cant get them to line up
SELECT
*
from
table
where
eurl = surl;
since the values started out in different tables, I also tried joining on table1.url = table2.url and that hasn't worked either. It just shows up blank:
SELECT
s.url, e.gameid
FROM
elixerdata e
JOIN
scrapeddata s ON e.url = s.url;
I'm trying to get the gameID's to match up to the surl column and using the eurl column as validation to confirm that it worked correctly.I'm probably not providing enough code or steps to get good feedback but I figure I might as well ask since I am low on ideas myself.
EDIT1:
I cleaned the quotes off by loading the table into python and then re-writing it to a csv with pandas. The data in the csv appears to not have any quotes, then I load it into SQL with the following:
drop table if exists urltable;
create table urltable(
eurl varchar(255),
gameid varchar(20),
surl varchar(255));
LOAD DATA LOCAL INFILE 'csvfile.csv' into table urltable
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 LINES;
When I read the table in MySQL Workbench there are no quotes, but if I export that table back to a csv, all the quotes are back, only for the surl column though.

Multiple input file loading to single mysql table

I am kind of new here. Have been searching for 2 days and no luck so I am posting my question here. Simply put I need to load data into a table in mysql. Now the thing is the input data for this table will be coming from two different source.
For eg: below is the how the 2 input files will be.
Input_file1
Field Cust_ID1, Acct_ID1, MODIFIED, Cust_Name, Acct_name, Comp_name1, Add2, Comp_name2, Add4
Sample value C1001, A1001, XXXXXX, JACK, TIM KFC, SINGAPORE, YUM BRAND, SINGAPORE
Input_file2
Field ID, MODIFIEDBY, Ref_id, Sys_id
Sample value 3001, TONY, 4001, 5001
Sorry was not able to copy data as in excel so improvised. The ',' is to show separate values. Field specifies the column name and its corresponding value is under sample value.
And the table that the above data needs to be loaded into is as such
Sample _table_structure
ID
Cust_ID1
Acct_ID1
Ref_id
Sys_id
MODIFIED
MODIFIEDBY
Cust_Name
Acct_name
Comp_name1
Add2
Comp_name2
Add4
What I need to do is load data into this table from the input data that comes to me in one single go. Is this possible. As you can see the order is also not a match that I can append and load it. Which is one main issue for me.
And no, changing the input sequence is not a option. Data is huge so that will take too much effort. Any help with this I would appreciate. Also I would like to know if we could use a shell or perl script to do this.
Thanks in advance for the help & time.
load data local infile 'c:\\temp\\file.csv'
into table table_name fields terminated by ',' LINES TERMINATED BY '\r\n' ignore 1 lines
(#col1,#col2,#col3,#col4,#col5,#col6,#col7,#col8,#col9)
set Cust_ID1 = #col1,
Acct_ID1 = #col2,
MODIFIED =#col3,
Cust_Name =#col4....;
load data local infile 'c:\\temp\\file2.csv'
into table table_name fields terminated by ',' LINES TERMINATED BY '\r\n' ignore 1 lines
(#col1,#col2,#col3,#col4 ) ## here Number the respective columns as per the table
set ID = #col1,
MODIFIEDBY = #col2,
REF_ID = #col3,
sys_ID = #col4....
ID, MODIFIEDBY, Ref_id, Sys_id
same thing for csv file 2.
this way you can import file to table.
Note :
Please save Excel file as csv format and then import

What would be the fastest way to insert this data

Okay, so I have a MySQL table called entries which contains the columns name VARCHAR(255) NOT NULL and address VARCHAR(255)
The table has about a million sets of data.
For every set of data, name has a value e.g. "john" whilst address is NULL.
For example:
+------+---------+
| name | address |
+------+---------+
| john | NULL |
+------+---------+
| jake | NULL |
+------+---------+
| zach | NULL |
+------+---------+
I received a CSV file which contains names along with their corresponding address in the format of name:address.
Like I said, the entries table has nearly a million entries, so the csv file has about 800,000 lines.
I want to take each line in the csv, and insert the address where the name is the same which would be:
UPDATE `entries` SET `address` = <address from csv> WHERE `name` = <name from csv>;
I made a Python script to open the csv file reading it line by line. For each line, it would store the name and address in separate variables. It would then execute the query above, but it was taking too long to insert the data into the columns.
Is there anyway I could do this in MySQL, if so, what is the fastest way?
Thanks.
You can import the CSV file into a separate table using mysql LOAD DATA INFILE and then update the entries table using JOIN statement on the basis of similar column name.
E.g:
update entries a inner join new_table b on a.name = b.name set a.address = b.address ;
Here new_table is imported from the CSV file..
Don't forget to add index on both tables for the name column so that it would be fast..
Create table1 and table2
LOAD DATA INFILE '/path/theFile1.csv'
INTO TABLE table1
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
Ditto for file2 into table2
Proceed
Well definitely using a batch query would be faster. You could use a for loop to scan through your CSV file and create a string that does a large batch query.
For example (pseudo code):
String Query ="UPDATE entries SET Value = ( CASE ";
For (begin of file to end)
Name = NameFromFile;
Value = ValueFromFile;
Query += "WHEN NameField = ";
Query += Name + " THEN " +Value;
End
Query+= " )";
Of course you would need to convert those values to strings when concatenating. I wouldn't claim this is fastest but definitely faster.
Sorry for the poor formatting, I'm on my phone.

SQL*Loader control file few columns are blank

My filename .dat file includes 1000 records in below format.
SIC_ID|NAME|CHANGED_DATE|LOGICALLY_DELETED
110|Wheat|31/10/2010 29:46:00|N
Table in which I want to feed content has few more columns. I wish to leave these columns blank as content is not there in .dat file.
Table Columns:
SIC_ID, NAME, CREATED_USER_ID ,CREATED_DATE ,CHANGED_USER_ID ,CHANGED_DATE,LOGICALLY_DELETED,RECORD_VERSION
My control file is as below:-
OPTIONS (DIRECT=TRUE,SKIP=1)
LOAD DATA CHARACTERSET WE8MSWIN1252
INFILE "mic_file.dat"
BADFILE "sql/mic_file.bad"
REPLACE
INTO TABLE SDS_SIC
FIELDS TERMINATED BY "|"
TRAILING NULLCOLS
(SIC_ID, NAME,
DATE "DD/MM/YYYY HH24:MI:SS" NULLIF (CHANGED_DATE=BLANKS),
LOGICALLY_DELETED)
After running SQL*Loader, I see below error:
Column Name Position Len Term Encl Datatype
------------------------------ ---------- ----- ---- ---- ---------------------
SIC_ID FIRST * | CHARACTER
NAME NEXT * | CHARACTER
CHANGED_DATE NEXT * | CHARACTER
LOGICALLY_DELETED NEXT * | CHARACTER
Record 1: Rejected - Error on table SDS_SIC, column CHANGED_DATE.
ORA-26041: DATETIME/INTERVAL datatype conversion error
last 2 lines of error is thrown multiple times. This is fixed now :)
Error 2: LOGICALLY_DELETED has only 2 possible values - Y or N.
Record 51: Rejected - Error on table SDS_SIC, column LOGICALLY_DELETED.
ORA-12899: value too large for column LOGICALLY_DELETED (actual: 2, maximum: 1)
Above error is displayed multiple times.
Remember that the control file column list is in the order fields are in the data file. Data is matched to the table columns by name. Your control file has the 3rd and 4th fields mapped to FILLER, that's why they are blank. FILLER only applies to a field in the data file you don't want.
You need something like this only in your column list section, the TRAILING NULLCOLS will handle the rest of the columns of the table:
(SIC_ID,
NAME,
CHANGED_DATE DATE "DD/MM/YYYY HH24:MI:SS" NULLIF (CHANGED_DATE=BLANKS),
LOGICALLY_DELETED
)
See this recent post which happens to describe the relationship by giving an example: Skipping data fields while loading delimited data using SQLLDR
You can go to the MySQL command line client and for inserting the values into the desired columns you should do the following:
like as you want to insert the value into
SIC_ID|NAME|CHANGED_DATE|LOGICALLY_ DELETED
And not in those extra columns. You should type:
insert into 'whatever the table name is'(SIC_ID,NAME,CHANGED_DATE,LOGICALLY_DELETED)
values(112,'wheat','31/10/2010 19:46:00',N);
Use single inverted commas only there where you have taken the property varchar of the columns
try it....it'll work

LOAD XML LOCAL INFILE with Inconsistent Column Names

MySQL has a nice statement: LOAD XML LOCAL INFILE
For example, if you have this table:
CREATE TABLE person (
person_id INT NOT NULL PRIMARY KEY,
fname VARCHAR(40) NULL,
lname VARCHAR(40) NULL
);
and the following XML file called person.xml:
<list>
<person>
<person_id>1</person_id>
<fname>Mikael</fname>
<lname>Ronström</lname>
</person>
<person>
<person_id>2</person_id>
<fname>Lars</fname>
<lname>Thalmann</lname>
</person>
</list>
You can do this:
LOAD XML LOCAL INFILE 'person.xml'
INTO TABLE person
ROWS IDENTIFIED BY '<person>';
My question is, what if the column names were different in the XML file than they are in the table? For example:
<list>
<person>
<PersonId>1</PersonId>
<FirstName>Mikael</FirstName>
<LastName>Ronström</LastName>
</person>
<person>
<PersonId>2</PersonId>
<FirstName>Lars</FirstName>
<LastName>Thalmann</LastName>
</person>
</list>
How can you accomplish the same thing with a MySQL statement without manipulating the XML file? I searched everywhere but couldn't find an answer.
The fields in the XML file that don't correspond to physical column names are ignored. And columns in the table that don't have corresponding fields in the XML are set NULL.
What I'd do is load into a temp table as #Kolink suggests but with additional columns. Add a SET clause as you load the data from XML.
CREATE TEMP TABLE person_xml LIKE person;
ALTER TABLE person_xml
ADD COLUMN FirstName VARCHAR(40),
ADD COLUMN LastName VARCHAR(40),
ADD COLUMN PersonId INT;
LOAD XML LOCAL INFILE 'person.xml' INTO TABLE person_xml
SET person_id = PersonId, fname = FirstName, lname = LastName;
SELECT * FROM person_xml;
+-----------+--------+-------------+-----------+-------------+----------+
| person_id | fname | lname | FirstName | LastName | PersonId |
+-----------+--------+-------------+-----------+-------------+----------+
| 1 | Mikael | Ronström | Mikael | Ronström | 1 |
| 2 | Lars | Thalmann | Lars | Thalmann | 2 |
+-----------+--------+-------------+-----------+-------------+----------+
Then copy to the real table, selecting a subset of columns.
INSERT INTO person SELECT person_id, fname, lname FROM person_xml;
Alternatively, drop the extra columns and use SELECT *.
ALTER TABLE person_xml
DROP COLUMN PersonId,
DROP COLUMN FirstName,
DROP COLUMN LastName;
INSERT INTO person SELECT * FROM person_xml;
SELECT * FROM person;
+-----------+--------+-------------+
| person_id | fname | lname |
+-----------+--------+-------------+
| 1 | Mikael | Ronström |
| 2 | Lars | Thalmann |
+-----------+--------+-------------+
A little bit hacky but working solution using the good old LOAD DATA INFILE:
LOAD DATA LOCAL INFILE '/tmp/xml/loaded.xml'
INTO TABLE person
CHARACTER SET binary
LINES STARTING BY '<person>' TERMINATED BY '</person>'
(#person)
SET
person_id = ExtractValue(#person:=CONVERT(#person using utf8), 'PersonId'),
fname = ExtractValue(#person, 'FirstName'),
lname = ExtractValue(#person, 'LastName')
;
P.S. You will probably need to additionaly play with field delimiter if the data contains commas.
The following were the options available to me:
Option 1: Create a temporary table with different field names (as suggested by the other answers). This would have been a satisfactory approach. However, when I tried it, a new problem emerged: the LOAD XML statement does not, for some reason, accept minimized format empty elements (for example <person />). So, the statement failed because the XML files I need to load occasionally have empty elements in that format.
Option 2: Transform the XML file with XSLT before running the LOAD XML statement to change the element names and modify the empty element formats. This was not feasible because the XML files are very large and XSLT processing engines load the entire XML into memory before processing.
Option 3: Bypass the LOAD XML statement entirely and use a SAX parser to parse the XML file and insert the records directly into the database using JDBC and prepared statements. Even though raw JDBC and prepared statements are generally efficient, this proved to be too slow. MUCH slower than the LOAD XML statement.
Option 4: Use the LOAD DATA statement instead of the LOAD XML statement and play around with the optional clauses associated with that statement to fit my needs (e.g. lines separated by, etc.). This could have worked but would have been error prone and unstable.
Option 5: Parse the file with a fast forward-only parser and read/write XML elements simultaneously and generate a new XML file with the modified names in the desired format for the LOAD XML statement.
I ended up using option 5. I used the Java Streaming API for XML (StAX) for both reading the XML file and generating the modified XML file and then running the LOAD XML LOCAL INFILE through JDBC from inside the web application. It works perfectly and it is super fast.
You could create a temporary table using the column names from the XML file (although this would have to be done manually in the create temporary table query), load the XML file into that table, then insert into person select * from tmp_table_name.
mysql table schema: organization_type(id, name)
organizationtype.xml:
<NewDataSet>
<row>
<ItemID>1</ItemID>
<ItemCreatedBy>53</ItemCreatedBy>
<ItemCreatedWhen>2014-03-10T22:53:43.947+10:00</ItemCreatedWhen>
<ItemModifiedBy>53</ItemModifiedBy>
<ItemModifiedWhen>2014-03-10T22:53:43.99+10:00</ItemModifiedWhen>
<ItemOrder>1</ItemOrder>
<ItemGUID>e2ad051f-b7ea-4feb-b91e-f558f6f632a0</ItemGUID>
<Name>Company Type 1</Name>
</row>
and the mysql import query will look like this:
LOAD XML INFILE '/var/lib/mysql-files/organizationtype.xml'
INTO TABLE organization_type (#ItemID, #Name)
SET id=#ItemID, name=#Name