LOAD XML LOCAL INFILE with Inconsistent Column Names - mysql

MySQL has a nice statement: LOAD XML LOCAL INFILE
For example, if you have this table:
CREATE TABLE person (
person_id INT NOT NULL PRIMARY KEY,
fname VARCHAR(40) NULL,
lname VARCHAR(40) NULL
);
and the following XML file called person.xml:
<list>
<person>
<person_id>1</person_id>
<fname>Mikael</fname>
<lname>Ronström</lname>
</person>
<person>
<person_id>2</person_id>
<fname>Lars</fname>
<lname>Thalmann</lname>
</person>
</list>
You can do this:
LOAD XML LOCAL INFILE 'person.xml'
INTO TABLE person
ROWS IDENTIFIED BY '<person>';
My question is, what if the column names were different in the XML file than they are in the table? For example:
<list>
<person>
<PersonId>1</PersonId>
<FirstName>Mikael</FirstName>
<LastName>Ronström</LastName>
</person>
<person>
<PersonId>2</PersonId>
<FirstName>Lars</FirstName>
<LastName>Thalmann</LastName>
</person>
</list>
How can you accomplish the same thing with a MySQL statement without manipulating the XML file? I searched everywhere but couldn't find an answer.

The fields in the XML file that don't correspond to physical column names are ignored. And columns in the table that don't have corresponding fields in the XML are set NULL.
What I'd do is load into a temp table as #Kolink suggests but with additional columns. Add a SET clause as you load the data from XML.
CREATE TEMP TABLE person_xml LIKE person;
ALTER TABLE person_xml
ADD COLUMN FirstName VARCHAR(40),
ADD COLUMN LastName VARCHAR(40),
ADD COLUMN PersonId INT;
LOAD XML LOCAL INFILE 'person.xml' INTO TABLE person_xml
SET person_id = PersonId, fname = FirstName, lname = LastName;
SELECT * FROM person_xml;
+-----------+--------+-------------+-----------+-------------+----------+
| person_id | fname | lname | FirstName | LastName | PersonId |
+-----------+--------+-------------+-----------+-------------+----------+
| 1 | Mikael | Ronström | Mikael | Ronström | 1 |
| 2 | Lars | Thalmann | Lars | Thalmann | 2 |
+-----------+--------+-------------+-----------+-------------+----------+
Then copy to the real table, selecting a subset of columns.
INSERT INTO person SELECT person_id, fname, lname FROM person_xml;
Alternatively, drop the extra columns and use SELECT *.
ALTER TABLE person_xml
DROP COLUMN PersonId,
DROP COLUMN FirstName,
DROP COLUMN LastName;
INSERT INTO person SELECT * FROM person_xml;
SELECT * FROM person;
+-----------+--------+-------------+
| person_id | fname | lname |
+-----------+--------+-------------+
| 1 | Mikael | Ronström |
| 2 | Lars | Thalmann |
+-----------+--------+-------------+

A little bit hacky but working solution using the good old LOAD DATA INFILE:
LOAD DATA LOCAL INFILE '/tmp/xml/loaded.xml'
INTO TABLE person
CHARACTER SET binary
LINES STARTING BY '<person>' TERMINATED BY '</person>'
(#person)
SET
person_id = ExtractValue(#person:=CONVERT(#person using utf8), 'PersonId'),
fname = ExtractValue(#person, 'FirstName'),
lname = ExtractValue(#person, 'LastName')
;
P.S. You will probably need to additionaly play with field delimiter if the data contains commas.

The following were the options available to me:
Option 1: Create a temporary table with different field names (as suggested by the other answers). This would have been a satisfactory approach. However, when I tried it, a new problem emerged: the LOAD XML statement does not, for some reason, accept minimized format empty elements (for example <person />). So, the statement failed because the XML files I need to load occasionally have empty elements in that format.
Option 2: Transform the XML file with XSLT before running the LOAD XML statement to change the element names and modify the empty element formats. This was not feasible because the XML files are very large and XSLT processing engines load the entire XML into memory before processing.
Option 3: Bypass the LOAD XML statement entirely and use a SAX parser to parse the XML file and insert the records directly into the database using JDBC and prepared statements. Even though raw JDBC and prepared statements are generally efficient, this proved to be too slow. MUCH slower than the LOAD XML statement.
Option 4: Use the LOAD DATA statement instead of the LOAD XML statement and play around with the optional clauses associated with that statement to fit my needs (e.g. lines separated by, etc.). This could have worked but would have been error prone and unstable.
Option 5: Parse the file with a fast forward-only parser and read/write XML elements simultaneously and generate a new XML file with the modified names in the desired format for the LOAD XML statement.
I ended up using option 5. I used the Java Streaming API for XML (StAX) for both reading the XML file and generating the modified XML file and then running the LOAD XML LOCAL INFILE through JDBC from inside the web application. It works perfectly and it is super fast.

You could create a temporary table using the column names from the XML file (although this would have to be done manually in the create temporary table query), load the XML file into that table, then insert into person select * from tmp_table_name.

mysql table schema: organization_type(id, name)
organizationtype.xml:
<NewDataSet>
<row>
<ItemID>1</ItemID>
<ItemCreatedBy>53</ItemCreatedBy>
<ItemCreatedWhen>2014-03-10T22:53:43.947+10:00</ItemCreatedWhen>
<ItemModifiedBy>53</ItemModifiedBy>
<ItemModifiedWhen>2014-03-10T22:53:43.99+10:00</ItemModifiedWhen>
<ItemOrder>1</ItemOrder>
<ItemGUID>e2ad051f-b7ea-4feb-b91e-f558f6f632a0</ItemGUID>
<Name>Company Type 1</Name>
</row>
and the mysql import query will look like this:
LOAD XML INFILE '/var/lib/mysql-files/organizationtype.xml'
INTO TABLE organization_type (#ItemID, #Name)
SET id=#ItemID, name=#Name

Related

Loading a huge CSV file with high performance in Oracle table

I have a CSV File which its size is about 20 Gig. The file has three Persian columns. I want to load it to Oracle Table. I searched and found that sql loader has high performance. But, when I load the file in the table, Persian data is not loaded in the right order. In fact, it is because Persian data is the right to left language.
I use this control file:
OPTIONS (SKIP=0, ERRORS=500, PARALLEL=TRUE, MULTITHREADING=TRUE, DIRECT=TRUE,
SILENT=(ALL))
load data
CHARACTERSET UTF8
infile '/home/oracle/part.csv'
APPEND
into table Fact_test
fields terminated by ','
trailing nullcols(
A_ID INTEGER,
T_ID,
G_ID,
TRTYPE,
ORRETURNED,
MECH,
AMN,
TRAM INTEGER,
USERID INTEGER,
USERS INTEGER,
VERID INTEGER,
TRSTAMP CHAR(4000),
OPR_BRID INTEGER
)
File is like this:
A_ID,T_ID,g_id,TrType,ORRETURNED,Mech,Amn,Tram,UserID,UserS,VerID,TRSTAMP,OPR_BRID
276876739075,154709010853,4302,بروفق,اصلی,غیر سبک,بررسی,86617.1,999995,NULL,NULL,1981-11-16 13:23:16,2516
When I export the table in excel format, I receive this, some numbers become negative:
(A_ID,T_ID,g_id,TrType,ORRETURNED,Mech,Amn,Tram,UserID,UserS,VerID,TRSTAMP,OPR_BRID) values (276876739075,'154709010853',411662402610,'4302','غیر بررسی','اصلي','سبک',-1344755500,-1445296167,-1311201320,909129772,'77.67',960051513);
The problem is when the data loaded, some columns have negative number and order of some columns change.
Would you please guide me how to solve the issue?
Any help is really appreciated.
Problem solved:
I change the control file to this one:
load data
CHARACTERSET UTF8
infile '/home/oracle/test_none.csv'
APPEND
into table Fact_test
FIELDS terminated BY ','
trailing nullcols(
A_ID CHAR(50),
T_ID CHAR(50),
G_ID CHAR(50),
TRTYPE,
ORRETURNED,
MECH,
AMN,
TRAM CHAR(50),
USERID,
USERS CHAR(50),
VERID CHAR(50),
TRSTAMP,
OPR_BRID CHAR(50)
)

JSON_VALUE for Nameless JSON payload

Just started playing with JSON_VALUE in SQL Server. I am able to pull values from name/value pairs of JSON but I happen to have an object that looks like this:
["first.last#domain.com"]
When I attempt what works for name/value pairs:
SELECT TOP 1
jsonemail,
JSON_VALUE(jsonemail, '$') as pleaseWorky
FROM MyTable
I get back the full input, not first.last#domain.com. Am I out of luck? I don't control the upstream source of the data. I think its a sting collection being converted into a json payload. If it was name: first.last#domain.com I would be able to get it with $.name.
Thanks in advance.
It is a JSON array. So you just need to specify its index, i.e 0.
Please try the following solution.
SQL
-- DDL and sample data population, start
DECLARE #tbl TABLE (ID INT IDENTITY PRIMARY KEY, jsonemail NVARCHAR(MAX));
INSERT INTO #tbl (jsonemail) VALUES
('["first.last#domain.com"]');
-- DDL and sample data population, end
SELECT ID
, jsonemail AS [Before]
, JSON_VALUE(jsonemail, '$[0]') as [After]
FROM #tbl;
Output
+----+---------------------------+-----------------------+
| ID | Before | After |
+----+---------------------------+-----------------------+
| 1 | ["first.last#domain.com"] | first.last#domain.com |
+----+---------------------------+-----------------------+
From the docs:
Array elements. For example, $.product[3]. Arrays are zero-based.
So you need JSON_VALUE(..., '$[0]') when the root is an array and you want the first value.
To break it out into rows, you would need OPENJSON:
SELECT TOP 1
jsonemail
,j.[value] as pleaseWorky
FROM MyTable
CROSS APPLY OPENJSON(jsonemail) j

How do I return a JSON updated document in Oracle?

From the docs I see an example:
SELECT json_mergepatch(po_document, '{"Special Instructions":null}'
RETURNING CLOB PRETTY)
FROM j_purchaseorder;
But When I try this code in SQL Developer I get a squiggly line under CLOB and an error when I run the query?
It works in Oracle 18c:
SELECT json_mergepatch(
po_document,
'{"Special Instructions":null}'
RETURNING CLOB PRETTY
) AS updated_po_document
FROM j_purchaseorder;
Which for the test data:
CREATE TABLE j_purchaseorder( po_document CLOB CHECK ( po_document IS JSON ) );
INSERT INTO j_purchaseorder ( po_document )
VALUES ( '{"existing":"value", "Special Instructions": 42}' );
Outputs:
| UPDATED_PO_DOCUMENT |
| :------------------------------- |
| {<br> "existing" : "value"<br>} |
Removing the Special Instructions attribute as per the documentation you linked to:
When merging object members that have the same field:
If the patch field value is null then the field is dropped from the source — it is not included in the result.
Otherwise, the field is kept in the result, but its value is the result of merging the source field value with the patch field value. That is, the merging operation in this case is recursive — it dives down into fields whose values are themselves objects.
db<>fiddle here

Hive imports data from csv to an incorrect columns in table

Below is my table creation and a sample from my csv;
DROP TABLE IF EXISTS xxx.fbp;
CREATE TABLE IF NOT EXISTS xxx.fbp (id bigint, p_name string, h_name string, ufi int, city string, country string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS TEXTFILE;
74905,xxx,xyz,-5420642,City One,France
74993,xxx,zyx,-874432,City,Germany
75729,xxx,yzx,-1284248,City Two Long Name,France
I then load the data into a hive table with the following query:
LOAD DATA
INPATH '/user/xxx/hdfs_import/fbp.csv'
INTO TABLE xxx.fbp;
It seems that there is data leaking from the 5th csv "column" into the 6th column of the table. So, I'm seeing city data in my country column.
SELECT country, count(country) from xxx.fbp group by country
+---------+------+
| country | _c1 |
| Germany | 1143 |
| City | 1 |
+---------+------+
I'm not sure why city data is occasionally being imported to the country column. The csv is downloaded from Google Sheets and I've removed the header.
The reason could be your line termination is not '\n', Windows based tool add additional characters which creates issue. Also may be you have feilds using column separator creating this.
Solution:
1. Try print line which have issue by 'where country = City' clause, this will give you some idea how Hive created the record.
2. Try binary storage format to be 100% sure about data processed by Hive.
Hope it helps.
The issue was within the CSV itself. Some columns, such as p.name contained , in several fields. This would cause a line ending to end sooner than expected. I had to clean the data and remove all ,. After that, it imported correctly. Quickly done with python.
with open("fbp.csv") as infile, open("outfile.csv", "w") as outfile:
for line in infile:
outfile.write(line.replace(",", ""))

What would be the fastest way to insert this data

Okay, so I have a MySQL table called entries which contains the columns name VARCHAR(255) NOT NULL and address VARCHAR(255)
The table has about a million sets of data.
For every set of data, name has a value e.g. "john" whilst address is NULL.
For example:
+------+---------+
| name | address |
+------+---------+
| john | NULL |
+------+---------+
| jake | NULL |
+------+---------+
| zach | NULL |
+------+---------+
I received a CSV file which contains names along with their corresponding address in the format of name:address.
Like I said, the entries table has nearly a million entries, so the csv file has about 800,000 lines.
I want to take each line in the csv, and insert the address where the name is the same which would be:
UPDATE `entries` SET `address` = <address from csv> WHERE `name` = <name from csv>;
I made a Python script to open the csv file reading it line by line. For each line, it would store the name and address in separate variables. It would then execute the query above, but it was taking too long to insert the data into the columns.
Is there anyway I could do this in MySQL, if so, what is the fastest way?
Thanks.
You can import the CSV file into a separate table using mysql LOAD DATA INFILE and then update the entries table using JOIN statement on the basis of similar column name.
E.g:
update entries a inner join new_table b on a.name = b.name set a.address = b.address ;
Here new_table is imported from the CSV file..
Don't forget to add index on both tables for the name column so that it would be fast..
Create table1 and table2
LOAD DATA INFILE '/path/theFile1.csv'
INTO TABLE table1
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
Ditto for file2 into table2
Proceed
Well definitely using a batch query would be faster. You could use a for loop to scan through your CSV file and create a string that does a large batch query.
For example (pseudo code):
String Query ="UPDATE entries SET Value = ( CASE ";
For (begin of file to end)
Name = NameFromFile;
Value = ValueFromFile;
Query += "WHEN NameField = ";
Query += Name + " THEN " +Value;
End
Query+= " )";
Of course you would need to convert those values to strings when concatenating. I wouldn't claim this is fastest but definitely faster.
Sorry for the poor formatting, I'm on my phone.