Set timestamp on insert when loading CSV [duplicate] - mysql

This question already has answers here:
How can i add date as auto update when import data from csv file?
(2 answers)
Closed 9 years ago.
I have a Timestamp field that is defined to be automatically updated with the CURRENT_TIMESTAMP value.
It works fine when I fire a query, but when I import a csv (which I'm forced to do since one of the fields is longtext) , the update does not work.
I have tried to:
Give timestamp column as now() function in csv
Manually enter timestamp like 2013-08-08 in the csv
Both the approaches do not work

From what I gather, after updating your question, is that you're actually updating rows using a CSV, and expect the ON UPDATE clause to set the value of your timestamp field to be updated.
Sadly, when loading a CSV into a database you're not updating, but inserting data, and overwriting existing records. At least, when using a LOCAL INFILE, if the INFILE isn't local, the query will produce an error, if it's a local file, these errors (duplicates) will produce warnings and the operation will continue.
If this isn't the case for you, perhaps consider following one of the examples on the doc pages:
LOAD DATA INFILE 'your.csv'
INTO TABLE tbl
(field_name1, field_name2, field_name3)
SET updated = NOW()
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY ('\n');
Just in case you can't/won't/forget to add additional information, loading a csv int a MySQL table is quite easy:
LOAD DATA
LOCAL INFILE '/path/to/file/filename1.csv'
INTO TABLE db.tbl
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '"'
LINES TERMINATED BY '\n'
(`field_name1`,`field_name2`,`field_name3`)
If you create a table along the lines of:
CREATE TABLE tbl(
id INT AUTO_INCREMENT PRIMARY KEY, -- since your previous question mentioned auto-increment
field_name1 VARCHAR(255) NOT NULL PRIMARY KEY, -- normal fields
field_name2 INTEGER(11) NOT NULL PRIMARY KEY,
field_name3 VARCHAR(255) NOT NULL DEFAULT '',
-- when not specified, this field will receive current_timestamp as value:
inserted TIMESTAMP NOT NULL DEFAULT CURRENT_TIMESTAMP,
-- if row is updated, this field will hold the timestamp of update-time
updated TIMESTAMP NOT NULL DEFAULT 0
ON UPDATE CURRENT_TIMESTAMP
)ENGINE = INNODB
CHARACTER SET utf8 COLLATE utf8_general_ci;
This query is untested, so please be careful with it, it's just to give a general idea of what you need to do to get the insert timestamp in there.
This example table will work like so:
> INSERT INTO tbl (field_name1, field_name2) VALUES ('foobar', 123);
> SELECT FROM tbl WHERE field_name1 = 'foobar' AND field_name2 = 123;
This will show:
+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| id | field_name1 | field_name2 | field_name3 | inserted | updated |
+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| 1 | foobar | 123 | | 2013-08-07 00:00:00 | 0000-00-00 00:00:00 |
+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
As you can see, because we didn't explicitly insert a value into the last three fields, MySQL used their DEFAULT values. For field_name3, an empty string was used, for inserted, the default was CURRENT_TIMESTAMP, for updated the default value was 0 which, because the field-type is TIMESTAMP is represented by the value 0000-00-00 00:00:00. If you were to run the following query next:
UPDATE tbl
SET field_name3 = 'an update'
WHERE field_name1 = 'foobar'
AND field_name2 = 123
AND id = 1;
The row would look like this:
+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| id | field_name1 | field_name2 | field_name3 | inserted | updated |
+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
| 1 | foobar | 123 | an update | 2013-08-07 00:00:00 | 2013-08-07 00:00:20 |
+---------------------+---------------------+---------------------+---------------------+---------------------+---------------------+
that's all. Some basics can be found here, on mysqltutorial.org, but best keep the official manual ready. It's not bad once you get used to it.
Perhaps this question might be worth a quick peek, too.

Related

How to INSERT or UPDATE a large number of rows (regarding the auto_increment value of a table)

I have a MySQL table with around 3 million rows (listings) at the moment. These listings are updated 24/7 (around 30 listings/sec) by a python script (Scrapy) using pymsql - so the performance of the queries is relevant!
If a listing doesn't exist (i.e. the UNIQUE url), a new record will be inserted (which is around every hundredth listing). The id is set to auto_increment and I am using a INSERT INTO listings ... ON DUPLICATE KEY UPDATE last_seen_at = CURRENT_TIMESTAMP. The update on last_seen_at is necessary to check if the item is still online, as I am crawling the search results page with multiple listings on it and not checking each individual URL each time.
+--------------+-------------------+-----+----------------+
| Field | Type | Key | Extra |
+--------------+-------------------+-----+----------------+
| id | int(11) unsigned | PRI | auto_increment |
| url | varchar(255) | UNI | |
| ... | ... | | |
| last_seen_at | timestamp | | |
| ... | ... | | |
+--------------+-------------------+-----+----------------+
The problem:
At first, it all went fine. Then I noticed larger and larger gaps in the auto_incremented id column and found out it's due to the INSERT INTO ... statement: MySQL attempts to do the insert first. This is when the id gets auto incremented. Once incremented, it stays. Then the duplicate is detected and the update happens.
Now my question is: Which is the best solution regarding performance for with long term perspective?
Option A: Set the id column to unsigned INT or BIGINT and just ignore the gaps. Problem here is I'm afraid of hitting the maximum after a couple of years updating. I'm already at an auto_increment value of around 12,000,000 for around 3,000,000 listings after two days of updating...
Option B: Switch to an INSERT IGNORE ... statement, check the affected rows and UPDATE ... if necessary.
Option C: SELECT ... the existing listings, check existence within python and INSERT ... or UPDATE ... dependingly.
Any other wise options?
Additonal Info: I need an id for information related to a listing stored in other tables (e.g. listings_images, listings_prices etc.). IMHO using the URL (which is unique) won't be the best option for foreign keys.
+------------+-------------------+
| Field | Type |
+------------+-------------------+
| listing_id | int(11) unsigned |
| price | int(9) |
| created_at | timestamp |
+------------+-------------------+
I was in exact situation as yours
I have millions of records being entered by scraper into table, scraper was running every day
I tried following but failed
Load all urls into a Python tuple or list and while scraping, only scrape those which are not in the list - FAILED because at the time of loading urls into a Python tuple or list script consumed so much of server's RAM
Check each record before entering - FAILED because it made INSERTion process too slow because it first have to query the table with millions of rows and then decide whether to INSERT or not
SOLUTION WORKED FOR ME: (for table with millions of rows)
I removed id column because it is irreverent and I do not need that
Make url PRIMARY KEY since it will be unique
Add UNIQUE INDEX -- THIS IS MUST TO DO - It will increase your table's performance drastically
Do bulk inserts instead of inserting one-by-one (see pipeline code below)
Notice it is using INSERT IGNORE INTO, so only new records will be entered and if it exists, it will be ignored completely
If you use REPLACE INTO instead of INSERT IGNORE INTO in MySQL, the new records will be entered, but if a record exists, it will be updated
class BatchInsertPipeline(object):
def __init__(self):
self.items = []
self.query = None
def process_item(self, item, spider):
table = item['_table_name']
del item['_table_name']
if self.query is None:
placeholders = ', '.join(['%s'] * len(item))
columns = '`' + '`, `'.join(item.keys()).rstrip(' `') + '`'
self.query = 'INSERT IGNORE INTO '+table+' ( %s ) VALUES ( %s )' \
% (columns, placeholders)
self.items.append(tuple(item.values()))
if len(self.items) >= 500:
self.insert_current_items(spider)
return item
def insert_current_items(self,spider):
spider.cursor.executemany(self.query, self.items)
self.items = []
def close_spider(self, spider):
self.insert_current_items(spider)
self.items = []

mysql change HEX number to decimal in LOAD DATA LOCAL INFILE

all,
I have a test.csv file with Id in HEX numbers as below:
Id, DateTime,...
66031851, ...
2E337E4E, ...
The table_test is created using MYSQL as below:
CREATE TABLE table_test(
Id BIGINT NOT NULL,
DateTime DATETIME NOT NULL,
OtherId BIGINT NOT NULL,
...,
PRIMARY KEY (Id, DateTime, OtherId)
)ENGINE=InnoDB DEFAULT CHARSET=utf8;
The created table_test is as below:
+---------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+---------------------+--------------+------+-----+---------+-------+
| Id | bigint(20) | NO | PRI | NULL | |
| DateTime | datetime | NO | PRI | NULL | |
I am using MYSQL as below to load the data in a table:
load data local infile 'test.csv' replace into table table_test character set utf8mb4 fields terminated by ',' ENCLOSED BY '\"' lines terminated by '\n' ignore 1 lines SET Id=CONV(Id, 16, 10);
Also tried:
SET Id=cast(CONV(Id, 16, 10) AS UNSIGNED)
and
SET Id=cast(CONV(CONVERT(Id,CHAR), 16, 10) AS UNSIGNED)
But the HEX numbers with letters like "2E337E4E" do not work. They become some very big number which is bigger than a BIGINT. But when I try MYSQL below:
select CONV('2E337E4E', 16, 10);
It works as expected with the correct result "775126606". So I think I miss a step in "LOAD DATA" to make the Id as string for the CONV(). Searched for some time, but did not find a solution.
Anyone has some idea or hint?
Thanks very much
Zhihong
The typical solution for this type of problem is to load the value into a user-defined-variable, then do the conversion in a SET statement.
Something like this should work for you:
load data local infile 'test.csv'
replace into table table_test
character set utf8mb4
fields terminated by ','
ENCLOSED BY '\"'
lines terminated by '\n'
ignore 1 lines
(#Id, `DateTime`, <explicitly list all other columns>)
SET Id=CONV(#Id, 16, 10);

Load data from CSV inside bit field in mysql

What's the right syntax to insert a value inside a column of type bit(1) in `MySQL'?
My column definition is:
payed bit(1) NOT NULL
I'm loading the data from a csv where the data is saved as 0 or 1.
I've tried to do the insert using:
b'value' or 0bvalue (example b'1' or 0b1)
As indicated from the manual.
But I keep getting this error:
Warning | 1264 | Out of range value for column 'payed' at row 1
What's the right way to insert a bit value?
I'm not doing the insert manually but I'm loading the data from a csv (using load data infile) in which the data for the column is 0 or 1.
This is my load query, I've renamed the fields for privacy questions, there's no error in that definition:
load data local infile 'input_data.csv' into table table
fields terminated by ',' lines terminated by '\n'
(id, year, field1, #date2, #date1, field2, field3, field4, field5, field6, payed, field8, field9, field10, field11, project_id)
set
date1 = str_to_date(#date1, '%a %b %d %x:%x:%x UTC %Y'),
date2 = str_to_date(#date2, '%a %b %d %x:%x:%x UTC %Y');
show warnings;
This is an example row of my CSV:
200014,2013,0.0,Wed Feb 09 00:00:00 UTC 2014,Thu Feb 28 00:00:00 UTC 2013,2500.0,21,Business,0,,0,40.0,0,PROSPECT,1,200013
Update:
I didn't find a solution with the bit, so I've changed the column data type from bit to tinyint to make it work.
I've finally found the solution and I'm posting it here for future reference. I've found help in the mysql load data manual page.
So for test purpose my table structure is:
+--------+-------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------+-------------+------+-----+---------+-------+
| id | int(11) | NO | PRI | NULL | |
| nome | varchar(45) | YES | | NULL | |
| valore | bit(1) | YES | | NULL | |
+--------+-------------+------+-----+---------+-------+
My csv test file is:
1,primo_valore,1
2,secondo_valore,0
3,terzo_valore,1
The query to load the csv into the table is:
load data infile 'test.csv' into table test
fields terminated by ',' lines terminated by '\n'
(id, nome, #valore) set
valore=cast(#valore as signed);
show warnings;
As you can see do load the csv you need to do a cast cast(#valore as signed) and in your csv you can use the integer notation 1 or 0 to indicate the bit value. This is because BIT values cannot be loaded using binary notation (for example, b'011010').
Replace the "0" values in the csv by no value at all. That worked for me.
You can use BIN() function like this :
INSERT INTO `table` VALUES (`column` = BIN(1)), (`column` = BIN(0));
Let me guess, but I think you should ignore 1st line of your CSV file in LOAD query.
See "IGNORE number LINES"

Seeking coding example for TSQLTimeStamp

Delphi XE2 and MySql.
My previous question led to the recommendation that I should be using MySql's native TIMESTAMP datatype to store date/time.
Unfornately, I can't seem to find any coding examples, and I am getting weird results.
Given this table:
mysql> describe test_runs;
+------------------+-------------+------+-----+---------------------+-------+
| Field | Type | Null | Key | Default | Extra |
+------------------+-------------+------+-----+---------------------+-------+
| start_time_stamp | timestamp | NO | PRI | 0000-00-00 00:00:00 | |
| end_time_stamp | timestamp | NO | | 0000-00-00 00:00:00 | |
| description | varchar(64) | NO | | NULL | |
+------------------+-------------+------+-----+---------------------+-------+
3 rows in set (0.02 sec)
I woudl like to :
declare a variable into which I can store the result of SELECT CURRENT_TIMESTAMP - what type should it be? TSQLTimeStamp?
insert a row at test start which has start_time_stamp = the variable above
and end_time_stamp = some "NULL" value ... "0000-00-00 00:00:00"? Can I use that directly, or do I need to declare a TSQLTimeStamp and set each field to zero? (there doesn't seem to be a TSQLTimeStamp.Clear; - it's a structure, not a class
upadte the end_time_stamp when the test completes
calcuate the test duration
Can somene please point me at a URL with some Delphi code whcich I can study to see how to do this sort of thing? GINMF.
I don't know why you want to hassle around with that TIMESTAMP and why you want to retrieve the CURRENT_TIMESTAMP just to put it back.
And as already stated, it is not a good advice to use a TIMESTAMP field as PRIMARY KEY.
So my suggestion is to use this TABLE SCHEMA
CREATE TABLE `test_runs` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`start_time_stamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP,
`end_time_stamp` timestamp NULL DEFAULT NULL,
`description` varchar(64) NOT NULL,
PRIMARY KEY (`id`)
);
Starting a test run is handled by
INSERT INTO test_runs ( description ) VALUES ( :description );
SELECT LAST_INSERT_ID() AS id;
and to finalize the record you simply call
UPDATE test_runs SET end_time_stamp = CURRENT_TIMESTAMP WHERE id = :id
just declare a TSQLQuery (or the correct component for the data access layer of your choice), attach it to a valid connection and populate it's SQL property with:
select * from test_runs;
double click on the query to launch it's fields editor and select add all fields from the contextual menu of that editor.
It will create the correct field type, according to the data access layer and driver you're using to access your data.
Once that's done, if you need to use the value in code, usually you do it by using the AsDateTime property of the field, so you just use a plain TDateTime Delphi type and let the database access layer deal with the specific database details to store that field.
For example, if your query object is named qTest and the table field is named start_time_stamp, your Delhi variable associated with that persistent field will be named qTeststart_time_stamp, so you can do something like this:
var
StartTS: TDateTime;
begin
qTest.Open;
StartTS := qTeststart_time_stamp.AsDateTime;
ShowMessage('start date is ' + DateTimeToStr(StartTS));
end;
If you use dbExpress and are new to it, read A Guide to Using dbExpress in Delphi database applications
I don't know about MySQL, but if the TField subclass generated is a TSQLTimeStampField, you will need to use the type and functions in the SqlTimSt unit (Data.SqlTimSt for XE2+).
You want to declare the local variables as TSQLTimeStamp
uses Data.SQLTmSt....;
....
var
StartTS: TSQLTimeStamp;
EndTS: TSQLTimeStamp;
begin
StartTS := qTeststart_time_stamp.AsSQLTimeStamp;
SQLTmSt also includes functions to convert to and from TSQLTimeStamp, e.g. SQLTimeStampToDateTime and DateTimeToSQLTimeStamp.
P.S. I tend to agree that using a timestamp as a primary key is likely to cause problems. I would tend to use a auto incrementing surrogate key as Sir Rufo suggests.

Fastest way to diff datasets and update/insert lots of rows into large MySQL table?

The schema
I have a MySQL database with one large table (5 million rows say). This table has several fields for actual data, an optional comment field, and fields to record when the row was first added and when the data is deleted. To simplify to one "data" column, it looks a bit like this:
+----+------+---------+---------+----------+
| id | data | comment | created | deleted |
+----+------+---------+---------+----------+
| 1 | val1 | NULL | 1 | 2 |
| 2 | val2 | nice | 1 | NULL |
| 3 | val3 | NULL | 2 | NULL |
| 4 | val4 | NULL | 2 | 3 |
| 5 | val5 | NULL | 3 | NULL |
This schema allows us to look at any past version of the data thanks to the created and deleted fields e.g.
SET #version=1;
SELECT data, comment FROM MyTable
WHERE created <= #version AND
(deleted IS NULL OR deleted > #version);
+------+---------+
| data | comment |
+------+---------+
| val1 | NULL |
| val2 | nice |
The current version of the data can be fetched more simply:
SELECT data, comment FROM MyTable WHERE deleted IS NULL;
+------+---------+
| data | comment |
+------+---------+
| val2 | nice |
| val3 | NULL |
| val5 | NULL |
DDL:
CREATE TABLE `MyTable` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`data` varchar(32) NOT NULL,
`comment` varchar(32) DEFAULT NULL,
`created` int(11) NOT NULL,
`deleted` int(11) DEFAULT NULL,
PRIMARY KEY (`id`),
KEY `data` (`data`,`comment`)
) ENGINE=InnoDB;
Updating
Periodically a new set of data and comments arrives. This can be fairly large, half a million rows say. I need to update MyTable so that this new data set is stored in it. This means:
"Deleting" old rows. Note the "scare quotes" - we don't actually delete rows from MyTable. We have to set the deleted field to the new version N. This has to be done for all rows in MyTable that are in the previous version N-1, but are not in the new set.
Inserting new rows. All rows that are in the new set and are not in version N-1 in MyTable must be added as new rows with the created field set to the new version N, and deleted as NULL.
Some rows in the new set may match existing rows in MyTable at version N-1 in which case there is nothing to do.
My current solution
Given that we have to "diff" two sets of data to work out the deletions, we can't just read over the new data and do insertions as appropriate. I can't think of a way to do the diff operation without dumping all the new data into a temporary table first. So my strategy goes like this:
-- temp table uses MyISAM for speed.
CREATE TEMPORARY TABLE tempUpdate (
`data` char(32) NOT NULL,
`comment` char(32) DEFAULT NULL,
PRIMARY KEY (`data`),
KEY (`data`, `comment`)
) ENGINE=MyISAM;
-- Bulk insert thousands of rows
INSERT INTO tempUpdate VALUES
('some new', NULL),
('other', 'comment'),
...
-- Start transaction for the update
BEGIN;
SET #newVersion = 5; -- Worked out out-of-band
-- Do the "deletions". The join selects all non-deleted rows in MyTable for
-- which the matching row in tempUpdate does not exist (tempUpdate.data is NULL)
UPDATE MyTable
LEFT JOIN tempUpdate
ON MyTable.data = tempUpdate.data AND
MyTable.comment <=> tempUpdate.comment
SET MyTable.deleted = #newVersion
WHERE tempUpdate.data IS NULL AND
MyTable.deleted IS NULL;
-- Delete all rows from the tempUpdate table that match rows in the current
-- version (deleted is null) to leave just new rows.
DELETE tempUpdate.*
FROM MyTable RIGHT JOIN tempUpdate
ON MyTable.data = tempUpdate.data AND
MyTable.comment <=> tempUpdate.comment
WHERE MyTable.id IS NOT NULL AND
MyTable.deleted IS NULL;
-- All rows left in tempUpdate are new so add them.
INSERT INTO MyTable (data, comment, created)
SELECT DISTINCT tempUpdate.data, tempUpdate.comment, #newVersion
FROM tempUpdate;
COMMIT;
DROP TEMPORARY TABLE IF EXISTS tempUpdate;
The question (at last)
I need to find the fastest way to do this update operation. I can't change the schema for MyTable, so any solution must work with that constraint. Can you think of a faster way to do the update operation, or suggest speed-ups to my existing method?
I have a Python script for testing the timings of different update strategies and checking their correctness over several versions. It's fairly long but I can edit into the question if people think it would be useful.
One of speed-ups is for loading -- LOAD DATA INFILE.
In so far as I've experienced audit-logging, you'll be better off with two tables, e.g.:
yourtable (id, col1, col2, version) -- pkey on id
yourtable_logs (id, col1, col2, version) -- pkey on (id, version)
Then add an update trigger on yourtable, which inserts the previous version in yourtable_logs.