How to solve a real time dwh delete process? - mysql

I am trying to create a near real time dwh. My first attempt is every 15 minutes load a table into my application from my DWH.
I would like to avoid all the possible problems that a near real time DWH can face. One of those problems is query an empty table that shows the value for a multiselect html tag.
To solve this I have thought the following solution but I do not know if there exists a standard to solve this kind of problem.
I create a table like this to save the possible values of the multiselect:
CREATE TABLE providers (
provider_id INT PRIMARY KEY,
provider_name VARCHAR(20) NOT NULL,
delete_flag INT NOT NULL
)
Before the insert I update the table like this:
UPDATE providers set my_flag=1
I insert rows with an ETL process like this:
INSERT INTO providers (provider_name, delete_flag) VALUES ('Provider1',0)
From my app I query the table like this:
SELECT DISTINCT provider_name FROM providers
While the app still working and selecting all providers without duplicated (The source can delete, add or update one provider, so I always have to still updated respect the source) and without showing an error because table is empty I can run this statement just after the insert statement:
DELETE FROM providers WHERE delete_flag=1
I think that this is a good solution for small tables, or big tables with few changes, but what happens when a table is big? Exist some standard to solve this kind of problems?
We can not risk user usability because we are updating data.

There are two aproaches to publich a bulk change of a dimenstion without taking a maintainance window that would interupt the queries.
The first one is simple using a transactional concept, but performs bad for large data.
DELETE the replaced dimension records
INSERT the new or changed dimension records
COMMIT;
Note that you need no logical DELETE flag as the changes are visible only after the COMMIT - so the table is never empty.
As mentioned this approach is not suitable if you have a large dimension with lot of changes. In such case you may use the EXCHANGE PARTITION feature as of MySQL 5.6
You define a temporary table with he same structure as your dimension table, that is partitioned with only one partition containing all data.
CREATE TABLE dim_tmp (
id INT NOT NULL,
col1 VARCHAR(30),
col2 VARCHAR(30)
)
PARTITION BY RANGE (id) (
PARTITION pp VALUES LESS THAN (MAXVALUE)
);
Populate the table with the complete new dimension definition and switch this temporary table with your dimension table.
ALTER TABLE dim_tmp EXCHANGE PARTITION pp WITH TABLE dim;
After this statement the data from the temporary table will be stored (published) in your dimension table (new definition) and the old state of the dimension will be stored in the temporary table.
Please check the documentation link above for constraints of this feature.
Disclaimer: I use this feature in Oracle DB and I have no experience with it in MySQL.

Related

What is the best and safest method to update/change the data type of a column in a MySQL table that has ~5.5 million rows (TINYINT to SMALLINT)?

Similar questions have been asked, but I have had issues in the past by using
ALTER TABLE tablename MODIFY columnname SMALLINT
I had a server crash and had to recover my table when I ran this the last time. Is it safe to use this command when there is that much data in the table? What if there are other queries that may be running on the table in parallel? Should I copy the table and run the query on the new table? Should I copy the column and move the data to the new column?
Please let me know if there are any best or "safest" practices when doing this.
Also, I know this depends on a lot of factors, but does anyone know how long the query should take on an InnoDB table with ~5.5 million rows (rough estimate)? The column in question is a TINYINT and has data in it. I want to upgrade to a SMALLINT to handle larger values.
Thanks!
On a slow disk, and with lots of columns in the table, it could take hours to finish.
The ALTER is "safe" because it used to do the following:
Lock the table
Create a similar table, but with SMALLINT instead of TINYINT.
Copy all the rows over to the new table.
Rename the tables and drop the old one.
Unlock
Step 3 is the slow part. The only vulnerability is in step 4, which is very fast.
A server crash during steps 1-3 should have left the old table intact, but possibly left behind a partially created tmp table named something like #sql....
Percona's pt-online-schema-change has the advantage of being virtually lockless.
This cannot be easily answered.
It depends on things like
Has the table its own file, or is it shared with others?
How big is the table in terms of bytes?
etc.
It can last from some minutes to, indeed, some hours and can involve copying over the whole content of the table, so you have quite big needs of disk space.
You can add a new SMALLINT column to the table:
ALTER TABLE tablename ADD columnname_new SMALLINT AFTER columnname;
then copy the data from old column to new one:
UPDATE tablename SET columnname_new = columnname WHERE columnname_new IS NULL LIMIT 100000
repeat above until all records done
then you can drop old column:
ALTER TABLE tablename DROP COLUMN columnname
and finally rename new column:
ALTER TABLE tablename CHANGE columnname_new columnname SMALLINT
you could do the copy of values from old column to new column in batch of 100000 rows, just to be sure not to have any issue
I would add a new column, change the code to check if a value exists in the new column and to read/write it if it does. Also change the code to read from the old column and write to the new column. At this point you can migrate the data at will, copying over values from the old column into the new column where a value does not exist in the new column.
Once all of the data has been migrated you can drop the old column.

Is there a way to cache a View so that queries against it are quick?

I'm extremely new to Views so please forgive me if this is a silly question, but I have a View that is really helpful in optimizing a pretty unwieldy query, and allows me to select against a small subset of columns in the View, however, I was hoping that the View would actually be stored somewhere so that selecting against it wouldn't take very long.
I may be mistaken, but I get the sense (from the speed with which create view executes and from the duration of my queries against my View) that the View is actually run as a query prior to the external query, every time I select against it.
I'm really hoping that I'm overlooking some mechanism whereby when I run CREATE VIEW it can do the hard work of querying the View query *then, so that my subsequent select against this static View would be really swift.
BTW, I totally understand that obviously this VIEW would be a snapshot of the data that existed at the time the VIEW was created and wouldn't reflect any new info that was inserted/updated subsequent to the VIEW's creation. That's actually EXACTLY what I need.
TIA
What you want to do is materialize your view. Have a look at http://www.fromdual.com/mysql-materialized-views.
What you're talking about are materialised views, a feature of (at least) DB2 but not MySQL as far as I know.
There are ways to emulate them by creating/populating a table periodically, or on demand, but a true materialised view knows when the underlying data has changed, and only recalculates if required.
If the data will never change once the view is created (as you seem to indicate in a comment), just create a brand new table to hold the subset of data and query that. People always complain about slow speed but rarely about data storage requirements :-)
You can do this with:
A MySQL Event
A separate table (for caching)
The REPLACE INTO ... SELECT statement.
Here's a working example.
-- create dummy data for testing
CREATE TABLE MyTable (
id INT NOT NULL,
groupvar INT NOT NULL,
myvar INT
);
INSERT INTO MyTable VALUES
(1,1,1),
(2,1,1),
(3,2,1);
-- create the view, making sure rows have a unique identifier (groupvar)
CREATE VIEW MyView AS
SELECT groupvar, SUM(myvar) as myvar_sum
FROM MyTable
GROUP BY groupvar;
-- create cache table, setting primary key to unique identifier (groupvar)
CREATE TABLE MyView_Cache (PRIMARY KEY (groupvar))
SELECT *
FROM MyView;
-- create a table to keep track of when the cache has been updated (optional)
CREATE TABLE MyView_Cache_updated (update_id INT NOT NULL AUTO_INCREMENT, PRIMARY KEY (update_id));
-- create event to update cache table (e.g., daily)
DELIMITER |
CREATE EVENT MyView_Cache_Event
ON SCHEDULE EVERY 1 DAY STARTS CURRENT_TIMESTAMP + INTERVAL 1 HOUR
DO
BEGIN
REPLACE INTO MyView_Cache
SELECT *
FROM MyView_Cache;
INSERT INTO MyView_Cache_updated
SELECT NULL, NOW() AS last_updated;
END |
DELIMITER ;
You can now query MyView_Cache for faster response times, and query MyView_Cache_updated to inform users of the last time the cache was updated (in this example, daily).
Since a view is basically a SELECT statement you can use query cache to improve performance.
But first you should check if :
you can add indexes in the tables involved to speed up the query (use EXPLAIN)
the data isn't changing very often you can materialize the view (make snapshots)
Use a materiallised view.. It can store data like count sum etc but yes after updating the table you need to refresh the view to get correct results as they are not auto updated.. Moreover after querying from view the results are stored in cache so the memory cycles reduces to 2 which are 4 in case of querying from the table itself. So it gets efficient from the second time.. When you query for 1st time from view the data is fetched from main memory and is stored in cache after it.

Changing a MySQL database retrospectively

Is there a method to track changes to a MySQL database? I develop offline and then commit all the changes to my server. For the app itself I use Git and it works nicely.
However, for the database, I'm changing everything manually because the live database contains customer data and I cannot just replace it with the development database.
Is there a way to only have the structural changes applied without completely replacing one db with another?
The term you're looking for is 'database migrations' (and no, it doesn't refer to moving from one RDBMS to another). Migrations are a way to programatically version control your database structure. Most languages have some kind of migrations toolkit, often as part of an ORM library/framework.
For PHP you can look at Doctrine
For Ruby it's Rails of course
The key to have keep track of your changes is Snapshots my friend.
Now, it's a wide field. The first thing you have to do is decide if you want to keep track of your database with some kind of data in it. If that's the case you have several options, from using LVM, copying InnoDB binary logs, and the simple mysqldump.
Now, if what you wanna do is have some smooth transition between your database changes (i mean, you added a column, for example), you have some other options.
The first one is replication. That's a great option, but is a little complex. With replication you may alter one slave and after it's done, with some locking, you can make it master, and replace master, and so on. It's really difficult, but is the better option.
If you cannot afford replication, what you must do is apply the changes to your single-master DB with the minimum downtime. Some good option is this:
Suppose you want to replace your Customer table to add a "facebook_account" field. First, you can use an alias table, like this:
The original table (it has data):
CREATE TABLE `customer` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
The new one:
CREATE TABLE `new_customer` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`name` varchar(255) NOT NULL,
`facebook_account` varchar(255) NOT NULL,
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
Or simply:
CREATE TABLE new_customer LIKE customer;
ALTER TABLE new_customer add column facebook_account VARCHAR(255);
Now we're gonna copy the data to the new table. We'll need to issue some other things first, i'll explain them each at a time.
First, you can allow other connections to modify the customer table while your making the change of table, so i'll issue a lock. If you want to learn more about this go here:
LOCK TABLES customer WRITE ,new_customer WRITE;
Now i flush the table to write any cache content to the filesystem:
FLUSH TABLES customer;
Now we can do the insert. First I disable the keys for performance issues. After the data is inserted i enable the keys again.
ALTER TABLE new_customer DISABLE KEYS;
INSERT INTO new_customer(id,name,facebook_account) SELECT customer.id,customer.name, Null FROM customer;
ALTER TABLE new_customer ENABLE KEYS;
Now we can switch the tables.
ALTER TABLE customer RENAME old_customer;
ALTER TABLE new_customer RENAME customer;
Finally we have to release the lock.
UNLOCK TABLES;
That's it. If you want to keep track of your modified tables you may want to rename your old_customer table, to something else or move it to other database.
The only issue i didn't cover here is about Triggers. You have to pay atention to any enabled trigger, but it will depend on your schema.
That's it, hope it helps.

Adding a time dimension to MySQL cells

Is there a way to keep a timestamped record of every change to every column of every row in a MySQL table? This way I would never lose any data and keep a history of the transitions. Row deletion could be just setting a "deleted" column to true, but would be recoverable.
I was looking at HyperTable, an open source implementation of Google's BigTable, and this feature really wet my mouth. It would be great if could have it in MySQL, because my apps don't handle the huge amount of data that would justify deploying HyperTable. More details about how this works can be seen here.
Is there any configuration, plugin, fork or whatever that would add just this one functionality to MySQL?
I've implemented this in the past in a php model similar to what chaos described.
If you're using mysql 5, you could also accomplish this with a stored procedure that hooks into the on update and on delete events of your table.
http://dev.mysql.com/doc/refman/5.0/en/stored-routines.html
I do this in a custom framework. Each table definition also generates a Log table related many-to-one with the main table, and when the framework does any update to a row in the main table, it inserts the current state of the row into the Log table. So I have a full audit trail on the state of the table. (I have time records because all my tables have LoggedAt columns.)
No plugin, I'm afraid, more a method of doing things that needs to be baked into your whole database interaction methodology.
Create a table that stores the following info...
CREATE TABLE MyData (
ID INT IDENTITY,
DataID INT )
CREATE TABLE Data (
ID INT IDENTITY,
MyID INT,
Name VARCHAR(50),
Timestamp DATETIME DEFAULT CURRENT_TIMESTAMP)
Now create a sproc that does this...
INSERT Data (MyID, Name)
VALUES(#MyID,#Name)
UPDATE MyData SET DataID = ##IDENTITY
WHERE ID = #MyID
In general, the MyData table is just a key table. You then point it to the record in the Data table that is the most current. Whenever you need to change data, you simply call the sproc which Inserts the new data into the Data table, then updates the MyData to point to the most recent record. All if the other tables in the system would key themselves off of the MyData.ID for foreign key purposes.
This arrangement sidesteps the need for a second log table(and keeping them in sync when the schema changes), but at the cost of an extra join and some overhead when creating new records.
Do you need it to remain queryable, or will this just be for recovering from bad edits? If the latter, you could just set up a cron job to back up the actual files where MySQL stores the data and send it to a version control server.

Does MS-SQL support in-memory tables?

Recently, I started changing some of our applications to support MS SQL Server as an alternative back end.
One of the compatibility issues I ran into is the use of MySQL's CREATE TEMPORARY TABLE to create in-memory tables that hold data for very fast access during a session with no need for permanent storage.
What is the equivalent in MS SQL?
A requirement is that I need to be able to use the temporary table just like any other, especially JOIN it with the permanent ones.
You can create table variables (in memory), and two different types of temp table:
--visible only to me, in memory (SQL 2000 and above only)
declare #test table (
Field1 int,
Field2 nvarchar(50)
);
--visible only to me, stored in tempDB
create table #test (
Field1 int,
Field2 nvarchar(50)
)
--visible to everyone, stored in tempDB
create table ##test (
Field1 int,
Field2 nvarchar(50)
)
Edit:
Following feedback I think this needs a little clarification.
#table and ##table will always be in TempDB.
#Table variables will normally be in memory, but are not guaranteed to be. SQL decides based on the query plan, and uses TempDB if it needs to.
#Keith
This is a common misconception: Table variables are NOT necessarily stored in memory. In fact SQL Server decides whether to keep the variable in memory or to spill it to TempDB. There is no reliable way (at least in SQL Server 2005) to ensure that table data is kept in memory. For more detailed info look here
You can declare a "table variable" in SQL Server 2005, like this:
declare #foo table (
Id int,
Name varchar(100)
);
You then refer to it just like a variable:
select * from #foo f
join bar b on b.Id = f.Id
No need to drop it - it goes away when the variable goes out of scope.
It is possible with MS SQL Server 2014.
See: http://msdn.microsoft.com/en-us/library/dn133079.aspx
Here is an example of SQL generation code (from MSDN):
-- create a database with a memory-optimized filegroup and a container.
CREATE DATABASE imoltp
GO
ALTER DATABASE imoltp ADD FILEGROUP imoltp_mod CONTAINS MEMORY_OPTIMIZED_DATA
ALTER DATABASE imoltp ADD FILE (name='imoltp_mod1', filename='c:\data\imoltp_mod1') TO FILEGROUP imoltp_mod
ALTER DATABASE imoltp SET MEMORY_OPTIMIZED_ELEVATE_TO_SNAPSHOT=ON
GO
USE imoltp
GO
-- create a durable (data will be persisted) memory-optimized table
-- two of the columns are indexed
CREATE TABLE dbo.ShoppingCart (
ShoppingCartId INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED,
UserId INT NOT NULL INDEX ix_UserId NONCLUSTERED HASH WITH (BUCKET_COUNT=1000000),
CreatedDate DATETIME2 NOT NULL,
TotalPrice MONEY
) WITH (MEMORY_OPTIMIZED=ON)
GO
-- create a non-durable table. Data will not be persisted, data loss if the server turns off unexpectedly
CREATE TABLE dbo.UserSession (
SessionId INT IDENTITY(1,1) PRIMARY KEY NONCLUSTERED HASH WITH (BUCKET_COUNT=400000),
UserId int NOT NULL,
CreatedDate DATETIME2 NOT NULL,
ShoppingCartId INT,
INDEX ix_UserId NONCLUSTERED HASH (UserId) WITH (BUCKET_COUNT=400000)
) WITH (MEMORY_OPTIMIZED=ON, DURABILITY=SCHEMA_ONLY)
GO
A good blog post here but basically prefix local temp tables with # and global temp with ## - eg
CREATE TABLE #localtemp
I understand what you're trying to achieve. Welcome to the world of a variety of databases!
SQL server 2000 supports temporary tables created by prefixing a # to the table name, making it a locally accessible temporary table (local to the session) and preceding ## to the table name, for globally accessible temporary tables e.g #MyLocalTable and ##MyGlobalTable respectively.
SQL server 2005 and above support both temporary tables (local, global) and table variables - watch out for new functionality on table variables in SQL 2008 and release two! The difference between temporary tables and table variables is not so big but lies in the the way the database server handles them.
I would not wish to talk about older versions of SQL server like 7, 6, though I have worked with them and it's where I came from anyway :-)
It’s common to think that table variables always reside in memory but this is wrong. Depending on memory usage and the database server volume of transactions, a table variable's pages may be exported from memory and get written in tempdb and the rest of the processing takes place there (in tempdb).
Please note that tempdb is a database on an instance with no permanent objects in nature but it’s responsible for handling workloads involving side transactions like sorting, and other processing work which is temporary in nature. On the other hand, table variables (usually with smaller data) are kept in memory (RAM) making them faster to access and therefore less disk IO in terms of using the tempdb drive when using table variables with smaller data compared to temporary tables which always log in tempdb.
Table variables cannot be indexed while temporary tables (both local and global) can be indexed for faster processing in case the amount of data is large. So you know your choice in case of faster processing with larger data volumes by temporary transactions. It's also worth noting that transactions on table variables alone are not logged and can't be rolled back while those done on temporary tables can be rolled back!
In summary, table variables are better for smaller data while temporary tables are better for larger data being processed temporarily. If you also want proper transaction control using transaction blocks, table variables are not an option for rolling back transactions so you're better off with temporary tables in this case.
Lastly, temporary tables will always increase disk IO since they always use tempdb while table variables may not increase it, depending on the memory stress levels.
Let me know if you want tips on how to tune your tempdb to earn much faster performance to go above 100%!
The syntax you want is:
create table #tablename
The # prefix identifies the table as a temporary table.
CREATE TABLE #tmptablename
Use the hash/pound sign prefix