Data partitioning in SAP HANA - partitioning

I have scenario here , i have 3 tables and i am anticipating they are going to grow in large size possibly over 2 billion records in future.
The application using these tables is yet to be deployed , i am looking here strategy for partitioning the table.
Do i create partition for these tables after the application is deployed for production and seeing the growth rate of table data ( monitoring some threshold limit of no of records which i am not aware of now) or do i design partitions for tables in schema before i deploy this into production?
If i have to partition can i partition on Range partitioning on timestamp column, ? on month or on year vale of timestamp?

If you can already anticipate the growth with a good degree of certainty, then you should already start with a partitioning approach.
SAP HANA supports partitioning on elements of a timestamp (eg YEAR or MONTH). However, pay attention to whether this will actually give you a good data partitioning, as not all data gets generated in a constant volume over time (e.g. sensor data may be constant while retail sales data likely increases over time and seasonally)

Related

Best practice to archive less frequent accessed data in MySQL

I have an e-commerce application running on MySQL server from last 10 years. There are particularly two tables:
orders (Apporx 30 million rows and 3 GB data)
order_item (Approx 45 million rows and 3.5 GB data)
These tables contain all orders and their respective products purchased in last 10 years. Generally last 6 months data is most frequently accessed and all previous data is either used in multiple reports by business/marketing teams or if a customer checks his order history
As this table constantly growing in size we are running into increased query time for both read and write. Therefore, I was wondering to archive some data from these tables to improve performance. Main concern is that archived data should still be available for reads by reports and customers.
What would be a right archiving strategy for this?
PARTITIONing is unlikely to provide any performance benefit.
Please provide SHOW CREATE TABLE and some of the "slow" queries so I can analyze the access patterns, clustered index choice, the potential for Partitioning, etc.
Sometimes a change to the PK (which is "clustered" with the data) can greatly improve the "locality of reference", which provides performance benefits. A user going back in time may also blow out the cache (the "buffer pool). Changing the PK to start with user_id may significantly improve on that issue.
Another possibility is to use replication to segregate "read-only" queries that go back in time from the "active" records (read/write in the last few months). This can benefit by decreasing the I/O by having customers on the Primary while relegating the "reports" to a Replica.
How much RAM? When the data gets "big", running a "report" may blow out the cache, making everything be slower. Summary table(s) is a very good cure for such.
The optimal indexes for a Partitioned table are necessarily different than for the equivalent non-partitioned table.

Database design suggestions for a data scraping/warehouse application?

I'm looking into the database design for a data ware house kind of project which involves large number of inserts daily.The data archives will be further used to generate reports.I will have a list of user's (for example a user set of 2 million), for which I need to monitor daily social networking activities associated with them.
For example, let there be a set of 100 users say U1,U2,...,U100
I need to insert their daily status count into my database.
Consider the total status count obtained for a user U1 for period June 30 - July 6, is as follows
June 30 - 99
July 1 - 100
July 2 - 102
July 3 - 102
July 4 - 105
July 5 - 105
July 6 - 107
The database should keep daily status count of each users ,like
For user U1,
July 1- 1 (100-99)
July 2- 2 (102-100)
July 3- 0 (102-102)
July 4- 3 (105-102)
July 5- 0 (105-105)
July 6- 2 (107-105)
Similarly the database should hold archived details of the full set of user's.
And on a later phase , I envision to take aggregate reports out of these data like total points scored on each day,week,month,etc; and to compare it with older data.
I need to start things from the scratch.I am experienced with PHP as a server side script and MySQL. I am confused on the database side? Since I need to process about a million insertion daily,what all things should be taken care of?
I am confused on how to to design a MySQL database in this regard ? On which storage engine to be used and design patterns to be followed keeping in my mind the data could later used effectively with aggregate functions.
Currently I envision the DB design with one table storing all the user id's with a foreign key and separate status count table for each day.Does lot of table's could create some overhead?
Does MySQL fits my requirement? 2 million or more DB operations are done every day. How the server and other things are to be considered in this case.
1) The database should handle concurrent inserts, which should enable 1-2 million inserts per day.
Before inserting I suggest to calculate daily status count,i.e the difference today's count with yesterday's.
2) on a later phase, the archives data (collected over past days) is used as a data warehouse and aggregation tasks are to be performed on it.
Comments:
I have read MyISAM is the best choice for data warehousing projects and at the same time heard INNODB excels in many ways. Many have suggested on proper tuning to get it done, I would like to get thoughts on that as well.
When creating a data warehouse, you don't have to worry about normalization. You're inserting rows and reading rows.
I'd just have one table like this.
Status Count
------------
User id
Date
Count
The primary (clustering) key would be (User id, Date). Another unique index would be (Date, User id).
As far as whether or not MySQL can handle this data warehouse, that depends on the hardware that MySQL is running on.
Since you don't need referential integrity, I'd use MyISAM as the engine.
As for table design, a dimensional model with a star schema is usually a good choice for a datamart where there are mostly inserts and reads. I see two different granularities for the status data, one for status per day and one for status per user, so I would recommend tables similar to:
user_status_fact(user_dimension_id int, lifetime_status int)
daily_status_fact (user_dimension_id int, calendar_dimension_id int, daily_status int)
user_dimension(user_dimension_id, user_id, name, ...)
calendar_dimension(calendar_dimension_id, calendar_date, day_of_week, etc..)
You might also consider having the most detailed data available even though you don't have a current requirement for it as it may make it easier to build aggregates in the future:
status_fact (user_dimension_id int, calendar_dimension_id int, hour_dimension_id, status_dimension_id, status_count int DEFAULT 1)
hour_dimension(hour_dimension_id, hour_of_day_24, hour_of_day_12, ...)
status_dimension(status_dimension_id, status_description string, ...)
If you aren't familiar with the dimensional model, I would recommend the book data warehouse toolkit by Kimball.
I would also recommend MyISAM since you don't need the transactional integrity provided by InnoDB when dealing with a read-mostly warehouse.
I would question whether you want to do concurrent inserts into a production database though. Often in a warehouse environment this data would get batched over time and inserted in bulk and perhaps go through a promotion process.
As for scalability, mysql can certainly handle 2M write operations per day on modest hardware. I'm inserting 500K+ rows/day (batched hourly) on a cloud based server with 8GB of ram running apache + php + mysql and the inserts aren't really noticeable to the php users hitting the same db.
I'm assuming you will get one new row per user per day inserted (not 2M rows a day as some users will have more than one status). You should look at how many new rows per day you expect that to created. When you get to a large number of rows you might have to consider partitioning, sharding and other performance tricks. There are many books out there that can help you with that. Or you could also consider moving to an analytics db such as Amazon Red Shift.
I would create a fact table for each user status for each day. This fact table would connect to a date dimension via a date_key and to a user dimension via a user_key. The primary key for the fact table should be a surrogate key = status_key.
So, your fact table now has four fields: status_key, date_key, user_key, status.
Once the dimension and fact tables have been loaded, then do the processing and aggregating.
Edit: I assumed you knew something about datamarts and star schemas. Here is a simple star schema to base your design on.
This design will store any user's status for a given day. (If the user status can change during the day, just add a time dimension).
This design will work on MySQL or SQL Server. You will have to manage a million inserts per day, don't bog it down with comparisons to previous data points. You can do that with the datamart (star schema) after it's loaded - that's what it's for - analysis and aggregation.
If there are large number of DML operation and selecting records from database MYISAM engine would be prefer. INNODB is mainly use for TCL and referential integrity.You can also specify engine at table level.
If you need to generate the report then also MYISAM engine work faster than INNODB.See which table or data you need for your report.
Remember that if you generate reports from MYSQL database processing on millions of data using PHP programming could create a problem.You may encounter 500 or 501 error frequently.
So report generation view point MYISAM engine for required table will be useful.
You can also store data in multiple table to prevent overhead otherwise there is a chance for DB table crash.
It looks like you need a schema that will keep a single count per user per day. Very simple. You should create a single table which is DAY, USER_ID, and STATUS_COUNT.
Create an index on DAY and USER_ID together, and if possible keep the data in the table sorted by DAY and USER_ID also. This will give you very fast access to the data, as long as you are querying it by day ranges for any (or all) users.
For example:
select * from table where DAY = X and USER_ID in (Y, Z);
would be very fast because the data is ordered on disk sequentially by day, then by user_id, so there are very few seeks to satisfy the query.
On the other hand, if you are more interested in finding a particular user's activity for a range of days:
select * from table where USER_ID = X and DAY between Y and Z;
then the previous method is less optimal because finding the data will require many seeks instead of a sequential scan. Index first by USER_ID, then DAY, and keep the data sorted in that order; this will require more maintenance though, as the table would need to be re-sorted often. Again, it depends on your use case, and how fast you want your queries against the table to respond.
I don't use MySQL extensively, but I believe MyISAM is faster for inserts at the expense of transaction isolation. This should not be a problem for the system you're describing.
Also, 2MM records per day should be child's play (only 23 inserts / second) if you're using decent hardware. Especially if you can batch load the records using mysqlimport. If that's not possible, 23 inserts/second should still be very doable.
I would not do the computation of the delta from the previous day in the insertion of the current day however. There is an analytic function called LAG() that will do that for you very handily (http://explainextended.com/2009/03/10/analytic-functions-first_value-last_value-lead-lag/), not to mention it doesn't seem to serve any practical purpose at the detail level.
With this detail data, you can aggregate it any way you'd like, truncating the DAY column down to WEEK or MONTH, but be careful how you build aggregates. You're talking about over 7 billion records per year, and re-building aggregates over so many rows can be very costly, especially on a single database. You might consider doing aggregation processing using Hadoop (I'd recommend Spark over plain old Map/Reduce also, its far more powerful). This will alleviate any computation burden from your database server (which can't easily scale to multiple servers) and allow it to do its job of recording and storing new data.
You should consider partitioning your table as well. Some purposes of partitioning tables are to distribute query load, ease archival of data, and possibly increase insert performance. I would consider partitioning along the month boundary for an application such as you've described.

Database sharding vs partitioning

I have been reading about scalable architectures recently. In that context, two words that keep on showing up with regards to databases are sharding and partitioning. I looked up descriptions but still ended up confused.
Could the experts at stackoverflow help me get the basics right?
What is the difference between sharding and partitioning ?
Is it true that 'all sharded databases are essentially partitioned (over different nodes), but all partitioned databases are not necessarily sharded' ?
Partitioning is more a generic term for dividing data across tables or databases. Sharding is one specific type of partitioning, part of what is called horizontal partitioning.
Here you replicate the schema across (typically) multiple instances or servers, using some kind of logic or identifier to know which instance or server to look for the data. An identifier of this kind is often called a "Shard Key".
A common, key-less logic is to use the alphabet to divide the data. A-D is instance 1, E-G is instance 2 etc. Customer data is well suited for this, but will be somewhat misrepresented in size across instances if the partitioning does not take in to account that some letters are more common than others.
Another common technique is to use a key-synchronization system or logic that ensures unique keys across the instances.
A well known example you can study is how Instagram solved their partitioning in the early days (see link below). They started out partitioned on very few servers, using Postgres to divide the data from the get-go. I believe it was several thousand logical shards on those few physical shards. Read their awesome writeup from 2012 here: Instagram Engineering - Sharding & IDs
See here as well: http://www.quora.com/Whats-the-difference-between-sharding-and-partition
I've been diving into this as well and although I'm by far the reference on the matter, there are few key facts that I've gathered and points that I'd like to share:
A partition is a division of a logical database or its constituent elements into distinct independent parts. Database partitioning is normally done for manageability, performance or availability reasons, as for load balancing.
https://en.wikipedia.org/wiki/Partition_(database)
Sharding is a type of partitioning, such as Horizontal Partitioning (HP)
There is also Vertical Partitioning (VP) whereby you split a table into smaller distinct parts. Normalization also involves this splitting of columns across tables, but vertical partitioning goes beyond that and partitions columns even when already normalized.
https://en.wikipedia.org/wiki/Shard_(database_architecture)
I really like Tony Baco's answer on Quora where he makes you think in terms of schema (rather than columns and rows). He states that...
"Horizontal partitioning", or sharding, is replicating [copying] the schema, and then dividing the data based on a shard key.
"Vertical partitioning" involves dividing up the schema (and the data goes along for the ride).
https://www.quora.com/Whats-the-difference-between-sharding-DB-tables-and-partitioning-them
Oracle's Database Partitioning Guide has some nice figures. I have copied a few excerpts from the article.
https://docs.oracle.com/cd/B28359_01/server.111/b32024/partition.htm
When to Partition a Table
Here are some suggestions for when to partition a table:
Tables greater than 2 GB should always be considered as candidates
for partitioning.
Tables containing historical data, in which new data is added into the newest partition. A typical example is a historical table where only the current month's data is updatable and the other 11 months are read only.
When the contents of a table need to be distributed across different types of storage devices.
Partition Pruning
Partition pruning is the simplest and also the most substantial means to improve performance using partitioning. Partition pruning can often improve query performance by several orders of magnitude. For example, suppose an application contains an Orders table containing a historical record of orders, and that this table has been partitioned by week. A query requesting orders for a single week would only access a single partition of the Orders table. If the Orders table had 2 years of historical data, then this query would access one partition instead of 104 partitions. This query could potentially execute 100 times faster simply because of partition pruning.
Partitioning Strategies
Range
Hash
List
You can read their text and visualize their images which explain everything pretty well.
And lastly, it is important to understand that databases are extremely resource intensive:
CPU
Disk
I/O
Memory
Many DBA's will partition on the same machine, where the partitions will share all the resources but provide an improvement in disk and I/O by splitting up the data and/or index.
While other strategies will employ a "shared nothing" architecture where the shards will reside on separate and distinct computing units (nodes), having 100% of the CPU, disk, I/O and memory to itself. Providing it's own set of advantages and complexities.
https://en.wikipedia.org/wiki/Shared_nothing_architecture
Looks like this answers both your questions:
Horizontal partitioning splits one or more tables by row, usually
within a single instance of a schema and a database server. It may
offer an advantage by reducing index size (and thus search effort)
provided that there is some obvious, robust, implicit way to identify
in which table a particular row will be found, without first needing
to search the index, e.g., the classic example of the 'CustomersEast'
and 'CustomersWest' tables, where their zip code already indicates
where they will be found.
Sharding goes beyond this: it partitions the problematic table(s) in
the same way, but it does this across potentially multiple instances
of the schema. The obvious advantage would be that search load for the
large partitioned table can now be split across multiple servers
(logical or physical), not just multiple indexes on the same logical
server.
Source:Wiki-Shard.
Sharding is the process of storing data records across multiple
machines and is MongoDB’s approach to meeting the demands of data
growth. As the size of the data increases, a single machine may not be
sufficient to store the data nor provide an acceptable read and write
throughput. Sharding solves the problem with horizontal scaling. With
sharding, you add more machines to support data growth and the demands
of read and write operations.
Source: MongoDB.
Consider a Table in database with 1 Million rows and 100 columns
In Partitioning you can divide the table into 2 or more table having property like:
0.4 Million rows(table1), 0.6 million rows(table2)
1 Million rows & 60 columns(table1) and 1 Million rows & 40 columns(table2)
There could be multiple cases like that
This is general partitioning
But Sharding refer to 1st case only where we are dividing the data on the basis of rows. If we are dividing the table into multiple table we need to maintain multiple similar copies of schemas as now we have multiple tables.
When talking about partitioning please do not use term replicate or replication. Replication is a different concept and out of scope of this page.
When we talk about partitioning then better word is divide and when we talk about sharding then better word is distribute.
In partition (normally and in common understanding not always) the rows of large data set table are divided into two or more disjoint (not sharing any row) groups. You can call each group a partition. These groups or all the partitions remain under the control of once RDMB instance and this is all logical. The base of each group can be a hash or range or etc. If you have ten years data in a table then you can store each of the year's data in a separate partition and this can be achieved by setting partition boundaries on the basis of a non-null column CREATE_DATE. Once you query the db then if you specify a create date between 01-01-1999 and 31-12-2000 then only two partitions will be hit and it will be sequential. I did similar on DB for billion + records and sql time came to 50 millis from 30 seconds using indices etc all.
Sharding is that you host each partition on a different node/machine. Now searching inside the partitions/shards can happen in parallel.
Sharding in a special case of horizontal partitioning, when partitions spans across multiple database instances. If a database is sharded, it means that it's partitioned by definition.
Horizontal partition when moved to another database instance* becomes a database shard.
Database instance can be on the same machine or on another machine.

Can I have several 'similar' database tables to reduce retrieval time

It is best to explain my question in terms of a concrete example.
Consider an order management application that restaurants use to receive orders from their customers. I have a table called orders which stores all of them.
Now every day the tables keep growing in size but the amount of data accessed is constant. Generally the restaurants are only interested in orders received in the last day or so. After 100 days, for example, 'interesting' data is only about 1/100 of the table size; after 1 year it's 1/365 and so on.
Of course, I want to keep all the old orders, but performance for applications that are only interested in current orders keeps reducing. So what is the best way to not have old data interfere with the data that is 'interesting'?
From my limited database knowledge, one solution that occurred to me was to have two identical tables - order_present and order_past - within the same database. New orders would come into 'order_present' and a cron job would transfer all processed orders older than two days to 'order_old', keeping the size of 'order_present' constant.
Is this considered an acceptable solution to deal with this problem. What other solutions exist?
Database servers are pretty good at handling volume. But the performance could be limited by physical hardware. If it is the IO latency that is bothering you, there are several solutions available. You really need to evaluate what fits best for your usecase.
For example:
you can Partition the table to distribute it onto multiple physical disks.
you can do Sharding to put data on to different physical servers
you can evaluate using another Storage Engine which best fits your data and application. MyISAM delivers better read performance compared to InnoDB at the cost of being less ACID compliant
you can use Read Replicas to deligate all (most) "select" queries to replicas (slaves) of the main database servers (master)
Finally, MySQL Performance Blog is a great resource on this topic.

storing time-series data in a database or binary file

I am storing large amounts of time-series financial market tick data.
Generally, this data is written sequentially (ie - data is timestamped as it comes in, and then written to db).
I need to read the data based on timestamp (only) - ie a general query would be something like "select all data between 1-Jan-2012 and 1-Feb-2012".
Question: Am I better off storing this data in a binary file, or a mySQL database, if READ performance is paramount?
It seems to me that the characteristics of the data may be better suited to a file, and my preliminary testing seems to indicate that this is faster (ie, I can read the data back faster).
Your description only talks about the time dimension. But what is/are the other dimension(s)? Probably the different financial instruments (MSFT, IBM, AAPL etc.).
The nature of financial market data usually is that it is received ordered by the time dimension (you get daily updates of hundred thousands of stock prices) but queried by the financial instrument dimension (you query all prices of a single instrument, possibly somewhat restricted by time).
So if you want maximum read performance, you have to make sure that your data isn't stored the way it's received but the way it will be queried, i.e. on the disk, it has to be physically ordered by financial instrument.
I've successfully implemented this in the past in Oracle. There you basically create an index-organized table with the financial instrument identifier and the date as the primary key (the identifier needs to be first). Oracle will then more or less store the data sorted by financial instrument identifier and date. So if you query the stock prices of a single instrument for a given time range, all the required data will be on consecutive disk pages, will already be in the desired order and thus the query will be very fast.
I don't have much experience with MySQL. But as far as I understand it, you can achieve the same with the InnoDB storage engine and clustered indexes:
CREATE TABLE prices (
ticker CHAR(10),
date DATE,
close NUMBER(10, 4),
PRIMARY KEY (ticker, date)
) ENGINE=InnoDB;
And please don't use binary files. You'll regret it.