Lets say I have a table
id col1 col2 col3
1 ABC DEF XYZ
2 XXX YYY ZZZ
Most frequent queries is going to be
SELECT * from XYZ where col1='abc' and col2='def'
SELECT * from XYZ where col1='abc' and col2='def' and col3='xyz'
As per VoltDB docs:-
Partition should be done on column on which most searches are going to be performed.
Partitioning should be done on one column
I couldn't find any example where search is performed on multiple columns.
I wonder what is the best way to partition table for multi column searches?
EDIT:-
Or what if my query is like:-
SELECT * from XYZ where col1 IN ('abc', ..., ...) and col2 IN ('def', ...) and col3 IN ('xyz', ...).
Guidelines for picking a column:
First off, you should pick a partitioning column that has many different values. To illustrate, picking a Male/Female column partitions poorly if you have more than two partitions (common).
It's also a bad idea to pick a column with a few values that dominate other values. If 20% of your values are NULL, then more than 20% of your rows will partition to the same place. Distributions don't have to be even, but if you have "hot" values, it's helpful to at least have a lot more "hot" values than partitions.
Picking a timestamp can also be tricky if the timestamp advances slower than the rate of ingestion. In this case your load will round-robin the partitions one-at-a-time when the timestamp advances. Though in practice a single partition can often handle 10-50k inserts per second, so this actually works for non-exteme use cases.
So if you partition on a column with lots of values that are pretty evenly distributed, your inserts will partition nicely and you will be able to ingest some serious load.
Picking a column to optimize queries:
Now the question becomes, given a set of candidate columns, can you pick one to make your queries run faster?
Any query that matches on an equality test to the partition column can be sent to a single partition. In your example above, if you partitioned on col1 or col2, then both queries would be single partition. If you partitioned on col3, only the second query would be single partitioned.
A lot of times the partitioning column will be obvious, perhaps a customer id or ticket symbol. But even if it's obvious, and especially if it's not, you're going to want to run queries that don't partition. The good news is that VoltDB 4.0 has made read-only cross-partition queries dramatically faster than in previous versions. Our internal benchmarks show that tens of thousands of queries per second are possible.
This level of cross-partition read performance is often better than the read performance of non-partitioned RDBMSs. So in VoltDB 4.0, it's now more important to partition for write operations than for reads. This makes partitioning a bit simpler.
Here are some criteria that may help in selecting a partition column:
Base considerations:
Should have values with sufficient cardinality so that it uses all of the partitions
Ideally, the values should hash evenly so that the distribution to the partitions is even.
That may leave you with several choices. Any would be fine if the workload was mostly inserts, because inserts will always provide the partitioning column value, so inserts will always be executed in a single partition and will therefore scale very well. To decide which column is best, you might consider:
For the queries and other transactions, which column is most commonly provided as an input parameter?
If there are transactions that involve multiple tables, which column is shared by all of the relevant tables?
If you need to join the table with another partitioned table, you must partition on one of the join keys.
Hopefully that will make it clear what is the best choice. There can be trade-offs, so sometimes it is worthwhile to test different approaches. Sometimes denormalizing slightly to provide a common partitioning key for related tables is something you might consider to result in a higher percentage of single-partition transactions, or to enable more joins. Also, it is perfectly ok to have queries that run as multi-partition transactions. These can scale to thousands per second, in some cases many thousands per second. So while you do want to maximize the percentage of the workload that is single-partition, you can still have a percentage that isn't.
Indexing is also very important. In your example, if you chose col1 or col2, then both queries would be executed as a single-partition transaction, but within a single-partition there may be many records with different partition key values. Defining a column as the partition key does not automatically create an index on that column. You still want to define indexes to support the queries you need to perform quickly and frequently. VoltDB is a row store, so many of the same considerations you would use in creating indexes on a traditional RDBMS will apply. Based on the example queries, an index on (col1,col2,col3) would support both queries. If you have a lot of different search queries that need to run frequently, it may help to create multiple indexes.
When designing the indexes it helps to examine the explain plans for your queries. You can do that in VoltDB's SQL interface using the following commands:
https://voltdb.com/docs/UsingVoltDB/sysprocexplain.php
https://voltdb.com/docs/UsingVoltDB/...xplainproc.php
You an also see these explain plans in the html catalog report that is output when you run "voltdb compile". The catalog report is also available through the web interface on port 8080.
The plan will show if the query execution would involve a table scan, or if it will use an index.
Related
I have transactions table in which it is partitioned by client ids(currently will have 4 clients, so 4 partitions). Now if I query for client id in (1,2) is there any performance issue compared to using same query with out partitions on the table?
I hear that for each partition mysql will maintain separate file system, so querying in partitioned table need to open multiple files internally and query will slow down. Is this correct?
PARTITION BY LIST? BY RANGE? BY HASH? other? It can make a big difference.
Use EXPLAIN PARTITIONS SELECT ... to see if it is doing any "pruning". If it is not, then partitioning is a performance drag for that query.
In general, there are very few cases where partitioning provides any performance benefit. It sounds like your case will not benefit from partitioning. Think of it this way... First, it must decide which partition(s) to look in, then it will dig into the index to finish locating the row(s). Without partitioning, the first step is avoided, hence potentially faster.
If you grow to hundreds of "clients", hence "partitions, then the pruning is inefficient since each partition is essentially a "table".
See http://mysql.rjweb.org/doc.php/partitionmaint for a list of the only 4 use cases that I have found for partitioning.
For example using a modulo 1024 hash on the auto incrementing index to specify which table the content is in, then querying that table. This way if there's millions of posts in the future table sorting and selecting won't be slow, at the expense of not going as easily searchable. Is there any other downsides to partitioning a large table into many many smaller tables? Like blog post comments or forum thread replies?
Partitioning a table doesn't automatically make all queries against it faster. For instance, you could run a query searching for a particular userid, which is not the partitioning column. Then the query would have to search every partition anyway.
So you have to design the partitioning to match the query terms you want to optimize for.
Sometimes there's no way to do this, either because you have a variety of searches with different terms, or else the column in your search term can't be put into the primary key of the table (remember that MySQL partitioning columns must be included in the primary/unique keys of the table).
That said, in cases when you can partition in a manner that allows partition pruning to speed up the queries you want to prioritize, then yes, partitioning can give a lot of benefit. How much benefit depends on a lot of other factors.
I am trying to build a database that would contain a large number of records, each with a lot of columns(fields) - maybe around 200-300 fields total for all tables. Let's say that I would have in a few years about 40.000.000 to 60.000.000 records.
I plan to normalize the database, so I will have a lot of tables (about 30-40) -> and lots of joins for queries.
Database will be strictly related to US, meaning that queries will be based on the 50 states alone (if a query is made, it won't allow to search/insert/etc in multiple states, but just one).
What can I do to have better performance?
Someone came with the idea to have all the states in different table structures, meaning I will have 50 tables * the 30-40 for the data (about 200 tables)! Should I even consider this type of approach?
The next idea was to use partitioning based on the US 50 states. How about this?
Any other way?
The best optimization is determined by the queries you run, not by your tables' structure.
If you want to use partitioning, this can be a great optimization, if the partitioning scheme supports the queries you need to optimize. For instance, you could partition by US state, and that would help queries against data for a specific state. MySQL supports "partition pruning" so that the query would only run against the specific partition -- but only if your query mentions a specific value for the column you used as the partition key.
You can always check whether partition pruning is effective by using EXPLAIN PARTITIONS:
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE state = 'NY';
That should report that the query uses a single partition.
Whereas if you need to run queries by date for example, then the partitioning wouldn't help; MySQL would have to repeat the query against all 50 partitions.
EXPLAIN PARTITIONS
SELECT ... FROM MyTable WHERE date > '2013-05-01';
That would list all partitions. There's a bit of overhead to query all partitions, so if this is your typical query, you should probably use range partitioning by date.
So choose your partition key with the queries in mind.
Any other optimization technique follows a similar pattern -- it helps some queries, possibly to the disadvantage of other queries. So be sure you know which queries you need to optimize for, before you decide on the optimization method.
Re your comment:
Certainly there are many databases that have 40 million rows or more, but have good performance. They use different methods, including (in no particular order):
Indexing
Partitioning
Caching
Tuning MySQL configuration variables
Archiving
Increasing hardware capacity (e.g. more RAM, solid state drives, RAID)
My point above is that you can't choose the best optimization method until you know the queries you need to optimize. Furthermore, the best choice may be different for different queries, and may even change over time as data or traffic grows. Optimization is an continual process, because you won't know where your bottlenecks are until after you see how your data grows and the query traffic your database receives.
Now I met a application requirement to build a database that can be queried for every field. Say, the table is supposed to have 30 fields.
| f1 | f2 | f3 | ... |f30|
The frontend may needs to query based on multiple or even all fields. For example, need to query all rows with f1 == x AND f2 < y AND f3 > z AND ... AND f30 = abc.
If I create index for each fields, insertion and update operation would be slow. If I just index some fields, query with un-indexed fields would be slow.
I suppose this is a common problem in a lot of application area. Is there any mature solution for this kind of case?
You should set it up as a name/value pair table. One "field" for the field name and one "field" for the value. You would have a third field that would be the "record ID" linking all the record together. So in your example, each "entry" would have 30 records. Then you only need 1 index on the field name+field value, and you can add as many "fields" as you like without needing to alter the table structure.
Indexes implement a space/time tradeoff. An index on every column
consumes more disk space,
makes some SELECT statements faster, and
makes some INSERT, UPDATE, and DELETE statements slower (because the dbms has to maintain the index as well as the row).
Very few user queries will select a random set of columns from your table. You'll probably find that two or three columns are in almost every query. Some kind of index on those columns will speed up all the queries that use them. A good query engine will use the indexes to isolate a subset of all the rows, then do a sequential scan on that subset for all the unindexed columns in the WHERE clause.
Often, that's fast enough for everybody. (Test, don't assume.)
If it isn't fast enough for everybody, then you examine query execution plans and user query patterns, take some performance measurements, add another index, and ask yourself whether you can live with the results. Each additional index will consume disk space, speed up some SELECT statements, and slow down some INSERT and DELETE statements. (It's not common for users to notice how INSERT, UPDATE, and DELETE statements have slowed down; they usually don't slow down by very much.)
At some point, you might find that the SELECTers start complaining about the INSERTers, and vice versa. Unless you're willing to consider more invasive performance improvements
faster hardware,
server tuning,
moving some tables or indexes to faster disks,
perhaps even changing to a different dbms,
you now have a political problem, not a technical one.
I am looking at storing some JMX data from JVMs on many servers for about 90 days. This data would be statistics like heap size and thread count. This will mean that one of the tables will have around 388 million records.
From this data I am building some graphs so you can compare the stats retrieved from the Mbeans. This means I will be grabbing some data at an interval using timestamps.
So the real question is, Is there anyway to optimize the table or query so you can perform these queries in a reasonable amount of time?
Thanks,
Josh
There are several things you can do:
Build your indexes to match the queries you are running. Run EXPLAIN to see the types of queries that are run and make sure that they all use an index where possible.
Partition your table. Paritioning is a technique for splitting a large table into several smaller ones by a specific (aggregate) key. MySQL supports this internally from ver. 5.1.
If necessary, build summary tables that cache the costlier parts of your queries. Then run your queries against the summary tables. Similarly, temporary in-memory tables can be used to store a simplified view of your table as a pre-processing stage.
3 suggestions:
index
index
index
p.s. for timestamps you may run into performance issues -- depending on how MySQL handles DATETIME and TIMESTAMP internally, it may be better to store timestamps as integers. (# secs since 1970 or whatever)
Well, for a start, I would suggest you use "offline" processing to produce 'graph ready' data (for most of the common cases) rather than trying to query the raw data on demand.
If you are using MYSQL 5.1 you can use the new features.
but be warned they contain lot of bugs.
first you should use indexes.
if this is not enough you can try to split the tables by using partitioning.
if this also wont work, you can also try load balancing.
A few suggestions.
You're probably going to run aggregate queries on this stuff, so after (or while) you load the data into your tables, you should pre-aggregate the data, for instance pre-compute totals by hour, or by user, or by week, whatever, you get the idea, and store that in cache tables that you use for your reporting graphs. If you can shrink your dataset by an order of magnitude, then, good for you !
This means I will be grabbing some data at an interval using timestamps.
So this means you only use data from the last X days ?
Deleting old data from tables can be horribly slow if you got a few tens of millions of rows to delete, partitioning is great for that (just drop that old partition). It also groups all records from the same time period close together on disk so it's a lot more cache-efficient.
Now if you use MySQL, I strongly suggest using MyISAM tables. You don't get crash-proofness or transactions and locking is dumb, but the size of the table is much smaller than InnoDB, which means it can fit in RAM, which means much quicker access.
Since big aggregates can involve lots of rather sequential disk IO, a fast IO system like RAID10 (or SSD) is a plus.
Is there anyway to optimize the table or query so you can perform these queries
in a reasonable amount of time?
That depends on the table and the queries ; can't give any advice without knowing more.
If you need complicated reporting queries with big aggregates and joins, remember that MySQL does not support any fancy JOINs, or hash-aggregates, or anything else useful really, basically the only thing it can do is nested-loop indexscan which is good on a cached table, and absolutely atrocious on other cases if some random access is involved.
I suggest you test with Postgres. For big aggregates the smarter optimizer does work well.
Example :
CREATE TABLE t (id INTEGER PRIMARY KEY AUTO_INCREMENT, category INT NOT NULL, counter INT NOT NULL) ENGINE=MyISAM;
INSERT INTO t (category, counter) SELECT n%10, n&255 FROM serie;
(serie contains 16M lines with n = 1 .. 16000000)
MySQL Postgres
58 s 100s INSERT
75s 51s CREATE INDEX on (category,id) (useless)
9.3s 5s SELECT category, sum(counter) FROM t GROUP BY category;
1.7s 0.5s SELECT category, sum(counter) FROM t WHERE id>15000000 GROUP BY category;
On a simple query like this pg is about 2-3x faster (the difference would be much larger if complex joins were involved).
EXPLAIN Your SELECT Queries
LIMIT 1 When Getting a Unique Row
SELECT * FROM user WHERE state = 'Alabama' // wrong
SELECT 1 FROM user WHERE state = 'Alabama' LIMIT 1
Index the Search Fields
Indexes are not just for the primary keys or the unique keys. If there are any columns in your table that you will search by, you should almost always index them.
Index and Use Same Column Types for Joins
If your application contains many JOIN queries, you need to make sure that the columns you join by are indexed on both tables. This affects how MySQL internally optimizes the join operation.
Do Not ORDER BY RAND()
If you really need random rows out of your results, there are much better ways of doing it. Granted it takes additional code, but you will prevent a bottleneck that gets exponentially worse as your data grows. The problem is, MySQL will have to perform RAND() operation (which takes processing power) for every single row in the table before sorting it and giving you just 1 row.
Use ENUM over VARCHAR
ENUM type columns are very fast and compact. Internally they are stored like TINYINT, yet they can contain and display string values.
Use NOT NULL If You Can
Unless you have a very specific reason to use a NULL value, you should always set your columns as NOT NULL.
"NULL columns require additional space in the row to record whether their values are NULL. For MyISAM tables, each NULL column takes one bit extra, rounded up to the nearest byte."
Store IP Addresses as UNSIGNED INT
In your queries you can use the INET_ATON() to convert and IP to an integer, and INET_NTOA() for vice versa. There are also similar functions in PHP called ip2long() and long2ip().