How to do this as mysql query - mysql

I need help with a mysql query. I have a table like this
I have "type" 1 to 15. I would like help with a query to automatically change the "type"

I wouldn't do it as an update, because you're saying that points determine type, and you want points AND type in the table, which means they must always be kept in sync. Of the two, points is more fine grained - points can be used to determine type but not the other way round - so we can devise a strategy to determine the type from the current points, and we can let points increase and the type will change automatically upon querying:
make another table with the type, lower and upper bound for points and then join it in to find the type:
CREATE TABLE TypeRanges (
playerType INT,
fromPoints INT,
toPoints INT
)
INSERT INTO TypeRanges VALUES(1, 0, 1599)
...
SELECT * FROM
username p
INNER JOIN typeRanges t ON p.points BETWEEN t.fromPoints AND t.toPoints
Remember that BETWEEN is inclusive at both ends so for < 1600 points you want the end to be 1599, for 1600 to 14000 you probably want 1600 and 13999 etc
If you want, you can make a view out of this query and then use that view anywhere you want to know the points and the type together. See the comments for a bit more on what a view is /used for
Footnote on dynamism/ performance considerations:
Every time you run this query it will calculate the type from the points. Calculating the type when you run the query rather than updating the type when the points change means you can easily redefine the bounds or add to them just by altering the points range table. Because we are calculating every time it's highly responsive to data updating but it would be a few nanoseconds slower than having type stored and retrieved simply; in most cases the benefits of recalculating outweigh this but if you're going to be querying it thousands of times per second and updating it once a year (as an extreme example) it may make sense to store the type instead. In most typical use cases I would go the route of calculating the type from the points and only look to optimize it if it proves to be a problem when scaled to large numbers of users and lots of activity. It would be a premature optimization to assume that lookup will make things unusably slow and seeking to store it - databases are engineered towards rapid data joining and retrieval. If you did determine that storing it would be better you can make the sync transparent by using a trigger to import the type upon each update
Side note; seek to avoid using reserved/keywords like TYPE as column names - while they can be quoted etc it usually does more good to find a more descriptive label for the column that doesn't need to be quoted in queries and treated specially in front end languages

If you are happy storing your logic in the table itself, then you could use a calculated column.
CREATE TABLE members (
points INT,
type DOUBLE AS
(CASE
WHEN points < 1600 THEN 1
WHEN points < 14000 THEN 2
-- TODO: implement other cases
ELSE 3
END))
So when the points column is updated, there is no need to "update" the type column - the type is automatically calculated from the points whenever the table is read.
https://www.db-fiddle.com/f/fqEL2jtNio1Srt5xLkyLy2/0
Edit:
As Caius Jard mentions in their answer, the performance of selects would start to degrade at scale, but how you optimize depends on how volatile the points are, and how frequent your reads are.

Related

Designing Database Schema for Event-based Analytics

I'm trying to figure out the best way to model the schema for this event-based analytics system I'm writing. My main concern is writing this in a way that makes queries simple and fast. I'm going to be using MySQL as well. I'll go over some of the requirements and present an outline of a possible (but I think poor) schema.
Requirements
Track events (e.g. track occurrences of the "APP_LAUNCH" event)
Define custom events
Ability to segment events on >1 custom properties (e.g. get occurrences of "APP_LAUNCH" segmented on the "APP_VERSION" property)
Track sessions
Perform queries based on timestamp range
Possible Modeling
The main problem that I'm having is how to model segmentation and the queries to perform to get the overall counts of an event.
My original idea was to define an EVENTS table with an id, int count, timestamp, property (?), and a foreign key to an EVENTTYPE. An EVENTTYPE has an id, name, and additional information belonging to a generic event type.
For example, the "APP_LAUNCH" event would have an entry in the EVENTS table with unique id, count representing the number of times the event happened, the timestamp (unsure about what this is stamped on), and a property or list of properties (e.g. "APP_VERSION", "COUNTRY", etc.) and a foreign key to an EVENTTYPE with name "APP_LAUNCH".
Comments and Questions
I'm pretty sure this isn't a good way to model this for the following reasons. It makes it difficult to do timestamp ranged queries ("Number of APP_LAUNCHES between time x and y"). The EVENTTYPE table doesn't really serve a purpose. Finally, I'm unsure as to how I would even perform queries for different segmentations. The last one is the one I'm most worried about.
I would appreciate any help in helping to correctly model this or in pointing me to resources that would help.
A final question (which is probably dumb): Is it bad to insert a row for every event? For example, say my client-side library makes the following call to my API:
track("APP_LAUNCH", {count: 4, segmentation: {"APP_VERSION": 1.0}})
How would I actually store this in the table (this is closely related to the schema design obviously)? Is it bad to simply insert a row for each one of these calls, of which there may be a significant amount? My gut reaction is that I'm really interested mainly in the overall aggregated counts. I don't have enough experience with SQL to know how these queries perform over possibly hundreds of thousands of these entries. Would an aggregate table or a in-memory cache help to alleviate problems when I want the client to actually get the analytics?
I realize there are lots of questions here, but I would really appreciate any and all help. Thanks!
I think most of your concerns are unnecessary. Taking one of your questions after another:
1) The biggest issue are the custom attributes, different for each event. For this, you have to use EAV (entity-attribute-value) design. The important question is - what types can these attributes have? If more than one - e.g. string and integer, then it is more complicated. There are in general two types of such design:
use one table and one column for values of all type - and convert everything to string (not scalable solution)
have separate tables for each data type (very scalable, I'd go for this)
So, the tables would look like:
Events EventId int, EventTypeId varchar, TS timestamp
EventAttrValueInt EventId int, AttrName varchar, Value int
EventAttrValueChar EventId int, AttrName varchar, Value varchar
2) What do you mean by segmentation? Querying various parameters of the event? In the EAV design mentioned above, you can do this:
select *
from Events
join EventAttrValueInt on Id = EventId and AttrName = 'APPVERSION' and Value > 4
join EventAttrValueChar on Id = EventId and AttrName = 'APP_NAME'
and Value like "%Office%"
where EventTypeId = "APP_LAUNCH"
This will select all events of APP_LAUNCH type where APPVERSION is > 4 and APP_NAME contains "Office".
3) The EVENTTYPE table could serve the purpose of consistency, i.e. you could:
table EVENTS (.... EVENTTYPE_ID varchar - foreign key to EVENTTYPE ...)
table EVENTTYPE (EVENTTYPE_ID varchar)
Or, you could use ID as number and have event name in the EVENTTYPE table - this saves space and allows for renaming of the events easily, but you will need to join this table in every query (resulting in a bit slower queries). Depends on priority of saving storage space vs lower query time / simplicity.
4) timestamp ranged queries are actually very simple in your design:
select *
from EVENTS
where EVENTTYPE_ID = "APP_LAUNCH" and TIMESTAMP > '2013-11-1'
5) "Is it bad to insert a row for every event?"
This totally depends on you! If you need the timestamp and/or different parameters of every such event, then probably you should have a row for every event. If there is a huge amount of events of the same type and parameters, you can probably do what most loging systems do: aggregate the events which occur in one row. If you have such a gut feeling, then it's probably a way to go.
6) " I don't have enough experience with SQL to know how these queries perform over possibly hundreds of thousands of these entries"
Hundreds or thousands such entries will be handled without problems. When you reach a milion, you will have to think much more on the efficiency.
7) "Would an aggregate table or a in-memory cache help to alleviate problems when I want the client to actually get the analytics?"
Of course, this is also a solution, if the queries get slow and you need to respond fast. But then you must introduce some mechanism to refresh the cache periodically. It is overly more complicated; maybe better to consider aggregating the events on the input, see 5).

mySQL - Do the operations and store in table or do the math every time I fetch the data

As simple as that, what's better in terms of performance? I'm doing some calculations based on user data input (simple arithmetic), should I do the operations and store the result in the database or do the operation each time I do the SELECT query?
Option 1: operate each time I fetch data from the table
SELECT
some_random_fields,
salary,
extra_days,
extra_days * salary * 0.05 AS extra_income
FROM
table
WHERE
user_id = 'xxx'
Option 2: operate once, INSERT INTO table, fetch result without operating (extra column)
SELECT
some_random_fields,
salary,
extra_days,
extra_income
FROM
table
WHERE
user_id = 'xxx'
The answer is that it depends.
In the case you've described, it would clearly be advantageous to do the calculations at the time they're needed, because the salary or percentage (0.05) can clearly change over time (people get raises or demotions, reduction in hours, etc., or the current economy calls for using 0.04 instead of 0.05), and it's better to calculate it as needed than to have to update the entire database to store new extra_income. The cost of the calculation (especially when limited in scope to a single user by the WHERE) is negligible compared to the accuracy of the calculation and the elimination of the need to remember to update all of the data when things change.
If the data is static (rarely or never changes), or you need to retain the values (for some historical reason, or for an audit trail), do the calculation up front and store it. The extra space used isn't typically an issue, and since the data is static there's no need to repeat the calculations every time you're doing a SELECT.

Need correct database structure to reduce the size

I want to design my database correctly. Maybe someone could help me with that.
I have a device which writes every 3s around 100 keys/values to a table.
Someone suggested to store it like this:
^ timestamp ^ key1 ^ key2 ^ [...] ^ key150 ^
| 12/06/12 | null | 2243466 | [...] | null ^
But I think thats completely wrong and not dynamic. Because I could have many null values.
So I tried to do my best and designed it how I learned it at school:
http://ondras.zarovi.cz/sql/demo/?keyword=tempidi
Here is the problem that I write for every value the timestamp which means within 100values it would be always the same and produce large amount of data.
Could someone give a me hint how to reduce the database size? Am I basically correct with my ERM?
I wouldn't worry so much about the database size. Your bigger problem is maintenance and flexibility.
Here's what I would do. First, define and fill this table with possible keys your device can write:
tblDataKey
(
ID int primary key (auto-increment - not sure how mysql does this)
Name varchar(32)
)
Next define a 'data event' table:
tblEvent
(
ID int primary key (auto-inc)
TimeStamp
...anything else you need - device ID's? ...
)
Then match events with keys and their values:
tblEventData
{
EventID INT FK-to-tblEvent
KeyID INT FK-to-tblDataKey
DataValue varchar(???)
)
Now every however-many-seconds your data comes in you create a single entry in tblEvent and multiple entries in tblEventData with key-values as needed. Not every event needs every key, and you can expand on the # of keys in the future.
This really shines in that space isn't wasted and you can easily do queries for evnets with specific data keys and values. Where this kind of structure falls down is when you need to produce 'cross-tab-like' tables of events and data items. You'll have to decide if that's a problem or not.
If you must implement a key-value store in MySQL, it doesn't make any sense to make it more complicated than this.
create table key_value_store (
run_time datetime not null,
key_name varchar(15) not null,
key_value varchar(15) not null,
primary key (run_time, key_name)
);
If the average length of both your keys and values is 10 bytes, you're looking at about 86 million rows and 2.5GB per month, and you don't need any joins. If all your values (column key_value) are either integers or floats, you can change the data type and reduce space a little more.
One of the main problems with implementing key-value stores in SQL is that, unless all values are the same data type, you have to use something like varchar(n) for all values. You lose type safety and declarative constraints. (You can't check that the value for key3 is between 1 and 15, while the value for key7 is between 0 and 3.)
Is this feasible?
This kind of structure (known as "EAV"--Google that) is a well-known table design anti-pattern. Part of the problem is that you're essentially storing columns as rows. (You're storing column names in key_value_store.key_name.) If you ever have to write out data in the format of a normal table, you'll discover three things.
It's hard to write queries to output the right format.
It takes forever to run. If you have to write hundreds of columns, it might never run to completion.
You'll wish you had much faster hardware. Much, much faster hardware.
What I look for
Opportunities to group keys into logical tables. This has to do with the first design, and it might not apply to you. It sounds like your application is basically storing a log file, and you don't know which keys will have values on each run.
Opportunities to reduce the number of rows. I'd ask, "Can we write less often?" So I'd be looking at writing to the database every 5 or 6 seconds instead of every 3 seconds, assuming that means I'm writing fewer rows. (The real goal is fewer rows, not fewer writes.)
The right platform. PostgreSQL 9.2 might be a better choice for this. Version 9.2 has index-only scans, and it has an hstore module that implements a key-value store.
Test before you decide
If I were in your shoes, I'd build this table in both MySQL and PostgreSQL. I'd load each with about a million rows of random-ish data. Then I'd try some queries and reports on each. (Reports are important.) Measure the performance. Increase the load to 10 million rows, retune the server and the dbms, and run the same queries and reports again. Measure again.
Repeat with 100 million rows. Quit when you're confident. Expect all this to take a couple of days.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.

char vs varchar for performance in stock database

I'm using mySQL to set up a database of stock options. There are about 330,000 rows (each row is 1 option). I'm new to SQL so I'm trying to decide on the field types for things like option symbol (varies from 4 to 5 characters), stock symbol (varies from 1 to 5 characters), company name (varies from 5 to 60 characters).
I want to optimize for speed. Both creating the database (which happens every 5 minutes as new price data comes out -- i don't have a real-time data feed, but it's near real-time in that i get a new text file with 330,000 rows delivered to me every 5 minutes; this new data completely replaces the prior data), and also for lookup speed (there will be a web-based front end where many users can run ad hoc queries).
If I'm not concerned about space (since the db lifetime is 5 minutes, and each row contains maybe 300 bytes, so maybe 100MBs for the whole thing) then what is the fastest way to structure the fields?
Same question for numeric fields, actually: Is there a performance difference between int(11) and int(7)? Does one length work better than another for queries and sorting?
Thanks!
In MyISAM, there is some benefit to making fixed-width records. VARCHAR is variable width. CHAR is fixed-width. If your rows have only fixed-width data types, then the whole row is fixed-width, and MySQL gains some advantage calculating the space requirements and offset of rows in that table. That said, the advantage may be small and it's hardly worth a possible tiny gain that is outweighed by other costs (such as cache efficiency) from having fixed-width, padded CHAR columns where VARCHAR would store more compactly.
The breakpoint where it becomes more efficient depends on your application, and this is not something that can be answered except by you testing both solutions and using the one that works best for your data under your application's usage.
Regarding INT(7) versus INT(11), this is irrelevant to storage or performance. It is a common misunderstanding that MySQL's argument to the INT type has anything to do with size of the data -- it doesn't. MySQL's INT data type is always 32 bits. The argument in parentheses refers to how many digits to pad if you display the value with ZEROFILL. E.g. INT(7) will display 0001234 where INT(11) will display 00000001234. But this padding only happens as the value is displayed, not during storage or math calculation.
If the actual data in a field can vary a lot in size, varchar is better because it leads to smaller records, and smaller records mean a faster DB (more records can fit into cache, smaller indexes, etc.). For the same reason, using smaller ints is better if you need maximum speed.
OTOH, if the variance is small, e.g. a field has a maximum of 20 chars, and most records actually are nearly 20 chars long, then char is better because it allows some additional optimizations by the DB. However, this really only matters if it's true for ALL the fields in a table, because then you have fixed-size records. If speed is your main concern, it might even be worth it to move any non-fixed-size fields into a separate table, if you have queries that use only the fixed-size fields (or if you only have shotgun queries).
In the end, it's hard to generalize because a lot depends on the access patterns of your actual app.
Given your system constraints I would suggest a varchar since anything you do with the data will have to accommodate whatever padding you put in place to make use of a fixed-width char. This means more code somewhere which is more to debug, and more potential for errors. That being said:
The major bottleneck in your application is due to dropping and recreating your database every five minutes. You're not going to get much performance benefit out of microenhancements like choosing char over varchar. I believe you have some more serious architectural problems to address instead. – Princess
I agree with the above comment. You have bigger fish to fry in your architecture before you can afford to worry about the difference between a char and varchar. For one, if you have a web user attempting to run an ad hoc query and the database is in the process of being recreated, you are going to get errors (i.e. "database doesn't exist" or simply "timed out" type issues).
I would suggest that instead you build (at the least) a quote table for the most recent quote data (with a time stamp), a ticker symbol table and a history table. Your web users would query against the ticker table to get the most recent data. If a symbol comes over in your 5-minute file that doesn't exist, it's simple enough to have the import script create it before posting the new info to the quote table. All others get updated and queries default to the current day's data.
I would definitely not recreate the database each time. Instead I would do the following:
read in the update/snapshot file and create some object based on each row.
for each row get the symbol/option name (unique) and set that in the database
If it were me I would also have an in memory cache of all the symbols and the current price data.
Price data is never an int - you can use characters.
The company name is probably not unique as there are many options for a particular company. That should be an index and you can save space just using the id of a company.
As someone else also pointed out - your web clients do not need to have to hit the actual database and do a query - you can probably just hit your cache. (though that really depends on what tables and data you expose to your clients and what data they want)
Having query access for other users is also a reason NOT to keep removing and creating a database.
Also remember that creating databases is subject to whatever actual database implementation you use. If you ever port from MySQL to, say, Postgresql, you will discover a very unpleasant fact that creating databases in postgresql is a comparatively very slow operation. It is orders of magnitude slower than reading and writing table rows, for instance.
It looks like there is an application design problem to address first, before you optimize for performance choosing proper data types.