How does a relational database organize data? - mysql

I was thinking that a relational database will store every possible query and the values to return for that query in a hash table.
So like, if each entry in your table had 5 attributes, then you would make a copy of that element for each subset of the 5 attributes that appear in any given query that should return that specific entry. So every individual entry would appear 2^5 = 32 times in the table. This seems like it would be very memory inefficient for large data sets with many entries, but it also allows for the fastest possible query time.
Do real world relational-databases have a mixed version of this where some response time for queries/lookups is traded off for more memory efficiency? If so, how would this be implemented?

That's not how relational databases store data. Keep in mind it's a lot more than 2^32, because you can make queries that have expressions, not simply references to attribute columns. Also queries that are joins, which expands the possibilities immensely.
Even if you could store all possible combinations, it would be a waste because most of them will never be needed.
Instead, databases typically store records, where a record includes all columns of one table. If you run a query that only needs some columns, the DBMS still fetches the whole record, and simply ignores columns that you didn't ask for. Then it evaluates any expressions in your query. And finally returns the result set.
MySQL does not use hash tables to store these records, it uses a B+Tree data structure, so looking up a record by its primary key takes O(log n) time.

Related

Database Optimisation through denormalization and smaller rows

Does tables with many columns take more time than the tables with less columns during SELECT or UPDATE query? (row count is same and I will update/select same number of columns in both cases)
example: I have a database to store user details and to store their last active time-stamp. In my website, I only need to show active users and their names.
Say, one table named userinfo has the following columns: (id,f_name,l_name,email,mobile,verified_status). Is it a good idea to store last active time also in the same table? Or its better to make a separate table(say, user_active) to store the last activity timestamp?
The reason I am asking, If I make two tables, userinfo table will only be accessed during new signups(to INSERT new user row) and I will use user_active table (table with less columns) to UPADATE timestamp and SELECT active users frequently.
But the cost I have to pay for creating two tables is data duplication as user_active table columns will be (id, f_name, timestamp).
The answer to your question is that, to a close approximation, having more columns in a table does not really take more time than having fewer columns for accessing a single row. This may seem counter-intuitive, but you need to understand how data is stored in databases.
Rows of a table are stored on data pages. The cost of a query is highly dependent on the number of pages that need to be read and written during the course of the query. Parsing the row from the data page is usually not a significant performance issue.
Now, wider rows do have a very slight performance disadvantage, because more data would (presumably) be returned to the user. This is a very minor consideration for rows that fit on a single page.
On a more complicated query, wider rows have a larger performance disadvantage, because more data pages need to be read and written for a given number of rows. For a single row, though, one page is being read and written -- assuming you have an index to find that row (which seems very likely in this case).
As for the rest of your question. The structure of your second table is not correct. You would not (normally) include fname in two tables -- that is data redundancy and causes all sort of other problems. There is a legitimate question whether you should store a table of all activity and use that table for the display purposes, but that is not the question you are asking.
Finally, for the data volumes you are talking about, having a few extra columns would make no noticeable difference on any reasonable transaction volume. Use one table if you have one attribute per entity and no compelling reason to do otherwise.
When returning and parsing a single row, the number of columns is unlikely to make a noticeable difference. However, searching and scanning tables with smaller rows is faster than tables with larger rows.
When searching using an index, MySQL utilizes a binary search so it would require significantly larger rows (and many rows) before any speed penalty is noticeable.
Scanning is a different matter. When scanning, it's reading through all of the data for all of the rows, so there's a 1-to-1 performance penalty for larger rows. Yet, with proper indexes, you shouldn't be doing much scanning.
However, in this case, keep the date together with the user info because they'll be queried together and there's a 1-to-1 relationship, and a table with larger rows is still going to be faster than a join.
Only denormalize for optimization when performance becomes an actual problem and you can't resolve it any other way (adding an index, improving hardware, etc.).

Save data as array or in multiple rows? Which way is more efficient?

Currently I have lots of rows in mysql db
venue_id
venue_name
venue_location
venue_geolocation
venue_type
venue_url
venue_manager
venue_phone
venue_logo
venue_company
venue_zip
venue_vat
venue_visible
Would it be more efficient to store most of the data in one array and in one row like venue_data. Then it would leave only 3 rows venue_id, venue_data, venue_visible. Then in my application I could explode that array. Would it save time, server load?
Storing the values as array (concatenating different values into a string?) is definitely a bad idea because:
You will loose the readability,
you won't be able to easily search on concatenated columns,
you cannot index these columns properly.
Furthermore it does not have an impact to the performance - see also Is there a performance decrease if there are too many columns in a table?
If you are unhappy with the many columns, you should consider normalizing (DB Normalization) your db schema.
You must ask yourself whether the amount of time and space you 'might' save is worth the cost.
Consider:
Combining columns into one will still have a comparable length as all of them separately
More space could potentially be saved by using appropriately sized data types
Disk space is cheap
Having distinct columns gives you the power to query any of those columns
Distinct columns also allows you to easily add or remove columns at a later date without having to re-construct every row's combined column
Distinct columns you can use $result->fetch_assoc() to immediately get your result row in an array, vs. spending processing time parsing a complex string
Parsing such a string may be prone to errors that selecting specific columns is not
You can add foreign key constraints and indexes on individual columns which would not work if you combined them
You can easily search on distinct columns, but not if you combine them
I can think of plenty more reasons why distinct columns are a better choice than trying to optimize code in a way that likely will not even save you any time. The query may be a few milliseconds faster, but you lost that time processing the string.

Does this de-normalization make sense?

I have 2 tables which I join very often. To simplify this, the join gives back a range of IDs that I use in another (complex) query as part of an IN.
So I do this join all the time to get back specific IDs.
To be clear, the query is not horribly slow. It takes around 2 mins. But since I call this query over a web page, the delay is noticeable.
As a concrete example let's say that the tables I am joining is a Supplier table and a table that contains the warehouses the supplier equipped specific dates. Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
The query it self can not be improved since it is a simple join between 2 tables that are indexed but since there is a date range this complicates things.
I had the following idea which, I am not sure if it makes sense.
Since the data I am querying (especially for previous dates) do not change, what if I created another table that has as primary key, the columns in my where and as a value the list of IDs (comma separated).
This way it is a simple SELECT of 1 row.
I.e. this way I "pre-store" the supplier ids I need.
I understand that this is not even 1st normal formal but does it make sense? Is there another approach?
It makes sense as a denormalized design to speed up that specific type of query you have.
Though if your date range changes, couldn't it result in a different set of id's?
The other approach would be to really treat the denormalized entries like entries in a key/value cache like memcached or redis. Store the real data in normalized tables, and periodically update the cached, denormalized form.
Re your comments:
Yes, generally storing a list of id's in a string is against relational database design. See my answer to Is storing a delimited list in a database column really that bad?
But on the other hand, denormalization is justified in certain cases, for example as an optimization for a query you run frequently.
Just be aware of the downsides of denormalization: risk of data integrity failure, poor performance for other queries, limiting the ability to update data easily, etc.
In the absence of knowing a lot more about your application it's impossible to say whether this is the right approach - but to collect and consider that volume of information goes way beyond the scope of a question here.
Essentially I get the IDs of suppliers that serviced specific warehouses at specific dates.
While it's far from clear why you actually need 2 tables here, nor if denormalizing the data woul make the resulting query faster, one thing of note here is that your data is unlikely to change after capture, hence maintaining the current structure along with a materialized view would have minimal overhead. You first need to test the query performance by putting the sub-query results into a properly indexed table. If you get a significant performance benefit, then you need to think about how you maintain the new table - can you substitute one of the existing tables with a view on the new table, or do you keep both your original tables and populate data into the new table by batch, or by triggers.
It's not hard to try it out and see what works - and you'll get a far beter answer than anyone here can give you.

sql query LIKE % on Index

I am using a mysql database.
My website is cut in different elements (PRJ_12 for projet 12, TSK_14 for task 14, DOC_18 for document 18, etc). We currently store the references to these elements in our database as VARCHAR. The relation columns are Indexed so it is faster to select.
We are thinking of currint these columns in 2 columns (on column "element_type" with PRJ and one "element_id" with 12). We are thinking on this solution as we do a lot of requests containing LIKE ...% (for example retrieve all tasks of one user, no matter the id of the task).
However, splitting these columns in 2 will increase the number of Indexed columns.
So, I have two questions :
Is a LIKE ...% request in an Indexed column realy more slow than a a simple where query (without like). I know that if the column is not indexed, it is not advisable to do where ... LIKE % requests but I don't realy know how Index work).
The fact that we split the reference columns in two will double the number of Indexed table. Is that a problem?
Thanks,
1) A like is always more costly than a full comparison (with = ), however it all comes down to the field data types and the number of records (unless we're talking of a huge table you shouldn't have issues)
2) Multicolumn indexes are not a problem, yes it makes the index bigger, but so what? Data types and ammount of total rows matter, but thats what indexes are for.
So go for it
There are a number of factors involved, but in general, adding one more index on a table that has only one index already is unlikely to be a big problem. Some things to consider.
If the table most mostly read-only, then it is almost certainly not a problem. If updates are rare, then the indexes won't need to be modified often meaning there will be very little extra cost (aside from the additional disk space).
If updates to existing records do not change either of those key values, then no index modification should be needed and so again there would be no additional runtime cost.
DELETES and INSERTS will need to update both indexes. So if that is the majority of the operations (and far exceeding reads), then an additional index might incur measurable performance degradation (but it might not be a lot and not noticeable from a human perspective).
The like operator as you describe the usage should be fully optimized. In other words, the clause WHERE combinedfield LIKE 'PRJ%' should perform essentially the same as WHERE element_type = 'PRJ' if there is an index existing in both situations. The more expensive situation is if you use the wild card at the beginning (e.g., LIKE '%abc%'). You can think of a LIKE search as being equivalent to looking up a word in a dictionary. The search for 'overf%' is basically the same as a search for 'overflow'. You can do a "manual" binary search in the dictionary and quickly find the first word beginning with 'overf'. Searching for '%low', though is much more expensive. You have to scan the entire dictionary in order to find all the words that end with "low".
Having two separate fields to represent two separate values is almost always better in the long run since you can construct more efficient queries, easily perform joins, etc.
So based on the given information, I would recommend splitting it into two fields and index both fields.

How does a hash table work? Is it faster than "SELECT * from .."

Let's say, I have :
Key | Indexes | Key-values
----+---------+------------
001 | 100001 | Alex
002 | 100002 | Micheal
003 | 100003 | Daniel
Lets say, we want to search 001, how to do the fast searching process using hash table?
Isn't it the same as we use the "SELECT * from .. " in mysql? I read alot, they say, the "SELECT *" searching from beginning to end, but hash table is not? Why and how?
By using hash table, are we reducing the records we are searching? How?
Can anyone demonstrate how to insert and retrieve hash table process in mysql query code? e.g.,
SELECT * from table1 where hash_value="bla" ...
Another scenario:
If the indexes are like S0001, S0002, T0001, T0002, etc. In mysql i could use:
SELECT * from table WHERE value = S*
isn't it the same and faster?
A simple hash table works by keeping the items on several lists, instead of just one. It uses a very fast and repeatable (i.e. non-random) method to choose which list to keep each item on. So when it is time to find the item again, it repeats that method to discover which list to look in, and then does a normal (slow) linear search in that list.
By dividing the items up into 17 lists, the search becomes 17 times faster, which is a good improvement.
Although of course this is only true if the lists are roughly the same length, so it is important to choose a good method of distributing the items between the lists.
In your example table, the first column is the key, the thing we need to find the item. And lets suppose we will maintain 17 lists. To insert something, we perform an operation on the key called hashing. This just turns the key into a number. It doesn't return a random number, because it must always return the same number for the same key. But at the same time, the numbers must be "spread out" widely.
Then we take the resulting number and use modulus to shrink it down to the size of our list:
Hash(key) % 17
This all happens extremely fast. Our lists are in an array, so:
_lists[Hash(key % 17)].Add(record);
And then later, to find the item using that key:
Record found = _lists[Hash(key % 17)].Find(key);
Note that each list can just be any container type, or a linked list class that you write by hand. When we execute a Find in that list, it works the slow way (examine the key of each record).
Do not worry about what MySQL is doing internally to locate records quickly. The job of a database is to do that sort of thing for you. Just run a SELECT [columns] FROM table WHERE [condition]; query and let the database generate a query plan for you. Note that you don't want to use SELECT *, since if you ever add a column to the table that will break all your old queries that relied on there being a certain number of columns in a certain order.
If you really want to know what's going on under the hood (it's good to know, but do not implement it yourself: that is the purpose of a database!), you need to know what indexes are and how they work. If a table has no index on the columns involved in the WHERE clause, then, as you say, the database will have to search through every row in the table to find the ones matching your condition. But if there is an index, the database will search the index to find the exact location of the rows you want, and jump directly to them. Indexes are usually implemented as B+-trees, a type of search tree that uses very few comparisons to locate a specific element. Searching a B-tree for a specific key is very fast. MySQL is also capable of using hash indexes, but these tend to be slower for database uses. Hash indexes usually only perform well on long keys (character strings especially), since they reduce the size of the key to a fixed hash size. For data types like integers and real numbers, which have a well-defined ordering and fixed length, the easy searchability of a B-tree usually provides better performance.
You might like to look at the chapters in the MySQL manual and PostgreSQL manual on indexing.
http://en.wikipedia.org/wiki/Hash_table
Hash tables may be used as in-memory data structures. Hash tables may also be adopted for use with persistent data structures; database indices sometimes use disk-based data structures based on hash tables, although balanced trees are more popular.
I guess you could use a hash function to get the ID you want to select from. Like
SELECT * FROM table WHERE value = hash_fn(whatever_input_you_build_your_hash_value_from)
Then you don't need to know the id of the row you want to select and can do an exact query. Since you know that the row will always have the same id because of the input you build the hash value form and you can always recreate this id through the hash function.
However this isn't always true depending on the size of the table and the maximum number of hashvalues (you often have "X mod hash-table-size" somewhere in your hash). To take care of this you should have a deterministic strategy you use each time you get two values with the same id. You should check Wikipedia for more info on this strategy, its called collision handling and should be mentioned in the same article as hash-tables.
MySQL probably uses hashtables somewhere because of the O(1) feature norheim.se (up) mentioned.
Hash tables are great for locating entries at O(1) cost where the key (that is used for hashing) is already known. They are in widespread use both in collection libraries and in database engines. You should be able to find plenty of information about them on the internet. Why don't you start with Wikipedia or just do a Google search?
I don't know the details of mysql. If there is a structure in there called "hash table", that would probably be a kind of table that uses hashing for locating the keys. I'm sure someone else will tell you about that. =)
EDIT: (in response to comment)
Ok. I'll try to make a grossly simplified explanation: A hash table is a table where the entries are located based on a function of the key. For instance, say that you want to store info about a set of persons. If you store it in a plain unsorted array, you would need to iterate over the elements in sequence in order to find the entry you are looking for. On average, this will need N/2 comparisons.
If, instead, you put all entries at indexes based on the first character of the persons first name. (A=0, B=1, C=2 etc), you will immediately be able to find the correct entry as long as you know the first name. This is the basic idea. You probably realize that some special handling (rehashing, or allowing lists of entries) is required in order to support multiple entries having the same first letter. If you have a well-dimensioned hash table, you should be able to get straight to the item you are searching for. This means approx one comparison, with the disclaimer of the special handling I just mentioned.