Why do databases limit row/value size? - mysql

I have been reading about databases and looks like most of the databases impose a limit on the size of a value (key-value store/document-based*) or the size of a row (relational database*). Although I understand the limitations on the size of the key/primary-key, it helps to increase the branching factor of BTree such that each BTree node can be fetched within one read of a block on file-system. For values, I assume that the keys store just a pointer to the file containing the value which allows values to be arbitrarily large. Is it that the pointer thing is true only for text/blob kind of data and the rest of the values are stored in the Btree node? But storing values with the Btree node itself helps optimize just one IO (to go and start reading the file pointed by the pointer), the size restriction seems to be a lot for the trade-off.
References:
Limit on mysql: https://dev.mysql.com/doc/mysql-reslimits-excerpt/5.7/en/column-count-limit.html
Limit on dynamodb: Maximum size of DynamoDB item

Cursor based result set traversal is a thing I suggest, DB clients won't be fetching half a row at any time, so if there's no limit on the row size, the client side lib would have to be ready facing arbitrary lengthy binary streams, which obviously make it much harder to design an efficient while still correct wire protocol for C/S communication.
But I don't thing that's the whole story, many other concerns can count as well.

Related

What's the best way to read in an entire LOB using ODBC?

Reading in an entire LOB whose size you don't know beforehand (without a max allocation + copy) should be a fairly common problem, but finding good documentation and/or examples on the "right" way to do this has proved utterly maddening for me.
I wrestled with SQLBindCol but couldn't see any good way to make it work. SQLDescribeCol and SQLColAttribute return column metadata that seemed to be a default or an upper bound on the column size and not the current LOB's actual size. In the end, I settled on using the following:
1) Put any / all LOB columns as the highest numbered columns in your SELECT statement
2) SQLPrepare the statement
3) SQLBindCol any earlier non-LOB columns that you want
4) SQLExecute the statement
5) SQLFetch a result row
6) SQLGetData on your LOB column with a buffer of size 0 just to query its actual size
7) Allocate a buffer just big enough to hold your LOB
8) SQLGetData again on your LOB column with your correctly sized allocated buffer this time
9) Repeat Steps 6-8 for each later LOB column
10) Repeat Steps 5-9 for any more rows in your result set
11) SQLCloseCursor when you are done with your result set
This seems to work for me, but also seems rather involved.
Are the calls to SQLGetData going back to the server or just processing the results already sent to the client?
Are there any gotchas where the server and/or client will refuse to process very large objects this way (e.g. - some size threshold is exceeded so they generate an error instead)?
Most importantly, is there a better way to do this?
Thanks!
I see several improvements to be done.
If you need to allocate a buffer then you should do it once for all the records and columns. So, you could use the technique suggested by #RickJames, improved with a MAX like this:
SELECT MAX(LENGTH(blob1)) AS max1, MAX(LENGTH(blob2)) AS max2, ...
You could use max1 and max2 to upfront allocate the buffers, or maybe only the largest one for all columns.
The length of the buffer returned at 1. might be too large for your application. You could decide at runtime how large the buffer would be. Anyway, SQLGetData is designed to be called multiple times for each column. Just by calling it again, with the same column number, it will fetch the next chunk. The count of available bytes will be saved where StrLen_or_IndPtr (the last argument) points. And this count will decrease after each call with the amount of bytes fetched.
And certainly there will be roundtrips to the server for each call because the purpose of all this is to prevent the driver from fetching more than the application can handle.
The trick with passing NULL as buffer pointer in order to get the length is prohibited in this case, check SQLGetData on Microsoft's Docs.
However, you could allocate a minimal buffer, say 8 bytes, pass it and its length. The function will return the count of bytes written, 7 in our case because the function add a null char, and will put at StrLen_or_IndPtr the count of remaining bytes. But you probably won't need this if you allocate the buffer as explained above.
Note: The LOBs need to be at the end of the select list and must be fetched in that order precisely.
SQLGetData
SQLGetData get the result of already fetched result. For example, if you have SQLFetch the first row of your table, SQLData will send you back the first row. It is used if you don't know if you can SQLBindCol the result.
But the way it is handle depends on your driver and is not describe in the standards. If your database is a SQL database, cursor cannot go backward, so the result may be still in the memory.
Large object query
The server may refuse to process large object according to the server standard and your ODBC Driver standard. It is not described in the ODBC standard.
To avoid a max-allocation, doing an extra copy, and to be efficient:
Getting the size first is not a bad approach -- it takes virtually no extra time to do
SELECT LENGTH(your_blob) FROM ...
Then do the allocation and actually fetch the blob.
If there are multiple BLOB columns, grab all the lengths in a single pass:
SELECT LENGTH(blob1), LENGTH(blob2), ... FROM ...
In MySQL, the length of a BLOB or TEXT is readily available in front of the bytes. But, even if it must read the column to get the length, think of that as merely priming the cache. That is, the overall time is not hurt much in either case.

How would I use DynamoDB to move this usage from my mysql db to nosql?

I'm currently experiencing issues with a service I've developed that relies heavily on large payload reads from the db (500 rows). I'm seeing huge throughput, in the range of 35,000+ requests per minute for up to 500 rows per request going through the DB and it is not handling the scaling at all.
The data in question is retrieved primarily on a latitude / longitude where statement that checks if the latitude and longitude of the row can be contained within a minimum latitude longitude coordinate, and a maximum latitude longitude coordinate. This is effective checking if the row in question is within the bounding box created by the min / max passed into the where.
This is the where portion of the query we rely on for reference.
s.latitude > {minimumLatitude} AND
s.longitude > {minimumLongitude} AND
s.latitude < {maximumLatitude} AND
s.longitude < {maximumLongitude}
SO, with that said. MySQL is handling this find, I'm presently on RDS and having to rely heavily on an r3.8XL master, and 3 r3.8XL reads just to get the throughput capacity I need to prevent the application from slowing down and throwing the CPU into 100% usage.
Obviously, with how heavy the payload is and how frequently it's queried this data needs to be moved into a more fitting service. Something like Elasticache's services or DynamoDB.
I've been leaning towards DynamoDB, but my only option here seems to be using SCAN as there is no useful primary key I can associate on my data to reduce the result set as it relies on calculating if the latitude / longitude of a point is within a bounding box. DynamoDB filters on attributes would work great as they support the basic conditions needed, however on a table that would be 250,000+ rows and growing by nearly 200,000 a day or more would be unusably expensive.
Another option to reduce the result set was to use a Map Binning technique to associate a map region with the data, and reduce on that in dynamo as the primary key and then further filter down on the latitude / longitude attributes. This wouldn't be ideal though, we'd prefer to get data within specific bounds and not have excess redundant data passed back as the min/max lat/lng can overlap multiple bins and would then pull data from pins that a majority may not be needed for.
At this point I'm continuously having to deploy read replicas to keep the service up and it's definitely not ideal. Any help would be greatly appreciated.
You seem to be overlooking what seems like it would be the obvious first thing to try... indexing the data using an index structure suited to the nature of the data... in MySQL.
B-trees are of limited help since you still have to examine all possible matches in one dimension after eliminating impossible matches in the other.
Aside: Assuming you already have an index on (lat,long), you will probably be able to gain some short-term performance improvement by adding a second index with the columns reversed (long,lat). Try this on one of your replicas¹ and see if it helps. If you have no indexes at all, then of course that is your first problem.
Now, the actual solution. This requires MySQL 5.7 because before then, the feature works with MyISAM but not with InnoDB. RDS doesn't like it at all if you try to use MyISAM.
This is effective checking if the row in question is within the bounding box created by the min / max passed into the where.
What you need is an R-Tree index. These indexes actually store the points (or lines, polygons, etc.) in an order that understands and preserves their proximity in more than one dimension... proximate points are closer in the index and minimum bounding rectangles ("bounding box") are easily and quickly identified.
The MySQL spatial extensions support this type of index.
There's even an MBRContains() function that compares the points in the index to the points in the query, using the R-Tree to find all the points contained in thr MBR you're searching. Unlike the usual optimization rule that you should not use column names as function arguments in the where clause to avoid triggering a table scan, this function is an exception -- the optimizer does not actually evaluate the function against every row but uses the meaning of the expression to evaluate it against the index.
There's a bit of a learning curve needed in order to understand the design of the spatial extensions but once you understand the principles, it falls into place nicely and the performance will exceed your expectations. You'll want a single column of type GEOMETRY and you'll want to store lat and long together in that one indexed column as a POINT.
To safely test this without disruption, make a replica, then detach it from your master, promoting it to become its own independent master, and upgrade it to 5.7 if necessary. Create a new table with the same structure plus a GEOMETRY column and a SPATIAL KEY, then populate it with INSERT ... SELECT.
Note that DynamoDB scan is a very "expensive" operation. On a table I was testing against just yesterday, a single scan consistently cost 112 read units each time it was run, regardless of the number of records, presumably because a scan always reads 1MB of data, which is 256 blocks of 4K (definition of a read unit) but not with strong consistency (so, half the cost). 1 MB ÷ 4KB ÷ 2 = 128 which I assume is close enough to 112 that this explains that number.
¹ It's a valid, supported operation to add an index to a MySQL replica but not the master, even in RDS. You need to temporarily make the replica writable by creating a new parameter group identical to the existing one, and then flipping read_only to 0 in that group. Associate the replica to the new parameter group, then wait for the state to change from applying to in-sync, log in to the replica and add the index. Then put the parameter group back when done.

The effect of the fields length on the querying time

I have a mysql database in which I keep information of item and also I keep description.
The thing is that the description column can hold up to 150 chars which I think is long and I wondered if it slows the querying time. Also I wanted to know if its recommended to shorten the size of the int I mean if I have a price which is normally not that big should I limit the column to small/medium int?
The columns are something like this:
id name category publisher mail price description
Thanks in advance.
Store your character data as varchar() and not as char() and read up on the MySQL documentation on these data types (here). This only stores the characters actually in the description, plus a few more bytes of overhead.
As for whether or not the longer fields imply worse-performing queries. That is a complicated subject. Obviously, at the extreme, having the maximum size records is going to slow things down versus a 10-byte record. The reason has to do with I/O performance. MySQL reads in pages and a page can contain one or more records. The records on the page are then processed.
The more records that fit on the page, the fewer the I/Os.
But then it gets more complicated, depending on the hardware and the storage engine. Disks, nowadays, do read-aheads as do operating systems. So, the next read of a page (if pages are not fragmented and are adjacent to each other) may be much faster than the read of the initial page. In fact, you might have the next page in memory before processing on the first page has completed. At that point, it doesn't really matter how many records are on each page.
And, 200 bytes for a record is not very big. You should worry first about getting your application working and second about getting it to meet performance goals. Along the way, make reasonable choices, such as using varchar() instead of char() and appropriately sized numerics (you might consider fixed point numeric types rather than float for monetary values).
It is only you that considers 150 long - the database most likely does not, as they're designed to handle much more at once. Do not consider sacrificing your data for "performance". If the nature of your application requires you to store up to 150 characters of text at once, don't be afraid to do so, but do look up optimization tips.
Using proper data types, though, can help you save space. For instance, if you have a field which is meant to store values 0 to 20, there's no need for an INT field type. A TINYINT will do.
The documentation lists the data types and provides information on how much space they use and how they're managed.

What is the algorithm for query search in the database?

Good day everyone, I'm currently doing research on search algorithm optimization.
As of now, I'm researching on the Database.
In a database w/ SQL Support.
I can write the query for a specific table.
Select Number from Table1 where Name = "Test";
Select * from Table1 where Name = "Test";
1 searches the number from Table1 from where the Name is Test and 2 searches all the column for name Test.
I understand the concept of the function however what I'm interested in learning what is the approach of the search?
Is it just plain linear search where from the first index until the nth index it will grab so long as the condition is true thus having O(n) speed or does it have a unique algorithm that speeds its process?
If there's no indexes, then yes, a linear search is performed.
But, databases typically use a B Tree index when you specify a column(s) as a key. These are special data structure formats that are specifically tuned(high B Tree branching factors) to perform well on magnetic disk hardware, where the most significant time consuming factor is the seek operation(the magnetic head has to move to a diff part of the file).
You can think of the index as a sorted/structured copy of the values in a column. It can be determined quickly if the value being searched for is in the index. If it finds it, then it will also find a pointer that points back to the correct location of the corresponding row in the main data file(so it can go and read the other columns in the row). Sometimes a multi-column index contains all the data requested by the query, and then it doesn't need to skip back to the main file, it can just read what it found and then its done.
There's other types of indexes, but I think you get the idea - duplicate data and arrange it in a way that's fast to search.
On a large database, indexes make the difference between waiting a fraction of a second, vs possibly days for a complex query to complete.
btw- B tree's aren't a simple and easy to understand data structure, and the traversal algorithm is also complex. In addition, the traversal is even uglier than most of the code you will find, because in a database they are constantly loading/unloading chunks of data from disk and managing it in memory, and this significantly uglifies the code. But, if you're familiar with binary search trees, then I think you understand the concept well enough.
Well, it depends on how the data is stored and what are you trying to do.
As already indicated, a common structure for maintaining entries is a B+ tree. The tree is well optimized for disk since the actual data is stored only in leaves - and the keys are stored in the internal nodes. It usually allows a very small number of disk accesses since the top k levels of the tree can be stored in RAM, and only the few bottom levels will be stored on disk and require a disk read for each.
Other alternative is a hash table. You maintain in memory (RAM) an array of "pointers" - these pointers indicate a disk address, which contains a bucket that includes all entries with the corresponding hash value. Using this method, you only need O(1) disk accesses (which is usually the bottleneck when dealing with data bases), so it should be relatively fast.
However, a hash table does not allow efficient range queries (which can be efficiently done in a B+ tree).
The disadvantage of all of the above is that it requires a single key - i.e. if the hash table or B+ tree is built according to the field "id" of the relation, and then you search according to "key" - it becomes useless.
If you want to guarantee fast search for all fields of the relation - you are going to need several structures, each according to a different key - which is not very memory efficient.
Now, there are many optimizations to be considered according to the specific usage. If for example, number of searches is expected to be very small (say smaller loglogN of total ops), maintaining a B+ tree is overall less efficient then just storing the elements as a list and on the rare occasion of a search - just do a linear search.
Very gOod question, but it can have many answers depending on the structure of your table and how is normalized...
Usually to perform a seacrh in a SELECT query the DBMS sorts the table (it uses mergesort because this algorithm is good for I/O in disc, not quicksort) then depending on indexes (if the table has) it just match the numbers, but if the structure is more complex the DBMS can perform a search in a tree, but this is too deep, let me research again in my notes I took.
I recommend activating the query execution plan, here is an example in how to do so in Sql Server 2008. And then execute your SELECT statement with the WHERE clause and you will be able to begin understanding what is going on inside the DBMS.

To BLOB or not to BLOB

I am in the process of writing a web app backed up by a MySQL database where one of the tables has a potential for getting very large (order of gigabytes) with a significant proportion of table operations being writes. One of the table columns needs to store a string sequence that can be quite big. In my tests thus far it has reached a size of 289 bytes but to be on the safe side I want to design for a maximum size of 1 kb. Currently I am storing that column as a MySQL MediumBlob field in an InnoDB table.
At the same time I have been googling to establish the relative merits and demerits of BLOBs vs other forms of storage. There is a plethora of information out there, perhaps too much. What I have gathered is that InnoDB stores the first few bytes (789 if memory serves me right) of the BLOB in the table row itself and the rest elsewhere. I have also got the notion that if a row has more than one BLOB (which my table does not) per column then the "elsewhere" is a different location for each BLOB. That apart I have got the impression that accessing BLOB data is significantly slower than accessing row data (which sounds reasonable).
My question is just this - in light of my BLOB size and the large potential size of the table should I bother with a blob at all? Also, if I use some form of inrow storage instead will that not have an adverse effect on the maximum number of rows that the table will be able to accommodate?
MySQL is neat and lets me get away with pretty much everything in my development environment. But... that ain't the real world.
I'm sure you've already looked here but it's easy to overlook some of the details since there is a lot to keep in mind when it comes to InnoDB limitations.
The easy answer to one of your questions (maximum size of a table) is 64TBytes. Using variable size types to move that storage into a separate file would certainly change the upper limit on number of rows but 64TBytes is quite a lot of space so the ratio might be very small.
Having a column with a 1KByte string type that is stored inside the table seems like a viable solution since it's also very small compared to 64TBytes. Especially if you have very strict requirements for query speed.
Also, keep in mind that the InnoDB 64TByte limit might be pushed down by the the maximum file size for the OS you're using. You can always link several files together to get more space for your table but then it's starting to get a bit more messy.
if the BLOB data is more then 250kb it is not worth it. In your case i wouldn't bother myself whit BLOB'n. Read this