I have a column of data I would like to add to a mysql database table. The column is raw text and the longest piece of text contains approximately 300,000 characters. Is it possible to store this in the table? How?
I have been reading that even LONGTEXT columns are limited somewhat.
Presumably you have ruled out the alternative of storing these items of text in files, and storing their pathnames in your table. If you have not considered that choice, please do. It's often the most practical way to handle this sort of application. That's especially true if you're using a web server to deliver your information to your users: by putting those objects in your file system you avoid a very serious production bottleneck (fetching the objects from the DBMS and then sending them to the user).
MySQL's LOBs (large objects) will take 300k characters without problems. MEDIUMTEXT handles 16 megabytes. But the programming work necessary to load those objects into the DBMS and get them out again can be a bit challenging. You haven't mentioned your application stack, so it's hard to give you specific advice about that. Where to start? read about the MySQL server parameter max_allowed_packet.
If this were my project, and for some reason using the file system was out of the question, I would store the large textual articles as segments in shorter rows. For example, instead of
textid textval
(int) (MEDIUMTEXT)
number lots and lots and lots of text.
I'd make a table like this:
textid segmentid textval
(int) (int) (VARCHAR(250))
number 1 Lots and
number 2 lots and
number 3 lots of
number 4 text.
The segment lengths should probably be around 250 characters each in length. I think you'd be smart to break the segments on word boundaries if you can; it will make stuff like FULLTEXT search easier. This will end up with many shorter rows for your big text items, but that will make your programming, your backups, and everything else about your system. easier to handle all around.
There is an upfront cost, but it's probably worth it.
Related
I have a mysql database in which I keep information of item and also I keep description.
The thing is that the description column can hold up to 150 chars which I think is long and I wondered if it slows the querying time. Also I wanted to know if its recommended to shorten the size of the int I mean if I have a price which is normally not that big should I limit the column to small/medium int?
The columns are something like this:
id name category publisher mail price description
Thanks in advance.
Store your character data as varchar() and not as char() and read up on the MySQL documentation on these data types (here). This only stores the characters actually in the description, plus a few more bytes of overhead.
As for whether or not the longer fields imply worse-performing queries. That is a complicated subject. Obviously, at the extreme, having the maximum size records is going to slow things down versus a 10-byte record. The reason has to do with I/O performance. MySQL reads in pages and a page can contain one or more records. The records on the page are then processed.
The more records that fit on the page, the fewer the I/Os.
But then it gets more complicated, depending on the hardware and the storage engine. Disks, nowadays, do read-aheads as do operating systems. So, the next read of a page (if pages are not fragmented and are adjacent to each other) may be much faster than the read of the initial page. In fact, you might have the next page in memory before processing on the first page has completed. At that point, it doesn't really matter how many records are on each page.
And, 200 bytes for a record is not very big. You should worry first about getting your application working and second about getting it to meet performance goals. Along the way, make reasonable choices, such as using varchar() instead of char() and appropriately sized numerics (you might consider fixed point numeric types rather than float for monetary values).
It is only you that considers 150 long - the database most likely does not, as they're designed to handle much more at once. Do not consider sacrificing your data for "performance". If the nature of your application requires you to store up to 150 characters of text at once, don't be afraid to do so, but do look up optimization tips.
Using proper data types, though, can help you save space. For instance, if you have a field which is meant to store values 0 to 20, there's no need for an INT field type. A TINYINT will do.
The documentation lists the data types and provides information on how much space they use and how they're managed.
I am in the process of writing a web app backed up by a MySQL database where one of the tables has a potential for getting very large (order of gigabytes) with a significant proportion of table operations being writes. One of the table columns needs to store a string sequence that can be quite big. In my tests thus far it has reached a size of 289 bytes but to be on the safe side I want to design for a maximum size of 1 kb. Currently I am storing that column as a MySQL MediumBlob field in an InnoDB table.
At the same time I have been googling to establish the relative merits and demerits of BLOBs vs other forms of storage. There is a plethora of information out there, perhaps too much. What I have gathered is that InnoDB stores the first few bytes (789 if memory serves me right) of the BLOB in the table row itself and the rest elsewhere. I have also got the notion that if a row has more than one BLOB (which my table does not) per column then the "elsewhere" is a different location for each BLOB. That apart I have got the impression that accessing BLOB data is significantly slower than accessing row data (which sounds reasonable).
My question is just this - in light of my BLOB size and the large potential size of the table should I bother with a blob at all? Also, if I use some form of inrow storage instead will that not have an adverse effect on the maximum number of rows that the table will be able to accommodate?
MySQL is neat and lets me get away with pretty much everything in my development environment. But... that ain't the real world.
I'm sure you've already looked here but it's easy to overlook some of the details since there is a lot to keep in mind when it comes to InnoDB limitations.
The easy answer to one of your questions (maximum size of a table) is 64TBytes. Using variable size types to move that storage into a separate file would certainly change the upper limit on number of rows but 64TBytes is quite a lot of space so the ratio might be very small.
Having a column with a 1KByte string type that is stored inside the table seems like a viable solution since it's also very small compared to 64TBytes. Especially if you have very strict requirements for query speed.
Also, keep in mind that the InnoDB 64TByte limit might be pushed down by the the maximum file size for the OS you're using. You can always link several files together to get more space for your table but then it's starting to get a bit more messy.
if the BLOB data is more then 250kb it is not worth it. In your case i wouldn't bother myself whit BLOB'n. Read this
I have table posts which contains LONGTEXT. My issue is that I want to retrieve parts of a specific post (basically paging)
I use the following query:
SELECT SUBSTRING(post_content,1000,1000) FROM posts WHERE id=x
This is somehow good, but the problem is the position and the length. Most of the time, the first word and the last word is not complete, which makes sense.
How can I retrieve complete words from position x for length y?
Presumably you're doing this for the purpose of saving on network traffic overhead between the MySQL server and the machine on which your application is running. As it happens, you're not saving any other sort of workload on the MySQL server. It has to fetch the LONGTEXT item from disk, then run it through SUBSTRING.
Presumably you've already decided based on solid performance analysis that you must save this network traffic. You might want to revisit this analysis now that you know it doesn't save much MySQL server workload. Your savings will be marginal, unless you have zillions of very long LONGTEXT items and lots of traffic to retrieve and display parts of them.
In other words, this is an optimization task. YAGNI? http://en.wikipedia.org/wiki/YAGNI
If you do need it you are going to have to create software to process the LONGTEXT item word by word. Your best bet is to do this in your client software. Start by retrieving the first page plus a k or two of the article. Then, parse the text looking for complete words. After you find the last complete word in the first page and its following whitespace, then that character position is the starting place for the next page.
This kind of task is a huge pain in the neck in a MySQL stored procedure. Plus, when you do it in a stored procedure you're going to use processing cycles on a shared and hard-to-scale-up resource (the MySQL server machine) rather than on a cloneable client machine.
I know I didn't give you clean code to just do what you ask. But it's not obviously a good idea to do what you're suggesting.
Edit:
An observation: A gigabyte of server RAM costs roughly USD20. A caching system like memcached does a good job of exploiting USD100 worth of memory efficiently. That's plenty for the use case you have described.
Another observation: many companies who serve large-scale documents use file systems rather than DBMSs to store them. File systems can be shared or replicated very easily among content servers, and files can be random-accessed trivially without any overhead.
It's a bit innovative to store whole books in single BLOBs or CLOBs. If you can break up the books by some kind of segment -- page? chapter? thousand-word chunk? -- and create separate data rows for each segment, your DBMS will scale up MUCH MUCH better than what you have described.
If you're going to do it anyway, here's what you do:
always retrieve 100 characters more than you need in each segment. For example, when you need characters 30000 - 35000, retrieve 30000 - 35100.
after you retrieve the segment, look for the first word break in the data (except on the very first segment) and display starting from that word.
similarly, find the very first word break in the 100 extra bytes, and display up to that word break.
So your fetched data might be 30000 - 35100 and your displayed data might be 30013 - 35048, but it would be whole words.
I have url table in mysql which has only two fields id and varchar(255) for url. There are currently more than 50 million urls there and my boss has just given my clue about the expansion of our current project which will result in more urls to be added in that url table and expected numbers are well around 150 million in the mid of the next year.
Currently database size is about 6GB so I can safely say that if things are left same way then it will cross 20GB which is not good. So, I am thinking of some solution which can reduce the disk space of url storage.
I also want to make it clear that this table is not a busy table and there are not too many queries at the momen so I am just looking to save disk space and more importantly I am looking to explore new ideas of short text compression and its storage in mysql
BUT in future that table can also be accessed heavily so its better to optimize the table well before the time come.
I worked quite a bit to change the url into numeric form and store using BIGINT but as it has limitations of 64 bits so it didn't work out quite well. And same is the problem with BIT data type and imposes the limit of 64 bits too.
My idea behind converting to numeric form is basically as 8byte BIGINT stores 19 digits so if each digit points a character in a character set of all possible characters then it can store 19 characters in 8 bytes if all characters are ranged from 1-10 but as in real world scenario there are 52 characters of English and 10 numeric plus few symbols so its well around 100 character set. So, in worst case scenario BIGINT can still point to 6 characters and yes its not a final verdict it still needs some workout to know exactly what each digit is point to it is 10+ digit or 30+ digit or 80+ digit but you have got pretty much the idea of what I am thinking about.
One more important thing is that as url are of variable length so I am also trying to save disk space of small urls so I don't want to give a fixed length column type.
I have also looked into some text compression algo like smaz and Huffman compression algo but not pretty much convinced because they use some sort of dictionary words but I am looking for a clean method.
And I don't want to use binary data type because it also take too many space like varchars in bytes.
Another idea to try might be to identify common strings and represent them with a bitmap. For instance, have two bits to represent the protocol (http, https, ftp or something else), another bit to indicate whether the domain starts with "wwww", two bits to indicate whether the domain ends with ".com", ".org", ".edu" or something else. You'd have to do some analysis on your data and see if these make sense, and if there are any other common strings you can identify.
If you have a lot of URLs to the same site, you could also consider splitting your table into two different ones, one holding the domain and the other containing the domain-relative path (and query string & fragment id, if present). You'd have a link table that had the id of the URL, the id of the domain and the id of the path, and you'd replace your original URL table with a view that joined the three tables. The domain table wouldn't have to be restricted to the domain, you could include as much of the URL as was common (e.g., 'http://stackoverflow.com/questions'). This wouldn't take too much code to implement, and has the advantage of still being readable. Your numeric encoding could be more efficient, once you get it figured out, you'll have to analyze your data to see which one makes more sense.
If you are looking for 128 bit integers then you can use binary(16) here 16 is bytes. And you can extend it to 64 bytes (512 bits) so it doesn't take more space than bit data type. You can say Binary data type as an expansion of BIT data type but its string variant.
Having said that I would suggest dictionary algorithms to compress URLs and short strings but with the blend of techniques used by url shortening services like using A-Z a-z 0-9 combination of three words to replace large dictionary words and you would have more combinations available than available words 62 X 62 X 62.
Though I am not sure what level of compression you would achieve but its not a bad idea to implement url compression this way.
I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.