Efficiently re-saving about 20kb of text to mysql - mysql

I have a contenteditable div where users can enter up to about 20kb of text. As they're typing I want to keep autosaving that div to a mysql table so that they can later retrieve that text on their phone, tablet etc. So far I've been simply autosaving every 20 seconds by taking the whole text block and doing an update to the row in my table that holds that text.
I realize that method is really inefficient and won't work once my user base grows. Is there a better way to do this? E.g, somehow take a diff between the original text and what changed and save only that? Use node.js? If you've got some idea, let me know. Thanks.

Not all texts will be 20kb, so there doesn't need to be a direct problem. You could also choose to autosave less often, and you can autosave to a session too, and only write to the database less frequently.
You could calculate the differences between the stored version (which can be kept in the session too) and the typed version. The advantage is that you don't have to use the database so often, but processing will become harder, so the load on your webserver will increase too.
I'd maybe choose to do autosaves to a memory table (you can choose the storage type in MySQL), and write a separate process/cron job, that updates the physical table in bulk on regular intervals.

this reminds me of google docs, they have a pretty nice system of autosaving large text to the database. Now I don't know about the internal components of google docs, but that is something you might want to look into.
google has much more bandwidth so their method might not work for you, though it seems like they are constantly saving data to their servers.
what you could do is use javascript and save data on the user side, and only load the data into the database when the user leaves or clicks "save". This way, when they come back to the page, as long as the savefile is there, they can get back to their file.

One way:
Use ajax to send the text back to server every x seconds or even by multiples of x characters (e.g every 10, 20, 30 chars and so on).
Use server side code (eg php) to do most of the work from this request (as its fast and efficient) where you could hold the text in session.
On each request the server side code could compare with the session text from the last request and would then give you the possibility to do one of the following:
a) If start of new text is same as old session text then just update the field in database with the different using say concat:
update table
set field=concat(field,'extra data...')
where id='xx'
and then set session text to new text.
b) If start text is different then you know you need to do a straight up update statement to change the field value to the new text.
Another way:
You only do a full update when necessary if not just a concat.
You can "hold" a certain amount of requests in the session before you actaully do the update eg run the ajax every 5 sec's or chars but you server side script only runs the update statement on every 5th call or 20 chars or addition 20 chars if the first part hasnt changed.

Related

Search autocomplete for usernames: Is it better to query DB ONCE for all users, or query multiple times on each keystroke for 'users like'?

I'm using a node/express backend, and angular on the front end. I have a feature where users can search for other users to add as a friend. My question is which method is less expensive. Let's, for the sake of argument, pretend there are 1 million user accounts.
1) When you enter the first keystroke in the user search field, a database query is done to find each users whose username begins with that letter, then the server responds with that entire json file, and you do a get request to that endpoint with your Angular front end. Every consecutively typed letter will no longer be doing a DB query, but will be utilizing that giant json array in the browser's memory and pulling up users as autocompleted suggestions.
or
2) Username's must be at least 6 characters long. Once you've entered 6 characters (significantly reducing the database query username possibilities), the db query is done, and the entire json file with usernames similar to that of your typed query is sent as a response. Every consecutive letter typed will filter for autocomplete from that point on.
OR
3) Same as number 2, however, instead of sending a single giant JSON object after 6 typed characters, each new consecutive letter typed will do a new, smaller db query, specifically targeted to find a DB entry with the exact string you have typed. Basically in angular an 'on change' listener. I.e Johnny will return an object with all usernames that start with Johnny, however, when I type Johnny1, now only the Johnny1 user object will get searched for as a DB query, then returned as the response if it exists. Then Johnny12 will do a DB query for the exact string (to lower case of course) for Johnny12... so on and so on.
Which of these methods is less expensive? Is it more expensive to do a SINGLE enormous DB query for the entire DB and send that over once and have that in front end memory? Or to do many small queries on each stroke, returning only a small fraction of the data, but allowing for more opportunities to possibly crash a server on the scale of 1 million users, each performing a request, each time they press a keystroke?
Thanks.
Your database will be a driver for this answer. What DB are you using? If using a SQL database, make sure that username field is indexed. Whatever database, using the database structure to your best advantage will drive this answer more than the interface. A database like elasticsearch (available as an AWS service) would be (almost) real-time responses, so just do it every letter.
Your tree of possible responses will certainly shrink the more letters you include in the initial search. Getting to two or three initial letters before the first query would significantly help reduce the resulting payload.
Caching those initial search values could be handy as well. A caching DB (Redis, MemCache) would be nifty, but even just storing those results on your instance would help reduce latency of future queries coming through that instance (probably should expire them in some fashion or that cached data will become stale from the user DB, or each instance has to listen for added/removed users to update their chaches).
One last thought. Add some "debouncing" on your client side, such that you do not query every key stroke but give the user some time to finish typing. That will lighten the load on your DB, and reduce the amount of queries you have process on the client.

MySQL LONGTEXT pagination

I have table posts which contains LONGTEXT. My issue is that I want to retrieve parts of a specific post (basically paging)
I use the following query:
SELECT SUBSTRING(post_content,1000,1000) FROM posts WHERE id=x
This is somehow good, but the problem is the position and the length. Most of the time, the first word and the last word is not complete, which makes sense.
How can I retrieve complete words from position x for length y?
Presumably you're doing this for the purpose of saving on network traffic overhead between the MySQL server and the machine on which your application is running. As it happens, you're not saving any other sort of workload on the MySQL server. It has to fetch the LONGTEXT item from disk, then run it through SUBSTRING.
Presumably you've already decided based on solid performance analysis that you must save this network traffic. You might want to revisit this analysis now that you know it doesn't save much MySQL server workload. Your savings will be marginal, unless you have zillions of very long LONGTEXT items and lots of traffic to retrieve and display parts of them.
In other words, this is an optimization task. YAGNI? http://en.wikipedia.org/wiki/YAGNI
If you do need it you are going to have to create software to process the LONGTEXT item word by word. Your best bet is to do this in your client software. Start by retrieving the first page plus a k or two of the article. Then, parse the text looking for complete words. After you find the last complete word in the first page and its following whitespace, then that character position is the starting place for the next page.
This kind of task is a huge pain in the neck in a MySQL stored procedure. Plus, when you do it in a stored procedure you're going to use processing cycles on a shared and hard-to-scale-up resource (the MySQL server machine) rather than on a cloneable client machine.
I know I didn't give you clean code to just do what you ask. But it's not obviously a good idea to do what you're suggesting.
Edit:
An observation: A gigabyte of server RAM costs roughly USD20. A caching system like memcached does a good job of exploiting USD100 worth of memory efficiently. That's plenty for the use case you have described.
Another observation: many companies who serve large-scale documents use file systems rather than DBMSs to store them. File systems can be shared or replicated very easily among content servers, and files can be random-accessed trivially without any overhead.
It's a bit innovative to store whole books in single BLOBs or CLOBs. If you can break up the books by some kind of segment -- page? chapter? thousand-word chunk? -- and create separate data rows for each segment, your DBMS will scale up MUCH MUCH better than what you have described.
If you're going to do it anyway, here's what you do:
always retrieve 100 characters more than you need in each segment. For example, when you need characters 30000 - 35000, retrieve 30000 - 35100.
after you retrieve the segment, look for the first word break in the data (except on the very first segment) and display starting from that word.
similarly, find the very first word break in the 100 extra bytes, and display up to that word break.
So your fetched data might be 30000 - 35100 and your displayed data might be 30013 - 35048, but it would be whole words.

Storing large, session-level datasets?

I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.

implementing a blacklist of usernames

I have a block list of names/words, has about 500,000+ entries. The use of the data is to prevent people from entering these words as their username or name. The table structure is simple: word_id, word, create_date.
When the user clicks submit, I want the system to lookup whether the entered name is an exact match or a word% match.
Is this the only way to implement a block or is there a better way? I don't like the idea of doing lookups of this many rows on a submit as it slows down the submit process.
Consider a few points:
Keep your blacklist (business logic) checking in your application, and perform the comparison in your application. That's where it most belongs, and you'll likely have richer programming languages to implement that logic.
Load your half million records into your application, and store it in a cache of some kind. On each signup, perform your check against the cache. This will avoid hitting your table on each signup. It'll be all in-memory in your application, and will be much more performant.
Ensure myEnteredUserName doesn't have a blacklisted word at the beginning, end, and anywhere in between. Your question specifically had a begins-with check, but ensure that you don't miss out on 123_BadWord999.
Caching bring its own set of new challenges; consider reloading from the database everyday n minutes, or at a certain time or event. This will allow new blacklisted words to be loaded, and old ones to be thrown out.
You can't do where 'loginName' = word%. % can only be used in the literal string, not as part of the column data.
You would need to say where 'logi' = word or 'login' = word or ... where you compare substrings of the login name with the bad words. You'll need to test each substring whose length is between the shortest and longest bad word, inclusive.
Make sure you have an index on the word column of your table, and see what performance is like.
Other ways to do this would be:
Use Lucene, it's good at quickly searching text, espacially if you just need to know whether or not your substring exists. Of course Lucene might not fit technically in your environment -- it's a Java library.
Take a hash of each bad word, and record them in a bitset in memory -- this will be small and fast to look up, and you'll only need to go to the database to make sure that a positive isn't false.

high load on mysql DB how to avoid?

I have a table contain the city around the worlds it contain more than 70,000 cities.
and also have auto suggest input in my home page - which used intensively in my home page-, that make a sql query (like search) for each input in the input (after the second letter)..
so i afraid from that heavily load,,...,, so I looking for any solution or technique can help in such situation .
Cache the table, preferably in memory. 70.000 cities is not that much data. If each city takes up 50 bytes, that's only 70000 * 50 / (1024 ^ 2) = 3MByte. And after all, a list of cities doesn't change that fast.
If you are using AJAX calls exclusively, you could cache the data for every combination of the first two letters in JSON. Assuming a Latin-like alphabet, that would be around 680 combinations. Save each of those to a text file in JSON format, and have jQuery access the text files directly.
Create an index on the city 'names' to begin with. This speeds up queries that look like:
SELECT name FROM cities WHERE name LIKE 'ka%'
Also try making your auto complete form a little 'lazy'. The more letters a user enters, lesser the number of records your database has to deal with.
What resources exist for Database performance-tuning?
You should cache as much data as you can on the web server. Data that does not change often like list of Countries, Cities, etc is a good candidate for this. Realistically, how often do you add a country? Even if you change the list, a simple refresh of the cache will handle this.
You should make sure that your queries are tuned properly to make best use of Index and Join techniques.
You may have load on your DB from other queries as well. You may want to look into techniques to improve performance of MySQL databases.
Just get your table to fit in memory, which should be trivial for 70k rows.
Then you can do a scan very easily. Maybe don't even use a sql database for this (as it doesn't change very often), just dump the cities into a text file and scan that. That'd definitely be better if you have many web servers but only one db server as each could keep its own copy of the file.
How many queries per second are you seeing peak? I can't imagine there being that many people typing city names in, even if it is a very busy site.
Also you could cache the individual responses (e.g. in memcached) if you get a good hit rate (e.g. because people tend to type the same things in)
Actually you could also probably precalculate the responses for all one-three letter combinations, that's only 26*26*26 (=17k) entries. As a four or more letter input must logically be a subset of one of those, you could then scan the appropriate one of the 17k entries.
If you have an index on the the city name it should be handled by the database efficiently. This statement is wrong, see comments below
To lower the demands on your server resources you can offer autocompletion only after n more characters. Also allow for some timeout, i.e. don't do a request when a user is still typing.
Once the user stopped typing for a while you can request autocompletion.