I am trying to understand what is the best way to read and send a huge amount of database rows (50K-100K) to the client.
Should I simply read all the rows at once from the database at the backend and then send all the rows in a json format? This isn't that much responsive as user is just waiting for a long time, but this is faster for small no. of rows.
Should I stream the rows from the database and upon each reading of the row from the database, I call a socket.emit()? This causes too many socket emits, but is more responsive, but slow...
I am using node.js, socket.io
Rethink the Interface
First off, a user interface design that shows 50-100k rows on a client is probably not the best user interface in the first place. Not only is that a large amount of data to send down to the client and for the client to manage and is perhaps impractical in some mobile devices, but it's obviously way more rows than any single user is going to actually read in any given interaction with the page. So, the first order might be to rethink the user interface design and create some sort of more demand-driven interface (paged, virtual scroll, keyed by letter, etc...). There are lots of different possibilities for a different (and hopefully better) user interface design that lessens the data transfer amount. Which design would be best depends entirely upon the data and the likely usage models by the user.
Send Data in Chunks
That said, if you were going to transfer that much data to the client, then you're probably going to want to send it in chunks (groups of rows at a time). The idea with chunks is that you send a consumable amount of data in one chunk such that the client can parse it, process it, show the results and then be ready for the next chunk. The client can stay active the whole time since it has cycles available between chunks to process other user events. But, sending it in chunks reduces the overhead of sending a separate message for each single row. If your server is using compression, then chunks gives a greater chance for compression efficiency too. How big a chunk should be (e.g. how many rows of data is should contain) depends upon a bunch of factors and is likely best determined through experimentation with likely clients or the lowest power expected client. For example, you might want to send 100 rows per message.
Use an Efficient Transfer Format for the Data
And, if you're using socket.io to transfer large amounts of data, you may want to revisit how you use the JSON format. For example, sending 100,000 objects that all repeat exactly the same property names is not very efficient. You can often invent your own optimizations that avoid repeating property names that are exactly the same in every object. For example, rather than sending 100,000 of these:
{"firstname": "John", "lastname": "Bundy", "state": "Az", "country": "US"}
if every single object has the exact same properties, then you can either code the property names into your own code or send the property names once and then just send a comma separated list of values in an array that the receiving code can put into an object with the appropriate property names:
["John", "Bundy", "Az", "US"]
Data size can sometimes be reduced by 2-3x by simply removing redundant information.
Related
I'm learning about sharding approaches. How to achieve good horizontal scalability with a large number of shards in an IO-heavy application. Below I describe a case that I expect to see in my app. I think that this would be a relatively common in the wild, however, I was unable to find much info on it.
Let's say that we need to shard a table/collection where each row is associated with a client. All queries will include a single client id (uuid). Updates and reads are mostly evenly distributed among clients.
From what I've read in this case I would want to use a hashed sharding key on the client id. Reads would touch a single shard providing best performance. Writes would be evenly distributed as long as clients produce relatively the same load.
But what to do if there is a very small subset of clients that produce so much IO load that a single shard would have trouble handling it?
If we change the sharding key for a random record ID then writes for all clients would be distributed across all shards. But reads would have to hit all shards which is not efficient, especially when there are a lot of them.
How do we achieve a balance: have average clients be evenly distributed, and at the same time allow large clients to occupy multiple shards? Are there any DB solutions that would be able to do this automatically? Or do we have to write custom logic for tracking DB load and redistributing large clients between shards? What should I read on the topic?
I'd suggest adding a new attribute to the client's records, for example we could call it part. Assign a single value to simple clients, and store the same value in part for all their records.
But heavy clients would be assigned multiple values for part, up to the number of shards. Every record for that client would set its part to one of these values. Assign them either randomly or round-robin, however you think is most efficient. The point being to use each part with approximately even frequency.
Your hashing algorithm for mapping clients to a shard would then use the client id + the part attribute. So each simple client would still store all their data on a single shard. But heavy clients will distribute their data over multiple shards.
This does mean that for the heavy clients, a read query would need to search multiple shards. Code your searches to loop over the part values for the client. For most clients, this loop will only need to execute once. For the heavy clients, the loop will execute once for each part value associated with that client.
To be honest, I've never seen a load so great that this would be necessary. It's more likely that the traffic for one client is too much for one database instance because the queries are not optimized well, or the application is running more queries than it should. It's important to make sure you analyze query efficiency before you make your sharding architecture more complex.
You've tagged your question with cockroachdb so you probably already suspect this, but CockroachDB handles sharding transparently. If your primary key is composite and the first column is the client id, data with the same client id will all fall in a contiguous key range, and therefore be generally stored on the same node. If a range gets bigger than a configurable limit, and/or gets much more traffic, CockroachDB will automatically split the range to rebalance storage and traffic across nodes. You'll mostly not have to pay attention to this, and for your pattern you won't want to do any explicit sharding. However, if you do need to inspect or tweak the behavior there are tools to do so such as SHOW RANGES.
I'm not so sure about the correct approach when using MySQL. When starting with a big website, I used to load articles with all their info using one function, load_articles. I loaded all articles that were supposed to be somehow displayed.
However often only some article info was used, for example only title, only image icon... So I created object oriented model, where Article uses MySQL to fetch the properties when needed, using __get and ArrayAccess. This results in higher number of queries in general, but reduces the ammount of data fetched from MySQL.
Of course, ideal approach would be to buffer the "data needed" and then send one query. But if this is too complicated for me, where should I aim?
Bulk fetch all data that may be needed and discard the unnecesary data - reducing the ammount of queries
Lazy-load the individual properties as they're needed when generating the page - fetching little data with many queries
If the second is the better, should I go as far as not fetching SELECT * and rather have multiple selects for individual properties, as they are needed?
First of all, Answer totally depends on how your webpage is getting loaded& what are user requirements and what are your SLAs.
suppose your page has 5 elements on it then your solutions will behave like below,
Fetch bulk data and store it locally and load it
This is good approach when your user needs to see all data at once or something very computational is required at user end. In this case also fetch only required attributes. never use select * which is always worst.
Check your network bandwidth while transferring data and if possible use CDN if you have many images or static data.
Fetch only base data first and then according to user requirement fetch more data.
This is good approach when your user generally wants to see only first section of webpage or rather he will be happy to see atleast first section on screen within 1 sec.
and slowly you can load/fetch more data as user scrolls down and performs some activity.
This ways you can save amount of memory needed on app. server and its cpu cycles processing bulk data. This approach also maintains the user by showing something very fast and continues to load.
This all was for page loading SLAs. Both options are suitable for different conditions(nowadays 2nd is more preferably used)
Coming to slow sql queries, you need to normalize the database and use proper indexes wherever required. use optimal sql queries to ensure only required data is fetched and with efficiency.
If you have something which cannot be normalized more and getting complex then you can look at nosql options.'
Applying these techniques efficiently will help you achieve your desired performance.
I hope I have cleared you confusion a bit.
So, I've been coding web apps for some time now... typically I've done both the data structs and retrieval and the client side coding. I now have a data admin teammate working with me and his sole job is to return data from a database to an api that serves json; standard stuff.
Recently, I have been having a disagreement with him on how this data should be returned. Essentially, we have two json objs, the first loaded remotely once (which includes racer name, racer number, etc...) when the application starts. Secondly, (during the race which is a recurring timed data call) we receive positions incrementally which contains a racer's lat/lon, spd etc.
Where we differ is that he is stating that it is "inefficient" to return the racer name (the first call) in the telem string (the second call). What this forces me to do is to keep the first data obj in a global obj, and then essentially get the racer's lat/long, spd from the second data obj "on the fly" using a join lookup function, which then returns a new json obj that I populate to a racer grid using jqGrid (looks something like this: getRaceDataByID(json[0].id){//lookup race data by racer id in json[1] where json[1].id == json[0].id[lat/lon, spd] and return new json obj row to populate jqgrid})).
The result seems to be to be an overly-coded/slow client (jquery) application.
My question is about theory. Of course I understand traditional data structs, normalization, sql etc... But in today world of "webapps" and the idea that it seems that larger web services are going away from "traditional sql" data structures and just returning the data as the client needs. In this sense, it would mean adding about 3 fields (name, bib number, vehicle type, etc...) to the sql call on each position telem call so I can display the data on the client per the interface's requirement (a data table that display real-time speed, lat/lon, etc...).
So finally, my question: has anyone had to deal with a situation like this and am I "all wet" in thinking that 3 fields per row, in today's world of massive data dependent web applications, that this is not a huge issue to be squabling over.
Please note: I understand that traditionally, you would not want to send more data than you need and that his understanding of data structs and inefficient data transfers (not sending more data than you need) is actually correct.
But, many times when I'm coding a web apps, it's often looked at a bit differently b/c of the stateless nature of the browser, and IMHO and it's much easier to just send the data that is needed. My question, is not being driven by not wanting to code the solution, but rather trying to put less load on the client by not having to re-stitch the json obj into something that I needed in the first place.
I think it makes sense to send these 3 fields with the rest of the data, even if this warrants some sort of duplication. You get the following advantages:
You don't have to maintain the names of racers from the first call in your browser
Your coding logic is simplified (don't have to match up racer names to subsequent calls, the packet contains the info. already)
As far as speed goes, you are doing the majority of the work in your remote call, adding another 3 fields doesn't matter IMHO. It makes your app cleaner.
So I guess I agree w/you.
I'm working on building a web application that consists of users doing the following:
Browse and search against a Solr server containing millions of entries. (This part of the app is working really well.)
Select a privileged piece of this data (the results of some particular search), and temporarily save it as a "dataset". (I'd like dataset size to be limited to something really large, say half a million results.)
Perform some sundry operations on that dataset.
(The frontend's built in Rails, though I doubt that's really relevant to how to solve this particular problem.)
Step two, and how to retrieve the data for step 3, are what's giving me trouble. I need to be able to temporarily save datasets, recover them when they're needed, and expire them after a while. The problem is, my results have SHA1 checksum IDs, so each ID is 48 characters. A 500,000 record dataset, even if I only store IDs, is 22 MB of data. So I can't just have a single database table and throw a row in it for each dataset that a user constructs.
Has anybody out there ever needed something like this before? What's the best way to approach this problem? Should I generate a separate table for each dataset that a user constructs? If so, what's the best way to expire/delete these tables after a while? I can deploy a MySQL server if needed (though I don't have one up yet, all the data's in Solr), and I'd be open to some crazier software as well if something else fits the bill.
EDIT: Some more detailed info, in response to Jeff Ferland below.
The data objects are immutable, static, and reside entirely within the Solr database. It might be more efficient as files, but I would much rather (for reasons of search and browse) keep them where they are. Neither the data nor the datasets need to be distributed across multiple systems, I don't expect we'll ever get that kind of load. For now, the whole damn thing runs inside a single VM (I can cross that bridge if I get there).
By "recovering when needed," what I mean is something like this: The user runs a really carefully crafted search query, which gives them some set of objects as a result. They then decide they want to manipulate that set. When they (as a random example) click the "graph these objects by year" button, I need to be able to retrieve the full set of object IDs so I can take them back to the Solr server and run more queries. I'd rather store the object IDs (and not the search query), because the result set may change underneath the user as we add more objects.
A "while" is roughly the length of a user session. There's a complication, though, that might matter: I may wind up needing to implement a job queue so that I can defer processing, in which case the "while" would need to be "as long as it takes to process your job."
Thanks to Jeff for prodding me to provide the right kind of further detail.
First trick: don't represent your SHA1 as text, but rather as the 20 bytes it takes up. The hex value you see is a way of showing bytes in human readable form. If you store them properly, you're at 9.5MB instead of 22.
Second, you haven't really explained the nature of what you're doing. Are your saved datasets references to immutable objects in the existing database? What do you mean by recovering them when needed? How long is "a while" when you talk about expiration? Is the underlying data that you're referencing static or dynamic? Can you save the search pattern and an offset, or do you need to save the individual reference?
Does the data related to a session need to be inserted into a database? Might it be more efficient in files? Does that need to be distributed across multiple systems?
There are a lot of questions left in my answer. For that, you need to better express or even define the requirements beyond the technical overview you've given.
Update: There are many possible solutions for this. Here are two:
Write those to a single table (saved_searches or such) that has an incrementing search id. Bonus points for inserting your keys in sorted order. (search_id unsigned bigint, item_id char(20), primary key (search_id, item_id). That will really limit fragmentation, keep each search clustered, and free up pages in a roughly sequential order. It's almost a rolling table, and that's about the best case for doing great amounts of insertions and deletions. In that circumstance, you pay a cost for insertion, and double that cost for deletion. You must also iterate the entire search result.
If your search items have an incrementing primary id such that any new insertion to the database will have a higher value than anything that is already in the database, that is the most efficient. Alternately, inserting a datestamp would achieve the same effect with less efficiency (every row must actually be checked in a query instead of just the index entries). If you take note of that maximum id, and you don't delete records, then you can save searches that use zero space by always setting a maximum id on the saved query.
This is a technical question regarding browser limitations for parsing and sorting JSON.
We are looking at performing a clustering algorithm on large data sets (potentially 50k rows, potentially 10 fields per row) that are returned from a query and displayed to users in a table, 25 rows per page, and sortable on all fields. The clustering will take place server side and then send back the client the clustered results as JSON.
Currently, the clustered result data will not be existing within any database table. This creates some issues for sorting and paging, and back button support too.
Instead of rerunning the query for "next page" and "resort", I'm wondering if I could send all the data back at one time as a potentially very large JSON, and then displaying only 25 records at a time to implement paging. But what about when a user wants to resort? Could a browser handle resorting 50k+ rows? And still maintain the paging feature?
Would it just be better to create a temp table for the users query results?
you might get a faster result with jsonp. I don't know if there any limits on size, other than practical, since you basically make the browser treat the json as a script. You would have to process the result into some sort of data structure to support your paging, but that shouldn't be difficult.