Is a good idea to build in-memory indexes and circumvent the DB when operating intensively on a small subset? - language-agnostic

I'm working on a program to automatically find optimal shift assignments, subject to lots of constraints. I'm using grails, i.e. the data about workers, shifts and assignments will be kept in a DBMS.
For the optimization itself, I'll have to work very intensively on a small subset of the data (about 600 rows total from about 5 different tables). I'll have to iterate over and search through various sub-subsets dozens of times to compute fitness functions, change some values, compute fitness again, lather, rinse, repeat, perhaps hundreds of times.
Now, while searching and iteration are exactly what a DBMS is for, I believe that in this case the overhead of hundreds of DB requests would dwarf the actual work being done, even for an in-memory DBMS like HSQLDB. So instead, I'm planning to slurp the entire subset into memory at the beginning, build my own indexes (HashMap, mainly) for the lookups I'll have to do, and then work only with those, staying away from the DB until I'm done and write my result to it.
Is this a sound approach? Any better ideas?

I'm assuming you must issue hundreds of commands to the database? There's no way to execute the code inside the DB?
The main thing I'd be worried about is integrity; make sure you handle locking correctly. You'd probably want a version number stored somewhere so you don't need to lock the entire set of data for the duration of processing. In the update transaction, you'd first ensure the version number is the same as when you started reading.
Finally, benchmark it? I've done some apps over the last year or so that had a similar very intensive compute process per request. Using in-process objects to represent the data was orders of magnitude more efficient than hitting the database per request. But every app is different and there might be things not considered that'll impact it.

Related

Multiple large MySQL SELECT queries - better to run in parallel or in a queue?

I have looked up answers to this question a bunch and couldn't find a specific answer - sorry in advance if I missed something! Also, I'm a SQL optimization noob.
I have an analytics dashboard which pulls data based on users' requests from a large database.
Each page the user loads runs a number of different queries to populate different parts of the page (different charts, tables, etc). Some of these pages can take quite some time to load as the user might request several years of data.
Currently, each part of the page pings off one SELECT query to the SQL server but as there are several parts of the page, those queries end up running in parallel.
Would it be faster to run these queries in a queue - to allow the server to process one query at a time? Or to keep everything in parallel, as is?
The added benefit of running them one at a time is that we could run the queries to fill in the "above-the-fold" part of the page first...
Hope that all makes sense and take it easy on me please :)
I also say "it depends", but I lean toward parallelism.
Probably should not have more parallelism than the number of CPU cores.
I rarely see a system that chews up all the CPU cores -- unless it does not have good enough indexes. That is, fix the indexes before asking the question.
If the data is bigger than can be cached, it may be faster to queue, since you may have a choke point -- I/O.
If the table(s) are continually being changed, turn off the Query Cache.
Is your goal to get some results on the page early (a likely Human Interface goal), add a small delay in all but one AJAX callee (not caller).
If multiple pages could be computing at the same time, things get more complex. For example, you can't really control the parallelism.
Let's see the queries. Perhaps we can speed them up enough to obviate the question.
There is no right answer to this question. Up to a point, running parallel SELECT queries is (generally) going to be faster than one running query. Whether that point is 2 queries or 200 depends on the nature of the queries, the hardware configuration, the data, and the speeds of various components.
The situation becomes even more complex when you consider how many different users may be involved and whether or not the data is being updated. You can get into really bad situations with parallel queries and updates if the locks start cascading. Of course, this can happen with multiple simultaneous users as well.
My guess is that you want a throttling mechanism that will run, say, n queries at a time and put the rest into a queue.

sqlite or mysql for large datasets

I am working with large datasets (10s of millions of records, at times, 100s of millions), and want to use a database program that links well with R. I am trying to decide between mysql and sqlite. The data is static, but there are lot of queries that I need to do.
In this link to sqlite help, it states that:
"With the default page size of 1024 bytes, an SQLite database is limited in size to 2 terabytes (241 bytes). And even if it could handle larger databases, SQLite stores the entire database in a single disk file and many filesystems limit the maximum size of files to something less than this. So if you are contemplating databases of this magnitude, you would do well to consider using a client/server database engine that spreads its content across multiple disk files, and perhaps across multiple volumes."
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests. I'm wondering if mysql is a better choice for me than sqlite due to the size of my dataset. The description above seems to suggest that this might be the case, but my data is no where near 2TB.
I'd appreciate any insights into understanding this constraint of maximum file size from the filesystem and how this could affect speed for indexing tables and running queries. This could really help me in my decision of which database to use for my analysis.
The SQLite database engine stores the entire database into a single file. This may not be very efficient for incredibly large files (SQLite's limit is 2TB, as you've found in the help). In addition, SQLite is limited to one user at a time. If your application is web based or might end up being multi-threaded (like an AsyncTask on Android), mysql is probably the way to go.
Personally, since you've done tests and mysql is faster, I'd just go with mysql. It will be more scalable going into the future and will allow you to do more.
I'm not sure what this means. When I have experimented with mysql and sqlite, it seems that mysql is faster, but I haven't constructed very rigorous speed tests.
The short short version is:
If your app needs to fit on a phone or some other embedded system, use SQLite. That's what it was designed for.
If your app might ever need more than one concurrent connection, do not use SQLite. Use PostgreSQL, MySQL with InnoDB, etc.
It seems that (in R, at least), that SQLite is awesome for ad hoc analysis. With the RSQLite or sqldf packages it is really easy to load data and get started. But for data that you'll use over and over again, it seems to me that MySQL (or SQL Server) is the way to go because it offers a lot more features in terms of modifying your database (e.g., adding or changing keys).
SQL if you are mainly using this as a web service.
SQLite, if you want it to able to function offline.
SQLite generally is much much faster, as majority (or ALL) of data/indexes will be cached in memory. However, in the case of SQLite. If the data is split up across multiple tables, or even multiple SQLite database files, from my experience so far. For even millions of records (i yet to have 100's of millions though), it is far more effective then SQL (compensate the latency / etc). However that is when the records are split apart in differant tables, and queries are specific to such tables (dun query all tables).
An example would be a item database used in a simple game. While this may not sound much, a UID would be issued for even variations. So the generator soon quickly work out to more then a million set of 'stats' with variations. However this was mainly due to each 1000 sets of records being split among different tables. (as we mainly pull records via its UID). Though the performance of splitting was not properly measured. We were getting queries that were easily 10 times faster then SQL (Mainly due to network latency).
Amusingly though, we ended up reducing the database to a few 1000 entries, having item [pre-fix] / [suf-fix] determine the variations. (Like diablo, only that it was hidden). Which proved to be much faster at the end of the day.
On a side note though, my case was mainly due to the queries being lined up one after another (waiting for the one before it). If however, you are able to do multiple connections / queries to the server at the same time. The performance drop in SQL, is more then compensated, from your client side. Assuming this queries do not branch / interact with one another (eg. if got result query this, else that)

DB design and optimization considerations for a social application

The usual case. I have a simple app that will allow people to upload photos and follow other people. As a result, every user will have something like a "wall" or an "activity feed" where he or she sees the latest photos uploaded from his/her friends (people he or she follows).
Most of the functionalities are easy to implement. However, when it comes to this history activity feed, things can easily turn into a mess because of pure performance reasons.
I have come to the following dilemma here:
i can easily design the activity feed as a normalized part of the database, which will save me writing cycles, but will enormously increase the complexity when selecting those results for each user (for each photo uploaded within a certain time period, select a certain number, whose uploaders I am following / for each person I follow, select his photos )
An optimization option could be the introduction of a series of threshold constraints which, for instance would allow me to order the people I follow on the basis of the date of their last upload, even exclude some, to save cycles, and for each user, select only the 5 (for example) last uploaded photos.
The second approach is to introduce a completely denormalized schema for the activity feed, in which every row represents a notification for one of my followers. This means that every time I upload a photo, the DB will put n rows in this "drop bucket", n meaning the number of people I follow, i.e. lots of writing cycles. If I have such a table, though, I could easily apply some optimization techniques such as clever indexing, as well as pruning entries older than a certain period of time (queue).
Yet, a third approach that comes to mind, is even a less denormalized schema where the server side application will take some part of the complexity off the DB. I saw that some social apps such as friendfeed, heavily rely on the storage of serialized objects such as JSON objects in the DB.
I am definitely still mastering the skill of scalable DB design, so I am sure that there are many things I've missed, or still to learn. I would highly appreciate it if someone could give me at least a light in the right direction.
If your application is successful, then it's a good bet that you'll have more reads than writes - I only upload a photo once (write), but each of my friends reads it whenever they refresh their feed. Therefore you should optimize for fast reads, not fast writes, which points in the direction of a denormalized schema.
The problem here is that the amount of data you create could quickly get out of hand if you have a large number of users. Very large tables are hard on the db to query, so again there's a potential performance issue. (There's also the question of having enough storage, but that's much more easily solved).
If, as you suggest, you can delete rows after a certain amount of time, then this could be a good solution. You can reduce that amount of time (up to a point) as you grow and run into performance issues.
Regarding storing serialized objects, it's a good option if these objects are immutable (you won't change them after writing) and you don't need to index them or query on them. Note that if you denormalize your data, it probably means that you have a single table for the activity feed. In that case I see little gain in storing blobs.
If you're going the serialized objects way, consider using some NoSQL solution, such as CouchDB - they're better optimized for handling that kind of data, so in principle you should get better performance for the same hardware setup.
Note that I'm not suggesting that you move all your data to NoSQL - only for that part where it's a better solution.
Finally, a word of caution, spoken from experience: building an application that can scale is hard and takes time better spent elsewhere. You should spend your times worrying about how to get millions of users to your app before you worry about how you're going to serve those millions - the first is the more difficult problem. When you get to the point that you're hugely successful, you can re-architect and rebuild your application.
There are many options you can take
Add more hardware, Memory, CPU -- Enter cloud hosting
Hows 24GB of memory sound? Most of your importantly accessed DB information can fit just in memory.
Choose a host with expandable SSDs.
Use an events based system in your application to write the "history" of all users. So it will be like so: id, user_id, event_name, date, event_parameters' -- an example would be: 1, 8, CHANGED_PROFILE_PICTURE, 26-03-2011 12:34, <id of picture> and most important of all, this table will be in memory. No longer need to worry about write performance. After the records go past i.e. 3 days they can be purged into another table (in non-memory) and included into the query results, if the user chooses to go back that far. By having all this in one table you remove having to do multiple queries and SELECTs to build up this information.
Consider using INNODB for the history/feeds table.
Good Resources to read
Exploring the software behind Facebook, the world’s largest site
Digg: 4000% Performance Increase by Sorting in PHP Rather than MySQL
Caching & Performance: Lessons from Facebook
I would probably start with using a normalized schema so that you can write quickly and compactly. Then use non transactional (no locking) reads to pull the information back out making sure to use a cursor so that you can process the results as they're coming back as opposed to waiting for the entire result set. Since it doesn't sound like the information has any particular critical implications you don't really need to worry about a lock of the concerns that would normally push you away from transactional reads.
These kind of problems are why currently NOSql solutions used these days. What I did in my previos projecs is really simple. I don't keep user->wall user->history which contains purely feed'ids in memory stores(my favorite is redis). so in every insert I do 1 insert operation on database and (n*read optimization) insert operation in memory store. I design memory store to optimize my reads. if I want to filter user history (or wall) for videos I put a push feedid to a list like user::{userid}::wall::videos.
Well ofcourse you can purely build the system in memstores aswell but its nice to have 2 systems doing what they are doing the best.
edit :
checkout these applications to get an idea:
http://retwis.antirez.com/
http://twissandra.com/
I'm reading more and more about NoSQL solutions and people suggesting them, however no one ever mentions drawbacks of such choice.
Most obvious for me is lack of transactions - imagine if you lost a few records every now and then (there are cases reporting this happens often).
But, what I'm surprised with is that no one mentions MySQL being used as NoSQL - here's a link for some reading.
In the end, no matter what solution you choose (relational database or NoSQL storage), they scale in similar manner - by sharding data across network (naturally, there are more choices but this is the most obvious one). Since NoSQL does less work (no SQL layer so CPU cycles aren't wasted on interpreting SQL), it's faster, but it can hit the roof too.
As Elad already pointed out - building an app that's scalable from the get go is a painful process. It's better that you spend time focusing on making it popular and then scale it out.

memcached use cases

What are some usecases that will benefit from using memcached with a mysql DB. I would guess it would be good for data that does not change much over time.
More specifically if my data changes often then its not worth using memcached right?
Even more specifically I am trying to use the DB as a data structure for a multi player game. So the records are going to change with every move the players make. And all players views should be updated with the latest moves. So my app is getting read and write intensive. Trying to see what I can do about it. If I use memcached, for every write we read 3 times max since 4 players max can play the game at a time.
Thanks.
Pav
Usecase: webshop with a lot of products. These products are assigned to various pages, and per product a user gets to see certain specs. The specs are called with a "getSpec" function. This is expensive and a query per time.
If we put these in memcached, its much quicker. Everytime someone changes something about the product, you jsut update the memcached.
so if your data changes it still can be worth it! Not everything might change at once.
edit: In your case, you could make your write also update memcached: no stale cache. But that's just a random thought, I don't know if making your write heavier like that has any disadvantaged. This would essentially mean you're running everything from memcached, and are just using your DB as a sort of backup :)
Caching is a tradeoff between speed and (potentially) stale data. You have to determine if the speed gain is appropriate given your own use cases.
We cache everything that doesn't require real-time data. Some things that are typically cached: Reports, user content, entire pages (though you may consider caching these to disk via some other system), etc..
Our API allows clients to query for huge amounts of data. We use memcached to store that for quick paging on the clients end.
If you plan ahead, you can setup your application to cache most everything and just invalidate parts of the cache as needed (for instance, when some data in your db is updated).
It's going to depend on how often "often" is and how busy your app is. For example, if you have a piece of data that changes hourly, but that data is queried 500 times per hour, it would probably make sense to cache it even though it changes relatively frequently.

how to do fast read data and write data in mysql?

Hi Friends
i am using MySQL DB for one of my Product, about 250 schools are singed for it now, its about 1500000 insertion per hour and about 12000000 insertion per day, i think my current setup like just a single server may crash with in hours, and the read is also same as write, how can i make it crash free DB server, the main problem i am facing now is the slow of both writing and reading data how can i over come that,it is very difficult for me to get a solution.guys please help me..which is the good model for doing the solution?
It is difficult to get both fast reads and writes simultaneously. To get fast reads you need to add indexes. To get fast writes you need to have few indexes. And to get both to be fast they must not lock each other.
Depending on your needs, one solution is to have two databases. Write new data to your live database and every so often when it is quiet you can synchronize the data to another database where you can perform queries. The disadvantage of this approach is that data you read will be a little old. This may or may not be a problem depending on what it is you need to do.
~500 inserts per second is nothing to sneeze at indeed.
For a flexible solution, you may want to implement some sort of sharding. Probably the easiest solution is to separate schools into groups upfront and store data for different groups of schools on different servers. E.g., data for schools 1-10 is stored on server A, schools 11-20 on server B, etc. This is almost infinitely scalable, assuming that there are few relationships between data from different schools.
Also you could just try throwing more horsepower at the problem and invest into a RAID of SSD drives and, assuming that you have enough processing power, you should be OK. Of course, if it's a huge database, the capacity of SSD drives may not be enough.
Finally, see if you can cut down on the number of insertions, for example by denormalizing the database. Say, instead of storing attendance for each student in a separate row put attendance of the entire class as a vector in a single row. Of course, such changes will heavily limit your querying capabilities.
My laid back advice is:
Build you application lightweight. Don't use an high level database abstraction layer like Active Record. They suck at scaling.
Learn a lot about mysql permformance.
Learn about mysql replication.
Learn about load balancing.
Learn about in memory caches. (memcached)
Hire an administrator (with decent mysql knowledge) or web app performance guru/consultant.
The concrete strategy depends on your application and how it is used. Mysql replication, may or may not be appropriate (same applies for the mentioned sharding strategy). But it's a rather simple way to achive some scaling, because it doesn't impact your application design too much. In memory caches can keep away some load from your databases, but they need some work to apply and some trade offs. In the end you need a good overall understanding how to handle a database driven application under heavy load. If you have a tight deadline, add external manpower, because you won't do this right within 6 weeks without experience.