Optimal DB structure for creating user segments - mysql

I want to create a segmentation engine and can't seem to figure out the most optimal DB or DB structure to use for the task.
Currently I use MySQL as my primary DB, but the segmentation engine is a separate software component and thus can have a different DB if applicably.
Basically I have 10 million unique users identified with UserID (integer). Administrator of the Segmentation engine dynamically creates segments with some predefined rules (like age range, geolocation, transaction history etc.). Application should execute the rules of each segment periodically (once every 15 minutes) to get the current list of all users (can be upto 1 million users each) that belong to the segment and store it.
Later application is exposing the API to allow external systems use segmentation functionality, namely:
1. Get list of all segments that a particular UserID belongs to.
2. Get list of all UserIDs that a particular segment contains.
Note that because segments need to be updated very frequently (every 15 min) this causes massive transactions in DB to "maintain" the segments, where non-applicable users should be removed and new ones added all the time.
I have considered several approaches so far:
1. Plain MySQL where I have a table of users belonging to segments (SegmentID,UserID). (this approach has 2 drawbacks: storage space and constant delete/insert/update in MySQL which will degrade innodb performance by introducing page splitting).
Using JSON data types in MySQL, where I can have table (UserID,Segments), where segments is a json containing an array of SegmentIDs. (drawbacks here are slow search and slow updates)
Using Redis with Sets (UserID,Segments), where UserID will be the Key and Segments will be the Set of SegmentIDs. (drawbacks here are no simple way to search by SegmentID).
Has anyone worked with similar task and can provide any guidance?
Any feedback will be appreciated, so I can be pointed to a direction which I can further research.

I think you can use Elasticsearch for this task.

Related

Using Redis to cache data that is in use in a Real TIme Single Page App

I've got a web application, it has the normal feature, user settings etc these are all stored in MYSQL with the user etc.....
A particular part of the application a is a table of data for the user to edit.
I would like to make this table real time, across multiple users. Ie multiple users can open the page edit the data and see changes in real time done by other users editing the table.
My thinking is to cache the data for the table in Redis, then preform all the actions in redis like keeping all the clients up to date.
Once all the connection have closed for a particular table save the data back to mysql for persistence, I know Redis can be used as a persistent NoSQL database but as RAM is limited and all my other data is stored in MYSQL, mysql seems a better option.
Is this a correct use case for redis? Is my thinking correct?
It depends on the scalability. The number of records you are going to deal with and the structure you are going to use for saving it.
I will discuss about pros and crons of using redis. The decision is up to you.
Advantages of using redis:
1) It can handle heavy writes and reads in comparison with MYSQL
2) It has flexible structures (hashmap, sorted set etc) which can
localise your writes instead of blocking the whole table.
3) Read queries will be much faster as it is served from cache.
Disadvantages of using redis:
1) Maintaining transactions. What happens if both users try to access a
particular cell at a time? Do you have a right data structure in redis to
handle this case?
2) What if the data is huge? It will exceed the memory limit.
3) What happens if there is a outage?
4) If you plan for persistence of redis. Say using RDB or AOF. Will you
handle those 5-10 seconds of downtime?
Things to be focussed:
1) How much data you are going to deal with? Assume for a table of 10000 rows wit 10 columns in redis takes 1 GB of memory (Just an assumption actual memory will be very much less). If your redis is 10GB cluster then you can handle only 10 such tables. Do a math of about how many rows * column * live tables you are going to work with and the memory it consumes.
2) Redis uses compression for data within a range http://redis.io/topics/memory-optimization. Let us say you decide to save the table with a hashmap, you have two options, for each column you can have a hashmap or for each row you can have a hashmap. Second option will be the optimal one. because storing 1000 (hashmaps -> rows) * 20 (records in each hash map -> columns) will take 10 time less memory than storing in the other way. Also in this way if a cell is changed you can localize in hashmap of within 20 values.
3) Loading the data back in your MYSQL. how often will this going to happen? If your work load is high then MYSQL begins to perform worse for other operations.
4) How are you going to deal with multiple clients on notifying the changes? Will you load the whole table or the part which is changes? Loading the changed part will be the optimal one. In this case, where will you maintain the list of cells which have been altered?
Evaluate your system with these questions and you will find whether it is feasible or not.

MySQL Cluster and big blobs

I decided to use a MySQL Cluster for a bigger project of mine. Beside storing documents in a simple table scheme with only three indexes, a need to store information in the size of 1MB to 50MB arise. Those informations will be serialized custom tables being aggregats of data feeds.
How will be those information be stored and how many nodes will those information hit? I understand that with a replication factor of three those information will be written three times and I understand that there are coordinator nodes (named differently) so I ask myself what will be the impact storing those information?
Is it right that I understand that for a read a cluster will send those blobs to three servers (one requested the information, one coordinator and one data server) and for a write it is 5 (1+1+3)?
Generally speaking MySQL only supports NoOfReplicas=2 right now, using 3 or 4 is generally not supported and not very well tested, this is noted here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-ndbd-definition.html#ndbparam-ndbd-noofreplicas
"The maximum possible value is 4; currently, only the values 1 and 2 are actually supported."
As also described in the above URL, the data is stored with the same number of replicas as this setting. So with NoOfReplicas=2, you get 2 copies. These are stored on the ndbd (or ndbmtd) nodes, the management nodes (ndb_mgmd) act as co-ordinators and the source of configuration, they do not store any data and neither does the mysqld node.
If you had 4 data nodes, you would have your entire dataset split in half and then each half is stored on 2 of the 4 data nodes. If you had 8 data nodes, your entire data set would be split into four parts and then each part stored on 2 of the 8 data nodes.
This process is sometimes known as "Partitioning". When a query runs, the data is split up and sent to each partition which processes it locally as much as possible (for example by removing non-matching rows using indexes, this is called engine condition pushdown, see http://dev.mysql.com/doc/refman/5.6/en/condition-pushdown-optimization.html) and then it is aggregated in mysqld for final processing (may include calculations, joins, sorting, etc) and return to the client. The ndb_mgmd nodes do not get involved in the actual data processing in any way.
Data is by default partitioned by the PRIMARY KEY, but you can change this to partition by other columns. Some people use this to ensure that a given query is only processed on a single data node much of the time, for example by partitioning a table to ensure all rows for the same customer are on a single data node rather than spread across them. This may be better, or worse, depending on what you are trying to do.
You can read more about data partitioning and replication here:
http://dev.mysql.com/doc/refman/5.6/en/mysql-cluster-nodes-groups.html
Note that MySQL Cluster is really not ideal for storing such large data, in any case you will likely need to tune some settings and try hard to keep your transactions small. There are some specific extra limitations/implications of using BLOB which you can find discussed here:
http://dev.mysql.com/doc/mysql-cluster-excerpt/5.6/en/mysql-cluster-limitations-transactions.html
I would run comprehensive tests to ensure it is performing well under high load if you go ahead and ensure you setup good monitoring and test your failure scenarios.
Lastly, I would also strongly recommend getting pre-sales support and a support contract from Oracle, as MySQL Cluster is quite a complicated product and needs to be configured and used correctly to get the best out of it. In the interest of disclosure, I work for Oracle in MySQL Support -- so you can take that recommendation as either biased or very well informed.

redis as write-back view count cache for mysql

I have a very high throughput site for which I'm trying to store "view counts" for each page in a mySQL database (for legacy reasons they must ultimately end up in mySQL).
The sheer number of views is making it impractical to do SQL "UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+1" type of statements. There are millions of items but most are only viewed a small number of times, others are viewed many times.
So I'm considering using Redis to gather the view counts, with a background thread that writes the counts to mySQL. What is the recommended method for doing this? There are some issues with the approach:
how often does the background thread run?
how does it determine what to write back to mySQL?
should I store a Redis KEY for every ITEM that gets hit?
what TTL should I use?
is there already some pre-built solution or powerpoint presentation that gets me halfway there, etc.
I have seen very similar questions on StackOverflow but none with a great answer...yet! Hoping there's more Redis knowledge out there at this point.
I think you need to step back and look at some of your questions from a different angle to get to your answers.
"how often does the background thread run?"
To answer this you need to answer these questions: How much data can you lose? What is the reason for the data being in MySQL, and how often is that data accessed? For example, if the DB is only needed to be consulted once per day for a report, you might only need it to be updated once per day. On the other hand, what if the Redis instance dies? How many increments can you lose and still be "ok"? These will provide the answers to the question of how often to update your MySQL instance and aren't something we can answer for you.
I would use a very different strategy for storing this in redis. For the sake of the discussion let us assume you decide you need to "flush to db" every hour.
Store each hit in hashes with a key name structure along these lines:
interval_counter:DD:HH
interval_counter:total
Use the page id (such as MD5 sum of the URI, the URI itself, or whatever ID you currently use) as the hash key and do two increments on a page view; one for each hash. This provides you with a current total for each page and a subset of pages to be updated.
You would then have your cron job run a minute or so after the start of the hour to pull down all pages with updated view counts by grabbing the previous hour's hash. This provides you with a very fast means of getting the data to update the MySQL DB with while avoiding any need to do math or play tricks with timestamps etc.. By pulling data from a key which is no longer bing incremented you avoid race conditions due to clock skew.
You could set an expiration on the daily key, but I'd rather use the cron job to delete it when it has successfully updated the DB. This means your data is still there if the cron job fails or fails to be executed. It also provides the front-end with a full set of known hit counter data via keys that do not change. If you wanted, you could even keep the daily data around to be able to do window views of how popular a page is. For example if you kept the daily hash around for 7 days by setting an expire via the cron job instead of a delete, you could display how much traffic each page has had per day for the last week.
Executing two hincr operations can be done either solo or pipelined still performs quite well and is more efficient than doing calculations and munging data in code.
Now for the question of expiring the low traffic pages vs memory use. First, your data set doesn't sound like one which will require huge amounts of memory. Of course, much of that depends on how you identify each page. If you have a numerical ID the memory requirements will be rather small. If you still wind up with too much memory, you can tune it via the config, and if needs be could even use a 32 bit compile of redis for a significant memory use reduction. For example, the data I describe in this answer I used to manage for one of the ten busiest forums on the Internet and it consumed less than 3GB of data. I also stored the counters in far more "temporal window" keys than I am describing here.
That said, in this use case Redis is the cache. If you are still using too much memory after the above options you could set an expiration on keys and add an expire command to each ht. More specifically, if you follow the above pattern you will be doing the following per hit:
hincr -> total
hincr -> daily
expire -> total
This lets you keep anything that is actively used fresh by extending it's expiration every time it is accessed. Of course, to do this you'd need to wrap your display call to catch the null answer for hget on the totals hash and populate it from the MySQL DB, then increment. You could even do both as an increment. This would preserve the above structure and would likely be the same codebase needed to update the Redis server from the MySQL Db if you the Redis node needed repopulation. For that you'll need to consider and decide which data source will be considered authoritative.
You can tune the cron job's performance by modifying your interval in accordance with the parameters of data integrity you determine from the earlier questions. To get a faster running cron nob you decrease the window. With this method decreasing the window means you should have a smaller collection of pages to update. A big advantage here is you don't need to figure out what keys you need to update and then go fetch them. you can do an hgetall and iterate over the hash's keys to do updates. This also saves many round trips by retrieving all the data at once. In either case if you will likely want to consider a second Redis instance slaved to the first to do your reads from. You would still do deletes against the master but those operations are much quicker and less likely to introduce delays in your write-heavy instance.
If you need disk persistence of the Redis DB, then certainly put that on a slave instance. Otherwise if you do have a lot of data being changed often your RDB dumps will be constantly running.
I hope that helps. There are no "canned" answers because to use Redis properly you need to think first about how you will access the data, and that differs greatly from user to user and project to project. Here I based the route taken on this description: two consumers accessing the data, one to display only and the other to determine updating another datasource.
Consolidation of my other answer:
Define a time-interval in which the transfer from redis to mysql should happen, i.e. minute, hour or day. Define it in a way so that fast and easyly an identifying key can be obtained. This key must be ordered, i.e. a smaller time should give a smaller key.
Let it be hourly and the key be YYYYMMDD_HH for readability.
Define a prefix like "hitcount_".
Then for every time-interval you set a hash hitcount_<timekey> in redis which contains all requested items of that interval in the form ITEM => count.
There exists two parts of the solution:
The actual page that has to count:
a) get the current $timekey, i.e. by date- functions
b) get the value of $ITEM
b) send the redis-command HINCRBY hitcount_$timekey $ITEM 1
A cronjob which runs in that given interval, not too close to the limit of that intervals (in example: not at the full hour). This cronjob:
a) Extracts the current time-key (for now it would be 20130527_08)
b) Requests all matching keys from redis with KEYS hitcount_* (those should be a small number)
c) compares every such hash against the current hitcount_<timekey>
d) if that key is smaller than current key, then process it as $processing_key:
read all pairs ITEM => counter by HGETALL $processing_key as $item, $cnt
update the database with `UPDATE ITEM SET VIEW_COUNT=VIEW_COUNT+$cnt where ITEM=$item"
delete that key from the hash by HDEL $processing_key $item
no need to del the hash itself - there are no empty hashes in redis as far as I tried
If you want to have a TTL involved, say if the cleanup-cronjob may be not reliable (as might not run for many hours), then you could create the future hashes by the cronjob with an appropriate TTL, that means for now we could create a hash 20130527_09 with ttl 10 hours, 20130527_10 with TTL 11 hours, 20130527_11 with TTL 12 hours. Problem is that you would need a pseudokey, because empty hashes seem to be deleted automatically.
See EDIT3 for current state of the A...nswer.
I would write a key for every ITEM. A few tenthousand keys are definitely no problem at all.
Do the pages change very much? I mean do you get a lot of pages that will never be called again? Otherwise I would simply:
add the value for an ITEM on page request.
every minute or 5 minutes call a cronjob that reads the redis-keys, read the value (say 7) and reduce it by decrby ITEM 7. In MySQL you could increment the value for that ITEM by 7.
If you have a lot of pages/ITEMS which will never be called again you could make a cleanup-job once a day to delete keys with value 0. This should be locked against incrementing that key again from the website.
I would set no TTL at all, so the values should live forever. You could check the memory usage, but I see a lot of different possible pages with current GB of memory.
EDIT: incr is very nice for that, because it sets the key if not set before.
EDIT2: Given the large amount of different pages, instead of the slow "keys *" command you could use HASHES with incrby (http://redis.io/commands/hincrby). Still I am not sure if HGETALL is much faster then KEYS *, and a HASH does not allow a TTL for single keys.
EDIT3: Oh well, sometimes the good ideas come late. It is so simple: Just prefix the key with a timeslot (say day-hour) or make a HASH with name "requests_". Then no overlapping of delete and increment may happen! Every hour you take the possible keys with older "day_hour_*" - values, update the MySQL and delete those old keys. The only condition is that your servers are not too different on their clock, so use UTC and synchronized servers, and don't start the cron at x:01 but x:20 or so.
That means: a called page converts a call of ITEM1 at 23:37, May 26 2013 to Hash 20130526_23, ITEM1. HINCRBY count_20130526_23 ITEM1 1
One hour later the list of keys count_* is checked, and all up to count_20130523 are processed (read key-value by hgetall, update mysql), and deleted one by one after processing (hdel). After finishing that you check if hlen is 0 and del count_...
So you only have a small amount of keys (one per unprocessed hour), that makes keys count_* fast, and then process the actions of that hour. You can give a TTL of a few hours, if your cron is delayed or timejumped or down for a while or something like that.

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!
Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.

MySQL scalable data model

I'd like to get feedback on how to model the following:
Two main objects: collections and resources.
Each user has multiple collections. I'm not saving user information per se: every collection has a "user ID" field.
Each collection comprises multiple resources.
Any given collection belongs to only one user.
Any given resource may be associated with multiple collections.
I'm committed to using MySQL for the time being, though there is the possibility of migrating to a different database down the road. My main concern is scalability with the following assumptions:
The number of users is about 200 and will grow.
On average, each user has five collections.
About 30,000 new distinct resources are "consumed" daily: when a resource is consumed, the application associates that resource to every collection that is relevant to that resource. Assume that typically a resource is relevant to about half of the collections, so that's 30,000 x (1,000 / 2) = 15,000,000 inserts a day.
The collection and resource objects are both composed of about a half-dozen fields, some of which may reach lengths of 100 characters.
Every user has continual polling set up to periodically retrieve their collections and associated resources--assume that this happens once a minute.
Please keep in mind that I'm using MySQL. Given the expected volume of data, how normalized should the data model be? Would it make sense to store this data in a flat table? What kind of sharding approach would be appropriate? Would MySQL's NDB clustering solution fit this use case?
Given the expected volume of data, how normalized should the data model be?
Perfectly.
Your volumes are small. You're doing 10,000 to 355,000 transactions each day? Let's assume your peak usage is a 12-hour window. That's .23/sec up to 8/sec. Until you get to rates like 30/sec (over 1 million rows on a 12-hour period), you've got get little to worry about.
Would it make sense to store this data in a flat table?
No.
What kind of sharding approach would be appropriate?
Doesn't matter. Pick any one that makes you happy.
You'll need to test these empirically. Build a realistic volume of fake data. Write some benchmark transactions. Run under load to benchmarking sharding alternatives.
Would MySQL's NDB clustering solution fit this use case?
It's doubtful. You can often create a large-enough single server to handle this load.
This doesn't sound anything like any of the requirements of your problem.
MySQL Cluster is designed not to have any single point of failure. In
a shared-nothing system, each component is expected to have its own
memory and disk, and the use of shared storage mechanisms such as
network shares, network file systems, and SANs is not recommended or
supported.