I have a table which stores the location of my user very frequently. I want to query this table frequently and return the newest rows I haven't read from.
What would be the best practice way to do this. My ideas are:
Add a boolean read flag, query all results where this is false, return them and then update them ALL. This might slow things down with the extra writes
Save the id of the last read row on the client side, and query for rows greater than this. Only issue here is that my client could lose their place
Some stream of data
there will eventually be multiple users and readers of the locations so this will not to scale somewhat
If what you have is a SQL database storing rows of things. I'd suggest something like option 2.
What I would probably do is keep a timestamp rather than in ID, and an index on that (a clustered index on MSSQL, or similar construct so that new rows are physically sorted by time). Then just query by anything newer than that.
That does have the "losing their place" issue. If the client MUST read every row published, then I'd either delete them after processing, or have a flag in the database to indicate that they have been processed. If the client just needs to restart reading current data, then I would do as above, but initialize the time with the most recent existing row.
If you MUST process every record, aren't limited to a database, what you're really talking about is a message queue. If you need to be able to access the individual data points after processing, then one step of the message handling could be to insert into a database for later querying(in addition to whatever this is doing with the data read).
Edit per comments:
If there's no processing that needs be done when receiving, but you just want to periodically update data then you'd be fine with solution of keeping the last received time or ID and not deleting the data. In that case I would recommend not persisting a last known id/timestamp across restarts/reconnects since you might end up inadvertently loading a bunch of data. Just reset it max when you restart.
On another note, when I did stuff like this I had good success using MQTT to transmit the data, and for the "live" updates. That is a pub/sub messaging protocol. You could have a process subscribing on the back end and forwarding data to the database, while the thing that wants the data frequently can subscribe directly to the stream of data for live updates. There's also a feature to hold onto the last published message and forward that to new subscribers so you don't start out completely empty.
Related
Were developing a large chat app using mysql db, conversations happen between only 2 people at any one time.
Looking for opinions as to which db schema options would perform better.
Option 1. A traditional approach to insert one row per message/ response.
Simply inserting in a db which no prior lookups, however rebuilding
the chat thread would require ORDERBY
Option 2. Or to append each message to on single message field.
Would be faster selecting as there would be no need for ORDERBY
However on every new message there would be a lookup 1st
Also with option 2 there would be less overall rows in the db
Any ideas?
This depends entirely on what you want to do with the field. In almost all cases, though, the first solution -- a separate row for each conversation -- is the right approach.
You would only want to use the second approach -- a single field for all of them -- if you were treating the conversation as a "blob". That is, if you did not want to select particular messages, search within a message, and so on. Essentially, the column would be an archive of the messages, rather that something as useful as another column.
I should also add that in a conversation, storing the messages in a single column loses the information of when the message was sent and who sent it. Of course, you could try to encapsulate that, say by using a JSON column. But why bother? SQL already has good mechanisms for representing such information.
I've participated in designs of two large (in 10s of millions of active users) systems. Both used relational DBs for storage, one used MySQL. In both cases one message per row was stored. Indexing by [thread_id, message_timestamp | message_sequential_number | message_auto_increment_id] was fine for both fetching and ordering.
Keep in mind conversations may grow to multi-megabytes. If you store the entire conversation in a single row you will have to read/write the entire thing on each new message or keep the entire think in memory just to show maybe 50 last messages in most cases. Easily 200x inefficiency.
On the other hand, if you feel adventurous, take a look at Cassandra. It's designed for efficient storage of an entire conversation in one record.
We have been tracking user login events for a while now in a MongoDB collection. Each event contains the userID, datetime, and a couple other fundamental attributes about the event.
For a new feature, we want to present a graph of these login events, with different groups representing cohorts related to the user who did the event. Specifically, we want to group by the "Graduation Year" attribute of the user.
In our event log, we do not record the Graduation Year of the user who's logging in, so cannot easily query that directly. We see two ways to go forward, plus a 3rd "in-between" option:
Instead of making a single MongoDB query to get the logins, we make that query PLUS a second one to our Relational DB to get the secondary user data we require, and merge the two together.
We could optionally query for all the users, load them into memory, and loop through the Events, or we could go through the events and find only the User IDs that logged in and query for those specific User IDs. (Then loop again, merging them in.)
The post-processing could be done on the server-side or we could send all the data to the client. (Currently our plan is to just send the raw event data to the client for processing into the graph.)
Upsides: The event log is made to track events. User "Graduation Year" is not relevant to the event in question; it's relevant to the user who did the event. This seems to separate concerns more properly. As well, if we later decide we want to group on a different piece of metadata (let's say: male vs female), it's easy to just join that data in as well.
Downsides: Part of the beauty of our event log is that it quickly can spit out tons of aggregate data that's ready-to-use. If there are 10,000 users, we may have 100,000 logins. It seems crazy to need to loop through 100,000 logins whenever this data is requested new (as in, not cached).
We can write a script that does a one-time load of all the events (presumably in batches), then requests the user metadata and merges it in, re-writing the Event Log to include the relevant data.
Upsides: The event log is our single point of interaction when loading the data. Client requests all the logins; gets 100,000 rows; sorts them and groups them according to Graduation Year; [Caches it;] and graphs it. Will have a script ready to re-add more data if it came to that, down the road.
Downsides: We're essentially rewriting history. We're polluting our event log with secondary data that isn't explicitly about the event we claim to be tracking. Need to rewrite or modify the script to add more data that we didn't know we wanted to track, if we had to, down the road.
We replicate the Users table in MongoDB, perhaps only as-needed (say when an event's metadata is unavailable), and do a join (I guess that's a "$lookup" in Mongo) to this table.
Upsides: MongoDB does the heavy lifting of merging the data.
Downsides: We need to replicate and keep-up-to-date, somehow, a secondary collection of our Users' relevant metadata. I don't think MongoDB's $lookup works like a join in MySQL, and maybe isn't really any more performant at all? Although I'd look into this before we implemented.
For the sake of estimation, let's just say that any given visitor to our site will never have to load more than 100,000 logins and 10,000 users.
For what it's worth, Option #2 seems most preferable to me, even though it involves rewriting history, for performance reasons. Although I am aware that, at some point, if we were sending a user's browser multiple years of login data (that is, all 100,000 imaginary logins), maybe that's already too much data for their browser to process and render quickly, and perhaps we'd already be better off grouping it and aggregating it as some sort of regularly-scheduled process on the backend. (I don't know!)
As a Data Warehouse, 100K rows, is quite small.
Performance in DW depends on building and maintaining "Summary Tables". This makes a pre-determined set of possibly queries very efficient, without having to scan the entire 'Fact' table. My discussion of Summary Tables (in MySQL): http://mysql.rjweb.org/doc.php/summarytables
I want to limit my users to 25k/requests per hour/day/whatever.
My first idea was to simply use mysql and have a column in the users table where i would store the requests and i would increment this counter each time the user makes a request. Problem with this approach is that sometimes you end up writing at the same time in the column and you get a deadlock from mysql, so this isn't really a good way to go about it, is it ?
Another way would be, instead of incrementing the counters of a column, to insert log records in a separate table and then count these records for a given timespan, but this way you can easily end up with million records table and the query can be too slow.
When using a RDBMS another aspect to take into consideration is that at each request you'd have to count the user quota from database and this can take time depending on either of the above mentioned methods.
My second idea, i thought using something like redis/memcached (not sure of alternatives or which one of them is faster) and store the request counters there. This would be fast enough to query and increment the counters, for sure faster than a RDBMS, but i haven't tried it with huge amounts of data, so i am not sure how it will perform just yet.
My third idea, i would keep the quota data in memory in a map, something like map[int]int where the key would be the user_id and the value would be the quota usage and i'd protect the map access with a mutex. This would be the fastest solution of all but then what do you do if for some reason your app crashes, you lose all that data related to the number of requests certain user did. One way would be to catch the app when crashing and loop through the map and update the database. Is this feasible?
Not sure if either of the above is the right approach, but i am open to suggestions.
I'm not sure what you mean by "get a deadlock from mysql" when you try to update a row at the same time. But a simple update rate_limit set count = count + 1 where user_id = ? should do what you want.
Personally I have had great success with Redis for doing rate limiting. There are lots of resources out there to help you understand the appropriate approach for your use case. Here is one I just glanced at that seems to handle things correctly: https://www.binpress.com/tutorial/introduction-to-rate-limiting-with-redis/155. Using pipelines (MULTI) or Lua scripts may make things even nicer.
You can persist your map[int]int in RDBMS or just file system time to time and in defer function. You even can use it as cache instead of redis. Surely it will be anyway faster than connection to third-party service every request. Also you can store counters on user side simply in cookies. Smart user can clear cookies of-douse but is it so dangerous at all and you can also provide some identification info in cookies to make clearing uncomfortable.
I got a scenario where Data Stream B is dependent on Data Stream A. Whenever there is change in Data Stream A it is required re-process the Stream B. So a common process is required to identify the changes across datastreams and trigger the re-processing tasks.
Is there a good way to do this besides triggers.
Your question is rather unclear and I think any answer depends very heavily on what your data looks like, how you load it, how you can identify changes, if you need to show multiple versions of one fact or dimension value to users etc.
Here is a short description of how we handle it, it may or may not help you:
We load raw data incrementally daily, i.e. we load all data generated in the last 24 hours in the source system (I'm glossing over timing issues, but they aren't important here)
We insert the raw data into a loading table; that table already contains all data that we have previously loaded from the same source
If rows are completely new (i.e. the PK value in the raw data is new) they are processed normally
If we find a row where we already have the PK in the table, we know it is an updated version of data that we've already processed
Where we find updated data, we flag it for special processing and re-generate any data depending on it (this is all done in stored procedures)
I think you're asking how to do step 5, but it depends on the data that changes and what your users expect to happen. For example, if one item in an order changes, we re-process the entire order to ensure that the order-level values are correct. If a customer address changes, we have to re-assign him to a new sales region.
There is no generic way to identify data changes and process them, because everyone's data and requirements are different and everyone has a different toolset and different constraints and so on.
If you can make your question more specific then maybe you'll get a better answer, e.g. if you already have a working solution based on triggers then why do you want to change? What problem are you having that is making you look for an alternative?
I have noticed that using something like delayed_job without a UNIQUE constraint on a table column would still create double entries in the DB. I have assumed delayed_job would run jobs one after another. The Rails app runs on Apache with Passenger Phusion. I am not sure if that is the reason why this would happen, but I would like to make sure that every item in the queue is persisted to AR/DB one after another, in sequence, and to never have more than one write to this DB table happen at the same time. Is this possible? What would be some of the issues that I would have to deal with?
update
The race conditions arise because an AJAX API is used to send data to the application. The application received a bunch of data, each batch of data is identified as belonging together by a Session ID (SID), in the end, the final state of the database has to include the latest most up-to date AJAX PUT query to the API. Sometimes queries arrive at the exact same time for the same SID -- so I need a way to make sure they don't all try to be persisted at the same time, but one after the other, or simply the last to be sent by AJAX request to the API.
I hope that makes my particular use-case easier to understand...
You can lock a specific table (or tables) with the LOCK TABLES statement.
In general I would say that relying on this is poor design and will likely lead to with scalability problems down the road since you're creating an bottleneck in your application flow.
With your further explanations, I'd be tempted to add some extra columns to the table used by delayed_job, with a unique index on them. If (for example) you only ever wanted 1 job per user you'd add a user_id column and then do
something.delay(:user_id => user_id).some_method
You might need more attributes if the pattern is more sophisticated, e.g. there are lots of different types of jobs and you only wanted one per person, per type, but the principle is the same. You'd also want to be sure to rescue ActiveRecord::RecordNotUnique and deal with it gracefully.
For non delayed_job stuff, optimistic locking is often a good compromise between handling the concurrent cases well without slowing down the non concurrent cases.
If you are worried/troubled about/with multiple processes writing to the 'same' rows - as in more users updating the same order_header row - I'd suggest you set some marker bound to the current_user.id on the row once /order_headers/:id/edit was called, and removing it again, once the current_user releases the row either by updating or canceling the edit.
Your use-case (from your description) seems a bit different to me, so I'd suggest you leave it to the DB (in case of a fairly recent - as in post 5.1 - MySQL, you'd add a trigger/function which would do the actual update, and here - you could implement similar logic to the above suggested; some marker bound to the sequenced job id of sorts)