Rails - Hot Swap Tables - mysql

I'm getting daily dumps for a table(lets stay students table) from an external source. In order to reduce downtime while the table is being truncated and updated with the new data, I'm planning to maintain two copies of this table(students_1 and students_2).
Both these need to be mapped with Student model on an alternating daily basis. So if today I am using data from students_1, tomorrow, once data has been entered to students_2, I'll need to switch seamlessly to that one.
So my questions are
1) Is this approach good enough or is there a better one ?
2) For hot swapping tables, is it fine to just maintain a file indicating the current table being used and then set_table_name via a method which reads this particular file ? Is there a more elegant solution ?

You can do it as part of your data loading strategy, i wouldn't mess with storing table names or using non standard table names. After data is done loading, execute a table rename command instead, it is done atomically and should not interrupt your app.
RENAME TABLE students TO students_secondary_temp, students_secondary TO students, students_secondary_temp TO students_secondary;

Related

How to scale mysql table to keep historic data without increasing table size

I have a questionnaire in my app, using which I am creating data corresponding to the user who has submitted it and at what time(I have to apply further processing on the last object/questionnaire per user). This data is saved inside my server's MySQL DB. As this questionnaire is open for all my users and as it will be submitted multiple times, I do not want new entries to be created every time for the same user because this will increase the size of the table(users count could be anything around 10M), But I also want to keep the old data as a history for later processing.
Now I have this option in mind:
Create two tables. One main table to keep new objects and one history table to keep history objects. Whenever a questionnaire is submitted it will create a new entry in the history table, but update the existing entry in the main table.
So, is there any better approach to this and how do other companies tackle such situations?
I think you should go through the SCD (Slowly Changing Dimension) Concepts and decide which one is better approach to you.
Please read this and i think you will find the best way for yourself :
Here

Database Design: Difference between using boolean fields and duplicate tables

I have to design a database schema for an application I'm building. I will be using MySQL. In this application, users enter data and it gets saved in the database obviously. However, this data is not accessible to the public until the user publishes the data. Currently, I have one column for storing all the data. I was wondering if a boolean field in this table that indicates whether the data has been published is a good idea. Or, is it much better design to create one table for saved data and one table for published data and move the saved data to the published data table when the user presses Publish.
What are the advantages and disadvantages of using each one and is one of them considered better design than the other?
Case: Binary
They are about equal. Use this as a learning exercise -- Implement it one way; watch it for a while, then switch to the other way.
(same) Space: Since a row exists exactly once, neither option is 'better'.
(favor 1 table) When "publishing" it takes a transaction to atomically delete from one table and insert into the other.
(favor 2 tables) Certain SELECTs will spend time filtering out records with the other value for published. (This applies to deleted, embargoed, approved, and a host of other possible boolean flags.)
Case: Revision history
If there are many revisions of a record, then two tables, Current data and History, is better. That is because the 'important' queries involve fetching the only Current data.
(PARTITIONs are unlikely to help in either case.)

best approach to exchanging data dumps between organizations

I am working a project where I will receive student data dumps once a month. The data will be imported into my system. The initial import will be around 7k records. After that, I don't anticipate more than a few hundred a month. However, there will also be existing records that will be updated as the student changes grades, etc.
I am trying to determine the best way to keep track of what has been received, imported, and updated over time.
I was thinking of setting up a hosted MySQL database with a script that imports the SFTP dump into a table that includes a creation_date and a modification_date field. My thought was, the person performing the extraction, could connect to the MySQL db and run a query on the imported table each month to get the differences before the next extraction.
Another thought I had, was to create a new received table every month for each data dump. Then I would perform the query on the differences.
Note: The importing system is legacy and will accept imports using a utility and unique csv type files. So that probably rules out options like XML.
Thank you in advance for any advice.
I'm going to assume you're tracking students' grades in a course over time.
I would recommend a two table approach:
Table 1: transaction level data. Add-only. New information is simply appended on. Sammy got a 75 on this week's quiz, Beth did 5 points extra credit, etc. Each row is a single transaction. Presumably it has the student's name/id, the value being added, maybe the max possible value or some weighting factor, and of course the timestamp added.
All of this just keeps adding to a never-ending (in theory) table.
Table 2: summary table, rebuilt at some interval. This table does a simple aggregation on the first table, processing the transactional scores into a global one. Maybe it's a simple sum, maybe it's a weighted average, maybe you have something more complex in mind.
This table has one row per student (per course?). You want this to be rebuilt nightly. If you're lazy, you just DROP/CREATE/INSERT. If you're worried about data-loss, you just INSERT and add a timestamp so you can have snapshots going back.

Create mysql tables dynamically based on a criteria in ruby on rails 5

I am new to Rails and I am trying to create a web app where you scrape some html from a page and store it into a database in order to compare it to a different version e.g the price of a product changed. The way I want to make it work is to create a new table every time you scrape something from a domain that's new.
So basically every domain has its own table for changes. I know how to create tables with migrations but how do you dynamically create a table when a new domain is added ?
The recommended "relational database" way to do this is to have a singular table and relate that table to the source. For page snapshots you can often hash the content to test for duplicated data, and a UNIQUE index on your content hash can automatically prevent those sorts of inserts.
If portions of the page update but you're not interested in them, like advertising blocks, you can use a tool like Nokogiri to pre-process and strip out that content before hashing and saving.
Now if this is just part of a pipeline where you're capturing pages with the express intent of extracting price information later, you may not need a database at all for that part of the process. You could funnel the raw page data into a queue like RabbitMQ and have workers process it, boiling it down to the price data, which is all you insert in the database.
If you need to preserve the page snapshots for diagnostic or historical reasons then a table will work. To save on size you can explore using an ARCHIVE type table. These are append-only, you can't edit them, but they are compact and perform well.
You could periodically TRUNCATE a table of that sort to clear out old data so you're not keeping junk around forever.

Using Redis as a Key/Value store for activity stream

I am in the process of creating a simple activity stream for my app.
The current technology layer and logic is as follows:
** All data relating to an activity is stored in MYSQL and an array of all activity id's are kept in Redis for every user.**
User performs action and activity is directly stored in an 'activities' table in MYSQL and a unique 'activity_id' is returned.
An array of this user's 'followers' is retrieved from the database and for each follower I push this new activity_id into their list in Redis.
When a user views their stream I retrieve the array of activity id's from redis based on their userid. I then perform a simple MYSQL WHERE IN($ids) query to get the actual activity data for all these activity id's.
This kind of setup should I believe be quite scaleable as the queries will always be very simple IN queries. However it presents several problems.
Removing a Follower - If a user stops following someone we need to remove all activity_id's that correspond with that user from their Redis list. This requires looping through all ID's in the Redis list and removing the ones that correspond to the removed user. This strikes me as quite unelegant, is there a better way of managing this?
'archiving' - I would like to keep the Redis lists to a length of
say 1000 activity_id's as a maximum as well as frequently prune old data from the MYSQL activities table to prevent it from growing to an unmanageable size. Obviously this can be achieved
by removing old id's from the users stream list when we add a new
one. However, I am unsure how to go about archiving this data so
that users can view very old activity data should they choose to.
What would be the best way to do this? Or am I simply better off
enforcing this limit completely and preventing users from viewing very old activity data?
To summarise: what I would really like to know is if my current setup/logic is a good/bad idea. Do I need a total rethink? If so what are your recommended models? If you feel all is okay, how should I go about addressing the two issues above? I realise this question is quite broad and all answers will be opinion based, but that is exactly what I am looking for. Well formed opinions.
Many thanks in advance.
1 doesn't seem so difficult to perform (no looping):
delete Redis from Redis
join activities on Redis.activity_id = activities.id
and activities.user_id = 2
and Redis.user_id = 1
;
2 I'm not really sure about archiving. You could create archive tables every period and move old activities from the main table to an archive table periodically. Seems like a single properly normalized activity table ought to be able to get pretty big though. (make sure any "large" activity stores the activity data in a separate table, the main activity table should be "narrow" since it's expected to have a lot of entries)