I am working a project where I will receive student data dumps once a month. The data will be imported into my system. The initial import will be around 7k records. After that, I don't anticipate more than a few hundred a month. However, there will also be existing records that will be updated as the student changes grades, etc.
I am trying to determine the best way to keep track of what has been received, imported, and updated over time.
I was thinking of setting up a hosted MySQL database with a script that imports the SFTP dump into a table that includes a creation_date and a modification_date field. My thought was, the person performing the extraction, could connect to the MySQL db and run a query on the imported table each month to get the differences before the next extraction.
Another thought I had, was to create a new received table every month for each data dump. Then I would perform the query on the differences.
Note: The importing system is legacy and will accept imports using a utility and unique csv type files. So that probably rules out options like XML.
Thank you in advance for any advice.
I'm going to assume you're tracking students' grades in a course over time.
I would recommend a two table approach:
Table 1: transaction level data. Add-only. New information is simply appended on. Sammy got a 75 on this week's quiz, Beth did 5 points extra credit, etc. Each row is a single transaction. Presumably it has the student's name/id, the value being added, maybe the max possible value or some weighting factor, and of course the timestamp added.
All of this just keeps adding to a never-ending (in theory) table.
Table 2: summary table, rebuilt at some interval. This table does a simple aggregation on the first table, processing the transactional scores into a global one. Maybe it's a simple sum, maybe it's a weighted average, maybe you have something more complex in mind.
This table has one row per student (per course?). You want this to be rebuilt nightly. If you're lazy, you just DROP/CREATE/INSERT. If you're worried about data-loss, you just INSERT and add a timestamp so you can have snapshots going back.
Related
I am working on a MariaDb database, in which there is a table called event that can calmly store 6-8 million monthly records. Although the data that is stored does not have much information, since it is simple records of events that occur, composed of 3 columns.
So I would like to prevent possible corruption of the table and my initial idea is, at the end of each month, to create a new historical table with the name in the format event_month_year, and dump the data of the corresponding date. Example:
event_jan_2022
event_feb_2022
event_mar_2022
In this way, if I want to access the information of the current month, I will just query to the table event, but if I want information of previous dates, then I will query to the table by its name in the format that I just explained (event_month_year).
I don't know if this way is the most correct, the most optimal or elegant or not. Surely there are other, better ways to do it, so I'd like to hear opinions.
Thanks in advance.
I'm getting daily dumps for a table(lets stay students table) from an external source. In order to reduce downtime while the table is being truncated and updated with the new data, I'm planning to maintain two copies of this table(students_1 and students_2).
Both these need to be mapped with Student model on an alternating daily basis. So if today I am using data from students_1, tomorrow, once data has been entered to students_2, I'll need to switch seamlessly to that one.
So my questions are
1) Is this approach good enough or is there a better one ?
2) For hot swapping tables, is it fine to just maintain a file indicating the current table being used and then set_table_name via a method which reads this particular file ? Is there a more elegant solution ?
You can do it as part of your data loading strategy, i wouldn't mess with storing table names or using non standard table names. After data is done loading, execute a table rename command instead, it is done atomically and should not interrupt your app.
RENAME TABLE students TO students_secondary_temp, students_secondary TO students, students_secondary_temp TO students_secondary;
I am new to MySQL partitioning, therefore any example will be appreciated.
I am trying to create a sort of an ageing mechanism for a data that is distributed between several MyISAM tables.
My question will actually include several sub-questions.
The relevant tables are:
First table contains raw data with high input frequency (next to each record there is an auto incremented id).
Second table contains processed results, there is a result record per every raw data record (result record contains the source id record of the auto incremented field of raw data record)
Questions:
I need to be able to partition the raw data table and result data table similarly so that both of them will include only 10 weeks of data in single partition (each raw data record contains unixtimestamp field), how do i do it , can someone write small example case for two such tables?.
I want to be able to change the 10 weeks constraint on the fly.
I want that when ever the current partition will be filled or a new partition is created , the previous (10 weeks before) partition will be deleted automatically.
I don't want the auto increment id integer to be overflown, as much as i understand the ids are unique for the partition only, so if i am not wrong the auto increment id will start from zero for the next partition? but what if the previous partition still exist, will i have 2 duplicated ids , how i know to reference only for the last id when i present a result record?
I want to load raw data using LOAD DATA INTO... instead of multiple inserts , is MySQL partitioning functionality affected?
And the last question, would you suggest some other approach to implement aging mechanism (i am writing Java implementation product that processes around 1 GB or raw data per day and stores the results in MySQL)
It's hard to give a real answer on this question since it depends on your data. But let me give you some things to think about.
I assume we're talking about some kind of logs with recent data (so not spanning multiple years). You can partition by range. You could add one field to your table with the year/week number (ie 201201, 201202, etc). If this question is related to your question about importing into multiple tables, you can easily do this is that import script.
On the fly as in, repartition your data on the fly (70GB?). I would not recommend it. But you could do it if you had the weeknumber in there. If you later want to change it to 12 days, you could add a column for the date and partition by that.
Well it won't be deleted automatically but a cron job can handle that right? Just check how many partitions there are, and if there are 3(?) delete the first one.
The partition needs to have a primary index on the field that you partition (if you want to use auto increment). Therefor you can never fully rely on the auto increment id alone. I don't see a way around this.
I'm not sure what you mean.
If your data is just some logs in chronological order then you might just use separate tables for each period. Then before you start the new period (at 00:00) check the last id of the last table, create a new table and set the auto increment to that value +1. Then your import will decide when a new period will begin so it can be easily changed. Your import script can use a small table in where it can store the next period.
LOAD DATA is really quite fast. I would just have two steps(in no partic order) - LOAD DATA and then 'delete .. where date < 10 weeks'. Autoincrement will go on for as long as the datatype you're using. If you wanted to be super careful you could push it back to zero periodically.
Once the data is in the 'raw' table run your routine to create the 'processed' table. We use a v similar process where I work. We keep a separate table that has 'write' and 'parse' pointers to all of our 'raw' tables. As new data comes in and gets parsed the appropriate row pointers get set. If the 'raw' table gets truncated you can reset the 'write' pointer but leave the 'parse' pointer. (we store the offset in another table when this happens - just to be sure).
And if I recommend , creating the index column for each of the related columns can also enhanced the performance Delete old data from multiple related tables since we have just compared the index numbers rather than strings.
I wonder if your tables are being sorted or not.
We dont have any existing data warehouse, but we have customers (in OLTP) that have been with us many years and made purchases. How can I populate a customer dimension and then "replay" all the age updates that have occurred over the years, so that the type 2 dimension will have all the updates for those customers.
Since I want to populate the fact table with sales and refer to the DimCustomerFK. But when our clients query for data I want those customers to have the correct age. Since if I dont make any changes the customer will have the same age now and 10 years back when he placed the first order.
Any ideas how this can be made?
Interesting problem Patrik.
Some options:-
1) design SQL to parse through your customer / transaction OLTP data to create a daily flat file of customer updates. So you will end up with many thousand fairly small files (obviously depending on the number of customers you have and the date range). Name them Customeryyyymmdd.csv. Then create an ETL suite to read in the flat files in forward date order and apply the type 2 changes in order to the DWH.
2) build a very complex SQL query (I'm waving my hands around here as I dont know your data structures so couldnt suggest how complex this would be) that creates an ordered customer change list that you can pass through an ETL SCD component record by record.
Either seems logically feasible given what you have said earlier, but that may give you some ideas to consider that may give you a more concrete solution.
g/l
Mark.
I want to know what's the most optimized way to work with mysql.
I have a quite large database for orders. i have another table for previous orders, whenever the order is completed, it's erased from orders and is moved to previous orders.
should i keep using this mothod or put them all in the same table and add a column that flags if it's current or previous orders?
kfir
In general, moving data around tables is a sensitive process - data can easily get lost or corrupted. The question is how are you querying this table - if you search and filter through it on an often basis, you want to keep the table relatively small. If you only access it to read specific lines (using a direct primary key), the size of the table is less crucial, and then I would advise to keep a flag.
A third option you might want to consider, is having 3 tables - one for ongoing orders, one for historic orders, and one with the order details. That third table can be long and static, while you query the first two tables.
Lastly - at some point, you might want to move the historic data out of the table all together. Maybe you would keep a cron job that runs once a month and moves out data which is older than 6 months to a different, remote database.