Data pipeline proposal - mysql

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

Related

Using a schedule in SQL without making the table huge

So I've designed a basic SQL database that imports data outputted by machines through SSIS into SQL, does a few transforms, and ends up with how many things we are producing every 15 minutes.
Now we want to be able to report on this per-operator. So I have another table with operators and operator numbers, and am trying to figure out how to track this, with the eventual goal of giving my boss charts and graphs of how his employees are doing.
Now the question:
I was going to format a table with the date, machine number, operator number and then a column for each 15m segment in a day, but that ended up being a million+ datapoints a year, which will clearly get out of control.
Then I was thinking, date, machinenumber, user#, start and stop time. but couldn't figure out how to get it to roll over into the next day if a shift goes past midnight, or how to query against a time between the start/stop times, simple stuff I'm sure but I'm new here. I need to use time instead of just a "shift" since that may change, people go home early, ect. stuff happens.
So the question is: What would be best practice on how to format a table for a work schedule, and how can I query off of it as above?
First, a million rows a year isn't a lot. SQL databases regularly get into the billions of rows. The storage requirements compared to modern drive sizes are nothing. Properly indexed, performance won't be a problem.
In fact, I'd say to consider not even bothering with the time periods. Record each data point with a timestamp instead. Use SQL operators such as BETWEEN to get whatever periods you like. It's simpler. It's more flexible. It takes more space, but space isn't really an issue. And with proper indexing it won't be a performance issue. Use the money saved on developer time to buy better hardware for your database, like more RAM or an SSD. Or move to a cloud database.
Just make sure you architect your system to encapsulate the details of the schema, probably by using a model, and ensure that you have a way to safely change your schema, like by using migrations. Then if you need to re-architect your schema later you can do so without having to hunt down every piece of code that might use that table.
That said, there's a few simple things you could do to reduce the number of rows.
There's probably going to be a lot of periods when a thing doesn't produce anything. If nothing is produced during that period, don't store a row. If you just store the timestamp for each thing produced, these gaps appear normally.
You could save a small amount of space and performance by putting the periods in their own table and referencing them. So instead of every table having redundant start and end datetime columns, they'd have a single period column which referenced a period table that had start and end columns. While this would reduce some duplication, I'm not so sure this is worth the complexity.
In the end, before you add a bunch of complexity over hypothetical performance issues, do the simplest thing and benchmark it. Load up your database with a bunch of test data, see how it performs, and optimize from there.

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!
Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.

Migrating and comparing a SQL Server database

We downloaded today RedGate's Toolbet, in oder to automatize some tasks that take so long in our company when it comes to databases.
The first one appear with a 15 GB database we have, with a lot of indexes, constrains and also several triggers. We want this database to be migrated exactly with the schema, all the data, triggers, etc to a new DB with the idea to reduce the size an also to get a better performance hidding all the mistakes commited in the past. Unfortunately this was the first customer's release DB of one products, and we used it to test lot of things that no always worked pretty well. We are sure that if we do something like this, we will get more tha 50% of the size back into our disk.
Can one or some Toolbet tools combined be useful to do this? If answer is not, is there available other tool useful for this task?
One common way this can happen is if you are not selecting all your tables to be included in the compare. For example, you may have selected a child table and not the parent table. This could lead to a FK error like you describe.

Best Update Method for MySQL DB

I have read through the solutions to similar problems, but they all seem to involve scripts and extra tools. I'm hoping my problem simple enough to avoid that.
So the user uploads a csv of next week's data. It gets inserted into the DB, no problem.
BUT
an hour later he gets feedback from everyone, and must make updates accordingly. He updates the csv and goes to upload it to the DB.
Right now, the system I'm using checks to see if the data for that week is already there, and if it is, pulls all of that data from the DB, a script finds the differences and sends them out, and after all of this, the data the old data is deleted and replaced with the new data.
Obviously, it is a lot easier to just wipe it clean and reenter the data, but not the best method, especially if there are lots of changes or tons of data. But I have to know WHAT changes have been made to send out alerts. But I don't want a transaction log, as the alerts only need to be sent out the one time and after that, the old data is useless.
So!
Is there a smart way to compare the new data to the already existing data, get only the rows that are changed/deleted/added, and make those changes? Right now it seems like I could do an update, but then I won't get any response on what has changed...
Thanks!
Quick Edit:
No foreign keys are currently in use. This will soon change, but it shouldn't make a difference, because the foreign keys will only point to who the data effects and thus won't need to be changed. As far as primary keys go, that does present a bit of a dilemma:
The data in question is everyone's work schedule. So it would be nice (for specific applications of this schedule beyond simple output) for each shift to have a key. But the problem is, let's say that user1 was late on Monday. The tardiness is recorded in a separate table and is tied to the shift using the shift key. But if on Tuesday there is some need to make some changes to the week already in progress, my fear is that it will become too difficult to insure that all entries in the DB that have already happened (and thus may have associations that shouldn't be broken) will get re-keyed in the process. Unfortunately, it is not as simple as only updating all events occurring AFTER the current time, as this would add work (and thus make it less marketable) to the people who do the uploading. Basically, they make the schedule on one program, export it to a CSV, and then upload it on a web page for all of the webapps that need that data. So it is simply much easier for them (and less stressful for everyone involved) to do the same routine every time of exporting the entire week and uploading it.
So my biggest concern is to make the upload script as smart as possible on both ends. It doesn't get bloated trying to find the changes, it can find the changes no matter the input AND none of the data that is unchanged risks getting re-keyed.
Here's a related question:
Suppose Joe User was schedule to wash dishes from 7:00 PM to 8:00 PM, but the new
data has him working 6:45 PM to 8:30 PM. Has the shift been changed? Or has the old
one been deleted and a new one added?
And another one:
Say Jane was schedule to work 1:00 PM to 3:00 PM, but now everyone has a mandatory
staff meeting at 2:00 to 3:00. Has she lost one shift and gained two? Or has one
shift changed and she gained one?
I'm really interested in knowing how this kind of data is typically handled/approached, more than specific answers to the above.
Again, thank you.
Right now, the system I'm using checks to see if the data for that week is already there, and if it is, pulls all of that data from the DB, a script finds the differences and sends them out, and after all of this, the data the old data is deleted and replaced with the new data.
So your script knows the differences, right? And you don't want to use some extra extra tools, apart from your script and MySQL, right?
I'm quite convinced that MySQL doesn't offer any 'diff' tool by itself, so the best you can achieve is making new CSV file for updates only. I mean - it should contain only changed rows. Updating would be quicker, and all changed data would be easily available.
If you have a unique key on one of the fields, you can use:
LOAD DATA LOCAL INFILE '/path/to/data.csv' REPLACE INTO TABLE table_name