Scalable Database Plan and Server Selection for Big Amount of Data - mysql

I started to work on a startup, where I am goign to write backend. It is my first time that I need to work with this big project and I wanted to ask a few questions to see what is the best practice. I will explain the workflow and informations first than ask for your valuable ideas.
The project is aiming to connect multiple pharmaceutical companies(at most 10) with many (at most 20.000) pharmacies. Pharmacies are supposed to upload screenshots or pdf files, and I need to gather information from these files. Every pharmacy might upload at most 100 screenshots and maybe a few pdf files, but they might do this process for different pharmaceutical companies. Lets say a pharmacy uploads 100 ss and 2 pdfs for company a, company b and company c, so 300 images and 6 pdfs total. Also, reading from pdf or using ocr system for images takes time. Every pdf will have information(Transaction data) about 50 Drugs. I will have Drugs Table and also Transactions table. Every Drug has on avarage 7 transaction. I feel like Transactions table will be huge after sometime and it would be costly to run queries on that big table.
Here are my questions
1)I am planning to use MySQL, would it be enough to serve my purpose?
2)Should I have a seperated database for each company or it is better idea to keep everything in one company.
3)What is the best practice to implement Drugs and Transactions table. Easist way is just to use a foreignkey, but as I told, after some time transactions table will be so big, so maybe there is a better way to plan it.
4)Should I go with a dedicated server or choose a service like AWS.As I said reading from pdf or using ocr takes time.
5)Which storege option would be reasonable for this project. Again, dedicated server or services like AWS storeage.
6)When I read a drug information, it will have drug data and also around 7 transactions. So i need to write to database 8 times per drug. Could there be less costly option?
Thank you very much for your answers :)

Thinking of MySQL:
Thousands of rows is "tiny"; millions is "medium-sized".
A table with a billion rows has some challenges, but it is do-able.
My best advice is to plan on rewriting the entire schema and code in about 4 months. With that advice, you can hastily build something, then learn why it is not optimal; then rebuild it.
Use as many tables as you need. (It sounds like you might need a few dozen. A thousand tables would probably indicate that you are doing something wrong.)
Consider storing images, pdfs, etc in files, not in tables. Then put the file path in a table.
Reading a PDF should be done only once, then capture the OCR'd result into a text file (or MEDIUMTEXT column). After that, you can write code (with or without SQL) to parse it and copy the important data into the database. And you can re-do that when you find that the parsing was not adequate.

Related

Data pipeline proposal

Our product has been growing steadily over the last few years and we are now on a turning point as far as data size for some of our tables is, where we expect that the growth of said tables will probably double or triple in the next few months, and even more so in the next few years. We are talking in the range of 1.4M now, so over 3M by the end of the summer and (since we expect growth to be exponential) we assume around 10M at the end of the year. (M being million, not mega/1000).
The table we are talking about is sort of a logging table. The application receives data files (csv/xls) on a daily basis and the data is transfered into said table. Then it is used in the application for a specific amount of time - a couple of weeks/months - after which it becomes rather redundant. That is: if all goes well. If there is some problem down the road, the data in the rows can be useful to inspect for problem solving.
What we would like to do is periodically clean up the table, removing any number of rows based on certain requirements, but instead of actually deleting the rows move them 'somewhere else'.
We currently use MySQL as a database and the 'somewhere else' could be the same, but can be anything. For other projects we have a Master/Slave setup where the whole database is involved, but that's not what we want or need here. It's just some tables where the Master table would need to become shorter and the Slave only bigger, not a one-on-one sync.
The main requirement for the secondary store would be that the data should be easy to inspect/query when need to, either by SQL or another DSL, or just visual tooling. So we are not interested in backing up the data to one or more CSV files or another plain text format, since that is not as easy to inspect. The logs will then be somewhere on S3 so we would need to download it, grep/sed/awk on it... We'd much rather have something database like that we can consult.
I hope the problem is clear?
For the record: while the solution can be anything we prefer to have the simplest solution possible. It's not that we don't want Apache Kafka (example), but then we'd have to learn it, install it, maintain it. Every new piece of technology adds onto our stack, the lighter it remains the more we like it ;).
Thanks!
PS: we are not just being lazy here, we have done some research but we just thought it'd be a good idea to get some more insight in the problem.

Creating a MySQL Database Schema for large data set

I'm struggling to find the best way to build out a structure that will work for my project. The answer may be simple but I'm struggling due to the massive number of columns or tables, depending on how it's set up.
We have several tools, each that can be run for many customers. Each tool has a series of questions that populate a database of answers. After the tool is run, we populate another series of data that is the output of the tool. We have roughly 10 tools, all populating a spreadsheet of 1500 data points. Here's where I struggle... each tool can be run multiple times, and many tools share the same data point. My next project is to build an application that can begin data entry for a tool, but allow import of data that shares the same datapoint for a tool that has already been run.
A simple example:
Tool 1 - company, numberofusers, numberoflocations, cost
Tool 2 - company, numberofusers, totalstorage, employeepayrate
So if the same company completed tool 1, I need to be able to populate "numberofusers" (or offer to populate) when they complete tool 2 since it already exists.
I think what it boils down to is, would it be better to create a structure that has 1500 tables, 1 for each data element with additional data around each data element, or to create a single massive table - something like...
customerID(FK), EventID(fk), ToolID(fk), numberofusers, numberoflocations, cost, total storage, employee pay,.....(1500)
If I go this route and have one large table I'm not sure how that will impact performance. Likewise - how difficult it will be to maintain 1500 tables.
Another dimension is that it would be nice to have a description of each field:
numberofusers,title,description,active(bool). I assume this is only possible if each element is in its own table?
Thoughts? Suggestions? Sorry for the lengthy question, new here.
Build a main table with all the common data: company, # users, .. other stuff. Give each row a unique id.
Build a table for each unique tool with the company id from above and any data unique to that implementation. Give each table a primary (unique key) for 'tool use' and 'company'.
This covers the common data in one place, identifies each 'customer' and provides for multiple uses of a given tool for each customer. Every use and customer is trackable and distinct.
More about normalization here.
I agree with etherbubunny on normalization but with larger datasets there are performance considerations that quickly become important. Joins which are often required in normalized databases to display human readable information can be performance killers on even medium sized tables which is why a lot of data warehouse models use de-normalized datasets for reporting. This is essentially pre-building the joined reporting data into new tables with heavy use of indexing, archiving and partitioning.
In many cases smart use of partitioning on its own can also effectively help reduce the size of the datasets being queried. This usually takes quite a bit of maintenance unless certain parameters remain fixed though.
Ultimately in your case (and most others) I highly recommend building it the way you are able to maintain and understand what is going on and then performing regular performance checks via slow query logs, explain, and performance monitoring tools like percona's tool set. This will give you insight into what is really happening and give you some data to come back here or the MySQL forums with. We can always speculate here but ultimately the real data and your setup will be the driving force behind what is right for you.

MYSQL - Database Design Large-scale real world deployment

I would love to hear some opinions or thoughts on a mysql database design.
Basically, I have a tomcat server which recieves different types of data from about 1000 systems out in the field. Each of these systems are unique, and will be reporting unique data.
The data sent can be categorized as frequent, and unfrequent data. The unfrequent data is only sent about once a day and doesn't change much - it is basically just configuration based data.
Frequent data, is sent every 2-3 minutes while the system is turned on. And represents the current state of the system.
This data needs to be databased for each system, and be accessible at any given time from a php page. Essentially for any system in the field, a PHP page needs to be able to access all the data on that client system and display it. In other words, the database needs to show the state of the system.
The information itself is all text-based, and there is a lot of it. The config data (that doesn't change much) is key-value pairs and there is currently about 100 of them.
My idea for the design was to have 100+ columns, and 1 row for each system to hold the config data. But I am worried about having that many columns, mainly because it isn't too future proof if I need to add columns in the future. I am also worried about insert speed if I do it that way. This might blow out to a 2000row x 200column table that gets accessed about 100 times a second so I need to cater for this in my initial design.
I am also wondering, if there is any design philosophies out there that cater for frequently changing, and seldomly changing data based on the engine. This would make sense as I want to keep INSERT/UPDATE time low, and I don't care too much about the SELECT time from php.
I would also love to know how to split up data. I.e. if frequently changing data can be categorised in a few different ways should I have a bunch of tables, representing the data and join them on selects? I am worried about this because I will probably have to make a report to show common properties between all systems (i.e. show all systems with a certain condition).
I hope I have provided enough information here for someone to point me in the right direction, any help on the matter would be great. Or if someone has done something similar and can offer advise I would be very appreciative. Thanks heaps :)
~ Dan
I've posted some questions in a comment. It's hard to give you advice about your rapidly changing data without knowing more about what you're trying to do.
For your configuration data, don't use a 100-column table. Wide tables are notoriously hard to handle in production. Instead, use a four-column table containing these columns:
SYSTEM_ID VARCHAR System identifier
POSTTIME DATETIME The time the information was posted
NAME VARCHAR The name of the parameter
VALUE VARCHAR The value of the parameter
The first three of these columns are your composite primary key.
This design has the advantage that it grows (or shrinks) as you add to (or subtract from) your configuration parameter set. It also allows for the storing of historical data. That means new data points can be INSERTed rather than UPDATEd, which is faster. You can run a daily or weekly job to delete history you're no longer interested in keeping.
(Edit if you really don't need history, get rid of the POSTTIME column and use MySQL's nice extension feature INSERT ON DUPLICATE KEY UPDATE when you post stuff. See http://dev.mysql.com/doc/refman/5.0/en/insert-on-duplicate.html)
If your rapidly changing data is similar in form (name/value pairs) to your configuration data, you can use a similar schema to store it.
You may want to create a "current data" table using the MEMORY access method for this stuff. MEMORY tables are very fast to read and write because the data is all in RAM in your MySQL server. The downside is that a MySQL crash and restart will give you an empty table, with the previous contents lost. (MySQL servers crash very infrequently, but when they do they lose MEMORY table contents.)
You can run an occasional job (every few minutes or hours) to copy the contents of your MEMORY table to an on-disk table if you need to save history.
(Edit: You might consider adding memcached http://memcached.org/ to your web application system in the future to handle a high read rate, rather than constructing a database design for version 1 that handles a high read rate. That way you can see which parts of your overall app design have trouble scaling. I wish somebody had convinced me to do this in the past, rather than overdesigning for early versions. )

Medium-term temporary tables - creating tables on the fly to last 15-30 days?

Context
I'm currently developing a tool for managing orders and communicating between technicians and services. The industrial context is broadcast and TV. Multiple clients expecting media files each made to their own specs imply widely varying workflows even within the restricted scope of a single client's orders.
One client can ask one day for a single SD file and the next for a full-blown HD package containing up to fourteen files... In a MySQL db I am trying to store accurate information about all the small tasks composing the workflow, in multiple forms:
DATETIME values every time a task is accomplished, for accurate tracking
paths to the newly created files in the company's file system in VARCHARs
archiving background info in TEXT values (info such as user comments, e.g. when an incident happens and prevents moving forward, they can comment about it in this feed)
Multiply that by 30 different file types and this is way too much for a single table. So I thought I'd break it up by client: one table per client so that any order only ever requires the use of that one table that doesn't manipulate more than 15 fields. Still, this a pretty rigid solution when a client has 9 different transcoding specs and that a particular order only requires one. I figure I'd need to add flags fields for each transcoding field to indicate which ones are required for that particular order.
Concept
I then had this crazy idea that maybe I could create a temporary table to last while the order is running (that can range from about 1 day to 1 month). We rarely have more than 25 orders running simultaneously so it wouldn't get too crowded.
The idea is to make a table tailored for each order, eliminating the need for flags and unnecessary forever empty fields. Once the order is complete the table would get flushed, JSON-encoded, into a TEXT or BLOB so it can be restored later if changes need made.
Do you have experience with DBMS's (MySQL in particular) struggling from such practices if it has ever existed? Does this sound like a viable option? I am happy to try (which I already started) and I am seeking advice so as to keep going or stop right here.
Thanks for your input!
Well, of course that is possible to do. However, you can not use the MySQL temporary tables for such long-term storage, you will have to use "normal" tables, and have some clean-up routine...
However, I do not see why that amount of data would be too much for a single table. If your queries start to run slow due to much data, then you should add some indexes to your database. I also think there is another con: It will be much harder to build reports later on, when you have 25 tables with the same kind of data, you will have to run 25 queries and merge the data.
I do not see the point, really. The same kinds of data should be in the same table.

How to reduce growing size of an access file?

So, at my workplace, they have a huge access file (used with MS Access 2003 and 2007). The file size is about 1.2GB, so it takes a while to open the file. We cannot delete any of the records, and we have about 100+ tables (each month we create 4 more tables, don't ask!). How do I improve this, i.e. downsizing the file?
You can do two things:
use linked tables
"compact" the database(s) every once in a while
The linked tables will not in of themselves limit the overall size of the database, but it will "package" it in smaller, more manageable files. To look in to this:
'File' menu + 'Get External data' + 'Linked tables'
Linked tables also have many advantages such as allowing one to keep multiple versions of data subset, and selecting a particular set by way of the linked table manager.
Compacting databases reclaims space otherwise lost as various CRUD operations (Insert, Delete, Update...) fragment the storage. It also regroup tables and indexes, making search more efficient. This is done with
'Tools' menu + 'Database Utilities' + 'Compact and Repair Database...'
You're really pushing up against the limits of MS Access there — are you aware that the file can't grow any larger than 2GB?
I presume you've already examined the data for possible space saving through additional normalization? You can "archive" some of the tables for previous months into separate MDB files and then link them (permanently or as needed) to your "current" database (in which case you'd actually be benefiting from what was probably an otherwise bad decision to start new tables for each month).
But, with that amount of data, it's probably time to start planning for moving to a more capacious platform.
You should really think about your db architecture. If there aren't any links between the tables you could try to move some of them to another database (One db per year :) as a short-term solution..
A couple of “Grasping at straws” ideas
Look at the data types for each column, you might be able to store some numbers as bytes saving a small amount per record
Look at the indexes and get rid of the ones you don’t use. On big tables unnecessary indexes can add a large amount of overhead.
I would + 2^64 the suggestions about the database design being a bit odd but nothing that hasn’t already been said so I wont labour the point
well .. Listen to #Larry, and keep in mind that, on the long term, you'll have to find another database to hold your data!
But on the short term, I am quite disturbed by this "4 new tables per month" thing. 4 tables per month is 50 per year ... That surely sounds strange to every "database manager" here. So please tell us: how many rows, how are they built, what are they for, and why do you have to build tables every month?
Depending on what you are doing with your data, you could also think about archiving some tables as XML files (or even XLS?). This could make sense for "historic" data, that do not have to be accessed through relations, views, etc. One good example would be the phone calls list collected from a PABX. Data can be saved as/loaded from XML/XLS files through ADODB recordsets or the transferDatabase method
Adding more tables every month: that is already a questionable attitude, and seems suspicious regarding data normalisation.
If you do that, I suspect that your database structure is also sub-optimal regarding field sizes, data types and indexes. I would really start by double checking those.
If you really have a justification for monthly tables (which I cannot imagine, again), why not having 1 back-end per month ?
You could also have on main back-end, with, let's say, 3 month of data online, and then an archive db, where you transfer your older records.
I use that for transactions, with the main table having about 650.000 records, and Access is very responsive.