MySQL: Complex queries or tracking/counter fields - mysql

I'm just thinking about MySQL database design and there are often situations where
A particular action is or is not carried out and consequently data is or is not stored in the database
Whether or not a user undertook a particular action is displayed statistically
An example of this would be:
A user does or does not fill out a survey. If they do fill out a survey, the data they provide is stored in the database. The total number of users who filled out the survey is displayed.
Now, in order to get the number of users who filled out the survey, we could either
create a field of type BOOL which is set to TRUE on suvey completion; we then calculate the number of users who completed the survey using a simple COUNT(*) WHERE field=TRUE
calculate the number of users who filled out the survey using the data they provided by joining the users and survey results tables and grouping on the user
This isn't a particularly complex example, but there are cases where without the BOOL flag, queries can be become very complex and expensive. But the flag is an almost unnecessary addition to the database tables - we use it only for convenience. Also it means we have to ensure that we UPDATE all user flags at the relevant time, as well as storing user data.
What would be your approach to this kind of problem? For smaller applications, i'll usually just write complex queries and cache their results (occasionally using views to make things more manageable). But in larger applications, with potentially many joins, I might be tempted to flag the users with an action field so that reads are simpler and cheaper.

The best solution is an indexed view (SQL Server terminology) or a materialized view (Oracle terminology) or a materialized query table (DB2 terminology). All those solutions keep the data up to date in real time. No maintenance.
When your platform doesn't support those kinds of database objects, you have to resort to using a table, along with all the other things necessary to keep the data right. You can keep the data right with
triggers
cron jobs
If you use triggers, you should probably also run a periodic cron job to make sure the data stored matches the data calculated.
It helps that, in the real world, most of these kinds of requirements really don't have to be up to date in real time. These kinds of numbers usually support management decisions; a lag of even a day is often acceptable. (In other words, it sometimes helps to think of it as a data warehouse problem or as a report rather than as an OLTP problem.) I've had to negotiate these kinds of requirements many times. I've never had anyone refuse to accept a two-hour update cycle. (But that's certainly application-dependent.)

calculate the number of users . . . by joining the users and
survey results tables and grouping on
the user
If you can join the users and the survey results tables, then the survey results table must have a user identifier, right? If that's right, you don't need to join those two tables to determine the number of users who completed a survey.

What you are describing is called a "denormalized view", i.e. a table that contains results which can be computed from other data already in the database. The reason to do this is indeed performance, whether to do this or not depends on the cost of (re-)generating the data, the effort in your code required to keep it coherent, and the extra amount of database space to store the computed values.

Related

One database or multiple databases for statistical architecture

I currently already have a website running using CodeIgniter and MySQL. The MySQL database is around 110 tables big and contains mainly website specific data, like user data, vacancy data, etc.
Now I want to extend this website to include a complete statistical module as well. We would capture a lot of user actions and other aggregations from the data gather on our own website, and would also pull in some data from google analytics API to use in our statistics (we will generate a report in Excel but also show statistical graphs and numbers on a page (using chart.js)).
We are not thinking (in a forseeable future) to use this data in other programs, but we need to be able to open some data to the public using an API.
We expect to start with about 300.000-350.000 data points gathered per day, but this amount will keep on growing every day of course, the more users we get.
Using multiple databases in CodeIgniter seems to not be an issue, so the main problem I am left with is how I should create the architecture for this statistical module.
I have a couple of idea's on how to start doing this, but I am not aware if there is performance impact from one to another solution or other things to take into consideration.
My main idea boils down to having a table containing all "events", which just insert in that table every time an action is performed, eg "user is registered", "user put account on private", "user clicked on X", ...
Then once a day (probably at around midnight), a CRON job would run over that table for the past day and aggregate all the values into a format usable for our statistical metrics. Those aggregated values would be stored in a new table. This way we can clean up the "event" table quite regularly since that will become very big very fast.
Idea 1: Extend the current MySQL database architecture with new tables to incorporate the statistics. I would keep on using the current database architecture and add 2 new tables for the events and the aggregated values.
Idea 2: Create a new database, separate from the current existing one, and use this to insert all the events in a table there and the aggregated values in a new table there.
Note: we already have quite a few CRONS running on our current database, updating statusses and dates, sending emails, ...
Note2: sync issues between databases is not an issue since we will never be storing statistics on a per-user level.
MySQL does not care whether tables are in the same database or separate databases. It is just a convenience for the user. Some things:
You might need db1.tbla JOIN db2.tblb to talk across dbs.
It is convenient to have different GRANTs for different databases, but clumsy to have different GRANTs for 110 tables.
I can't think of any performance differences.
Nightly aggregation is a middle-of-the road approach. Using IODKU gives you 'immediate' aggregation, but is probably more burden on the system.
My blog on Summary Tables .
350K rows inserted per day is about 5/second, which is comfortably low, so I don't think we need to discuss performance issues there.
"Summarize and toss" (for events) -- Yes. I like that approach. (Most people fail to think of this option.)
Do the math. Which table is the largest after a year? How many GB will it be? Then think about whether you can shrink any of the columns in it: SMALLINT instead of INT, normalization of long, oft-repeated, strings, etc.

Best database design for storing a high number columns?

Situation: We are working on a project that reads datafeeds into the database at our company. These datafeeds can contain a high number of fields. We match those fields with certain columns.
At this moment we have about 120 types of fields. Those all needs a column. We need to be able to filter and sort all columns.
The problem is that I'm unsure what database design would be best for this. I'm using MySQL for the job but I'm are open for suggestions. At this moment I'm planning to make a table with all 120 columns since that is the most natural way to do things.
Options: My other options are a meta table that stores key and values. Or using a document based database so I have access to a variable schema and scale it when needed.
Question:
What is the best way to store all this data? The row count could go up to 100k rows and I need a storage that can select, sort and filter really fast.
Update:
Some more information about usage. XML feeds will be generated live from this table. we are talking about 100 - 500 requests per hours but this will be growing. The fields will not change regularly but it could be once every 6 months. We will also be updating the datafeeds daily. So checking if items are updated and deleting old and adding new ones.
120 columns at 100k rows is not enough information, that only really gives one of the metrics: size. The other is transactions. How many transactions per second are you talking about here?
Is it a nightly update with a manager running a report once a week, or a million page-requests an hour?
I don't generally need to start looking at 'clever' solutions until hitting a 10m record table, or hundreds of queries per second.
Oh, and do not use a Key-Value pair table. They are not great in a relational database, so stick to proper typed fields.
I personally would recommend sticking to a conventional one-column-per-field approach and only deviate from this if testing shows it really isn't right.
With regards to retrieval, if the INSERTS/UPDATES are only happening daily, then I think some careful indexing on the server side, and good caching wherever the XML is generated, should reduce the server hit a good amount.
For example, you say 'we will be updating the datafeeds daily', then there shouldn't be any need to query the database every time. Although, 1000 per hour is only 17 per minute. That probably rounds down to nothing.
I'm working on a similar project right now, downloading dumps from the net and loading them into the database, merging changes into the main table and properly adjusting the dictionary tables.
First, you know the data you'll be working with. So it is necessary to analyze it in advance and pick the best table/column layout. If you have all your 120 columns containing textual data, then a single row will take several K-bytes of disk space. In such situation you will want to make all queries highly selective, so that indexes are used to minimize IO. Full scans might take significant time with such a design. You've said nothing about how big your 500/h requests will be, will each request extract a single row, a small bunch of rows or a big portion (up to whole table)?
Second, looking at the data, you might outline a number of columns that will have a limited set of values. I prefer to do the following transformation for such columns:
setup a dictionary table, making an integer PK for it;
replace the actual value in a master table's column with PK from the dictionary.
The transformation is done by triggers written in C, so although it gives me upload penalty, I do have some benefits:
decreased total size of the database and master table;
better options for the database and OS to cache frequently accessed data blocks;
better query performance.
Third, try to split data according to the extracts you'll be doing. Quite often it turns out that only 30-40% of the fields in the table are typically being used by the all queries, the rest 60-70% are evenly distributed among all of them and used partially. In this case I would recommend splitting main table accordingly: extract the fields that are always used into single "master" table, and create another one for the rest of the fields. In fact, you can have several "another ones", logically grouping data in a separate tables.
In my practice we've had a table that contained customer detailed information: name details, addresses details, status details, banking details, billing details, financial details and a set of custom comments. All queries on such a table were expensive ones, as it was used in the majority of our reports (reports typically perform Full scans). Splitting this table into a set of smaller ones and building a view with rules on top of them (to make external application happy) we've managed to gain a pleasant performance boost (sorry, don't have numbers any longer).
To summarize: you know the data you'll be working with and you know the queries that will be used to access your database, analyze and design accordingly.

Efficient MySQL Database Structure for Dynamic Form Creation

I'm creating an application in Codeigniter which will allow anyone, without signing in, to create a form to be filled out using different input types (using text boxes, dropdowns, checkboxes, etc.). This form could be 1-100 questions and when completed it will be emailed to someone else who will then fill it out on the site.
I first set up my MySQL database similar to this post, with quite a few different tables all with only a few columns. I then indexed and used foreign keys to link the information.
Since then, I have changed and set up my database like this so I'm making fewer queries:
Document
id, name, email, recipientname, recipientemail, document name
Document Questions
document_id, question_id, question, type, comments
Is having more tables with fewer columns but more queries more efficient than how I'm doing it now? I understand that normalization plays a role, but to what extent are you hindering performance by making your tables so specifically small?
From a normalization point of view there are things you could do to further normalize your data (recipients could have their own entity and types could also), but it's not always the most optimal way of accessing your data.
For example, if you split your problem into 4 different entities (Types could just as easily be an ENUM):
Documents
Document Questions
Recipients
Types
Then to fetch a single form for your application you would be executing a query with multiple joins. If you're using MyISAM then all four of your tables become locked until the query finishes. Queries with bad joins and bad indexes can become very slow.
A better alternative would be to execute four separate queries on the database (add indexes relative to the most common queries you're running) to retrieve your data, this way tables will stay locked for a shorter period of time.
I know this is an extreme example, but I would concentrate more on your index optimization and strike a good balance between normalization and performance.
To sum up, sometimes fully normalized data means lower performance.

Whats a better strategy for storing log data in a database?

Im building an application that requires extensive logging of actions of the users, payments, etc.
Am I better off with a monolithic logs table, and just log EVERYTHING into that.... or is it better to have separate log tables for each type of action Im logging (log_payment, log_logins, log_acc_changes)?
For example, currently Im logging user's interactions with a payment gateway. When they sign up for a trial, when trial becomes a subscription, when it gets rebilled, refunded, if there was a failure or not, etc.
I'd like to also start logging actions or events that dont interact with the payment gateway (renewal cancellations, bans, payment failures that were intercepted before the data is even sent to the gateway for verification, logins, etc).
EDIT:
The data will be regularly examined to verify its integrity, since based on it, people will need to be paid, so accurate data is very critical. Read queries will be done by myself and 2 other admins, so 99% of the time, its going to be write/update.
I just figured having multiple tables, just creates more points of failure during the critical mysql transactions that deal with inserting and updating the payment data, etc.
All other things being equal, smaller disjoint tables can have a performance advantage, especially when they're write-heavy (as table related to logs are liable to be) -- most DB mechanisms are better tuned for mostly-read, rarely-written tables. In terms of writing (and updating any indices you may have to maintain), small disjoint tables are a clear win, especially if there's any concurrency (depending on what engine you're using for your tables, of course -- that's a pretty important consideration in mysql!-).
In terms of reading, it all depends on your pattern of queries -- what queries will you need, and how often. In certain cases for a usage pattern such as you mention there might be some performance advantage in duplicating certain information -- e.g. if you often need an up-to-the-instant running total of a user's credits or debits, as well as detailed auditable logs of how the running total came to be, keeping a (logically redundant) table of running totals by users may be warranted (as well as the nicely-separated "log tables" about the various sources of credits and debits).
Transactional tables should never change, not be editable, and can serve as log files for that type of information. Design your "billing" tables to have timestamps, and that will be sufficient.
However, where data records are editable, you need to track who-changed-what-when. To do that, you have a couple of choices.
--
For a given table, you can have a table_history table that has a near-identical structure, with NULLable fields, and a two-part primary key (the primary key of the original table + a sequence). If for every insert or update operation, you write a record to this table, you have a complete log of everything that happened to table.
The advantage of this method is you get to keep the same column types for all logged data, plus it is more efficient to query.
--
Alternatively, you can have a single log table that has fields like "table", "key", "date", "who", and a related table that stores the changed fields and values.
The advantage of this method is that you get to write one logging routine and use it everywhere.
--
I suggest you evaluate the number of tables, performance needs, change volume, and then pick one and go with it.
It depends on the purpose of logging. For debugging and general monitoring purpose, a single log table with dynamic log level would be helpful so you can chronologically look at what the system is going through.
On the other hand, for audit trail purpose, there's nothing like having duplicate table for all tables with every CRUD action. This way, every information captured in the payment table or whatever would be captured in your audit table.
So, the answer is both.

Which is more efficient: Multiple MySQL tables or one large table?

I store various user details in my MySQL database. Originally it was set up in various tables meaning data is linked with UserIds and outputting via sometimes complicated calls to display and manipulate the data as required. Setting up a new system, it almost makes sense to combine all of these tables into one big table of related content.
Is this going to be a help or hindrance?
Speed considerations in calling, updating or searching/manipulating?
Here's an example of some of my table structure(s):
users - UserId, username, email, encrypted password, registration date, ip
user_details - cookie data, name, address, contact details, affiliation, demographic data
user_activity - contributions, last online, last viewing
user_settings - profile display settings
user_interests - advertising targetable variables
user_levels - access rights
user_stats - hits, tallies
Edit: I've upvoted all answers so far, they all have elements that essentially answer my question.
Most of the tables have a 1:1 relationship which was the main reason for denormalising them.
Are there going to be issues if the table spans across 100+ columns when a large portion of these cells are likely to remain empty?
Multiple tables help in the following ways / cases:
(a) if different people are going to be developing applications involving different tables, it makes sense to split them.
(b) If you want to give different kind of authorities to different people for different part of the data collection, it may be more convenient to split them. (Of course, you can look at defining views and giving authorization on them appropriately).
(c) For moving data to different places, especially during development, it may make sense to use tables resulting in smaller file sizes.
(d) Smaller foot print may give comfort while you develop applications on specific data collection of a single entity.
(e) It is a possibility: what you thought as a single value data may turn out to be really multiple values in future. e.g. credit limit is a single value field as of now. But tomorrow, you may decide to change the values as (date from, date to, credit value). Split tables might come handy now.
My vote would be for multiple tables - with data appropriately split.
Good luck.
Combining the tables is called denormalizing.
It may (or may not) help to make some queries (which make lots of JOINs) to run faster at the expense of creating a maintenance hell.
MySQL is capable of using only JOIN method, namely NESTED LOOPS.
This means that for each record in the driving table, MySQL locates a matching record in the driven table in a loop.
Locating a record is quite a costly operation which may take dozens times as long as the pure record scanning.
Moving all your records into one table will help you to get rid of this operation, but the table itself grows larger, and the table scan takes longer.
If you have lots of records in other tables, then increase in the table scan can overweight benefits of the records being scanned sequentially.
Maintenance hell, on the other hand, is guaranteed.
Are all of them 1:1 relationships? I mean, if a user could belong to, say, different user levels, or if the users interests are represented as several records in the user interests table, then merging those tables would be out of the question immediately.
Regarding previous answers about normalization, it must be said that the database normalization rules have completely disregarded performance, and is only looking at what is a neat database design. That is often what you want to achieve, but there are times when it makes sense to actively denormalize in pursuit of performance.
All in all, I'd say the question comes down to how many fields there are in the tables, and how often they are accessed. If user activity is often not very interesting, then it might just be a nuisance to always have it on the same record, for performance and maintenance reasons. If some data, like settings, say, are accessed very often, but simply contains too many fields, it might also not be convenient to merge the tables. If you're only interested in the performance gain, you might consider other approaches, such as keeping the settings separate, but saving them in a session variable of their own so that you don't have to query the database for them very often.
Do all of those tables have a 1-to-1 relationship? For example, will each user row only have one corresponding row in user_stats or user_levels? If so, it might make sense to combine them into one table. If the relationship is not 1 to 1 though, it probably wouldn't make sense to combine (denormalize) them.
Having them in separate tables vs. one table is probably going to have little effect on performance though unless you have hundreds of thousands or millions of user records. The only real gain you'll get is from simplifying your queries by combining them.
ETA:
If your concern is about having too many columns, then think about what stuff you typically use together and combine those, leaving the rest in a separate table (or several separate tables if needed).
If you look at the way you use the data, my guess is that you'll find that something like 80% of your queries use 20% of that data with the remaining 80% of the data being used only occasionally. Combine that frequently used 20% into one table, and leave the 80% that you don't often use in separate tables and you'll probably have a good compromise.
Creating one massive table goes against relational database principals. I wouldn't combine all them into one table. Your going to get multiple instances of repeated data. If your user has three interests for example, you will have 3 rows, with the same user data in just to store the three different interests. Definatly go for the multiple 'normalized' table approach. See this Wiki page for database normalization.
Edit:
I have updated my answer, as you have updated your question... I agree with my initial answer even more now since...
a large portion of these cells are
likely to remain empty
If for example, a user didn't have any interests, if you normalize then you simple wont have a row in the interest table for that user. If you have everything in one massive table, then you will have columns (and apparently a lot of them) that contain just NULL's.
I have worked for a telephony company where there has been tons of tables, getting data could require many joins. When the performance of reading from these tables was critical then procedures where created that could generate a flat table (i.e. a denormalized table) that would require no joins, calculations etc that reports could point to. These where then used in conjunction with a SQL server agent to run the job at certain intervals (i.e. a weekly view of some stats would run once a week and so on).
Why not use the same approach Wordpress does by having a users table with basic user information that everyone has and then adding a "user_meta" table that can basically be any key, value pair associated with the user id. So if you need to find all the meta information for the user you could just add that to your query. You would also not always have to add the extra query if not needed for things like logging in. The benefit to this approach also leaves your table open to adding new features to your users such as storing their twitter handle or each individual interest. You also won't have to deal with a maze of associated ID's because you have one table that rules all metadata and you will limit it to only one association instead of 50.
Wordpress specifically does this to allow for features to be added via plugins, therefore allowing for your project to be more scalable and will not require a complete database overhaul if you need to add a new feature.
I think this is one of those "it depends" situation. Having multiple tables is cleaner and probably theoretically better. But when you have to join 6-7 tables to get information about a single user, you might start to rethink that approach.
I would say it depends on what the other tables really mean.
Does a user_details contain more then 1 more / users and so on.
What level on normalization is best suited for your needs depends on your demands.
If you have one table with good index that would probably be faster. But on the other hand probably more difficult to maintain.
To me it look like you could skip User_Details as it probably is 1 to 1 relation with Users.
But the rest are probably alot of rows per user?
Performance considerations on big tables
"Likes" and "views" (etc) are one of the very few valid cases for 1:1 relationship _for performance. This keeps the very frequent UPDATE ... +1 from interfering with other activity and vice versa.
Bottom line: separate frequent counters in very big and busy tables.
Another possible case is where you have a group of columns that are rarely present. Rather than having a bunch of nulls, have a separate table that is related 1:1, or more aptly phrased "1:rarely". Then use LEFT JOIN only when you need those columns. And use COALESCE() when you need to turn NULL into 0.
Bottom Line: It depends.
Limit search conditions to one table. An INDEX cannot reference columns in different tables, so a WHERE clause that filters on multiple columns might use an index on one table, but then have to work harder to continue the filtering columns in other tables. This issue is especially bad if "ranges" are involved.
Bottom line: Don't move such columns into a separate table.
TEXT and BLOB columns can be bulky, and this can cause performance issues, especially if you unnecessarily say SELECT *. Such columns are stored "off-record" (in InnoDB). This means that the extra cost of fetching them may involve an extra disk hit(s).
Bottom line: InnoDB is already taking care of this performance 'problem'.