Database design to get as many stats as possible

Database design to get as many stats as possible - mysql

I have to structure a MySQL database for work and haven't done that in years. I'd love to get some ideas from you. So here's the task:
I have a couple of "shops" that have, depending on the day of the week and year, different opening hours, which could change further down the line. The shops have
space for a given amount of people (which could change later as well).
A few times a day we count the amount of people in the shop.
We want to compare the utilized capacity between shops. I myself would like to use dc.js to be able to get as much stats as possible from the data.
We also have two different methods of counting our users:
By hand. Reliable, but time consuming.
Light barrier. Automatic, but very inaccurate.
I'd like to get a better approximation of the usercount using the light barrier data and some machine learning algorithm.
Anyway, do you have any tips on how to design the DB as efficiently as possible for my tasks. I was thinking:
SHOP
Id
Name
OPENINGHOURS
Id
ShopId
MaxUsers
Date
Open
Close
MANUALUSERCOUNT
Id
ShopId
Time
Count
AUTOUSERCOUNT
ID
ShopId
Time
Count
Does this structure make sense (at all and for my tasks)?
Thank you!

For an application of this size, I see no problem with this at all. Except what does "time" column in usercount tables refer to ?

Related

Do I have too many columns in my MySQL table?

I'm a junior doctor and I'm creating a database system for my senior doctor.
Basically, my senior Dr wants to be able to store a whole lot of information on each of his patients in a relational database so that later, he can very easily and quickly analyse / audit the data (i.e. based on certain demographics, find which treatments result in better outcomes or which ethnicities respond better to certain treatments etc. etc.).
The information he wants to store for each patient is huge.
Each patient is to complete 7 surveys (each only takes 1-2 minutes) a number of times (immediately before their operation, immediately postop, 3months postop, 6months postop, 2years postop and 5years postop) - the final scores of each of these survey at these various times will be stored in the database.
Additionally, he wants to store their relevant details (name, ethnicity, gender, age etc etc).
Finally, he also intends to store A LOT of relevant past medical history, current symptoms, examination findings, the various treatment options they try and then outcome measures.
Basically, there's A LOT of info for each patient. And all of this info will be unique to each patient. So because of this, I've created one HUGE patient table with (~400 columns) to contain all of this info. The reason I've done this is because most of the columns in the table will NOT be redundant for each patient.
Additionally, this entire php / mysql database system is only going to live locally on his computer, it'll never be on the internet.
The tables won't have too many patients, maybe around 200 - 300 by the end of the year.
Given all of this, is it ok to have such a huge table?
Or should I be splitting it into smaller tables
i.e.
- Patient demographics
- Survey results
- Symptoms
- Treatments
etc. etc, with a unique "patient_id" being the link between each of these tables?
What would be the difference between the 2 approaches and which would be better? Why?

About the 400 columns...
Which, if any, of the columns will be important to search or sort on? Probably very few of them. Keep those few columns as columns.
What will you do with the rest? Probably you simply display them somewhere, using some app code to pretty-print them? So these may as well be in a big JSON string.
This avoids the EAV nightmares, yet stores the data in the database in a format that is actually reasonably easy (and fast) to use.

Getting average or keeping temp data in db - performance concern

I am building a little app for users to create collections. I want to have a rating system in there. And now, since I want to cover all my fields, let's pretend that I have a lot of visitors. Performance comes into play, especially with rates.
Let's suppose that I have rates table, and there I have id, game_id, user_id and rate. Data comes simple, for every user there is one entry. Let's suppose again, that 1000 users will rate one game. And I want to print out average rate on that game subpage (and somewhere else, like on the games list). For now, I have two scenarios to go with:
Getting AVG each time the game is displayed.
Creating another column in games, called temprate and store there rate for the game. It would be updated evey time someone votes.
Those two scenarios have obvious flaws. First one is more stressful to my host, since it definietly will consume more power of the machine. Secound is more work while rating (getting all the game data, submitting rate, getting new AVG).
Please advice me, which scenario should I go with? Or maybe you have some other ideas?
I work with PDO and no framework.

So I've finally manage to solve this issue. I used file caching based on dumping arrays into files. I just go with something like if (cache) { $var = cache } else { $var = db }. I am using JG Cache, for now, but propably I'll write myself something similar soon, but for now - it's a great solution.

I'd have gone with a variation of your "number 2" solution (update a separate rating column), maybe in a separate table just for this.
If the number of writes becomes a problem, then that'd be well after select avg(foo) from ... does, and there are lots of ways to mitigate it by just updating the average rating periodically or just processing new votes every so often.
Likely then eventually you can't just do an avg() anyway because you have to consider each vote for fraud, calculating a sort score and who knows what else.,

Storing incremental prices in MongoDB

I'm using MongoDB and MySQL for different aspects of an e-commerce site.
One of the features is 'bidding'. The price goes up with each bid.
There are several ways I could do this, such as having a single column that updates the 'price' or I could have another column that simply adds prices and I can get the latest price based on the date, requiring an order by. Also, each new price, will be based off the current high price, so I'll need to know the current high price.
I'd like to keep this in the MongoDB portion, but not sure what best way to handle this.
Any suggestions would be great!
Thank you!

You can atomically update documents in mongodb, there's an $inc operator, so you can atomically update a document's "max price" while also $pushing the last bidder, the date, and the price increase to an array, for example. This way you'll never be in danger of having an inconsistent auction document. Using safe mode for writes is necessary too.
Splitting bids into separate documents which you then assemble to find the current price is another solution. It really depends on how much state you're tracking with the bids.

Performance of Views versus Tables for temporal changing data

I have a table for news articles, containing amongst others the author, the time posted and the word count for each article. The table is rather large, containing more than one million entries and growing with an amount of 10.000 entries each day.
Based on this data, a statistical analysis is done, to determine the total number of words a specific author has published in a specific time-window (i.e. one for each hour of each day, one for each day, one for each month) combined with an average for a time-span. Here are two examples:
Author A published 3298 words on 2011-11-04 and 943.2 words on average for each day two month prior (from 2011-09-04 to 2011-11-03)
Author B published 435 words on 2012-01-21 between 1pm and 2pm and an average of 163.94 words each day between 1pm and 2pm in the 30 days before
Current practice is to start a script at the end of each defined time-window via cron-job, which calculates the count and the averages and stores it in a separate table for each time-window (i.e. one for each hourly window, one for each daily, one for each monthly etc...).
The calculation of sums and averages can easily be done in SQL, so I think Views might be a more elegant solution to this, but I don't know about the implications on performance.
Are Views an appropriate solution to the problem described above?

I think you can use materialize views for it. It's not really implemented in MySQL, but you can implement it with tables. Look at

views will not be equivalent to your denormalization.
if you are moving aggregate numbers somewhere else, then that has a certain cost, which you are paying - in order to keep the data correct, and a certain benefit, which is much less data to look through when querying.
a view will save you from having to think too hard about the query each time you run it, but it will still need to look through the larger amount of data in the original tables.
while i'm not a fan of denormalization, since you already did it, i think the view will not help.

Gathering pageviews MySQL layout

Hey, does anyone know the proper way to set up a MySQL database to gather pageviews? I want to gather these pageviews to display in a graph later. I have a couple ways mapped out below.
Option A:
Would it be better to count pageviews each time someone visits a site and create a new row for every pageview with a time stamp. So, 50,000 views = 50,000 rows of data.
Option B:
Count the pageviews per day and have one row that counts the pageviews. every time someone visits the site the count goes up. So, 50,000 views = 1 row of data per day. Every day a new row will be created.
Are any of the options above the correct way of doing what I want? or is there a better more efficient way?
Thanks.

Option C would be to parse access logs from the web server. No extra storage needed, all sorts of extra information is stored, and even requests to images and JavaScript files are stored.
..
However, if you just want to track visits to pages where you run your own code, I'd definitely go for Option A, unless you're expecting extreme amounts of traffic on your site.
That way you can create overviews per hour of the day, and store more information than just the timestamp (like the visited page, the user's browser, etc.). You might not need that now, but later on you might thank yourself for not losing that information.
If at some point the table grows too large, you can always think of ways on how to deal with that.

If you care about how your pageviews vary with time in a day, option A keeps that info (though you might still do some bucketing, say per-hour, to reduce overall data size -- but you might do that "later, off-line" while archiving all details). Option B takes much less space because it throws away a lot of info... which you might or might not care about. If you don't know whether you care, I think that, in doubt, you should keep more data rather than less -- it's reasonably easy to "summarize and archive" overabundant data, but it's NOT at all easy to recover data you've aggregated away;-). So, aggregating is riskier...
If you do decide to keep abundant per-day data, one strategy is to use multiple tables, say one per day; this will make it easiest to work with old data (summarize it, archive it, remove it from the live DB) without slowing down current "logging". So, say, pageviews for May 29 would be in PV20090529 -- a different table than the ones for the previous and next days (this does require dynamic generation of the table name, or creative uses of ALTER VIEW e.g. in cron-jobs, etc -- no big deal!). I've often found such "sharding approaches" to have excellent (and sometimes unexpected) returns on investment, as a DB scales up beyond initial assumptions, compared to monolithic ones...

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008