I am building a less than traditional stock system to power a browser/mobile game. Basic principal is a building has stock of certain resources. These buildings have an hourly production that decreases imports and produces exports in each building. The productions are based on a structure as the type of building and that buildings level and capacity.
My dilemma is how to layout out these stock tables in a scalable way. I am able to build tables so that each column is a resources. Example:
building_id | structure_id | energy | food | water
--------------------------------------------------
1 | 1 | 459 | 19 | 0
The benefit of this method is that I can write a few handy views and events and power this logic completely from mysql. I can fire one big update statement every hour to transaction productions.
The downfall to this method is that I have to write each resource as a column in my tables. This will be present on my tables in the database as well. I am projecting only have 150 or so resources.
The other option I have been playing with is building this like a basic inventory system. So, having a stock table that looks like this:
stock_id | building_id | resource_id | qty
-------------------------------------------
1 | 1 | 3 | 19
4 | 1 | 2 | 0
5 | 1 | 1 | 459
The benefit to this method is scalabity in to code to allow easy entry of new resources to enhance game play.
The downfall to this method is that I will have to do multiple UPDATE and SELECT statements to do one buildings production. As well as for each building. I plan to have a server limit of 250k buildings. This can become taxing.
All in all, I am looking for the most optimum way of doing this. I will have a finite set of resources and I have the ability to use query building code to create upgrade classes to handle adding a resource. But this also becomes a large set of code to just build the database.
Anyone have any thoughts on this?
Edit:
I am adding how the production sequence works for clarity.
the building has to check what it needs to import from stock and how much space it will free up in capacity.
the building has to check what it needs to export into stock and how much space it will take up in capacity.
building imports and exports are from the structures table and are multiplied by the buildings level.
If we do not exceed capacity and we have all needed resources, the build will transform the stock.
This all, right now runs correctly from one single UPDATE statement on all buildings and does it quite quick(not tested on sets larger than 100 yet). But this is based on the design with each resource as a column. I can achieve the same structure i do now with a proper inventory system style tables but I would need 150 left joins (there are 150 resources).
Ditch the 150 resource columns notion. Force joins to behave with index hints after a analyze table xxxx call.
Verify plan with explain command. Make calls thru stored procs.
I realize this is a game you are constructing. I did large map game play MMOG with such structures items states. The data layer was highly optimized else it woulda barfed the User experience. Lot of memcache.
data is only important as needed. you do not approach a building and fetch every attribute about it. why is that?
1) not needed now. who cares that the antenna is blown. it is irrelevant. you are 90 feet from water, how would u use it anyway
2) slow
3) becomes stale
that is all pull technology. client manually pulls it
as for push from server (we had benefit of open socket)
these are critical and need to be near real-time <80ms
1) player positions and how equipped
2) base status. this is important. what is where and state in base. these is constantly grabbed by users from mini-maps
3) your player, stats in particular, partly to prevent hacks
these push 90% of the time resided memcached in the structure most friendly to the client side. cannot seem to get anywhere near this performance
push: stuff not in memcache but is happening right in front of the player's face. or behind it. like getting shot in the head.
naturally the player isn't pulling that. it independently occurred to walking, zooming.
Obviously of a row with all info without joins is nice. Wasn't suitable for us
Related
Imagine the following BIG-DATA situation:
There are 1 million persons stored in a SQL database.
Each of them follows exactly 50 other persons.
So there is a table like this (with 50 million entries):
person1 | person2
0 | 1
0 | 2.341
0 | 212.881
.. | ..
999.999 | 421.111
999.999 | 891.129
999.999 | 920.917
Is it possible to use Oracle's connect by or MySQL's WITH RECURSIVE to find out if there is a connection (maybe over intermediaries) from one person to an other?
Would those queries literally run forever? (the data are highly connected)
Or is there a way to limit the depth of the queries? (in this case: only < 3 intermediaries)
context: this example will be used to explain why a graph database can be better in some cases and I want to show whether this is even solvable with SQL.
Is it possible to use Oracle's connect by or MySQL's WITH
RECURSIVE to find out if there is a connection (maybe over
intermediaries) from one person to an other?
Yes. That's the purpose of those features.
Would those queries literally run forever? (the data are highly connected)
As with all SQL queries, appropriate indexes are vital for good performance.
As for "forever" Oracle detects loops in hierarchies (that is, when the data breaks the assumption that it is a directed acyclic graph.)
Recursive common table expressions (in most non-Oracle table servers) can have their recursion limited by level. See this https://dba.stackexchange.com/questions/16111/cte-running-in-infinite-loop.
Is it better to do this kind of work with a graph database? That's a matter of opinion.
You still need loop detection.
In production, moving data from one database to another, or keeping copies in multiple places, is costly. So your pragmatic design choice will be influenced by where your system stores the data.
I'm familar with normalized databases and I'm able to produce all kind of queries. But since i'm starting on a green-field project now, one question kept me busy during this week:
It's the typical "webshop-problem" i'd say (even if i'm not building a webshop): How to model the "product-information"?
There are some approaches, each with its own advantages or disadvantages:
One Table to rule them all
Putting every "product" into a single table, generating every column possible and working with this monster-table.
Pro:
Easy queries
Easy layout
Con:
Lot of NULL Values
The actual code becomes sensitive towards the query (different type, different columns are required)
EAV-Pattern
Obviously the EAV-Pattern can provide a nicer solution for this. However, I've been working with EAV in the past, and when it comes down to performance, it can become a Problem for a huge amount of entries.
Searching is easy, but listing a "normalized table" requires one join per actual column -> slow.
Pro:
Clean
Flexible
Con:
Performance
Not Normalized
Single Table per category
Basically the opposite of the EAV-Pattern: Create one table per product-type, i.e. "cats", "dogs", "cars", ...
While this might be possible for a countable number of categories, it becomse a nightmare for a steady growing amount of categories, if you have to maintain those.
Pro:
Clean
Performance
Con:
Maintenance
Query-Management
Best of both worlds
So, on my journey through the internet I found recommendations to mix both approaches: Use a single Table for the common information, while grouping other attributes into "attribute-groups" which are organized in the EAV-Fashion.
However, here I think, this would basically import the drawbacks of EACH approach... You need to work with regular Tables (basic information) and do a huge amount of joins to get ALL information.
Storing enhanced information in JSON/XML
Another approach is to store extendet information in JSON/XML Format entries (within a column of the "root-table").
However, I don't really like this, as it seems hard(er) to query and to work-with than a regular database layout.
Automating single tables
Another idea was automating the part of "creating tables" per category (and therefore automating the queries on those), while maintaining a "master-table" just containing the id and the category information, in order to get the best performance for an undetermined amount of tables...?
i.e.:
Products
id | category | actualId
1 | cat | 1
2 | car | 1
cats
id | color | mew
1 | white | true
cars
id | wheels | bhp
1 | 4 | 123
the (abstract) Product table would allow to query for everything, while details are available by an easy join with "actualId" and the responsible table.
However, this would lead to problems if you want to run a "show all" query, because this is not solvable by SQL alone, cause the table name (in the join) needs to be explicit in the query.
What other Options are available? There are a lot of "webshops", each dealing with this problem more or less - how do they solve it in a efficent way?
I strongly disagree with your opinion that the "monster" table approach leads to "Easy queries", and that the EAV approach will cause performance issues (premature optimization?). And it doesn't have to require complex queries:
SELECT base.id, base.other_attributes,
, GROUP_CONCAT(CONCATENATE(ext.key, '[', ext.type, ']', ext.value))
FROM base_attributes base
LEFT JOIN extended_attributes ext
ON base.id=ext.id
WHERE base.id=?
;
You would need to do some parsing on the above, but a wee bit of polishing would give something parseable as JSON or XML without putting your data inside anonymous blobs
If you don't care about data integrity and are happy to solve performance via replication, then NoSQL is the way to go (this is really the same thing as using JSON or XML to store your data).
First off, I am new to database design so apologies for use of incorrect terminology.
For a university assignment I have been tasked with creating the database schema for a website. Part of the website a user selects the availability of hosting an event but the event can be at any time so for example from 12/12/2015 - 15/12/2015 and 16/01/2016 - 22/12/2016 and also singular dates such as 05/01/2016. They also have the option of having the event all the time
So I am unsure of how to store all these kind of variables in a database table without using a lot of rows. The example below is a basic one that would store each date of availability but that is a lot of records and that is just for one event. Is there a better method of storing these values or would this be stored elsewhere, outside of a database.
calendar_id | event_id | available_date
---------------------------------------
492 | 602 | 12/12/2015
493 | 602 | 13/12/2015
494 | 602 | 14/12/2015
495 | 602 | 15/12/2015
496 | 602 | 05/01/2016
497 | 602 | 16/01/2016
498 | 602 | 17/01/2016
etc...
This definitely requires a database. I don't think you should be concerned about the number of records in a database... that is what databases do best. However, from a university perspective there is something called Normalization. In simple terms normalization is about minimizing data repetition.
Steps to design a schema
Identify entities
As the first step of designing a database schema I tend to identify all the entities in the system. Looking at your example I see (1) Events and (2) EventTimes (event occurrences/bookings) with a one-to-many relation since one Event might have multiple EventTimes. I would suggest that you keep these two entities separate in the database. That way an Event can be extended with more attributes/fields without affecting its EventTimes. Most importantly you can add many EventTimes on an Event without repeating all the event's fields (which would be the case if you use a single table).
Identify attributes
The second step for me is to identify all the attributes/fields of each entity. Additionally, I always suggest an auto-increment id in every table to uniquely identify a row.
Identify constraints
This might be a bit more advanced, but most of the times you have constraints on what is acceptable data values or what uniquely identifies a row in real-life. For example, the Event.id might identify the row in the database but you might also require that each event has a unique title
Example schema
This has to be adjusted to the assignment or, in a real application, to the system's requirements
Events table
id int auto-increment
title varchar unique: Event's title
always_on boolean/enum: If 'Y' then the event is on all the time
... more fields here ... (category, tags, notes, description, venue,...)
EventTimes
id int auto-increment
event_id foreign key pointing to Event.id
start_datetime datetime or int (int if you go for a unix timestamp)
end_datetime : as above
... more fields again... (recursion below is a hard one! avoid it if you can)
recursion enum/int : Is the event repeated? Weekly, Montly, etc
recursion_interval int: Every x days, months, years, etc
A note on date/times, as a rule of thumb whenever you deal with dates and times in a database, always store them in UTC format. You probably don't want/need to mess with timezones in an assignment... but keep it in mind.
Possible extensions to the example
Designing a complete system one might add the tables: Venues, Organizers, Locations, etc... this can go on forever! I do try to think of future requirements when designing but do not over do it cause you end up with a lot of fields that you don't use and increased complexity.
Conclusion
Normalization is something you have to keep in mind when designing a database, however you can see that the more you normalize your schema the more complex will become your selects and joins. There is a trade-off there between data efficiency and query efficiency... That is the reason I used "from a university perspective" earlier. In a real-life system with complex data structures (for example graphs!) you might require to under-normalize the tables to make your queries more efficient/faster or easier. There are other approaches to deal with such issues (functions in the database, temporary/staging tables, views, etc) but always depends on the specific case.
Another really useful thing to keep in mind is: Requirements always change! Design your databases taking as granted that fields will be added/removed, more tables will be added, new constraints will appear, etc and thus make it as extensible and easy to modify as possible... (now we are scratching a bit "Agile" methodologies)
I hope this helps and does not confuse things more. I am not a DBA per-se but I have designed a few schemes. All the above come from experience rather than a book and they may not be 100% accurate. Definitely not the only way to design a database... its kind of an art this job :)
I am using MySQL, InnoDB, and running it on Ubuntu 13.04.
My general question is: If I don't know how my database is going to evolve or what my needs will eventually be, should I not worry about redundancy and relationships now?
Here is my situation:
I'm currently building a baseball database from scratch, but I am unsure how I should proceed. Right now, I'm approaching the design in a modular fashion. For example, I am currently writing a python script to parse the XML feed of a sports betting website which tells me the money line and the over/under. Since I need to start recording the information, I am wondering if I should just go ahead and populate the tables and worry about keys and such later.
So for example, my python sports odds scraping script would populate three tables (Game,Money Line, Over/Under) like so:
DateTime = Date and time of observation
Game
+-----------+-----------+--------------+
| Home Team | Away Team | Date of Game |
+-----------+-----------+--------------+
Money Line
+-----------+-----------+--------------+-----------+-----------+----------+
| Home Team | Away Team | Date of Game | Home Line | Away Line | DateTime |
+-----------+-----------+--------------+-----------+-----------+----------+
Over/Under
+-----------+-----------+--------------+-----------+-----------+----------+----------+
| Home Team | Away Team | Date of Game | Total | Over | Under | DateTime |
+-----------+-----------+--------------+-----------+-----------+----------+----------+
I feel like I should be doing something with the redundant (home team, away team, date of game) columns of information, but I don't really know how my database is going to expand, and in what ways I will be linking everything together. I'm basically building a database so I can answer complicated questions such as:
How does weather in Detroit affect the betting lines when Justin Verlander is pitching against teams who have averaged 5 or fewer runs per game for 20 games prior to the appearance against Verlander? (As you can see, complex questions create complex relationships and queries.)
So is it alright if I go ahead and start collecting data as shown above, or is this going to create a big headache for me down the road?
The topic of future proofing a database is a large one. In general, the more successful a database is, the more likely it is to be subjected to mission creep, and therefore to have new requirements.
One very basic question is this: who will be providing the new requirements? From the way you wrote the question, it sounds like you have built the database to fit your own requirements, and you will also be inventing or discovering the new requirements down the road. If this is not true, then you need to study the evolving pattern of your client(s) needs, so as to at least guess where mission creep is likely to lead you.
Normalization is part of the answer, and this aspect has been dealt with in a prior answer. In general, a partially denormalized database is less future proofed than a fully normalized database. A denormalized database has been adapted to present needs, and the more adapted something is, the less adaptable it is. But normalization is far from the whole answer. There are other aspects of future proofing as well.
Here's what I would do. Learn the difference between analysis and design, especially with regard to databases. Learn how to use ER modeling to capture the present requirements WITHOUT including the present design. Warning: not all experts in ER modeling use it to express requirements analysis. In particular, you omit foreign keys from an analysis model because foreign keys are a feature of the solution, not a feature of the problem.
In parallel, maintain a relational model that conforms to the requirements of your ER model and also conforms to rules of normalization, and other rules of simple sound design.
When a change comes along, first see if your ER model needs to be updated. Sometimes the answer is no. If the answer is yes, first update your ER model, then update your relational model, then update your database definitions.
This is a lot of work. But it can save you a lot of work, if the new requirements are truly crucial.
Try normalizing your data (so that you do not have redundant info) like:
Game
+---+-----------+-----------+--------------+
|ID | Home Team | Away Team | Date of Game |
+---+-----------+-----------+--------------+
Money Line
+-----------+-----------+--------------+-----------+
| Game_ID | Home Line | Away Line | DateTime |
+-----------+-----------+--------------+-----------+
Over/Under
+-----------+-----------+--------------+-----------+-----------+
| Game_ID | Total | Over | Under | DateTime |
+-----------+-----------+--------------+-----------+-----------+
You can read more on NORMALIZATION here
We have a mySQL database table for products. We are utilizing a cache layer to reduce database load, but we think that it's a good idea to minimize the actual data needed to be stored in the cache layer to speed up the application further.
All the products in the database, that is visible to visitors have a price attached to them:
The prices are stored in a different table, called prices . There are multiple price categories depending on which discount level each visitor (customer) applies to. From time to time, there are campaigns which means that a special price for each product is available. The special prices are stored in a table called specials.
Is it a bad to make a temp table that binds the tables together?
It would only have the neccessary information and would ofcourse be cached.
-------------|-------------|------------
| productId | hasPrice | hasSpecial
-------------|-------------|------------
1 | 1 | 0
2 | 1 | 1
By doing such, it would be super easy to know if the specific product really has a price, without having to iterate through the complete prices or specials table each time a product should be listed or presented.
Are temp tables a common thing for web applications or is it just bad design?
If you're going to cache this data anyways, does it really need to be in a temp table? You would only incur the overhead of the query when you needed to rebuild the cache, so the temp table might not even be necessary.
You should approach it like any other performance problem: Decide how much performance is necessary, then iterate doing testing on production-grade hardware in your lab. Do not do needless optimisations.
You shoud profile your app and discover if it's doing too many queries or the queries themselves are slow; most cases of web-app slowness are caused by doing too many queries (in my experience) even though the queries are very easy.
Normally the best engineering solution is to restructure the database, in some cases denormalising, to make the common read use-cases require fewer queries. Caching may be helpful as well, but refactoring so you need fewer queries is often the best.
Essentially you can increase the amount of work on the write-path to reduce the amount on the read-path, if you are planning to do a lot more reading than writing.