I currently inherited a table similar to the one in the image below. I don't have the resources to do what should be done in the allotted time, which is obviously to normalize the data into separate tables break it into a few smaller tables to eliminate redundancy, etc.
My current idea for a short-term solution is to create a query for each product type and store it in a new table based on ParentSKU. In the image below, a different query would be necessary for each of the 3 example ParentSKUs. This will work okay, but if new attributes are added to a SKU the query needs to be adjusted manually. What would be ideal in the short term (but probably not very likely) is to be able to come up with a query that would only include and display attributes where there weren't any NULL values. The desired results for each of the three ParentSKUs would be the same as they are in the examples below. If there were only 3 queries total, that would be easy enough, but there are dozens of combinations based on the products and categories of each product.
I'm certainly not the man for the job, but there are scores of people way smarter than I am that frequent this site every day that may be able to steer me in a better direction. I realize I'm probably asking for the impossible here, but as the saying goes, "There are no stupid questions, only ill-advised questions that deservedly and/or inadvertently draw the ire of StackOverflow users for various reasons." Okay, I embellished a tad, but you get my point...
I should probably add that this is currently a MySQL database.
Thanks in advance to anyone that attempts to help!
First create SKUTypes with the result of
SELECT ParentSKU , count(Attr1) as Attr1,..
FROM tbl_attr
GROUP BY ParentSKU;
Then create script which will generate an SQL query for every row of SKUTypes taking every AttrN column which value > 0.
Related
Okay, so first of all let me tell a little about what I'm trying to do. Basically, during my studies I wrote a little webservice in PHP that calculates how similar movies are to each other based on some measurable sizes like length, actors, directors, writers, genres etc. The data I used for this was basically a collection of data accquired from omdbapi.com.
I still have that database, but it is technically just a SINGLE table that contains all the information to each movie. This means, that for each movie all the above mentioned parameters are divided by commas. Therefore I have so far used a query that encapsulates all these things by using LIKE statements. The query can become quite large as I will pretty much query for every parameter within the table, sometimes 5 different LIKE statements for different actors, the same for directors and writers. Back when I last used this, it took about 30 to 60 seconds to enter a single movie and receive a list of 15 similar ones.
Now I started my first job and to teach myself in my freetime, I want to work on my own website. Because I have no real concept for what I want to do with it, I thought I'd get out my old "movie finder" again and use it differently this time.
Now to challenge myself, I want the whole thing to be faster. Understand, that the data is NEVER changed, only read. It is also not "really" relational, as actor names and such are just strings and have no real entry anywhere else. Which essentially means having the same name will be treated as the same actor.
Now here comes my actual question:
Assuming I want my select queries to operate faster, would it make sense to run a script that splits the comma divided strings into extra tables (these are n to m relations, see attempt below) and then JOIN all these tables (they will be 8 or more) or will using LIKE as I currently do be about the same speed? The ONLY thing I am trying to achieve is faster select queries, as there is nothing else to really do with the data.
This is what I currently have. Keep in mind, I would still have to create tables for the relation between movies + each of these tables. After doing that, I could remove the columns in the movie table and would end up having to join a lot of tables with EACH query. The only real advantage I can see here, is that it would be easier to create an index on individuals tables, rather than one (or a few) covering the one, big movie table.
I hope all of this even makes sense to you. I appreciate any answer short or long, like I said this is mostly for self studies and as such, I don't have/need a real business model.
I don't understand what you currently have. It seems that you only showd the size of tables but not its internal structure. You need to separate data into separate tables using normalization rules and then put correct indexes. Indexes will make your queries very fast. What does the sizing above your query mean? Have you ever run EXPLAIN ANALYZE for you queries, and please post the query I cannot guess your query out of the result. There are a lot of optimization videos on YT.
Given a series of complex websites that all use the same user tacking mysql database. (this is not our exact situation: but a simplification of the situation to make this post a brief/efficient as possible)
We don't always know where a user is when he starts using a site. In fact there are about 50 points in the code where the country field might get updated. We might collect it from the IP address on use. We might get it when he uses his credit card. We might get it when he fills out a form. Heck we might get it when we talk to him on the phone.
Assume a simple structure like:
CREATE TABLE `Users` (
`ID` INT NOT NULL AUTO_INCREMENT ,
`County` VARCHAR(45) NULL ,
PRIMARY KEY (`ID`) );
What Im wondering is what is the best way to keep track of one more scrap of information on this person:
`Number_of_Users_in_My_Country`.
I know I could run a simple query to get it with each record. But I constantly need two other bits of information: (Keep in mind that Im not really dealing with countries but other groups that number in the 100,000X : again: counties is just to make this post simple)
User count by Country and
Selection of countries with less than x users.
Im wondering if I should create a trigger when the country value changes to update the Number_of_Users_in_My_Country field?
As Im new to mySQL I would love to know thoughts on this or any other approach.
Lots of people will tell you not to do that, because it's not normalized. However, if it's trivial to keep an aggregate value (to save complex joins in certain queries), I'd say go for it. Keep in mind with your triggers that you can't update the same table as the trigger's definition, so be careful in defining how certain events propagate updates to other tables, lest you get in a loop.
An additional recommendation: I would keep a table for countries, and use a foreign key reference from Users to Countries. Then in countries, have a column for total users in that country. Users_in_my_country seems to have a very specific use, and it would be easier to maintain from the countries' perspective.
Given that you've simplified the question somewhat, it's hard to be totally precise.
In general, if at all possible, I prefer to calculate these derived values on the fly. And to find out if it's valuable, I prefer to try it out; 100.000x records is not a particularly scary number, and I'd much prefer to spend time tuning a query/indexing scheme once than dealing with maintenance crazy for the life of the application.
If you've tried that, and still can't get it to work, my next consideration would be to work with stale/cached data. It all depends on your business, but if it's okay for the "number of users in my country" value to be slightly out of date, then calculating these values and caching them in the application layer would be much better. Caching has lots of pre-existing libraries you can use, it's well understood by most developers, and with high traffic web sites, caching for even a few seconds can have a dramatic effect on your performance and scalability. Alternatively, have a script that populates a table "country_usercount" and run it every minute or so.
If the data must, absolutely, be fresh, I'd include the logic to update the counts in the application layer code - it's a bit ugly, but it's easy to debug, and behaves predictably. So, every time the event fires that tells you which country the user is from, you update the country_usercount table from the application code.
The reason I dislike triggers is that they can lead to horrible, hard to replicate bugs and performance issues - if you have several of those aggregated pre-calculated fields, and you write a trigger for each, you could easily end up with lots of unexpected database activity.
I'm creating an order system to to keep track of orders. There's about 60 products or so, each with their own prices. The system isn't very complicated though, just need to be able to submit how many of each product the person orders.
My question is, is it more efficient to have an 'orders' table, with columns representing each product, and numeric values representing how many of each they ordered.., example:
orders
id
product_a
product_b
product_c
etc...
OR
should I break it into different tables, with a many-to-many table to join them. something like this maybe:
customers
id
name
email
orders
id
customer_id
products
id
product
orders_products
order_id
product_id
I would break it out apart like you show in your second sample. This will make your application much more scalable and will still be quite efficient.
Always build for future features and expansion in mind. A shortcut here or there always seems to bite you later when you have to re-architect and refactor the whole thing. Look up normalization and why you want to separate every independent element in a relational DB.
I am often asked “why make it a separate table, when this way is simpler?” Then remind them that their “oh, there are no other of this type of thing we will use” then later having them ask for a feature that necessitates many-to-many, not realizing they painted you into a corner by not considering future features. People who do not understand data structures tend not to be able to realize this and are pretty bad at specifying system requirements. This usually happens when the DB starts getting big and they realize they want to be able to look at only subset of data. A flat DB means adding columns to handle a ton of different desires, while a many-to-many join table can do it with a few lines of code.
I'd also use the second way. If the db is simple as you say, the difference might not be much in terms of speed and such. But the second way is more efficient and easy to reuse/enhance in case you get new ideas and add to your application.
Should you go for the 1st case, how will you keep track of the prices and discounts you gave to your customers for each product? And even if you have no plans to track it now, this is quite common thing, so might have request for such change.
With normalized schema all you have to do is add a couple of fields.
This is a follow up from my last question: MySQL - Best method to saving and loading items
Anyways, I've looked at some other examples and sources, and most of them have the same method of saving items. Firstly, they delete all the rows that's already inserted into the database containing the character's reference, then they insert the new rows accordingly to the current items that the character has.
I just wanted to ask if this is a good way, and if it would cause a performance hit if i were to save 500 items per each character or so. If you have a better solution, please tell me!
Thanks in advance, AJ Ravindiran.
It would help if you talked about your game so we could get a better idea of your data requirements.
I'd say it depends. :)
Are the slot/bank updates happening constantly as the person plays, or just when the person savles their game and leaves. Also does the order of the slots really matter for the bank slots? Constantly deleting and inserting 500 records certainly can have a performance hit, but there may be a better way to do it, possibly you could just update the 500 records without deleting them. Possibly your first idea of 0=4151:54;1=995:5000;2=521:1;
wasn't SO bad. If the database is only being used for storing that information, and the game itself is managing that information once its loaded. But if you might want to use it for other things like "What players have item X", or "What is the total value of items in Player Ys bank". Then storing it like that won't allow you to ask the database, it would have to be computed by the game.
I find that when trying to construct complex MySQL joins and groups between many tables I usually run into strife and have to spend a lot of 'trial and error' time to get the result I want.
I was wondering how other people approach the problems. Do you isolate the smaller blocks of data at the end of the branches and get these working first? Or do you start with what you want to return and just start linking tables on as you need them?
Also wondering if there are any good books or sites about approaching the problem.
I don't work in mySQL but I do frequently write extremely complex SQL and here's how I approach it.
First, there is no substitute whatsoever for thoroughly understanding your database structure.
Next I try to break up the task into chunks.
For instance, suppose I'm writing a report concerning the details of a meeting (the company I work for does meeting planning). I will need to know the meeting name and sales rep, the meeting venue and dates, the people who attened and the speaker information.
First I determine which of the tables will have the information for each field in the report. Now I know what I will have to join together, but not exactly how as yet.
So first I write a query to get the meetings I want. This is the basis for all the rest of the report, so I start there. Now the rest of the report can probably be done in any order although I prefer to work through the parts that should have one-one relationshisps first, so next I'll add the joins and the fields that will get me all the sales rep associated information.
Suppose I only want one rep per meeting (if there are multiple reps, I only want the main one) so I check to make sure that I'm still returning the same number of records as when I just had meeting information. If not I look at my joins and decide which one is giving me more records than I need. In this case it might be the address table as we are storing multiple address for the rep. I then adjust the query to get only one. This may be easy (you may have a field that indicates the specific unique address you want and so only need to add a where condition) or you may need to do some grouping and aggregate functions to get what you want.
Then I go on to the next chunk (working first through all the chunks that should have a 1-1 relationshisp to the central data in this case the meeting). Runthe query nd check the data after each addition.
Finally I move to those records which might have a one-many relationship and add them. Again I run the query and check the data. For instance, I might check the raw data for a particular meeting and make sure what my query is returning is exactly what I expect to see.
Suppose in one of these additions of a join I find the number of distinct meetings has dropped. Oops, then there is no data in one of the tables I just added and I need to change that to a left join.
Another time I may find too many records returned. Then I look to see if my where clause needs to have more filtering info or if I need to use an aggreagte function to get the data I need. Sometimes I will add other fields to the report temporarily to see if I can see what is causing the duplicated data. This helps me know what needs to be adjusted.
The real key is to work slowly, understand your data model and check the data after every new chunk is added to make sure it is returning the results the way you think they should be.
Sometimes, If I'm returning a lot of data, I will temporarily put an additonal where clause on the query to restrict to a few items I can easily check. I also strongly suggest the use of order by because it will help you see if you are getting duplicated records.
Well the best approach to break down your MySQL query is to run the EXPLAIN command as well as looking at the MySQL documentation for Optimization with the EXPLAIN command.
MySQL provides some great free GUI tools as well, the MySQL Query Browser is what you need to use.
When running the EXPLAIN command this will break down how MySQL interprets your query and displays the complexity. It might take some time to decode the output but thats another question in itself.
As for a good book I would recommend: High Performance MySQL: Optimization, Backups, Replication, and More
I haven't used them myself so can't comment on their effectiveness, but perhaps a GUI based query builder such as dbForge or Code Factory might help?
And while the use of Venn diagrams to think about MySQL joins doesn't necessarily help with the SQL, they can help visualise the data you are trying to pull back (see Jeff Atwood's post).