After searching many forums, I think my problem is how to type the question properly because I can't seem to find an answer remotely close to what I need, yet I think this is excel > mysql 101 by the looks of it..
I have an excel sheet with dozens of types of blinds (for windows). There is a row which is the width.. and a left column that is the height. As you cross reference a width and height (say 24 x 36) it has a price value.
| 24 | 30 | 32 | 36 (width)
----------------------------
24 | $50 $55 etc
30 | $60 etc etc(price)
32 | $70
(height)
I can't for the life of me figure out where or how I am to import this into mysql when my database looks like this..
itemname_id <<(my primary) | width | height | price
-------------------------------------------------------------------
Am I doomed to manually typed thousands of combinations or is this common? How do I type the correct terms to find a solution? I'm not speaking the right lingo evidently.
Thank you so much for any guidance. I've looked forever and I keep hitting a wall.
It probably would have helped to know that the layout of your Excel data is commonly referred to as a pivot table. It is possible to "unpivot" the data in Excel to get the data in the format that you want to import to your database.
This brief article shows how to create a pivot table and then unpivot it. Basically, that entails creating a "sum of values" pivot table and then double-clicking on the single value that is the result. It's counter-intuitive, but pretty simple to do.
Related
I have a 'users' table which has a bunch of concrete "sure" properties about my users all of which must be there and their veracity is certain and then I have a separate table 'users_derived' where all data in this table is derived properties of my users guessed by machine learning models. For example: 'age' might be a certain property since they supplied it to me, 'height' or 'hair color' might be a derived property since an ML model guessed it from a picture. The main difference is all properties in the 'users' table were given to me by the user themselves and have complete certainty whereas all properties in the 'user_derived' table have both the value and a certainty associated with it and were guessed at by my system. The other difference is all properties of the 'users' table will be there for every user, while any property in the 'users_derived' table may or may not be there. From time to time I add new ML models which guess at more properties of users as well.
My question is how to do the schema for the 'users_derived' table. I could do it like this:
userid | prop1 | certainty1 | prop2 | certainty2 | prop3 | etc ...
123 7 0.57 5'8'' 0.82 red
124 12 0.6 NULL NULL black
125 NULL NULL 6'1'' 0.88 blonde
or I could do it like this with slightly different indexing:
userid | property | value | certainty
123 1 7 0.57
123 2 5'8'' 0.82
124 1 12 0.60
123 3 red 0.67
124 3 black 0.61
125 2 6'1'' 0.88
etc ....
So the tradeoffs seem like in the second way it isn't as normalized and might be slightly harder to query but you don't have to know all the properties you care about in advance -- that is if I want to add a new property there is no schema change. Also there don't have to be any NULL spots since if we don't have that property yet we just don't have a row for it. What am I missing? What are the benefits of the first way? Are there queries I can do against the first schema that are hard or impossible in the second schema? Does the second way somehow need more space for indexing to make it fast?
The second way is more normalized. Both the table and the indexes are likely to be more compact, especially if the first form is relatively sparsely populated. Although the two forms have different tradeoffs for different queries, in general the second form is more flexible and better suited to a wide variety of queries. If you want to transform data from the normalized form to the crosstabbed form, there is a crosstab function in Postgres' tablefunc extension that can be used for this purpose. Normalizing crosstabbed data will be more difficult, especially if the number of columns is indeterminate--yet you may need to do that for some types of queries.
Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 6 years ago.
Improve this question
Let's say I want to create a table like this:
id | some_foreign_id | attribute | value
_________________________________________
1 1 Weight 100
2 1 Reps 5
3 2 Reps 40
4 3 Time 10
5 4 Weight 50
6 4 Reps 60
Versus the same data represented this way
id | some_foreign_id | weight | reps | time
____________________________________________
1 1 100 5 NULL
2 2 NULL 40 NULL
3 3 NULL NULL 10
4 4 50 60 NULL
And since in this case the id = foreign_id I think we can just append these columns to whatever table foreign_id is referring to.
I would assume most people would overwhelmingly say the latter approach is the accepted practice.
Is the former approach considered a bad idea, even though it doesn't result in any NULLs? What are the tradeoffs between these two approaches exactly? It seems like the former might be more versatile, at the expense of not really having a clear defined structure, but I don't know if this would actually result in other ramifications. I can imagine a situation where you have tons of columns in the latter example, most of which are NULL, and maybe only like three distinct values filled in.
EAV is the model your first example is in. It's got a few advantages, however you are in mysql and mysql doesn't handle this the best. As pointed out in this thread Crosstab View in mySQL? mysql lacks functions that other databases have. Postgres and other databases have some more fun functions PostgreSQL Crosstab Query that make this significantly easier. In the MSSQL world, this gets referred to as sparsely populated columns. I find columnar structures actually lend themselves quite well to this (vertica, or high end oracle)
Advantages:
Adding a new column to this is significantly easier than altering a table schema. If you are unsure of what future column names will be, this is the way to go
Sparsely populated columns result in tables full of nulls and redundant data. You can setup logic to create a 'default' value for a column...IE if no value is specified for this attribute, then use this value.
Downsides:
A bit harder to program with in MySQL in particular as per comments above. Not all SQL dev's are familiar with the model and you might accidentally implement a steeper learning curve for new resources.
Not the most scalable. Indexing is a challenge and you need work around (Strawberry's input in the comments is towards this, your value column is basically forced to Varchar and that does not index well, nor does it search easily...welcome to table scan hell) . Though you can get around this with a third table (say you query on dates like create date and close date alot. Create a third 'control' table that contains those frequently queried columns and index that...refer to the EAV tables from there) or creating multiple EAV tables, one for each data type.
First one is the right one.
If later you want change the number of properties, you dont have to change your DB structure.
Changing db structure can cause your app to break.
If the number of null is too big you are wasting lot of storage.
My take on this
The first I would probably use if I have a lot of different attributes and values I would like to add in a more dynamic way, like user tags or user specific information etc,
The second one I would probably use if I just have the three attributes (as in your example) weights, reps, time and have no need for anything dynamic or need to add any more attributes (if this was the case, I would just add another column)
I would say both works, it is as you yourself say, "the former might be more versatile". Both ways needs their own structure around them to extract, process and store data :)
Edit: for the first one to achieve the structure of the second one, you would have to add a join for each attribute you would want to include in the data extract.
I think the first way contributes better towards normalization. You could even create a new table with attributes:
id attribute
______________
1 reps
2 weight
3 time
And turn the second last column into a foreign id. This will save space and will save you the risk of mistyping the attribute names. Like this:
id | some_foreign_id | attribute | value
_________________________________________
1 1 2 100
2 1 1 5
3 2 1 40
4 3 3 10
5 4 2 50
6 4 1 60
As others have stated, the first way is the better way. Why? Well, it normalizes the structure. Reference: https://en.wikipedia.org/wiki/Database_normalization
As that article states, normalization reduces database size & allows for easy expansion.
I'd like to do some Php graphs with MySQL datas, but first I have to think about the best design for my database, I prefer ask you some advice because I'm a newbie and I think there is a better way that the one I'm thinking.
I have a main table called "companies", with an autoincrement id column and a region column, more companies and more column could be added later.
On the Php graphs, I'd like to choose the graphs by companies regions and maybe others filters later, this is why my table "companies" is the the central table.
I'd like to fetch new datas (stocks, commands, roi, etc..) for every companies one time per day and use those datas on graph. The old datas won't be overwritten but keep for the graphs history, more data I'll have more the graph will show the behavior of the company.
So, every day there will be one more row per companie (more than 200 companies in total), and this point is my issue.
I was thinking to create one table per companie and add a new row everyday on each on those tables, but I feel it's dirty and there is a better and cleaner way to do that, is anybody can show me the best way ?
Thanks for reading, I hope you'll can help me.
From what I understand, you have at least those entities in your domain:
Companies
Company indicators (stock, ROI, commands, ...)
Indicator types (stock, ROI, commands, ...)
Regions
So you should probably have something similar to the schema below:
,---------.
,-------. |Indicator|
|Company| |---------|
|-------| |date |
|name |*---|value |
|-------| |---------|
`-------' `---------'
o o
| |
,------. ,-------------.
|Region| |IndicatorType|
|------| |-------------|
|name | |-------------|
|------| `-------------'
`------'
I have a database for a device and the columns are like this:
DeviceID | DeviceParameter1 | DeviceParameter2
At this stage I need only these parameters, but maybe a few months down the line, I may need a few more devices which have more parameters, so I'll have to add DeviceParameter3 etc as columns.
A friend suggested that I keep the parameters as rows in another table (ParamCol) like this:
Column | ColumnNumber
---------------------------------
DeviceParameter1 | 1
DeviceParameter2 | 2
DeviceParameter3 | 3
and then refer to the columns like this:
DeviceID | ColumnNumber <- this is from the ParamCol table
---------------------------------------------------
switchA | 1
switchA | 2
routerB | 1
routerB | 2
routerC | 3
He says that for 3NF, when we expect a table whose columns may increase dynamically, it's better to keep the columns as rows. I don't believe him.
In your opinion, is this really the best way to handle a situation where the columns may increase or is there a better way to design a database for such a situation?
This is a "generic data model" question - if you google the term you'll find quite a bit of material on the net.
Here is my view: if and only if the parameters are NOT qualitatively different from the application perspective, then go with the dynamic row solution (i.e. a generic data model). What does qualitatively mean - it means that within your application you don't treat Parameter3 any different to Parameter17.
You should never ever generate new columns on-the-fly, that's a very bad idea. If the columns are qualitatively different and you want to be able to cater for new ones, then you could have a different Device Parameter table for each different category of parameters. The idea is to avoid dynamic SQL as much as possible as it brings a set of its own problems.
Adding dynamic column is a bad idea, Actually it's a bad design. I would agree with your second option , Adding rows is OK,
Because if you want to add dynamically grow the columns then you have to provide them a default value, also you will not be able to use them as 'UNIQUE' vals, you will find really hard while updating the tables, So better to stick with adding 'ROWS' plan.
I'm working on a URL shortener project with PHP & MYSQL which tracks visits of each url. I've provided a table for visits which mainly consists of these properties :
time_in_second | country | referrer | os | browser | device | url_id
#####################################################################
1348128639 | US | direct | win | chrome | mobile | 3404
1348128654 | US | google | linux | chrome | desktop| 3404
1348124567 | UK | twitter| mac | mozila | desktop| 3404
1348127653 | IND | direct | win | IE | desktop| 3465
Now I want to make a query on this table. for example I want to get visits data for the url with url_id=3404. Because I should provide statistics and draw graphs, for this url, I need these data:
Number of each kind of OS for this URL , for example 20 windows, 15 linux , ...
Number of visits in each desired period of time , for example each 10 minutes in past 24 hour
Number of visits for each country
...
As you see, some data like country may accept lots of different values.
One good idea which I can imagine is to make query which outputs number of each unique value in each column, for example in the country case for the data given above, on column for num_US, one for num_UK, and one for num_IND.
Now the question is how to implement such a high-performance query in sql (MYSQL) ?
Also if you think this is not an efficient query for performance, what's your suggestion?
Any help will be appreciated deeply.
UPDATE: look at this question : SQL; Only count the values specified in each column . I think this question is similar to mine , but the difference is in variety of values possible (as lots of values are possible for country property) for each column which makes the query more complex.
It looks like you need to do more than one query. You probably could write one query with different parameters but that would make it complex and hard to maintain. I would approach it as multiple small queries. So for each requirement I make a query and call them separately or individually. For example, if you want the country query you mentioned, you could do the following
SELECT country, count (*) FROM <TABLE_NAME> WHERE url_id = 3404 GROUP BY Country
By the way, I have not tested this query, so it may be inaccurate, but this is just to give you an idea. I hope this helps.
Also, another suggestion is to use Google Analytics, look into it, they do have a lot of what you already are implementing, maybe that helps as well.
Cheers.
Each of these graphs you want to draw represents a separate relation, so my off-the-cuff response is that you can't build a single query that gives you exactly the data you need for every graph you want to draw.
From this point, your choises are:
Use different queries for different graphs
Send a bunch of data to the client and let it do the required post-processing to create the exact sets of data it needs for different graphs
farm it all out to Google Analytics (a la #wahab-mirjan)
If you go with option 2 you can minimize the amount of data you send by counting hits per (10-minute, os, browser, device, url_id) tupple. This essentially removes all duplicate rows and gives you a count. The client software would take these numbers and further reduce them by country (or whatever) to get the numbers it needs for a graph. To be honest though, I think you're buying yourself extra complexity for not very much gain.
If you insist on doing this yourself (instead of using a service) then go with a different query for each kind of graph. Start with a couple of reasonable indexes (url_id and time_in_second are obvious starting points). Use the explain statement (or whatever your database provides) to understand how each query is executed.
Sorry, I am new to Stack Overflow and having a problem with comment formatting. Here is my answer again, hopefully it workds now:
Not sure how it is poor in performance. The way I am thinking is you will end up with a table that looks like this:
country | count
#################
US | 304
UK | 123
UK | 23
So when you group by country, and count, it will be one query. I think this will get you going in the right direction. In any case, it is just an opinion, so if you find another approch, I am interested in knowing it as well.
Apologies about the comment messup up there..
Cheers