I have an Excel table (can be converted to XML or CSV for manipulating) of this structure:
| License-plate | Parking | Fuel | Cleaning |
---------------------------------------------
| 1111AAA | 234 | 21 | 1244 |
| 2222AAA | 22 | 12 | 644 |
| 3333BBB | 523 | 123 | 123 |
Which is a monthly spending for parking, fuel, etc. per car/month.
License plate is unique value in the table.
I need to convert this table in this to import it to MySQL, but I don't know how to do that and which tool is good for it:
| License-plate | Concept | Amount |
-------------------------------------
| 1111AAA | Parking | 234 |
| 1111AAA | Fuel | 21 |
| 1111AAA | Cleaning | 1244 |
| 2222AAA | Parking | 22 |
| 2222AAA | Fuel | 12 |
| 2222AAA | Cleaning | 644 |
| ....... | ........ | ..... |
In the result table License-plate is not the unique value, and it's repeated for the number of concepts it have.
UPD: Just discovered that it can be called denormalized data (maybe not exactly).
I would do it with MySQL, the following way:
Import the table (after converting it to CSV) into MySQL. Let's call it source
CREATE TABLE source (
License_Plate char(7) primary key,
Parking int(8) unsigned,
Fuel int(8) unsigned,
Cleaning int(8) unsigned
);
LOAD DATA INFILE 'path/to/file' INTO TABLE source FIELDS TERMINATED BY ',';
Create another table with the desired final structure, let's call it destination
CREATE TABLE destination (
License_Plate char(7),
Concept varchar(10),
Amount int(8) unsigned
);
Perform the following queries
INSERT INTO destination
SELECT License_Plate, 'Parking' as Concept, Parking as Amount
FROM source
INSERT INTO destination
SELECT License_Plate, 'Fuel' as Concept, Fuel as Amount
FROM source
INSERT INTO destination
SELECT License_Plate, 'Cleaning' as Concept, Cleaning as Amount
FROM source
Things to consider:
I declared License_Plate as a primary key, just based on your example. This might not be the case if it repeats in real data. Also, if you have more than one row on the source table for the same license plate, you probably need to adjust my 3 queries to aggregate values.
Also, the datatypes are adjusted to the sample data, you might need to change it if you have values with more than 8 digits, for instance.
LOAD DATA is one way in which you can upload your CSV. It has many options, you should check them out. You can also do it with some tools, so as not to write that statement.
Last, those table names were chosen as an example, you should come up with better ones, that represent your problem domain.
Hope this helps you.
The comment of #pnuts helped me. It is very easy solution and can be done in Excel. Thank you!
The solution is: Convert matrix to 3-column table ('reverse pivot', 'unpivot', 'flatten', 'normalize')
Related
The Problem
I landed a small gig to develop an online quoting system for an electronic distributor. He has roughly a half million parts - one little screw is considered a part, one little led, etc. So there are a LOT of parts.
One Important Note: This is only a RFQ ( Request for Quote ). There are no prices client-side, or totals, or anything to do with money. Just collecting a list of part numbers to send to my client.
I had to collect the part data from multiple sources (vendor website, scanned paper catalog, Excel spreadsheets, CSV files, and even a few JSON files. It was exhausting, but I got it done.
Results
Confusing at first. I had dozens of product categories, and some products had so many attributes that were not common to any other products. I could see this project getting very complicated, and given the fact I bid this job at $900 even, I had to simplify this somehow.
This is what I came up with, and received client approval.
Current Columns
+--------------------------+--------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+--------------------------+--------------+------+-----+---------+-------+
| Datasheets | varchar(128) | YES | | NULL | |
| Image | varchar(85) | YES | | NULL | |
| DigiKey_Part_Number | varchar(46) | YES | | NULL | |
| Manufacturer_Part_Number | varchar(47) | YES | | NULL | |
| Manufacturer | varchar(49) | YES | | NULL | |
| Description | varchar(34) | YES | | NULL | |
| Quantity_Available | int(11) | YES | | NULL | |
| Minimum_Quantity | int(11) | YES | | NULL | |
+--------------------------+--------------+------+-----+---------+-------+
so all products will fit this page template (menu on bottom is error in screenshot):
Autocomplete Off The Table?
Early on in the design, I implemented a nice autocomplete feature:
BUT .. given the number of products in the table, is this even
practical anymore ???
FINAL PRODUCT COUNT: 223,347
What changes do I need to make to PRODUCTS table so that querying the table will not take forever?
These are the only queries the app will be making ( not sure if this info will help in your solution advice )...
Get all products by category:
Select * from products where category = 'semiconductors'
Get single product:
Select * from products where Manufacturer_Part_Number = '12345'
Get product count by category
I think those three actually cover everything I need to do. Maybe a couple more, but not many.
In closing...
Is there a way to "index" this table with 223000 records where searching by one or more columns can be done efficiently?
I am very new to database design, and know I do need to index SOMETHING, but ... WHAT???
Thank you for taking the time to look at this post.
Regards,
John
Listing the queries is mandatory to answering your question. Thanks for including them.
INDEX(category)
INDEX(Manufacturer_Part_Number)
But I suggest your second query should include Manufacturer, too. Then this would be better it:
INDEX(Manufacturer, Manufacturer_Part_Number)
Everything NULL? Seems unlikely.
(I've done jobs like yours; I can't imagine bidding only $900 for all that scraping.)
What will you do when there are a thousand items in a single category or manufacturer? A UI with a thousand-item list sucks.
For how to handle "so many attributes", I recommend http://mysql.rjweb.org/doc.php/eav (I should charge you $899 for the research that went into that document. Just kidding.)
Don't they need other lookups, like "Flash drive", which need to match "FLASH DRV"?
223K rows -- no problem. The VARCHARs seem to be too short; were they based on the data?
And the table needs a PRIMARY KEY.
I am studying about databases and I have encountered this question.If I have for example the table product_supply which containts Invoice_Id(pk),Product_Id(pk),Date_Of_Supply,Quantity and Value_Of_Product.
| Invoice_ID | Product_ID | Date_Of_Supply | Quantity | Value_Of_Product |
-------------------------------------------------------------------------
| AA111111111| 5001 | 08-07-2013 | 50 | 200$ |
| AA111111111| 5002 | 08-07-2013 | 20 | 300$ |
| BB222222222| 5003 | 10-09-2013 | 70 | 50$ |
| CC333333333| 5004 | 15-10-2013 | 100 | 40$ |
| CC333333333| 5005 | 15-10-2013 | 70 | 25$ |
| CC333333333| 5006 | 15-10-2013 | 100 | 30$ |
As we Can see The table is already in the 1NF form.My question here is.In terms of normalization if it is wise to normalize this table to a 2NF form and have another table for example supply_date with Invoice_ID(pk) and Date_Of_Supply or if having the upper table is ok?
| Invoice_ID | Date_Of_Supply |
-------------------------------
|AA111111111 | 08-07-2013 |
|BB222222222 | 10-09-2013 |
|CC333333333 | 15-10-2013 |
It's definitely worth normalizing. If you need to modify a supply date, with 1NF, you need to update several records; with 2NF, you only need to update one record. Also, note the redundancy of data in 1NF, where the supply date is stored multiple times for each invoice id. Not only does it waste space, it makes it harder to process a query like "list all invoices that were supplied between dates X and Y".
EDIT
As Robert Harvey points out in his comments (which it took me a while to understand because I was being thick for some reason), if you already have a table that has a single row for each Invoice_ID (say, an "invoice table"), then you should probably add a column for Date_Of_Supply to that table rather than create a new table.
Changing the table to second normal form involves removing redundancies in the first normal form table. The first question is to determine whether there are even any redundancies.
If a redundancy exists, then we should be able to create a second table which does NOT involve the primary key (Invoice_ID) of the first one. Based on the non PK columns in the first table (namely Product_ID, Date_Of_Supply, Quantity, and Value_Of_Product), it is not clear that any of these are dependent on each other.
As a general rule of thumb, if you have a table where all non PK columns are dependent solely on the PK column of that table, it is already in 2NF.
I have cumulative input values that start life as smallints.
I read these values from a Access database, and aggregate them into a MySQL database.
Now I'm faced with input values of type smallint that are cumulative, thus always increasing.
Input Required output
---------------------------------
0 0
10000 10000
32000 32000
-31536 34000 //overflow in the input
-11536 54000
8464 74000
I process these values by inserting the raw data into a blackhole table and in the trigger to the blackhole I upgrade the data before inserting it into the actual table.
I know how to store the previous input and output, or if there is none, how to select the latest (and highest) inserted value.
But what's the easiest/fastest way to deal with the overflow, so I get the correct output.
Given you have a table named test with a primary key called id and the column is named value Then just do this:
SELECT
id,
test.value,
(SELECT SUM(value) FROM test AS a WHERE a.id <= test.id) as output
FROM test;
This would be the output:
------------------------
| id | value | output |
------------------------
| 1 | 10000 | 10000 |
| 2 | 32000 | 42000 |
| 3 | -31536 | 10464 |
| 4 | -11536 | -1072 |
| 5 | 8464 | 7392 |
------------------------
Hope this helps.
If it doesn't work, just convert your data to INT (or BIGINT for lots of data). It does not hurt and memory is cheap this days.
I'm at the moment developping a quite big application that will manipulate a lot of data.
I'm designing the data model and I wonder how to tune this model for big amount of data. (My DBMS is MySQL)
I have a table that will contain objects called "values". There are 6 columns called :
id
type_bool
type_float
type_date
type_text
type_int
Depending of the type of that value (that is written elsewhere), one of these columns has a data, the others are NULL values.
This table is aimmed to contain millions lines (growing very fastly). It's also going to be read a lot of times.
My design is going to make a lot of lines with few data. I wonder if it's better to make 5 different tables, each will contain only one type of data. With that solution there would be much more jointures.
Can you give me a piece of advice ?
Thank you very much !
EDIT : Description of my tables
TABLE ELEMENT In the application there are elements thats contains attributes.
There will be a LOT of rows.
There is a lot of read/write, few update/delete.
TABLE ATTRIBUTEDEFINITION Each attribute is described (design time) in the table attributeDefinition that tells which is the type of the attribute.
There will not be a lot of rows
There is few writes at the begining but a LOT of reads.
TABLE ATTRIBUTEVALUE After that, another table "attributeValue" contains the actual data of each attributeDefinition for each element.
There will be a LOT of rows ([nb of Element] x [nb of attribute])
There is a LOT of read/write/UPDATE
TABLE LISTVALUE *Some types are complex, like the list_type. The set of values available for this type are in another table called LISTVALUE. The attribute value table then contains an id that is a key of the ListValue Table*
Here are the create statements
CREATE TABLE `element` (
`id` int(11),
`group` int(11), ...
CREATE TABLE `attributeDefinition` (
`id` int(11) ,
`name` varchar(100) ,
`typeChamps` varchar(45)
CREATE TABLE `attributeValue` (
`id` int(11) ,
`elementId` int(11) , ===> table element
`attributeDefinitionId` int(11) , ===> table attributeDefinition
`type_bool` tinyint(1) ,
`type_float` decimal(9,8) ,
`type_int` int(11) ,
`type_text` varchar(1000) ,
`type_date` date,
`type_list` int(11) , ===> table listValue
CREATE TABLE `listValue` (
`id` int(11) ,
`name` varchar(100), ...
And there is a SELECT example that retrieve all elements of a group that id is 66 :
SELECT elementId,
attributeValue.id as idAttribute,
attributeDefinition.name as attributeName,
attributeDefinition.typeChamps as attributeType,
listValue.name as valeurDeListe,
attributeValue.type_bool,
attributeValue.type_int,
DATE_FORMAT(vdc.type_date, '%d/%m/%Y') as type_date,
attributeValue.type_float,
attributeValue.type_text
FROM element
JOIN attributeValue ON attributeValue.elementId = element.id
JOIN attributeDefinition ON attributeValue.attributeDefinitionId = attributeDefinition.id
LEFT JOIN listValue ON attributeValue.type_list = listValue.id
WHERE `e`.`group` = '66'
In my application, foreach row, I print the value that corresponds to the type of the attribute.
As you are only inserting into a single column each time, create a different table for each data type - if you are inserting large quantities of data you will be wasting a lot of space with this design.
Having fewer rows in each table will increase index lookup speed.
Your column names should describe the data in them, not the column type.
Read up on Database Normalisation.
Writing will not be an issue here. Reading will
You have to ask yourself :
how often are you gonna query this ?
are old data modified or is it just "append" ?
==> if the answers are frequently / append only, or minor modification of old data, a cache may solve your read issues, as you won't query the base so often.
There will be a lot of null fields at each row. If the table is not big ok, but as you said there will be millions of rows so you are wasting space and the queries will take longer to execute. Do someting like this:
table1
id | type
table2
type | other fields
Advice I have, although it might not be the kind you want :-)
This looks like an entity-attribute-value schema; using this kind of schema leads to all kind of maintenance / performance nightmares:
complicated queries to get all values for a master record (essentially, you'll have to left join your result table N times with itself to obtain N attributes for a master record)
no referential integrity (I'm assuming you'll have lookup values with separate master data tables; you cannot use foreign key constraints for this)
waste of disk space (since your table will be sparsely filled)
For a more complete list of reasons to avoid this kind of schema, I'd recommend getting a copy of SQL Antipatterns
Finally I tried to implement both solutions and then I benched them.
For both solution, there were a table element and a table attribute definition as follow :
[attributeDefinition]
| id | group | name | type |
| 12 | 51 | 'The Bool attribute' | type_bool |
| 12 | 51 | 'The Int attribute' | type_int |
| 12 | 51 | 'The first Float attribute' | type_float |
| 12 | 51 | 'The second Float attribute'| type_float |
[element]
| id | group | name
| 42 | 51 | 'An element in the group 51'
First Solution (Best one)
One big table with one column per type and many empty cells. Each value of each attribute of each element.
[attributeValue]
| id | element | attributeDefinition | type_int | type_bool | type_float | ...
| 1 | 42 | 12 | NULL | TRUE | NULL | NULL...
| 2 | 42 | 13 | 5421 | NULL | NULL | NULL...
| 3 | 42 | 14 | NULL | NULL | 23.5 | NULL...
| 4 | 42 | 15 | NULL | NULL | 56.8 | NULL...
One table for attributeDefinition that describe each attribute of every element in a group.
Second Solution (Worse one)
8 tables, one for each type :
[type_float]
| id | group | element | value |
| 3 | 51 | 42 | 23.5 |
| 4 | 51 | 42 | 56.8 |
[type_bool]
| id | group | element | value |
| 1 | 51 | 42 | TRUE |
[type_int]
| id | group | element | value |
| 2 | 51 | 42 | 5421 |
Conclusion
My bench was first looking at the database size. I had 1 500 000 rows in the big table which means approximatly 150 000 rows in each small table if there are 10 datatypes.
Looking in phpMyAdmin, sizes are nearly exactly the same.
First Conclusion : Empty cells doesn't take place.
After that, my second bench was for performance tests, getting all values of all attributes of all elements in one group. There are 15 groups in the database. Each group has :
400 elements
30 attributes per element
So that is 12 000 rows in [attributeValue] or 1200 rows in each table [type_*].
The First SELECT only does one join between [attributeValue] and [element] to put a WHERE on the group.
The second SELECT uses a UNION with 10 SELECT in each table [type_*].
That second SELECT is 10 times longer !
Second Conclusion : One table is better that many.
I'm attempting to design a small database for a customer. My customer has an organization that works with public and private schools; for every school that's involved, there's an implementation (a chapter) at each school.
To design this, I've put together two tables; one for schools and one for chapters. I'm not sure, however, if I should merge the two together. The tables are as follows:
mysql> describe chapters;
+--------------------+------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------------+------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| school_id | int(10) unsigned | NO | MUL | | |
| is_active | tinyint(1) | NO | | 1 | |
| registration_date | date | YES | | NULL | |
| state_registration | varchar(10) | YES | | NULL | |
| renewal_date | date | YES | | NULL | |
| population | int(10) unsigned | YES | | NULL | |
+--------------------+------------------+------+-----+---------+----------------+
7 rows in set (0.01 sec)
mysql> describe schools;
+----------------------+------------------------------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+----------------------+------------------------------------+------+-----+---------+----------------+
| id | int(10) unsigned | NO | PRI | NULL | auto_increment |
| full_name | varchar(255) | NO | MUL | | |
| classification | enum('high','middle','elementary') | NO | | | |
| address | varchar(255) | NO | | | |
| city | varchar(40) | NO | | | |
| state | char(2) | NO | | | |
| zip | int(5) unsigned | NO | | | |
| principal_first_name | varchar(20) | YES | | NULL | |
| principal_last_name | varchar(20) | YES | | NULL | |
| principal_email | varchar(20) | YES | | NULL | |
| website | varchar(20) | YES | | NULL | |
| population | int(10) unsigned | YES | | NULL | |
+----------------------+------------------------------------+------+-----+---------+----------------+
12 rows in set (0.01 sec)
(Note that these tables are incomplete - I haven't implemented foreign keys yet. Also, please ignore the varchar sizes for some of the fields, they'll be changing.)
So far, the pros of keeping them separate are:
Separate queries of schools and
chapters are easier. I don't know if
it's necessary at the moment, but
it's nice to be able to do.
I can make a chapter inactive
without directly affecting the
school information.
General separation of data - the fields in
"chapters" are directly related to
the chapter itself, not the school
in which it exists. (I like the
organization - it makes more sense
to me. Also follows the "nothing but the key" mantra.)
If possible, we can collect school
data without having a chapter
associated with it, which may make
sense if we eventually want people
to select a school and autopopulate
the data.
And the cons:
Separate IDs for schools and
chapters. As far as I know, there
will only ever be a one-to-one
relationship between the two, so
doing this might introduce more
complexity that could lead to errors
down the line (like importing data
from a spreadsheet, which is unfornately
something I'll be doing a lot of).
If there's a one-to-one ratio, and
the IDs are auto_increment fields,
I'm guessing that the chapter_id and
school_id will end up being the same - so why not just put them in a single table?
From what I understand, the chapters
aren't really identifiable on their
own - they're bound to a school, and
as such should be a subset of a
school. Should they really be
separate objects in a table?
Right now, I'm leaning towards keeping them as two separate tables; it seems as though the pros outweigh the cons, but I want to make sure that I'm not creating a situation that could cause problems down the line. I've been in touch with my customer and I'm trying to get more details about the data they store and what they want to do with it, which I think will really help. However, I'd like some opinions from the well-informed folks on here; is there anything I haven't thought of? The bottom line here is just that I want to do things right the first time around.
I think they should be kept separate. But, you can make the chapter a subtype of a school (and the school the supertype) and use the same ID. Elsewhere in the database where you use SchoolID you mean the school and where you use ChapterID you mean the chapter.
CREATE TABLE School (
SchoolID int unsigned NOT NULL AUTO_INCREMENT,
CONSTRAINT PK_School PRIMARY KEY (SchoolID)
)
CREATE TABLE Chapter (
ChapterID int unsigned NOT NULL,
CONSTRAINT PK_Chapter PRIMARY KEY (ChapterID)
CONSTRAINT FK_Chapter_School FOREIGN KEY (ChapterID) REFERENCES School (SchoolID)
)
Now you can't have a chapter unless there's a school first. If such a time occurred that you had to allow multiple chapters per school, you would recreate the Chapter table with ChapterID as identity/auto-increment, add a SchoolID column populated with the same value and put the FK on this one to School, and continue as before, only inserting the ID to SchoolID instead of ChapterID. If MySQL supports inserting explicit values to an autoincrement column, then making it SchoolID autoincrement ahead of time could save you trouble later (unless switching a regular column to autoincrement is supported in which case no issues there).
Additional benefits of keeping them separate:
You can make foreign key relationships directly with SchoolID or ChapterID so that the data you're storing is always correct (for example, if no chapter exists yet you can't store related data for such a thing until it is created).
Querying each table separately will perform better as the rows don't contain extraneous information.
A school can be created with certain required columns, but the chapter left uncreated (temporarily). Then, when it is created, you can have some NOT NULL columns in it as well.
keep them separate.
they may be 1-1 currently... however these are clearly separate concepts.
will they eventually want to input schools which do not have chapters? perhaps as part of a sales lead system?
can there really only be one chapter per school or just one active chapter ? what about across time? is it possible they will request a report with all chapters in the past 10 years at x school ?
You said the links will always be 1 to 1, but does a school always have a chapter can it change chapters? If so, then keeping chapters separate is a good idea.
Another reason to keep them separate is if the amount of information about the two entities combined would make the length of the records longer than the database backend can handle. One-to_one tables are often built to keep the amount of data that needs to be stored in a record down to an appropriate size.
Further is the requirement a firm 1-1 or is does it have the potential to be 1-many? If the second, make it a separate table now. Id there the potential to have schools without chapters? Again I'd keep them separate.
And how are you intending to query this data, will you generally need the data about both the chapter and school in the same queries, then you might put them in one table if you are sure there is no possibility of it turning into a 1-many relationship. However a proper join with the join fields indexed should be fast anyway.
I tend to see these as separate entities and would keep them separte unless there was a critcal performance problem that would lead to putting them to gether. I think that having separate entities in separate table from the start tends to be less risky than putting them together. And performance would normally be perfectly acceptable as long as the indexing is correct and may even be better if you don't normally need to query data from both tables all the time.