How to a denormalize repeating mysql data? - mysql

Hi I need to do some denormalizing on a MySQL table with repeating data.
My "Publications" table is currently in this format:
Publications Source Table
| title | author
--------------------------------------------
| my paper | michael
| my paper | bill
| my paper | jill
| other paper | tom
| other paper | amy
| third paper | ben
| third paper | sophie
I need to change it to this format:
Publications Destination Table
| title | author | author2 | author 3
|-----------------------------------------------------------------
| my paper | michael | bill | jill
| other paper | tom | amy |
| third paper | ben | sophie |
Now, just for your information I need to do this so I can eventually get a CSV file so the data can be exported from an old system into a new system that requires a CSV file in this format.
Also there are many other fields in the table and about 60,000 rows in the source table, but only about 15,000 unique titles. In the source table there is one row per author. In the destination, title will be a unique identifier. I need one row per unique publication title. Also I can calculate in advance what the most number of authors is on any one publication, if that makes the problem easier.
How can I do this in MySQL? Thanks

If you don't actually want to alter the structure of the table, and instead just want to get the data out so you can import it into a new system, you could try the GROUP_CONCAT() function in mysql:
SELECT title, GROUP_CONCAT(author SEPARATOR "|") AS authors FROM publications GROUP BY title;
I've used the pipe as a separator as there's a good chance your titles will contain commas. If you want this to end up as a csv file, you could do a find-and-replace on the pipe character to turn it into whatever it needs to be (e.g., ", ").

My recommendation is that you actually normalize the table instead of adding new columns for supplemental authors. So your new table structure would look something like this:
Publications Source Table
| title_id | title
--------------------------------------------
| 1 | my paper
| 2 | other paper
| 3 | third paper
| title_id | author
--------------------------------------------
| 1 | michael
| 1 | bill
| 1 | jill
| 2 | tom
| 2 | amy
| 3 | ben
| 3 | sophie

Related

how should I build up my database when I want to store these kind of data?

I want to build a page like shown below and all data should be retrieved from a database. Both the term, subject and sentences is retrieved from a database. Three levels of data. And under each term (eg. Spring 2017) I can pick and choose between all of these sentences.
Spring 2017
Subject1
Sentence 1
Sentence 2
Sentence 3
Subject2
Sentence 13
Sentence 12
Sentence 17
Subject3
Sentence 11
Sentence 14
Sentence 19
Autmn 2017
...
I want to present similar info from database to user, and let the user choose between all this sentences. How should i build up my database for achieving this in the best and most efficient way.
One way is:
Table 'subject' Table 'sentences'
| id | subjects | | id | subjectid | name |
| 3 | Subject1 | | 1 | 3 | Sentence 2 |
| 4 | Subject2 | | 2 | 4 | Sentence 13 |
Table 'term'
| id | term | sentenceid |
| 1 | Spring 17 | 1,2,28 |
Another way is maybe using pivot-tables, something like this:
Table 'sentences'
| id | parentid | name |
| 1 | 0 | Subject2 |
| 2 | 3 | Sentence 2 |
| 3 | 0 | Subject1 |
| 4 | 1 | Sentence 13 |
Table 'term'
| id | term | sentenceid |
| 1 | Spring 17 | 2,4,28 |
Notice: Number of terms can be many more than just two in a year.
Is it any of this structures you recommend, or any other way you think I should build my database? Is one of these more efficient? Not so demanding? Easier to adjust?
You are doing relational analysis/design:
Find all substantives/nouns of your domain. These are candidates for tables.
Find any relationships/associations between those substantives. "Has", "consists of", "belongs to", "depends on" and so on. Divide them into 1:1, 1:n, n:m associations.
look hard at the 1:1 ones and check if you can reduce two of your original tables into one.
the 1:n lead you to foreign keys in one of the tables.
the n:m give you additional association tables, possibly with their own attributes.
That's about it. I would strongly advise against optimizing for speed or space at this point. Any modem RDBMS will be totally indifferent against the number of rows you are likely to encounter in your example. All database related software (ORMs etc.) expect such a clean model. Packing ids into comma separated fields is an absolutes no-no as it defeats all mechanisms your RDBMS has to deal with such data; it makes the application harder to program; it confuses GUIs and so on.
Making weird choices in your table setup so they deviate from a clean model of your domain is the #1 cause of trouble along the way. You can optimize for performance later, if and when you actually get into trouble. Except for extreme cases (huge data sets or throughput), such optimisation primarily takes place inside the RDBMS (indexes, storage parameters, buffer management etc.) or by optimizing your queries, not by changing the tables.
If the data is hierarchical, consider representing it with a single table, with one column referencing a simple lookup for the "entry type".
Table AcademicEntry
================================
| ID | EntryTypeID | ParentAcademicEntryID | Description |
==========================================================
| 1 | 3 | 3 | Sentence 1 |
| 2 | 1 | <null> | Spring 2017 |
| 3 | 2 | 2 | Subject1 |
Table EntryType
================================
| ID | Description |
====================
| 1 | Semester |
| 2 | Subject |
| 3 | Sentence |
Start with the terms. Every term has subjects. Every subject has sentences. Then you may need the position of a subject within a term and probably the position of a sentence in a subject.
Table 'term'
id | term
---+------------
1 | Spring 2017
Table 'subject'
id | title | termid | pos
---+----------+--------+----
3 | Subject1 | 1 | 1
4 | Subject2 | 1 | 2
5 | Subject3 | 1 | 3
Table 'sentence'
id | name | subjectid | pos
---+-------------+-----------+-----
1 | Sentence 2 | 3 | 2
2 | Sentence 13 | 4 | 1
3 | Sentence 1 | 3 | 1
4 | Sentence 3 | 3 | 3
2 | Sentence 17 | 4 | 3
...
This table design Should resolve your need.
TblSeason
(
SeasonId int,
SeasonName varchar(30)
)
tblSubject
(
Subjectid int
sessionid int (fk to tblsession)
SubjectData varchar(max)
)
tblSentences
(
SentencesID INT
Subjectid int (Fk to tblSubject)
SentenceData varchar(max)
)

Multiple Data Sources in Microsoft Excel SQL Query

I have a lot of spreadsheets that pull transactional information from our ERP software into Excel using the Microsoft Query that we then perform other calculations on automatically. Recently we upgraded our ERP system, but management made the decision to leave the transactional history in the old databases to have a clean one going forward in the new system. I still need to have some "rolling 12 months" graphs, but if I use only the old database, I'm missing new data and if I use only the new, I'm missing the last 11 months data.
Is there a way that I can write a query in Excel to pull data from the old database PartTran table and merge it with the new database PartTran table without user intervention each time? For instance, I don't want my users (if possible) to have to have two queries that they copy and paste into one Excel table. The schema of the tables (at least the columns I need) are identically named and defined.
If you want to take a bit of a fun, hacky Excel approach, you could do the "copy-paste" bit FOR your users behind the scenes. Given two similar tables OLD and NEW with structures
+-----+------+-------+------------+
| id | foo | bar | date |
+-----+------+-------+------------+
| 95 | blah | $25 | 2015-06-01 |
| 96 | bork | $12 | 2015-07-01 |
| 97 | bump | $200 | 2015-08-01 |
| 98 | fizz | | 2015-09-01 |
| 99 | buzz | $50 | 2015-10-01 |
| 100 | char | ($1) | 2015-11-01 |
| 101 | mope | | 2015-12-01 |
+-----+------+-------+------------+
and
+----+-----+-------+------------+------+---------+
| id | foo | bar | date | fizz | buzz |
+----+-----+-------+------------+------+---------+
| 1 | cat | ($10) | 2016-01-01 | 285B | 1110111 |
| 2 | dog | $25 | 2016-02-01 | 27F5 | 1110100 |
| 3 | ant | $100 | 2016-03-01 | 1F91 | 1001111 |
+----+-----+-------+------------+------+---------+
... you can union together the data for these two datasets with some prudent excel wizardry as below:
Your UNION table ( named using alt+j+t+a ) should have the following items:
New natural ID
DataSet pointer ( name of old or new table )
Derived ID from original dataset
Columns of data you want from Old & New DataSets
example:
+---------+------------+------------+----+------+-----+------------+------+------+
| UnionId | SourceName | SourceRank | id | foo | bar | date | fizz | buzz |
+---------+------------+------------+----+------+-----+------------+------+------+
| 1 | OLD | | | | | | | |
| 2 | NEW | | | | | | | |
+---------+------------+------------+----+------+-----+------------+------+------+
You will then make judicious use of Indirect() and VlookUp() to derive the lookup id and column targets. Sample code below
SourceRank - helper column
=COUNTIFS([SourceName],[#SourceName],[UnionId],"<="&[#UnionId])
id - the id from the original DataSet
=SMALL(INDIRECT([#SourceName]&"[id]"),[#SourceRank])
Everything else is just VlookUp madness!! Although I've taken the liberty of copying the sample code below for reference
foo =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[foo]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
bar =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[bar]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
date =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[date]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
fizz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
buzz =VLOOKUP([#id],INDIRECT([#SourceName]),MATCH(UNION[[#Headers],[fizz]],INDIRECT([#SourceName]&"[#Headers]"),0),0)
Output
You'll likely want to make prudent use of If() and/or IfError() to help your users ignore the new column references to the old table and those rows that do not yet have data. Without that, however, you'll end up with something like the below.
This is both ready to accept & read new inputs to both OLD and NEW DataSets and is sortable to get rid of those pesky placeholder rows...
Hope this helps! Happy coding!

MySQL - At what point should more than one table be used?

Edit for future viewers: Aside from the accepted answer which helped me I found some really good info here .
I've got a database with a single table for displaying inventory on a website (RVs). It stores the typical info: year, make, model, etc. I originally made it with 6 extra columns for storing "special features", but I don't like having such a hard limit on what options can be listed. Since I've never messed with more than a single table my gut instinct was to just add 24 or so more columns to cover everything, but something in my head told me that there might be a better way. So when do I decide N columns is too many? The data in these columns will commonly not be unique.
(Sorry for crappy diagram)
Current table design:
-----------------------------------------------------------------------
| id | year | make | model | price | ft_1 | ft_2 | ft_3 | ft_4 | ft_5 |
-----------------------------------------------------------------------
| | | | | | | | | | |
-----------------------------------------------------------------------
Possible better design:
table #1
------------------------------------
| id | year | make | model | price |
------------------------------------
| | | | | |
------------------------------------
table #2
---------------------------------------------
| unique_id(?) | feature | unit_ref |
---------------------------------------------
| 0 | "Diesel Pusher" | 2,6,14 |
---------------------------------------------
I feel like a bonus of the second table might be that I could more easily propagate a dropdown containing all the previously entered features to speed up adding new units to inventory.
Is this the right way to go about it, or should I just add more columns and be content?
Thanks.
Believe it or not, your best option would likely be to add a third table.
Since each record in your rvs table can be linked to multiple rows in the features table, and each feature can correspond to multiple rvs, you have a many-to-many relationship which is inherently difficult to maintain in a relational dbms. By adding a third "intersection" table you convert it to a one-to-many-to-one relationship which can be enforced declaratively by the dbms.
Your table structure would then become something like
rvs
------------------------------------
| id | year | make | model | price |
------------------------------------
| | | | | |
------------------------------------
features
--------------------------
| id | feature |
--------------------------
| 1192 | "Diesel Pusher" |
--------------------------
rv_features
----------------------
| rv_id | feature_id |
----------------------
| | |
----------------------
How do you make use of this? Suppose you want to record the fact that the 2016 Travelmore CampMaster has a 25kW diesel generator. You would first add a record to rvs like
--------------------------------------------------
| id | year | make | model | price |
--------------------------------------------------
| 0231 | 2016 | Travelmore | CampMaster | 750000 |
| 2101 | 2016 | Travelmore | Domestant | 650000 |
--------------------------------------------------
(Note the value in the id column is entirely arbitrary; its sole purpose is to serve as the primary key which uniquely identifies the record. It can encode meaningful information, but it must be something that will not change throughout the life of the record it identifies.)
You then add (or already have) the generator in the features table:
--------------------------------
| id | feature |
--------------------------------
| 1192 | Diesel Pusher 450hp |
| 3209 | diesel generator 25kW |
--------------------------------
Finally, you associate the rv to the feature with a record in rv_features:
----------------------
| rv_id | feature_id |
----------------------
| 0231 | 3209 |
| 0231 | 1192 |
| 2101 | 3209 |
----------------------
(I've added a few other records to each table for context.)
Now, to retrieve the features of the 2016 CampMaster, you use the following SQL query:
SELECT r.year, r.make, r.model, f.feature
FROM rvs r, features f, rv_features rf
WHERE r.id = rf.rv_id
AND rv.feature_id = f.id
AND r.id = '2031';
to get
----------------------------------------------------------
| year | make | model | feature |
----------------------------------------------------------
| 2016 | Travelmore | CampMaster | diesel generator 25kW |
| 2016 | Travelmore | CampMaster | Diesel Pusher 450hp |
----------------------------------------------------------
To see the rvs with a 25kW generator, change the query to
SELECT r.year, r.make, r.model, f.feature
FROM rvs r, features f, rv_features rf
WHERE r.id = rf.rv_id
AND rv.feature_id = f.id
AND f.id = '3209';
Sherantha's link to A Quick-Start Tutorial on Relational Database Design actually looks like a good intro to table design and normalization; you might find it useful.
There is a thing calles "third normal form" it says that everything without the unique ids shuld be unique. This means you need to make a table for year, a table for make a table for models etc and a table where you can combine all these ids to one connected dataset.
But this is not always practical, io think the best way to take this is something in between, like tables for entrys that repeat very often, but there dont need to be an extra table for price with unique ids, that would be overkill i think.
Based upon your scenario, if you believe no. of features columns remain same then no need for second table. And in case if there any possibility that features can be increased at any time in future then you should break up your table into two. (RVS & Features). Then create a third table that identify RVS & features as it seems there is many-to-many relationship. So I suggest you to use three tables.
I think it is better for you to be more familiar with relational database design. This is a short but great article I have found earlier.

How to save language skill levels correctly in a database

I think I am before a problem where many of you were before. I have a registration form where a user can pick any language of the planet and then pick his skill level for the respective language from a selectbox.
So, for example:
Language1: German
Skill: Fluent
Language2: English
Skill: Basic
I'm thinking what's the best way to store these values in a MySQL database.
I thought of two ways.
First way: creating a column for each language and assigning a skill value to it.
--------------------------------------------------
| UserID | language_en | language_ge |
--------------------------------------------------
| 22 | 1 | 4 |
--------------------------------------------------
| 23 | 3 | 4 |
--------------------------------------------------
So the language is always the column's name and the number represents the skill level (1. Basic, 2. Average ... )
I believe this is a nice way to work with these things and it is also pretty fast. The problem starts when there are 50 languages or more. It doesn't sound like a good idea to make 50 columns where the script always have to check them all if a user have any skill in that language.
Second way: inserting an array in one of the table's column. The table will look like this:
----------------------------------
| UserID | languages |
----------------------------------
| 22 | "ge"=>"4", "en"=>"1" |
----------------------------------
This way the user with ID 22 has skill level 4 for Germany and skill level 1 for English. This is fine because we don't need to check 50 additional columns (or even more) but it's not the right way in my eyes anyway.
We have to parse a lot of results and find a user with, for example, has level 1 for Germany and level 2 for Spanish without looking for the English skill level - it will take the server's a longer time and when bigger data comes we are in trouble.
I bet many of you have experienced this kind of issue. Please, can someone advise me how to sort this out?
Thanks a lot.
I'd advise you to have a separate table with all the languages:
Table: Language
+------------+-------------------+--------------+
| LanguageID | LanguageNameShort | LanguageName |
+------------+-------------------+--------------+
| 1 | en | English |
| 2 | de | German |
+------------+-------------------+--------------+
And another table to link the users to the languages:
Table: LanguageLink
+--------+------------+--------------+
| UserID | LanguageID | SkillLevelID |
+--------+------------+--------------+
| 22 | 1 | 1 |
| 22 | 2 | 4 |
| 23 | 1 | 3 |
| 23 | 2 | 4 |
+--------+------------+--------------+
This is the normalised way to represent that kind of relations in a DB. All data is easily searchable and you don't have to change the DB scheme if you add a language.
To render a user's languages you could use a query like that. It will give you a row per lanugage a user speaks:
SELECT
LanguageLink.UserID,
LanguageLink.SkillLevelID,
Language.LanguageNameShort
FROM
LanguageLink,
Language
WHERE
LanguageLink.UserID = 22
AND LanguageLink.LanguageID = Language.LanguageID
If you want to go further, you could create another table fo the skill level:
Table: Skill
+--------------+-----------+
| SkillLevelID | SkillName |
+--------------+-----------+
| 1 | bad |
| 2 | mediocre |
| 3 | good |
| 4 | perfect |
+--------------+-----------+
What I've done here is called Database normalization. I'd recommend reading about it, it may help you design further databases.

How can I save semantic information in a MySQL table?

I wish to save some semantic information about data in a table. How can I save this information in MySQL, such that I can access data and also search for the articles using the semantic data.
For example, I have a article about Apple and Microsoft. The semantic data will be like
Person : Steve Jobs
Person : Steve Ballmer
Company : Apple
Company : Microsoft
I want to save the information without losing the info that Steve Jobs and Steve Ballmer are persons and Apple and Microsoft are companies. I also want to search for articles about Steve Jobs / Apple.
Person and Company are not the only possible types, hence adding new fields is not viable. Since the type of the data is to be saved, I cannot use FullText field type directly.
Update - These are two options that I am considering.
Save the data in a full text column as serialized php array.
Create another table with 3 columns
--
--------------------------------
| id | subject | object |
--------------------------------
| 1 | Person | Steve Ballmer |
| 1 | Person | Steve Jobs |
| 1 | Company | Microsoft |
| 1 | Company | Apple |
| 2 | Person | Obama |
| 2 | Country | US |
--------------------------------
You're working on a hard and interesting problem! You may get some interesting ideas from looking at the Dublin Core Metadata Initiative.
http://dublincore.org/metadata-basics/
To make it simple, think of your metadata items as all fitting in one table.
e.g.
Ballmer employed-by Microsoft
Ballmer is-a Person
Microsoft is-a Organization
Microsoft run-by Ballmer
SoftImage acquired-by Microsoft
SoftImage is-a Organization
Joel Spolsky is-a Person
Joel Spolsky formerly-employed-by Microsoft
Spolsky, Joel dreamed-up StackOverflow
StackOverflow is-a Website
Socrates is-a Person
Socrates died-on (some date)
The trick here is that some, but not all, your first and third column values need to be BOTH arbitrary text AND serve as indexes into the first and third columns. Then, if you're trying to figure out what your data base has on Spolsky, you can full-text search your first and third columns for his name. You'll get out a bunch of triplets. The values you find will tell you a lot. If you want to know more, you can search again.
To pull this off you'll probably need to have five columns, as follows:
Full text subject (whatever your user puts in)
Canonical subject (what your user puts in, massaged into a standard form)
Relation (is-a etc)
Full text object
Canonical object
The point of the canonical forms of your subject and object is to allow queries like this to work, even if your user puts in "Joel Spolsky" and "Spolsky, Joel" in two different places even if they mean the same person.
SELECT *
FROM relationships a
JOIN relationships b (ON a.canonical_object = b.canonical_subject)
WHERE MATCH (subject,object) AGAINST ('Spolsky')
You might want to normalize your data table by making 2 tables.
----------------
| id | subject |
----------------
| 1 | Person |
| 2 | Company |
| 3 | Country |
----------------
-----------------------------------
| id | subject-id | object |
-----------------------------------
| 1 | 1 | Steve Ballmer |
| 2 | 1 | Steve Jobs |
| 3 | 2 | Microsoft |
| 4 | 2 | Apple |
| 5 | 1 | Obama |
| 6 | 3 | US |
-----------------------------------
This allows you to more easily see all the different subject types you have defined.