Merging two datasets into a new entity - mysql

I am new to all this and struggling to get my head around some of the conundrums thrown up. My area of interest is census data. What I am currently doing is taking the data from a 1901 and a 1911 censuses and merging them into a new database. I then ascertain that a particular person is actually the same person on both censuses, once I am certain that 1901 Jack Thelad (aged 31) with ID 55 is the same as 1911 Jack Thelad (aged 41) with ID 777 what is the best way to deal with the primary key issue?
1901 Jack Thelad ID55
1911 Jack Thelad ID777
MergedCensus Jack Thelad ID???
Should I look on the primary key as like a social security number, allocate Jack Thelad a number in my MergedCensus and then copy that number back into the 1901 and 1911 data effectively overwriting ID55 and ID77?

in this new database which i assume you are designing, could u have a table that was:
newId | name | 1901id | 1911id |
------|-------------|---------|--------|
1234 | Jack Thelad | ID55 | ID77 |
and then you could search
SELECT data,data,data from newtable,1901id,1911id where newtable.1901id=1901table.id

Related

Multiple taxonomy term repeating results in views. Remove repetition but keep the term

NOTE: I dont want to remove the repeating node, more like merging them.
I have a view that pulls a seminar content type along with the taxonomy term attached to the content type. In the content type the term reference field pulling the taxonomy term is a multivalue field. So whenever there is more than one taxonomy term attached to the node the view result is repeated. So using view and its api what I want is
What I have now when view pulls the result is
Nid Speaker name | Location | Time
----------------------------------
12 Sanjok Gurung| London | 1900
11 John | London | 1900
10 Sally | London | 1900
10 Molly | London | 1900
So the above table, Sally and Molly are term reference selected in the same node.
What I want is
Nid Speaker name | Location | Time
----------------------------------
12 Sanjok Gurung| London | 1900
11 John | London | 1900
10 Sally,Molly | London | 1900
I tried manipulating the results from views_pre_render but this method feels like it is so dirty. There should be a better clean solutions
You need to use this contrib module.
URL: https://www.drupal.org/project/views_aggregator
You can read the documentation from the below url. http://cgit.drupalcode.org/views_aggregator/plain/README.txt?id=refs/heads/7.x-1.x
This is not a view issue actually,
If you open the Manage Display of seminar content type and try to edit the display settings of entity reference FORMAT format drop-down select separator you can change what kind of separator you may want to choose from settings tab like (comma or dash)
Note: Make sure to edit the exact display mode (teaser or Full content or Default) which ever is used in the view.
May be this will resolve the issue.

Fragmentation of related records in a table

If I have a table which contains previous product purchases of a user like so:
ID | ProductName | UserID
1 | Grapes | 3455
2 | Water | 1944
3 | Bread | 3455
4 | Milk | 3455
...
As you can see in the example above, user 3455 has bought grapes, bread and milk. If I wanted to retrieve all the products a user has bought, I would have to find each of the records which has user 3455.
Would storing all the products which are from user 3455 together speed up searching for these records like defragmenting a hard drive? And if so, would the process of deleting the old records and readding them to the end of the database be a waste of processing power?
The fastest way to get the information you want is to have an index on t(UserId, ProductName). This is a covering index for the following query:
select ProductName
from t
where UserId = 3455;
The means that all the columns needed by the query are in the index, and in the proper order. So, the query optimizer can resolve the query using only the index. This should be quite fast.

How to design my database to accommodate this data

I am developing a database for a payroll application, and one of the features I'll need is a table that stores the list of employees that work at each store, each day of the week.
Each employee has an ID, so my table looks like this:
| Mon | Tue | Wed | Thu | Fri | Sat | Sun
Store 1 | 3,4,5 | 3,4,5 | 3,4,5 | 4,5,7 | 4,5,7 | 4,5,6,7 | 4,5,6,7
Store 2 | 1,8,9 | 1,8,9 | 1,8,9 | 1,8,9 | 1,8,9 | 1,8,9 | 1,8,9
Store 3 | 10,12 | 10,12 | 10,12 | 10,12 | 10,12 | 10,12 | 10,12
Store 4 | 15 | 15 | 15 | 16 | 16 | 16 | 16
Store 5 | 6,11,13 | 6,11,13 | 6,11,13 | 14,18,19| 14,18,19| 14,18,19| 14,18,19
My question is, how do I represent that on my database? I came up with the following ideas:
Idea 1: Pretty much replicate the design above, creating a table with the following columns: [Store_id | Mon | Tue ... | Sat | Sun] and then store the list of employee IDs of each day as a string, with IDs separated by commas. I know that comma-separated lists are not good database design, but sometimes they do look tempting, as in this case.
Store_id | Mon | Tue | Wed | Thu | Fri | Sat
---------+---------+---------+---------+---------+---------+---------
1 | '3,4,5' | '3,4,5' | '3,4,5' | '4,5,7' | '4,5,7' | '4,5,6,7'
2 | '1,8,9' | '1,8,9' | '1,8,9 '| '1,8,9' | '1,8,9' | '1,8,9'
Idea 2: Create a table with the following columns: [Store_id | Day | Employee_id]. That way each employee working at a specific store at a specific day would be an entry in this table. The problem I see is that this table would grow quite fast, and it would be harder to visualize the data at the database level.
Store_id | Day | Employee_id
---------+-----+-------------
1 | mon | 3
1 | mon | 4
1 | mon | 5
1 | tue | 3
1 | tue | 4
Any of these ideas sound viable? Any better way of storing the data?
if I were you I would store the employee data and stores data in separate tables... but still keep the design of your main table. so do something like this
CREATE TABLE stores (
id INT, -- make it the primary key auto increment.. etc
store_name VARCHAR(255)
-- any other data for your store here.
);
CREATE TABLE schedule (
id INT, -- make it the primary key auto increment.. etc
store_id INT, -- FK to the stores table id
day VARCHAR(20),
emp_id INT -- FK to the employees table id
);
CREATE TABLE employees
id INT, -- make it the primary key auto increment.. etc
employee_name VARCHAR(255)
-- whatever other employee data you need to store.
);
I would have a table for stores and for employees as that way you can have specific data for each store or employee
BONUS:
if you wanted a query to show the store name with the employees name and their schedule and everything then all you have to do is join the two tables
SELECT s.store_name, sh.day, e.employee_name
FROM schedule sh
JOIN stores s ON s.id = sh.store_id
JOIN employees e ON e.id = sh.emp_id
this query has limitations though because you cannot order by days so you could get data by random days.. so in reality you also need a days table with specific data for the day that way you can order the data by the beginning or end of the week.
if you did want to make a days table it would just be the same thing again
CREATE TABLE days(
id INT,
day_name VARCHAR(20),
day_type VARCHAR(55)
-- any more data you want here
)
where day name would be Mon Tue... and day_type would be Weekday or Weekend
and then all you would have to do for your query is
SELECT s.store_name, sh.day, e.employee_name
FROM schedule sh
JOIN stores s ON s.id = sh.store_id
JOIN employees e ON e.id = sh.emp_id
JOIN days d ON d.id = sh.day_id
ORDER BY d.id
notice the two colums in the schedule table for day would be replaced with one column for the day_id linked to the days table.
hope thats helpful!
The second design is correct for a relational database. One employee_id per row, even if it results in multiple rows per store per day.
The number of rows is not likely to get larger than the RDBMS can handle, if your example is accurate. You have no more than 4 employees per store per day, and 5 stores, and up to 366 days per year. So no more than 7320 rows per year, and perhaps less.
I regularly see databases in MySQL that have hundreds of millions or even billions of rows in a given table. So you can continue to run those stores for many years before running into scalability problems.
I upvoted John Ruddell's answer, which is basically your option #2 with the addition of tables to hold data about the store and the employee. I won't repeat what he said, but let me just add a couple of thoughts that are too long for a comment:
Never ever ever put comma-separated values in a database record. This makes the data way harder to work with.
Sure, either #1 or #2 makes it easy to query to find which employees are working at store 1 on Friday:
Method 1:
select Friday_employees from schedule where store_id='store 1'
Method 2:
select employee_id from schedule where store_id=1 and day='fri'
But suppose you want to know what days employee #7 is working.
With method 2, it's easy:
select day from schedule where employee_id=7
But how would you do that with method 1? You'd have break the field up into it's individual pieces and check each piece. At best that's a pain, and I've seen people screw it up regularly, like writing
where Friday_employees like '%7%'
Umm, except what if there's an employee number 17 or 27? You'll get them too. You could say
where Friday_employees like '%,7,%'
But then if the 7 is the first or the last on the list, it doesn't work.
What if you want the user to be able to select a day and then give them the list of employees working on that day?
With method 2, easy:
select employee_id from schedule where day=#day
Then you use a parameterized query to fill in the value.
With method 1 ...
select employee_id from schedule where case when #day='mon' then Monday_employees when #day='tue' then Tuesday_employees when #day='wed' then Wednesday_employees when #day='thu' then Thursday_employees when #day='fri' then Friday_employees when #day='sat' then Saturday_employees as day_employees
That's a beast, and if you do it a lot, sooner or later you're going to make a mistake and leave a day out or accidentally type "when day='thu' then Friday_employees" or some such. I've seen that happen often enough.
Even if you write those long complex queries, performance will suck. If you have a field for employee_id, you can index on it, so access by employee will be fast. If you have a comma-separated list of employees, then a query of the "like '%,7,%' variety requires a sequential search of every record in the database.

finding duplicates in MySQL with a null field when some SHOULD have a null field

Need some help from a MySQL expert here. I have a terrible database to work with and I'm trying to fix the structure a bit but this one has me baffled. The table initially had an id, name and 4 sell columns. I converted that to an id, name and single sell column as basically a pivot table. That was fine, next issue was to get rid of duplicates since not every entry had 4 sell entries.
So after the first operation I ended up with something like this:
id name sellid
1 bob 111
1 bob
1 bob
2 mary 112
2 mary 113
2 mary 114
2 mary 115
3 fred
3 fred
3 fred
3 fred
So by doing group by I managed to get it to the point where it looks like this:
id name sellid
1 bob 111
1 bob
2 mary 112
2 mary 113
2 mary 114
2 mary 115
3 fred
Now here is where I hit a wall. Fred is fine, he is supposed to have an entry but no sellid, Mary is also fine she has all 4 sellids full. Bob is the issue. How do I remove the empty sellid for him without affecting Fred?
I'd say what I tried but I am just at a complete loss here so I really haven't tried anything yet.
You are looking for an outer join between your names and other data:
SELECT * FROM
(SELECT DISTINCT id, name FROM my_table) t1 NATURAL LEFT JOIN
(SELECT * FROM my_table WHERE sellid IS NOT NULL) t2
See it on sqlfiddle.
But really, you should normalise your schema further so that you have a table of (personid, name) and a table of (personid, sellid) pairs (from which you essentially perform the above outer join as & when required to obtain the necessary records including NULLs).

How to store a "location" for an "event" in MySQL

I have two entities: Event and Location. The relations are:
1 Event can have 1 Location.
1 Location can have many Events.
Basically I want to store events. And each event is hosted in a specific location. When I say specific location I mean:
Street, Number, City, Zip Code, State, Country
I basically have a design question, that I would like some help with:
1 - Right now I am thinking on doing the following:
Event table will have a location_id that will point to a specific location row in the locations table. What happens with this is that:
I will have many repeated values in each row. For example, if an event is happening in 356 Matilda Street in San Francisco, and another one is happening in 890 Matilda Street in San Francisco. The values Matilda Street and San Francisco will be duplicated many times in the location table. How can I redesign that to normalize this?
So, basically I would love to hear a good approach to solve this question in terms of a relational database, like MySQL.
If you want a strictly normalized database, you could have a table for street names, another for cities, another for states, and so on. You might even have an additional location table that holds unique combinations of street, city, and state; you'd add rows to this table each time an event occurs at a previously unknown location. Then each of your events would reference the appropriate row in the location table.
In practice, though, it's sometimes better simply to store the location data directly within the events table and tolerate the extra memory usage; there's always a trade-off between speed and memory use.
Another consideration: what happens if a street is renamed? Do you want old events to be associated with the old name or the new name?
Each location in the locations table should be uniquely identifiable by its PRIMARY KEY. Records in the events table then reference their associated location with column(s) that contain the value of that PRIMARY KEY.
For example, your locations table might contain:
location_id | Street | Number | City | Zip Code | State | Country
------------+----------------+--------+---------------+----------+-------+---------
1 | Matilda Street | 356 | San Francisco | 12345 | CA | USA
2 | Matilda Street | 890 | San Francisco | 12345 | CA | USA
Then your events table might contain:
event_id | location_id | Date | Description
---------+-------------+------------+----------------
1 | 1 | 2012-04-28 | Birthday party
2 | 1 | 2012-04-29 | Hangover party
3 | 2 | 2012-04-29 | Funeral
4 | 1 | 2012-05-01 | May day!
In this example, location_id is the PRIMARY KEY in the locations table and a FOREIGN KEY in the events table.
Your Locations table should have a unique ID such as 1=Matilda Street, 2=Market Street - one record for each possible location NO DUPLICATES, then your Events table should have a location ID that uses one of those IDs - again, one for each event, no duplicate.
You can then join them like this;
SELECT events.event_name, locations.location_name
FROM events
JOIN locations on locations.location_id = events.location_id
The duplication is very normal because each of them is an unique location. And beyond that, the design you think is very usable when you try to filter the places in san francisco.