How to identify relationship between 3 dimensions? - many-to-many

I am trying to identify relationships between 3 relations. I have a large dataset whith 3 dimensions.
Debitor ID
CVR ID
KF ID
These are all listed in one long list where each row (Debitor, CVR and KF) is unique. However the Debitor ID can appear more than one time with two different CVR ID, but maybe with the same KF ID. The same thing can happen with the other two dimensions. That means that there is many to many relationships between the two.
Is it possible to make some kind of code and input the dataset to it, and make it loop through all rows and find the relationsships between all rows and assigning it a unique Client-Group?
I attempted to do a sketch of how the data is, and how the relationships are - and what I want to get in the end.
Screenshot with explanation of data relations and how it should be grouped in the end.
So how do I make a unique ID that groups related rows?

With pandas in Python:
df['unique_id'] = df.groupby(['DEBITOR', 'CVR', 'KF']).ngroup()
https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html#enumerate-groups

Related

Junction table in one-to-many relationship?

I have a question regarding junction tables and I really hope you can help me out as it is confusing me. I know that junction tables are usually implemented to create two one-to-many relationships instead of one many-to-many, but see the example below:
In a hypothetical situation where a user can have multiple photos (like a portfolio) but there are also groups that can have multiple photos.
The situation would look something like this I believe (please correct me if I'm already wrong here):
Image 1
But isn't it preferable to create a situation using junctions tables like the image below:
Image 2
In this way you prevent the Photo table from getting alot of NULL values, assuming you set the two foreign keys, User_ID and Group_ID, to NOT NULL.
Thank you for your time and I hope someone could guide me with this.
Here is what I have done... I create a relator table that has three columns
leftsideID, relatortype, RightSideID
In your case.. you have two relationship types that connect photos to things.
Users and groups.
So I would have two relatortypes USERPHOTO and GROUPPHOTO.
All the RightSideID's point to photo keys in the photo table.
All leftside ID's either carry the userid or the groupid, based upon the relator type, example could be:
leftsideid relatortpype rightside
========== ============ =========
1 'userphoto' 1
2 'groupphoto' 1
1 'userphoto' 2
This says (in three rows), that user1 has two photos(1,2) and group2 has photo(1)
no nulls. I have used design pattern alot to a high degree of success.
Its basically a generic Junction Table, given your example.

MySQL: Data structure for transitive relations

I tried to design a data structure for easy and fast querying (delete, insert an update speed does not really matter for me).
The problem: transitive relations, one entry could have relations through other entries whose relations I don't want to save separately for every possibility.
Means--> I know that Entry-A is related to Entry-B and also know that Entry-B is related to Entry-C, even though I don't know explicitly that Entry-A is related to Entry-C, I want to query it.
What I think the solution is:
Eliminating the transitive part when inserting, deleting or updating.
Entry:
id
representative_id
I would store them as sets, like group of entries (not mysql set type, the Math set, sorry if my English is wrong). Every set would have a representative entry, all of the set elements would be related to the representative element.
A new insert would insert the Entry and set the representative as itself.
If the newly inserted entry should be connected to another, I simply set the representative id of the newly inserted entry to the referred entry's rep.id.
Attach B to A
It doesn't matter, If I need to connect it to something that is not a representative entry, It would be the same, because every entry in the set would have the same rep.id.
Attach C to B
Detach B-C: The detached item would have become a representative entry, meaning it would relate to itself.
Detach B-C and attach C to X
Deletion:
If I delete a non-representative entry, it is self explanatory. But deleting a rep.entry is harder a bit. I need to chose a new rep.entry for the set and set every set member's rep.id to the new rep.entry's rep.id.
So, delete A in this:
Would result this:
What do you think about this? Is it a correct approach? Am I missing something? What should I improve?
Edit:
Querying:
So, If I want to query every entry that is related to an certain entry, whose id i know:
SELECT *
FROM entries a
LEFT JOIN entries b ON (a.rep_id = b.rep_id)
WHERE a.id = :id
SELECT * FROM AlkReferencia
WHERE rep_id=(SELECT rep_id FROM AlkReferencia
WHERE id=:id);
About the application that requires this:
Basically, I am storing vehicle part numbers (references), one manufacturer can make multiple parts that can replace another and another manufacturer can make parts that are replacing other manufacturer's parts.
Reference: One manufacturer's OEM number to a certain product.
Cross-reference: A manufacturer can make products that objective is to replace another product from another manufacturer.
I must connect these references in a way, when a customer search for a number (doesn't matter what kind of number he has) I can list an exact result and the alternative products.
To use the example above (last picture): B, D and E are different products we may have in store. Each one has a manufacturer and a string name/reference (i called it number before, but it can be almost any character chain). If I search for B's reference number, I should return B as an exact result and D,E as alternatives.
So far so good. BUT I need to upload these reference numbers. I can't just migrate them from an ALL-IN-ONE database. Most of the time, when I upload references I got from a manufacturer (somehow, most of the time from manually, but I can use catalogs too), I only get a list where the manufacturer tells which other reference numbers point to his numbers.
Example.:
Asas filter manufacturer, "AS 1" filter has these cross references (means, replaces these):
GOLDEN SUPER --> 1
ALFA ROMEO --> 101000603000
ALFA ROMEO --> 105000603007
ALFA ROMEO --> 1050006040
RENAULT TRUCKS (RVI) --> 122577600
RENAULT TRUCKS (RVI) --> 1225961
ALFA ROMEO --> 131559401
FRAD --> 19.36.03/10
LANDINI --> 1896000
MASSEY FERGUSON --> 1851815M1
...
It would took ages to write all of the AS 1 references down, but there is many (~1500 ?). And it is ONE filter. There is more than 4000 filter and I need to store there references (and these are only the filters). I think you can see, I can't connect everything, but I must know that Alfa Romeo 101000603000 and 105000603007 are the same, even when I only know (AS 1 --> alfa romeo 101000603000) and (as 1 --> alfa romeo 105000603007).
That is why I want to organize them as sets. Each set member would only connect to one other set member, with a rep_id, that would be the representative member. And when someone would want to (like, admin, when uploading these references) attach a new reference to a set member, I simply INSERT INTO References (rep_id,attached_to_originally_id,refnumber) VALUES([rep_id of the entry what I am trying to attach to],[id of the entry what I am trying to attach to], "16548752324551..");
Another thing: I don't need to worry about insert, delete, update speed that much, because it is an admin task in our system and will be done rarely.
It is not clear what you are trying to do, and it is not clear that you understand how to think & design relationally. But you seem to want rows satisfying "[id] is a member of the set named by member [rep_id]".
Stop thinking in terms of representations and pointers. Just find fill-in-the-(named-)blank statements ("predicates") that say what you know about your application situations and that you can combine to ask about your application situations. Every statement gets a table ("relation"). The columns of the table are the names of the blanks. The rows of the table are the ones that make its statement true. A query has a statement built from its table's statements. The rows of its result are the ones that make its statement true. (When a query has JOIN of table names its statement ANDs the tables' statements. UNION ORs them. EXCEPT puts in AND NOT. WHERE ANDs a condition. Dropping a column by SELECT corresponds to logical EXISTS.)
Maybe your application situations are a bunch of cells with values and pointers. But I suspect that your cells and pointers and connections and attaching and inserting are just your way of explaining & justifying your table design. Your application seems to have something to do with sets or partitions. If you really are trying to represent relations then you should understand that a relational table represents (is) a relation. Regardless, you should determine what your table statements are. If you want design help or criticism tell us more about your application situations, not about representation of them. All relational representation is by tables of rows satisfying statements.
Do you really need to name sets by representative elements? If we don't care what the name is then we typically use a "surrogate" name that is chosen by the DBMS, typically via some integer auto-increment facility. A benefit of using such a membership-independent name for a set is that we don't have to rename, in particular by choosing an element.

fast query for too many tables in a database

For very very large tables, indexing may help quite a lot. But what is the solution for too many small tables in a data base. ?
what if I have a large DB, that has too many tables in it. how can i make query fast as indexes help fasten queries of a table?
Lets talk with a real example.
in stackoverflow.com , there is a table say. "questions". having id,date, votes. and then there exist a table for each id in the questions table. (this table will have the name as of the numeric id . eg. "q-45588") now its easy to index the "questions" table. but what about so many child tables of each question id. (that may contain ids,answer 1, answer 2, answer 3, comment 1, comment 2... votes, down votes, dates, flags, so many things) ?
This is what happens in usual accounts software. ie. debtors account table having ids of all debtors and each table exist for each of that id (having further details of the debtor)
or is it a design problem?
*update* -----------------
Some people might say that do all in 3 or 4 tables (which may have trillions of rows)
e.g questions table, answers table, comments table, users table.
heres an example of modified stack
Catagory of thread:-----info----
Question
Discussion
Catagory of Thread Response:----info-----
A Answer
c comment
Theads:----A table-----
Id (key)
Thread Id number (Long data type)
status (active,normal,closed(visible but not editable), deleted, flagged, etc.
type (Ques / Dis)
votes Up
vots Down
count of views
tag 1
tag 2
tag 3
Subject
body
maker ID
date time stramp of time creation
date time stramp of time last activity
A Answer count
c comment count
Thread: (table name is thread id (long data type) (in Threads table)----A table-----
id (key)
response text
response type ( A Answer / c comment)
vote up
vote down
abuse count
Typically, indexes are meant to make searching faster by providing and ordered structure to search within. In a very small table, since searching should be fast to begin with, it might not make much sense. Your best bet would be to try with and without indexes, and measure accordingly.
That being said, if your small tables have the same exact structure, it might make more sense (from a RDBMS point of view anyway) to merge them into a single entity.
What you have there is a design problem. Having multiple tables with the same columns should set off alarm bells immediately -- having multiple tables with the same unique key should as well.
In the example you give you should have a single child table.
Now, in some cases you might have a table with one or more distinct values that represent a large proportion of the table rows. For example, let's say that you have sales for 50 customers but one of them is responsible for 40% of the total sales records with the others distributed evenly between the other customers. Accessing the smaller customers' data through an index on customer_id makes sense, but it does not for the large customer. In that case you might look at partitioning the table to place the large customer's records in one child table and the other records in another, both being related to a master table http://www.postgresql.org/docs/9.2/static/ddl-partitioning.html .
However in general, and for your initial design, you should be using a single non-partitioned table for these child records.
Maybe this document can help you.
http://dev.mysql.com/doc/refman/5.0/en/table-cache.html
Actually, MySQL and other RDBMSs are focus on handling a big table, not many tables, right? If you want to handle extremely large number of tables, you should consider about NoSQL solutions.

Should I reference a field id or have the actual data in the same row?

I have two tables, meetings_table and items_table.
items_table has three columns:
id
item
description
meetings_table has two columns:
first_party
second_party
I want to show the names of the two parties and the item they're meeting about. Should I add another column to meetings_table referencing the id of the item in items_table, then pull the item's name and description from there? Or should I add two columns to meetings_table, item and description and just run one query that way? Which is best in principal efficiency?
You should just reference the id of the item. This way, if you need to update anything about the item, you just do it in one location and it will propagate across all queries using that data. You almost never want to store data twice in a database. Every once in awhile you will come across a situation where unnormalized data is acceptable, but this is the exception, not the rule.
The best thing to do is to have an id for each item and add the id to the meetings_table, and query from there like
SELECT * FROM meetings_table a, items_table b WHERE a.item_id=b.item_id
This will make item and meetings kinda independent of one another, and also saves space.

Table join--multiple rows to/ from one column (/ cell)

I have searched for a solution for this problem, but haven't found it (yet), probably because I don't quite know how to explain it properly myself. If it is posted somewhere already, please let me know.
What I have is three databases that are related to each other; main, pieces & groups. Basically, the main database contains the most elementary/ most used information from a post and the pieces database contains data that is associated with that post. The groups database contains all of the (long) names of the groups a post in the main database can be 'posted in'. A post can be posted in multiple groups simultaneously. When a new post is added to my site, I check the pieces too see if there are any duplicates (check if the post has been posted already). In order to make the search for duplicates more effective, I only check the pieces that are posted in the same group(s).
Hopefully you're still with me, cause here's where it starts to get really confusing I think (let me know if I need to specify things more clearly): right now, both the main and the pieces database contain the full name of the group(s) (basically I'm not using the groups database at all). What I want to do is replace the names of those groups with their associated IDs from the groups database. For example, I want to change this:
from:
MAIN_table:
id  |  group_posted_in
--------|---------------------------
1   | group_1, group_5
2   | group_15, group_75
3   | group_1, group_215
GROUPS_table:
id  |  group_name
--------|---------------------------
1   | group_1
2   | group_2
3   | group_3
etc...
into:
MAIN_table:
id  |  group_posted_in
--------|---------------------------
1   | 1,5
2   | 15,75
3   | 1,215
Or something similar to this. However, This format specifically causes issues as the following query will return all of the rows (from the example), instead of just the one I need:
SELECT * FROM main_table WHERE group = '5'
I either have to change the query to something like this:
...WHERE group = '5' OR group = '5,%' OR group = '%,5,%' OR group = '%,5'
Or I have to change the database structure from Comma Separated Values to something like this: [15][75]. The accompanying query would be simpler, but it somehow seems like a cumbersome solution to me. Additionally, (simple) joins will not be easy/ possible at all. It will always require me to run a separate query to fetch the names of the groups--whether a user searches for posts in a specific group (in which case, I first have to run a query to fetch the id's, then to search for the associated posts), or whether it is to display them (first the posts, then another query to match the groups).
So, in conclusion: I suppose I know there is a solution to this problem, but my gut tells me that it is not the right/ best way to do it. So, I suppose the question that ties this post together is:
What is the correct method to connect the group database to the others?
For a many-to-many relationship, you need to create a joining table. Rather than storing a list of groups in a single column, you should split that column out into multiple rows in a separate table. This will allow you to perform set based functions on them and will significantly speed up the database, as well as making it more robust and error proof.
Main
MainID ...
Group
GroupID GroupName
GroupsInMain
GroupsInMainID MainID(FK) GroupID(FK)
So, for MainID 1, you would have GroupsInMain records:
1,1,1
2,1,5
This associates groups 1 and 5 with MainID 1
FK in this case means a Foreign Key (i.e. a reference to a primary key in another table). You'd probably also want to add a unique constraint to GroupsInMain on MainID and GroupID, since you'd never want the same values for the pairing to show up more than once.
Your query would then be:
select GroupsInMain.MainID, Group.GroupName
from Group, GroupsInMain
where Group.GroupID=GroupsInMain.GroupID
and Group.GroupID=5