Data Mart star schema development solution

Data Mart star schema development solution - mysql

i have to translate a DB into a DM, but i have some doubt about, this is the DB schema:
http://i.stack.imgur.com/PHha1.png
This is a simple DB to store authors, books and various othe things (foreign keys of author table are wrongs and the table "book" as another field called year). I should built a DM to analyse how authors works trought years (coauthors and books). I would like to add even a way to see citations of an author... The DM i'm building is something like this:
http://i.stack.imgur.com/MPCTL.png
Now my doubt is: how could i add citations in this datamart?
PS for citation i mean a book that cites an author and i'm working with kettle and penthao

Citations and book authorship have different granularities. As such, they should be in different fact tables.
How I would do it:
Citations fact table: grain is 1 citation of 1 person in 1 book. external keys are for the time dimension, cited author dimension, author dimension, book dimension and whatever else you may need. This data mart gives you directly counts of citations of person X, broken down by time, book author, etc.
Authorship fact table: one may think that the grain is 1 book, but in fact it's not. The grain is 1 author of 1 book. That's the most atomic level of data. To get a book count you can either define that 1 book co-authored by 1 person counts as 1 book, counts as 0.5, as 1/Number of co-authors or any other useful metric. If you also want to count books, you should use the 1/N metric, together with any other you find useful.
Co-authorship relationships: trying to determine the authors that publish the most together: this is trickier. Here the fact granularity is also authorship, but with 1 entry for each pair of co-authors available. So, if a book is written by Albert, Bill and Charles, you'd get 1 entry with author Albert and co-author Bill, one for Albert as author and Charles as co-author, etc (all 6 combinations). This allows you to get a full list of authors and their co-authors and count how many times they appear combine, but everything will show up as double counted: Albert+Bill and Bill+Albert shows up twice. The best way to filter out the duplicates would be to either define "authors in alphabetical order, where Albert+Bill, Albert+Charles and Bill+Charles are stored but not the others, or on the client side, removing duplicates as a query post-processing.
To combine multiple metrics arriving from multiple data marts, you should add a post-processing layer to your visualisation tool, to cross reference all these results.
Finally, one comment: this problem doesn't seem to be best treated with a data mart. Book metadata doesn't have a fixed data schema and a schemaless structure may be best to do all those searchs (look into Elastic Search and Mongo DB, they are perhaps better suited for this specific problem.

Related

Split similar data into two tables?

I have two sets of data that are near identical, one set for books, the other for movies.
So we have things such as:
Title
Price
Image
Release Date
Published
etc.
The only difference between the two sets of data is that Books have an ISBN field and Movies has a Budget field.
My question is, even though the data is similar should both be combined into one table or should they be two separate tables?
I've looked on SO at similar questions but am asking because most of the time my application will need to get a single list of both books and movies. It would be rare to get either books or movies. So I would need to lookup two tables for most queries if the data is split into two tables.

Doing this -- cataloging books and movies -- perfectly is the work of several lifetimes. Don't strive for perfection, because you'll likely never get there. Take a look at Worldcat.org for excellent cataloging examples. Just two:
https://www.worldcat.org/title/coco/oclc/1149151811
https://www.worldcat.org/title/designing-data-intensive-applications-the-big-ideas-behind-reliable-scalable-and-maintainable-systems/oclc/1042165662
My suggestion: Add a table called metadata. your titles table should have a one-to-many relationship with your metadata table.
Then, for example, titles might contain
title_id title price release
103 Designing Data-Intensive Applications 34.96 2017
104 Coco 34.12 2107
Then metadata might contain
metadata_id title_id key value
1 103 ISBN-13 978-1449373320
2 103 ISBN-10 1449373320
3 104 budget USD175000000
4 104 EIDR 10.5240/EB14-C407-C74B-C870-B5B6-C
5 104 Sound Designer Barney Jones
Then, if you want to get items with their ISBN-13 values (I'm not familiar with IBAN, but I guess that's the same sort of thing) you do this
SELECT titles.*, isbn13.value isbn13
FROM titles
LEFT JOIN metadata isbn13 ON titles.title_id = metadata.title_id
AND metadata.key='ISBN-13'
This is a good way to go because it's future-proof. If somebody turns up tomorrow and wants, let's say, the name of the most important character in the book or movie, you can add it easily.

The only difference between the two sets of data is that Books have an
IBAN field and Movies has a Budget field.
Are you sure that this difference that you have now will not be
extended to other differences that you may have to take into account
in the future?
Are you sure that you will not have to deal with any other type of
entities (other than books and movies) in the future which will
complicate things?
If the answer in both questions is "Yes" then you could use 1 table.
But if I had to design this, I would keep a separate table for each entity.
If needed, it's easy to combine their data in a View.
What is not easy, is to add or modify columns in a table, even naming them, just to match requirements of 2 or more entities.

You must be very sure about future requests/features for your application.
I can't image what type of books linked with movies you store thus a lot of movies have different titles than books which are based on. Example: 25 films that changed the name.
If you are sure that your data will be persistent and always the same for books and movies then you can create new table for example Productions and there store attributes Title, Price, Image, Release Date, Published. Then you can store foreign keys of Production entity in your tables Books and Movies.
But if any accident happen in the future you will need to rebuild structure or change your assumptions. But anyway it will be easier with entity Production. Then you just create new row with modified values and assign to selected Book or Movie.
Solution with one table for both books and movies is the worst, because if one of the parameters drive away you will add new row and you will have data for first set (real book and non-existing movie) and second set (non-existing book and real movie).
Of course everything is under condition they may be changes in the future. If you are 100% sure, then 1 table is enough solution, but not correct from the database normalization perspective.
I would personally create separate tables for books and movies.

Database structure for simple waiting times project with CSV data and MySql

Suppose I have some sample data like that shown below (with a lot more entries), and my main use case is to look up a specific aliment and provide a list of waiting times for different hospitals which offer that treatment.
Not being very experienced at all with DB design, I don't know whether in this example there is an advantage to using separate tables with links between then or if a simple import of the CSV to a single table will suffice.
If I used separate tables, I'm guessing they would be for hospital and ailment perhaps?
I would be very grateful if someone tell me the best approach for this.
ID,Main Department,Specific Complaint,Hospital ,Waiting time
1,Cardiology,general,Hospital 1,7
2,Cardiology,general,Hospital 2,7
3,Cardiology,general,Hospital 3,7
4,Cardiology,general,Hospital 4,21
5,Cardiology,traumatology,Hospital 1,8
6,Cardiology,traumatology,Hospital 2,7
7,Dermatology,general,Hospital 1,21
8,Dermatology,general,Hospital 2,14
9,Dermatology,general,Hospital 3,21
10,Dermatology,erysipelas,Hospital 1,7
11,Dermatology,erysipelas,Hospital 3,7
...

One detail you must understand, SO is not a teaching site, tutorials abound for that. It is more to address specific problems that arise when developing solutions. That being said, I like this type of question, so here goes.
The type of solution to implement (simple CSV or complete database) depends on the volume of data, and type type of reports you require.
CSV is quick to implement.
Database takes more time, but will allow you to produce more complex reports than CSV, through the use of queries.
CSV is often used as a medium to load or extract data, but as for queries it is not as powerful.
A database can be expanded. Ex. today you only consider the name of the hospital. You could expand your table to include the address, phone number, ... You could also expand your model to add insurance company links, doctors, ...
Basic modeling:
Identify your objects. Ex. here I would consider ailment, hospital, complaint.
Identify relations between objects, and their type. Ex. ailment and hospital are linked, the that link is n-n. Meaning 1 ailment can be treated in many hospitals, and 1 hospital can treat many ailments.
I am not certain what to do with complaint. In your question you do not specify if all hospitals treat all (ailment - complaint) duos or not. More on that later.
As you define your structure, make sure you apply the normal forms. In most cases, forms 1-3 are enough.
1NF: atomic values and no repeating groups. Ex. you would create table with columns hospital and ailments separated by commas. 1 line == 1 hospital <-> 1 ailment.
2NF: 1NF is achieved and all the non-key attributes are dependent on the primary key. Ex. you should not create a table linking ailment and wait time. The wait time is not dependent on the ailment, it is dependent on the combination of ailment and hospital.
3NF: 2NF is achieved and there are no transitive functional dependencies. So A is dependent on B, B is dependant on C, so A is transitively dependent on C.
Some critical questions must be answered before you can model your data:
A hospital can treat a certain ailment. In all cases?
Can you have: hospital 1 can tread ailment 1 when the complaint is A and B, but not C?
Ex. all hospitals can provide primary care for cardiac patients, but cardiac surgery can only be performed as some hospitals.
In that case, you cannot link ailment and hospital together directly. A combination of (ailment,complaint) can. And this will impact wait time.
Based on reality, I will link (ailment and complaint) and link this duo to hospital.
Here is my first model, "for fun", which might need to be modified for your needs:
Wait time is in table Hospital_Treads_Ailment_has_Complaint. In my model, an hospital can only estimate the wait time once they know which ailment and which complaint the patient has.
A final exercise I do to test my model is try the main queries I need. If one query cannot be done with the model, it needs to be changed.
Which hospital treats cardiac problems? Ok, select hospital where ailment == cardiology, complaint == *.
Which hospital can accept patients who have trauma. Ok, select hospital where ailment == *, complaint == trauma.
and so on...

How to manage entity duplication in database table [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed last month.
Improve this question
I am working on a simple database design of an application.
I have a Book Illustrator and Editor table.
Modelling 1 Relation between
With this model, I think here is the duplication of the column name in each author editor and illustrator table.
What if a book author, illustrator and editor person are same, in this case, data get duplicated across 3 tables.
But in case of searching it will be faster, I guess as it no of items per table will be less.
Modelling 2
With this modeling, all the author, illustrator and editor info get saved in a single table and I am confused what should be the name of this table.
With this approach. The data won't' get duplicated but the searching will be double as compared to model 1.
Can anyone suggest me which model should I choose. I feel modeling 2 is better.

It is purely up to your taste which model you should use. The second one has the advantage that you wont get duplicates. With both models you can get the results with one query
select * from books
left join names auth ON (auth.id = author_id)
left join names ill ON (ill.id = illustrator_id)
left join names ed ON (ed.id = editor_id)
where books.id = 1;
SQLFiddle gives an example of model 2. If you want to obtain the data from model one, just change the 3 joins to the right table.
If you want to display a list of authors, I would not recommend adding it as a new field in the names table, but just use a joint query.
select auth.* from books
left join names auth ON (auth.id = author_id)
As long as you set the indexes on the id, author_id, illustrator_id and editor_id, you are fine.
Edit: my preference would go to model 2. I think it might also a bit faster:
The database only needs to open one file (not 3)
There are less records in the table (compared to the combined of the 3 tables) because you don't have duplicates.
The database only need to search through one index set (not 3) and might do some optimised stuff in the back because it is looking for 3 keys in the same set (instead of 3 key in 3 index sets) - it's my gut feeling, not sure if this is exactly correct...

You can make one amendment in the 2nd design you have proposed by keeping the user type column as well, which describes whether the user is any of author, illustrator and editor. the id will vary from 0 - 7, you can store the decimal value of the bitwise data. as if a person is Editor & Author then,
1(Editor) 0(Illustrator) 1(Author) => 5
So when you will perform any select/search on that table you can add filters where user type in query.

Do you need to validate, for example, that the author is defined as author in "Author" before you link to a book as author?
Do you care to do a query to know who are all authors/editors/illustrators defined in your database?
You have created N-N link between the entities, however, you have the "auhorId", "editorId" and "illustatorId" in the "Book" entity!
The proper way would be to have the resolution of the many-to-many relationship by having another table, and end up with something like this
BOOK, has ID, TITLE, DESC, etc.
PARTICIPANT (suggested name for all people), has ID, NAME, BIO, etc
AUTHOR, has BOOK_ID, PARTICIPANT_ID
EDITOR, has BOOK_ID, PARTICIPANT_ID
ILLUSTRATORS, has BOOK_ID, PARTICIPANT_ID
OR, instead of (3, 4, 5), BOOK_PARTICIPANT, has BOOK_ID, PARTICIPANT_ID, PARTICIPATION_TYPE (code for author, editor, illustrator), or even use flags (IS_AUTHOR, IS_EDITOR, IS_PARTICIPANT, where one is required to be set)
If you need to validate the participant as author, editor, illustrator before being able to link to a book, you need to add three flags here to to PARTICIPANT: IS_AUTHOR, IS_EDITOR, IS_ILLUSTRATOR
Hope this helps

What would be the cardinality between Artist vs ArtWork vs Group?

You set up a database company, ArtBase, that builds a product for art galleries. The core of this product is a database with a schema
that captures all the information that galleries need to maintain.
Galleries keep information about artists, their names (which are
unique), birthplaces, age, and style of art.
For each piece of artwork, the artist, the year it was made, its
unique title, its type of art (e.g., painting, lithograph, sculpture,
photograph), and its price must be stored.
Pieces of artwork are also classified into groups of various kinds,
for example, portraits, still lifes, works by Picasso, or works of the
19th century; a given piece may belong to more than one group. Each
group is identified by a name (like those just given) that describes
the group.
Finally, galleries keep information about customers. For each
customer, galleries keep that person’s unique name, address, total
amount of dollars spent in the gallery (very important!), and the
artists and groups of art that the customer tends to like.
Draw the ER diagram for the database.
Is the following ERD correct?
Is it possible that a group has zero Artworks?
Is it possible that the Artist didn't produce any artwork but still sits in the database?

1) You used ID as a PK in Artist and Artwork. This is a good thing as the use of an unique name (as requested in the business model) is wrong: after all, two pieces of art or two artists may bear the same name. However, you did respect the business model for the Customer entity whose PK is Name.
You can choose to make a good ERD and use ID as a surrogate PK for Artwork, Artist, and Customer; or respect the business model you were given and use Name as a PK for these three entities. Personally, I'd go with the former.
The following two questions can't be answered given the business model only; the answers below reflect the cardinality in the specific ERD you designed.
2) Yes, because according to the ERD a Group includes from 0 to N Artworks;
3) Yes, because according to the ERD although an Artist makes from 1 to N Artworks (and therefore there wouldn't be the need to insert an Artist in the database if he didn't do any Artwork) there is still a relationship between Customer and Artist in the sense that a Customer likes from 1 to N Artists.
Therefore an Artist can be in the database even if he didn't produce any Artwork (yet), provided that he is liked by at least one Customer. If an Artist didn't do any Artwork and is not liked by any Customer, he won't be in the database.

Missing some context information here, especialy some cadinality information. Pay attention to yourself asking questions about the context:
Is it possible that a group has zero Artworks?
Is it possible that the Artist didn't produce any artwork but still
sits in the database?
This information should be given by you (or by the presenting problem). If this is a work of your course or your college, your instructor needs to better explain the present context. If you are already working as a DBA or data modeler, please look for more information about this problem. It's almost indescribable the importance of a context in the development of an ER-Diagram. Keep this in mind: Without a well-defined context, the problem (the situation) is uncertain, and so is missing information to complete the reflection of a real-world situation. In short:
No complete context, no diagram (without a diagram, there is no system!).
I will make this diagram with you step-by-step, but I'll take some assumptions due to lack of information (context) here. I will give my opinion on certain resources used in ER-Diagram, but that does not mean that I'm saying you're layman. I am just showing my thought, which shows how I learned that here in my country. I believe that you are as capable as I am, ok? Well, let's begin...
Entities in ER-Diagram are defined when we have attributes / properties. According to your description, we can see immediately 3 entities here:
Customers
Artists
Artworks
Relationships exists to express links between entities. The most obvious relationship here is between Artists and Artworks, Don't you agree?
For each piece of artwork, the artist...
In accordance with the context revealed, all artwork has a unique artist (always), but it is uncertain if an artist always has one, multiple, or zero artworks. I SUPPOSE that an artist can have many or no artwork. That being said, we see that artists to artworks have a cardinality 0 to N, because, again, an artist may have made several or no artwork at all.
So far we have defined three entities, and linked two of them. Let's continue...
...its type of art (e.g., painting, lithograph, sculpture, photograph)...
If an artwork has only a single type of art, and an art type is defined only by its name, then we have here what is called a Functional Redundancy (translated from the Portuguese term "Redundância Funcional"). In spit summary, Functional Redundancies are like relationships between two entities, and serve to save you the trouble of repeating the same field in multiple columns in a table (which would be susceptible to errors). In a Conceptual Model, they are represented as a field in an entity with the suffix "(R)" (without the double-quotes).
If an entity has a field (column) like a Functional Redundancy, but with different values (multiple), then we have what is called Multivalued Field (also translated from the Portuguese term "Campo Multivalorado"). These are fields in entities that have the suffix "*" (also without the double-quotes).
This is not the case of the type of artwork, but it would until now for the groups of each artwork:
Pieces of artwork are also classified into groups of various kinds,
for example, portraits, still lifes, works by Picasso, or works of the
19th century; a given piece may belong to more than one group.
This would be true if groups only possess names, and no other entity relate to them. But then you said:
and groups of art that the customer tends to like.
This has changed things a bit. Groups no longer is a Multivalued Field in Artworks entity and becomes an entity with two relationships, one for Customers and one for Artworks. The relationship between Groups and Customers reveals the preferred art groups by customers. The relationship between groups and artworks shows which art groups a artwork is related. Now let's talk about the cardinalities of these relationships.
...a given piece may belong to more than one group. [...]
...and groups of art that the customer tends to like. [...]
Concerning Groups and Artworks, the word "may" says a lot to me. It says that something may or may not be effective. Still, it is uncertain whether an artwork can exist without at least one related group. Because of this, I see a 1 to N relationship from Artworks to Groups.
Conversely, the opposite process is not clear. I believe that there may be groups unrelated to artworks, perhaps because they are new groups created in a given time. So I see a relationship of 0 to N from Groups to Artworks.
Let's talk about Groups and Customers. It seems to me that a customer like at least one group of art. So I see a 1 to N relationship from Customers to Groups.On the opposite side, as already said, it would be possible to add new groups without automatically tying at least one customer to it. I think there may be new groups unrelated to customers. So guess what? We have a relationship of 0 to N from Customers to Groups.
So far we have identified another entity, a Functional Redundancy,
and two relationships with their respective cardinalities. Let's keep going...
and the artists ... that the customer tends to like.
There is a close connection here between two entities, Customers, and Artists. This relationship tells us what artists the customers like. If a customer must like at least one artist, then we have a 1 to N relationship from Customers to Artists. If a customer may or may not like an artist, then we have a relationship 0 to N.
If an artist has zero or more customers who appreciate it, then we have a relationship 0 to N from Artists to Customers. If an artist has at least one client who appreciates it's work, then we have a 1 to N relationship from Artists to Customers.
Lastly...
Galleries keep information about artists, [...] and style of art.
If multiple artists can share a single same art style, then we have a Functional Redundancy here. If several artists have various art styles, then we have a Multivalued Field.
After much talk, I came up with an ER-Diagram presented by your context and assumptions made by me:
NOTE: The green points highlights major assumptions.
Is this right? Is this the correct diagram? The correct answer would be (from me to you):
I do not know...
Without a concrete context, we can not finalize a diagram correctly. My tip is that you finish your context. Only then you will have a correct diagram.
Oh, one more thing. What would be this "money spent" attribute? If customers can buy artworks, it would represent a new relationship between Artworks and Customers. This relationship would represent the purchase of artworks from customers (called "ORDERS", for instance). If not so, skip this paragraph.
If I have forgotten something, please say so. If you have questions feel free to ask, I'm here to help you.

Acess 2007 one-to-many relationship counting

My Set-up: I have two tables: tblAuthors and tblBooks. tblAuthors includes a list authors: Kurt Vonnegut, Frank Herbert, J. K. Rowling, John Nichols, etc.. tblBooks includes a list of books: Slaughter House Five, Cat's Craddle, Monkey House, Dune, Harry Potter, Milagro Beanfield War, etc..
There is a one-to-many relationship between tblBooks and tblAuthors; Authors in tblAuthors is used as the primary key for this relationship. tblAuthors has a Number of Books coloumn which tells the user of the table how many books each author has written that is included in the table. Currently the user (sadly me) must input this information by hand, updating it after every book is entered and given an author. Although this is not particularly difficult because I can simply see how many books are related to the author in tblAuthors because of the relationship, it is sometimes difficult to remember to update it (not to mention a colossal pain in the butt).
I want Number of Books to update automatically as I add more books. If there is a code out there please let me know!!
I am not really familiar with VBA and could use an explanation that is geared towards someone who may not understand all of the facets of the code.
Thank you in advance for any help that you give me!

You might like to read about relational database design It is not usual to store calculated fields because the information can easily be obtained from a query.
SELECT AuthorID,
Count(BookID)
FROM Books
GROUP BY AuthorID

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008