R Novice here that needs help.
I have an exam for which we have merged two datasets and now we want to run a PanelMatch on the dataset to estimate the ATT for staggered DiDs.
Problem: There are duplicates in the data set.
This is because R uses the firm code and year to judge about a duplicate while some firms hired two politicians in the same year, meaning there are two rows with the same firm code and year.
How can I tell R that I want to keep the duplicates / that the name of the politician should be considered when judging about duplicates?
Thanks in advance!
Related
Using a basic star schema, I have been told that a fact table would have at least the amount of rows equal to the product of the number of rows in each dimension.
For example, 3 products, 5 promotions, and 10 stores would mean that the fact table should have at least 150 records, regardless of where or not a product actually had every promotion or exists in every store. Specifically, null values would exists where for example, a product does not have a specific promotion and etc.
Can someone please provide an academical source that supports, or in the least, please just confirm this idea.
The reason why I am asking this is that my understanding tells me this would create a MASSIVE amount of useless data in the fact table.
Thanks!
Hi thanks for the replies. I consulted my lecturer and he actually found a page reference for me: "...Take a very simplistic example of 3 products, 5 customers, 30 days, and 10 sales representatives represented as row in the dimension tables. Even in this example, the number of fact table rows will be 4500, very large in comparison with the dimension table rows..." (Ponniah, P., 2009. Data warehousing: Fundamentals for IT professionals, 2nd Edition. John Wiley & Sons, Inc., New Jersey. p. 237)
However, the author goes on to say that: "We have said that a single row in the fact table relates to a particular product, a specific calendar date, a specific customer, and an individual sales representative. In other words, for a particular product, a specific calendar date, a specific customer, and an individual sales representative, there is a corresponding row in the fact table. What happens when the date represents a closed holiday and no orders are received and processed? The fact table rows for such dates will not have values for the measures. Also there could be other combinations of dimension table attributes, values for which the fact table rows will have null measures. Do we need to keep such rows with nulls measures in the fact table? There is no need for this. Therefore it is important to realize this type of sparse data and understand that the fact table could have gaps."
In short, you guys seem to be correct, thanks!
Of course not. I suggest you ask your source to clarify this claim, it sounds as if there is a missunderstanding somewhere here.
And what if you add a time dimension..?
Also it is not even possible to have null values as keys where i.e. promotions are missing, because the reason for the key is to point to a dimensional value, wich a null value isn't doing.
The dimension values are there to support whatever facts you have, not the other way around.
This may relate to a specific kind of fact table: the pattern that Ralph Kimball terms a Periodic Snapshot Fact Table. That is where the fact table repeats an entire population of rows for each point in time. IMO the usefulness of that approach is extremely limited.
A Snapshot Fact Table does not implicitly require that the fact table is the product of its dimensions but it does pose the potential problem of what the correct population of each snapshot should be. The cross product of dimensions is one way to do it I suppose.
I'm building a small cinema booking system PHP web application,
The database has a Film and Showing table. (amongst others but not important)
A Showing has a date and a time, and each Showing consists of one Film
A Film can have many Showings
I'm trying to build a query that will get all the film_name, showing_date and showing_time although I want to group the results so I don't have multiple films in the result, as you can have more than one showing on the same date.
I have this SQL:
SELECT f.film_name, s.showing_date, s.showing_time
FROM film f, showing s
WHERE f.film_id = s.film_id
GROUP BY s.film_id
However it's not showing all the times for each film, just the first one. I guess there is a lot I'm missing out, and maybe I should split the showing times into a separate table, but any help would be greatly appreciated. I will most more information and diagrams if necessary.
Thanks
Assuming you want one row per film, with all showings in the same row, try:
SELECT f.film_name, group_concat(concat(s.showing_date, s.showing_time)) showings
FROM film f, showing s
WHERE f.film_id = s.film_id
GROUP BY s.film_id
You cannot do what you are asking to do.
Each row in your result set can only show one film name and one show time. If film A is showing 5 times, then you can either get a result set of five lines, all listing film A and the different show times, or if you group by film A, you will only get one result, and it will list the first show time.
Based upon what you have told us, I believe what you are looking for is some way to condense each film into one row that still lists the showing dates and times properly. In order to do this, you will need to somehow collapse these rows into one row in a way that is not often used. Normall you would use some sort of function on these rows (SUM, COUNT, etc.) to give aggregate data. However, it sounds like you want to see the actual data.
To do this, there is a really helpful SO question here:
Concatenate many rows into a single text string?
The second-highest rated response talks about using XML PATH, which would probably be the cleanest way of doing it if your database supports that feature. If not, look at the accepted answer (COALESCE). I would suggest putting this type of code into a scalar function that returned one field with comma-separated showtimes for you. Then you could list a film and have a list of showtimes next to the film.
Sorry for the confusing and maybe wasting of time, I think I have found the solution by splitting the showing times into a separate table.
I find all of the films being shown on a certain date, then loop through and select all the showing times for those films based on the showing id returned from the first query, as there will only be on showing of a film per day. I add this information to the first result per loop cycle and pass the whole data back.
There's probably better way's of doing it, but this will do for now.
Thanks
MODIFIED TO ADD INFORMATION:
I realize that there have been many "get two highest" or "get second-highest" SQL questions and answered posted, so I apologize in advance if this question is redundant, but I want to do something a bit different than the other situations, and I need some help getting from A to B. I am a MySQL hobbyist at best, so I'm sure the answer is obvious to some of you.
I have a bunch of rows of baseball player single-season statistics. I want to compare their season with the highest value with their season with the second highest value. I also want to be able to compare the two seasons by subtracting the second-highest from the highest.
I can easily get the highest value using MAX, of course, but this is a big more difficult for a novice like myself.
Thanks for your help so far.
I will simplify the relevant table structure so that it is relevant:
playerid, Year, Value
Each Player-season is separated by year.
What i want returned from my query is
Player id,
Year [of Highest Value],
Value [Highest],
Year [of Second Highest Value],
Value [Second-Highest]
I hope that is simple enough and clear. Thanks for any help.
Without knowing your table structure, you could essentially do:
SELECT score
FROM statstable
WHERE playerID=???
ORDER BY score DESC
LIMIT 2
which would retrieve the two rows with the highest scores, which you can the pull out the scores and subtract in your client.
If you need this highest-next_highest value for user in another query, then it gets a bit more complicated.
My question is more of trying to understand what and how I can get something done. Here's the thing:
I got a job to build this application for a school to manage student bio data, work-out and handle student information and basic finance management.
Based on requirements I got from meets with my client, I have an ERD of a proposed MySQL Database with 23 different tables. The one part I would like to understand quickly is displaying data based on school terms. There are 3 terms in a year, each with its own summaries at the end of each term. At the end of 3 terms, a year has gone by and a student is promoted or demoted.
So my question is, how can I render my data to show 3 different terms and also to create a new year working out how to either promote a student or make the student repeat the class its in?
23 different tables? I'd like to see that model.
I don't think you should have one table per term. You'll have to keep adding tables every term, every year.
Sounds like a transcript table should have term and year columns that are incremented or decremented as a student progresses through. It should also have a foreign key relationship with its student: it's a 1:1 between a student and their transcript.
I would have a separate transcript table because I'd prefer keeping it separate from basic personal information about a student. A transcript would refer to the courses taken each term, the grade received for each, and calculate overall progress. If I queried for the transcript for an individual student, I should be able to see every year, every term, every course, every grade in reverse chronological order.
I'm having an issue with a certain requirement to one of my Homework Assignments. I am required to take a list of students and print out all of the students with credit hours of 12 or more. The Credit hours are stored in a separate table, and referenced through a third table
basically, a students table, a classes table with hours, and an enrolled table matching student id to Course id
I used a SUM aggregate grouped by First name from the tables and that all works great, but I don't quite understand how to filter out the people with less than 12 hours, since the SQL doesn't know how many hours each person is taking until it's done with the query.
my string looks like this
'SELECT Students.Fname, SUM(Classes.Crhrs) AS Credits
FROM Students, Classes, Enrolled
WHERE Students.ID = Enrolled.StudentID AND Classes.ID = Enrolled.CourseID
GROUP BY Students.Fname;'
It works fine and shows the grid in the Delphi Project, but I don't know where to go from here to filter the results, since each query run deletes the previous.
Since it's a homework exercise, I'm going to give a very short answer: look up the documentation for HAVING.
Beside getting the desired result directly from SQL as Martijn suggested, Delphi datasets have ways to filter data on the "client side" also. Check the Filter property and the OnFilter record.
Anyway, remember it is usually better to apply the best "filter" on the database side using the proper SQL, and then use client side "filters" only to allow for different views on an already obtained data set, without re-querying the same data, thus saving some database resources and bandwidth (as long as the data on the server didn't change meanwhile...)