Understanding OUTER JOIN for DateTimes, Tableau - csv

I am trying to understand the uses and limitation of outer joins in tableau (tableau online in this case). I have found the beavaiour of tableau to be not what I have expected.
I have provided as detailed a description to my problems below, to avoid any ambiguotity and since I don't know where to start anymore. I hope I have not gone overboard(edits welcome).
Specifics of my use case
I am creating a join between two .csv files that have logged natural data at specific time intervals. One set has hourly time intervals, the other at intervals of minutes (which is variable due to various factors).
'Rain' data set(1):
Date and Time | Rain
01/01/2018 00:00 | 0
01/01/2018 01:00 | 0.4
01/01/2018 02:00 | 1.4
01/01/2018 03:00 | 0.4
'Fill' data set (2):
Date and Time | Fill
24/04/2018 06:04 | 78
24/04/2018 12:44 | 104
24/04/2018 18:51 | 96
25/04/2018 00:20 | 84
Unsurprisingly, I have many nulls in the data (which is not a problem to me) as:
'Rain' has a longer time series
In either data set, the majority of date times do not have an exact equivalent in the other
screenshot of data join here
What I am trying to achieve
I am trying to graph the two data sets in such a way that that I can compare the full data sets against each other, in all of the following ways:
Monthly or Yearly aggregation (average)
Hourly aggregation (average)
Exact times
Problems (and my limited assumptions)
Once graphed in tableau some values had 'null' DateTime values*.
Once graphed in tableau it appears as if many points are simply missing**
Graphing using 'Fill' time series
Graphing using 'Rain' time series
I had assumed (giving the full outer join of 'Date and Time(s)') tableau would join the data sets in chronological order with a common date time series
* I had assumed it impossible for the join conditions to have 'null' values without throwing an error. Also, the data is clean and uniform
** And this is when aggregating monthly, which I assumed would not be affected by any (if any) hourly/minute mismatches
So, finally the question #
In my reading of the online help documentation I am struggling to find a functionality that is native to tableau that can help me achieve these specific goals. I am reaching the worrying conclusion that tableau was not built for this type of 'visual analytics'.
Is there a functionality native to tableau that will allow me to combine the data in the way I described above?
Approaches I have considered
Since I have two .csv files I could combine both set so that I have the full, granular 'Date and Time' fields in one tall list.
However, I would like to find a method that is natural to tableau (online) because in future, at least some of the data wil come from a database (postgres) connection but others will likely have to remain as upload as a .csv or excel files.
Again I ask
What am I overlooking in regards how (and why) to use tableau?
I am not looking for a complete solution, but what tools could I use to achieve this?
Many thanks for any help

Your databases more specifically datasources are in a different level of granularity one is in hours(Higher Level of granularity) and other is in minutes (Lower level of granularity) but your requirmenet is different
Year/Month -- High aggregation
Hourly -- Medium agregation
Exact -- Lower aggregation
When you join two data sources on dates and times (Which would never match) you will get these kind of weird results.
Possible Solution:
Their is a tableau prep tool, use the tool and make both data sources at same level of aggregation, in you case dataset 2 will be aggregated to hour level and the join both the tables, In this case you need to check last requirement (Exact times) as I assume you are looking for the charts at minutes level
Other solution is use blending where primary datasource will be dataset 1 and secondary datasource will be dataset 2, in this case you will get the required data where tableau manages the aggregation and granularity.
Let me know how it goes

So it appears as if various solutions are available.
I want to post this now but will re-edit when I get a bit more time
Option 1
One work-around/solution I found was to create a calculated field as mentioned here and then graph everything against this time series.
This worked well for me even after having created 20+ sheets and numberous dashboards.
As mentioned below, other uses may not provide this flexibility.
Calculation:
IFNULL([Date and Time (Fill.csv)],[Date and Time (Rain.csv)]))
Option 2
This is as mentioned by matt_black a join of the data performs the job quite well. It seems less hacky and is perfect when starting from a clean slate.
I had difficulty creating a join on data sources already in use (will do more poking around on this)
Option 3 ?
As in the answer provided by Siva, blending maybe an option.
I have not confirmed this as of yet.

Related

Anylogic: How to create an objective function using values of two dataset (for optimization experiment)?

In my Anylogic model I have a population of agents (4 terminals) were trucks arrive at, are being served and depart from. The terminals have two parameters (numberOfGates and servicetime) which influence the departures per hour of trucks leaving the terminals. Now I want to tune these two parameters, so that the amount of departures per hour is closest to reality (I know the actual departures per hour). I already have two datasets within each terminal agent, one with de amount of departures per hour that I simulate, and one with the observedDepartures from the data.
I already compare these two datasets in plots for every terminal:
Now I want to create an optimization experiment to tune the numberOfGates and servicetime of the terminals so that the departure dataset is the closest to the observedDepartures dataset. Does anyone know how to do create a(n) (objective) function for this optimization experiment the easiest way?
When I add a variable diff that is updated every hour by abs( departures - observedDepartures) and put root.diff in the optimization experiment, it gives me the eq(null) is not allowed. Use isNull() instead error, in a line that reads the database for the observedDepartures (see last picture), but it works when I run the simulation normally, it only gives this error when running the optimization experiment (I don't know why).
You can use the absolute value of the sum of the differences for each replication. That is, create a variable that logs the | difference | for each hour, call it diff. Then in the optimization experiment, minimize the value of the sum of that variable. In fact this is close to a typical regression model's objectives. There they use a more complex objective function, by minimizing the sum of the square of the differences.
A Calibration experiment already does (in a more mathematically correct way) what you are trying to do, using the in-built difference function to calculate the 'area between two curves' (which is what the optimisation is trying to minimise). You don't need to calculate differences or anything yourself. (There are two variants of the function to compare either two Data Sets (your case) or a Data Set and a Table Function (useful if your empirical data is not at the same time points as your synthetic simulated data).)
In your case it (the objective function) will need to be a sum of the differences between the empirical and simulated datasets for the 4 terminals (or possibly a weighted sum if the fit for some terminals is considered more important than for others).
So your objective is something like
difference(root.terminals(0).departures, root.terminals(0).observedDepartures)
+ difference(root.terminals(1).departures, root.terminals(1).observedDepartures)
+ difference(root.terminals(2).departures, root.terminals(2).observedDepartures)
+ difference(root.terminals(3).departures, root.terminals(2).observedDepartures)
(It would be better to calculate this for an arbitrary population of terminals in a function but this is the 'raw shape' of the code.)
A Calibration experiment is actually just a wizard which creates an Optimization experiment set up in a particular way (with a UI and all settings/code already created for you), so you can just use that objective in your existing Optimization experiment (but it won't have a built-in useful UI like a Calibration experiment). This also means you can still set this up in the Personal Learning Edition too (which doesn't have the Calibration experiment).

Would it be faster to make a python script to aggregate a table?, or would a built in SQL aggregate combined with polling be faster?

Currently, I have a little problem where I'm expected to build a table that shows the energy generated for the respected days.
I have solved this problem using python with SQL data polling combined with a for loop to look at the energy generated at the beginning of the day to the end of the day and the difference between the two will result in the total energy generated for the particular day. But unfortunately due to the amount of data that's coming out of the SQL database the python function is too slow.
I was wondering if this can be integrated within an SQL query to just spit out a table after it has done the aggregation. I have shown an example below for a better understanding of the table.
SQL TABLE
date/time
value
24/01/2022 2:00
2001
24/01/2022 4:00
2094
24/01/2022 14:00
3024
24/01/2022 17:00
4056
25/01/2022 2:00
4056
25/01/2022 4:00
4392
25/01/2022 17:00
5219
Final Table
From the above table, we can work that the energy generated for 24/01/2022 is 4056(max)-2001(min)= 2055
date
value
24/01/2022
2055
25/01/2022
1163
Usually, the time spent sending more stuff across the network makes the app-solution slower.
The GROUP BY may cost an extra sort, or it may be "free" if the data is sorted that way. (OK, you say unindexed.)
Show us the query and SHOW CREATE TABLE; we can help with indexing.
Generally, there is much less coding for the user if the work is done in SQL.
MySQL, in particular, picks between
Case 1: Sort the data O(N*log N), then make a linear pass through the data; this may or may not involve I/O which would add overhead
Case 2: Build a lookup table in RAM for collecting the grouped info, then making a linear pass over the data (no index needed); but then you need something like O(N*log n) for counting/summing/whatever the grouped value.
Notes:
I used N for the number or rows in the table and n for the number of rows in the output.
I do not know the conditions that would cause the Optimizer to pick one method versus the other.
If you drag all the data into the client, you would probably pick one of those algorithms. If you happen to know that you are grouping on a simple integer, the lookup (for the second algorithm) could be a simply array lookup -- O(N). But, as I say, the network cost is likely to kill the performance.
It is simple enough to write is SQL:
SELECT DATE(`date`) AS "day",
MAX(value) - MIN(value) AS range
FROM tbl
GROUP BY DATE(`date`);

Store overlapping date ranges to be filtered using a custom date range

I need help regarding how to structure overlapping date ranges in my data warehouse. My objective is to model the data in a way that allows date-level filtering on the reports.
I have dimensions — DimEmployee, DimDate and a fact called FactAttendance. The records in this fact are stored as follows —
To represent this graphically —
A report needs to be created out of this data, that will allow the end-user to filter it by making a selection of a date range. Let's assume user selects date range D1 to D20. On making this selection, the user should see the value for how many days at least one of the employees was on leave. In this particular example, I should see the addition of light-blue segments in the bottom i.e. 11 days.
An approach that I am considering is to store one row per employee per date for each of the leaves. The only problem with this approach is that it will exponentially increase the number of records in the fact table. Besides, there are other columns in the fact that will have redundant data.
How are such overlapping date/time problems usually handled in a warehouse? Is there a better way that does not involve inserting numerous rows?
Consider modelling your fact like this:
fact_attendance (date_id,employee_id,hours,...)
This will enable you to answer your original question by simply filtering on the Date dimension, but you will also be able to handle issues like leave credits, and fractional day leave usage.
Yes, it might use a little more storage than your first proposal, but it is a better dimensional representation, and will satisfy more (potential) requirements.
If you are really worried about storage - probably not a real worry - use a DBMS with columnar compression, and you'll see large savings in disk.
The reason I say "not a real worry" about storage is that your savings are meaningless in today's world of storage. 1,000 employees with 20 days leave each per year, over five years would mean a total of 100,000 rows. Your DBMS would probably execute the entire star join in RAM. Even one million employees would require less than one terabyte before compression.

Databases that Handle Many Columns

I'm considering converting some excel files I regularly update to a database. The files have a large number of columns. Unfortunately, many of the databases I am looking at, such as Access and PostreSQL, have very low column limits. MySQL's is higher, but I'm worried that as my dataset expands I might break that limit as well.
Basically, I'm wondering what (open source) databases are effective at dealing with this type of problem.
For a description of the data, I have a number of excel files (less than 10) with each containing a particular piece of information on some firms over time. It totals about 100mb in excel files. The firms are in the columns (about 3500 currently), the dates are in the rows (about 270 currently, but switching to a higher frequency for some of the files could easily cause this to balloon).
The most important queries will likely be to get the data for each of the firms on a particular date and put it in a matrix. However, I may also run queries to get all the data for a particular firm for a particular piece of data over every date.
Changing dates to a higher frequency is also the reason that I'm not really interested in transposing the data (the 270 beats Access' limit anyway, but increasing the frequency would far exceed MySQL's column limits). Another alternative might be to change it so that each firm has its own excel file (that way I limit the columns to some amount less than 10), but is quite unwieldy for the purposes of updating the data.
This seems to be begging to be split up!
How using a schema like:
Firms
id
name
Dates
id
date
Data_Points
id
firm_id
date_id
value
This sort of de-composed schema will make reporting quite a bit easier.
For reporting you can easily get a stream of all values with a query like
SELECT firms.name, dates.date, data_points.value from data_points left join firms on firms.id = data_points.firm_id left join dates on dates.id = data_points.date_id

Mysql Structure for rental availability

I have to import an availability calendar of 30,000 places into MySQL, and I am stuck on structure design. I need something which will allow me to easily subquery and join availability of checkIn for a given date.
Actually, each day has several options
Can checkIn and CheckOut
Not Available
CanCheckIn only
CanCheckOut
OnRequest
now what would be a most optimal solution for a table?
PlaceId Day AvailabilityCode ???
Then I would have 366 * 30, 000 rows? I am afraid of that.
Is there any better way to do?
The xml data I should parse looks like this
<?xml version="1.0" encoding="utf-8" ?>
<vacancies>
<vacancy>
<code>AT1010.200.1</code>
<startday>2010-07-01</startday>
<availability>YYYNNNQQ</availability>
<changeover>CCIIOOX</changeover>
<minstay>GGGGGGGG</minstay>
<flexbooking>YYYYY</flexbooking>
</vacancy>
</vacancies>
Where
Crucial additional information: The problem is that the availability calendar is given as an XML feed, and I have to import it and repopulate my database each 10-20 minutes.
I think your problem is the XML feed, not the table structure. The easiest solution would be to ask the feed provider to deliver just a delta rather than a whole dump. But presumably there's a good reason why that is not possible.
So you will have to do it. You should store the XML feeds somehow, and compare the new file with the previous one. This will give you the delta, which you can then apply to your database table. There are several approaches you could take, and which you choose will largely depend on your programming prowess, and the capabilities of your database product.
For instance, MySQL has only had XML functionality since 5.1 and it is still pretty limited. So if you want to preprocess the XML file you will probably have to do it outside the database. An alternative approach would be to load the latest file into a staging table and use SQL to find and apply the differences.
you only need to add rows when something is not available. A missing row for a date and room can be implicitly interpreted as availability
365 * 30000 is a little over 10 million records in a table with only small fields (int id, date or day, and a code, which is probably an int as well or maybe a char(1)). This is very doable in MySQL and will only become a problem if you got many reads and frequent updates to this table. If it is only updates now and then, it will not be much of a problem to have tables with 10 or 20 million records.
But maybe there's a better solution, although it may be more complex.
It sounds to me like some soort of booking programme. If so, each place will probably have seasons in which they can be booked. You can give each place a default value, or maybe even a default value per season. For instance, a place is available from march to august, and unavailable in the other months. Then, when a place is booked during the summer and it becomes unavailable, you can put that value in the table you suggested.
That way, you can check if a record exists for a given day for the requested place. If it does not exist, you check the default value in the 'places' table (30000 records), or the 'seasons' table where you got a record per season per place (maybe 2 to 4 records per place). That way you can cut the number of records down by a lot.
But remember this will not work if you got bookings for almost every day for each place. In that case, you will hardly ever need the defaults, and there will still be millions of records in the state-per-day table. Like I said before, this may not be a problem at all, but anyway you should consider whether the more complex solution will indeed help you decrease the data or not. It depends on you situation.