Tableau can't sum (10^10 numbers) - csv

I've been doing some harmless operations, basically combining 2 ids to create a unique id, and as the ids are numbers, I decided to use math operations to combine then (rather then string concatenation). So, as my second id is alwas < 10000, what I did was id1*10000 + id2
Problem is, Tableau doesn't seem to know how to add those numbers. To illustrate better, I created Calculation 1 (that is Id1*10000), Calculation2 (that is Id2) and Calculation3 (that is Calculation1 + Calculation2).
Check the file. http://www.speedyshare.com/Wc5zP/Tableau-can-t-sum.twbx
Original datasource is a csv file, but it's extracted (to a tde).
One thing that might be happening is that Tableau has some limitation on the size of int it can store. Anyone knows how I can change this? (Int64 would do the trick, if that is actually the problem)
Here a snapshot

Related

Is there a way to store database modifications with a versioning feature (for eventual versions comparaison)?

I'm working on a project where users could upload excel files into a MySQL database. Those files are the main source of our data as they come directly from the contractors working with the company. They contain a large number of rows (23000 on average for each file) and 100 columns for each row!
The problem I am facing currently is that the same file could be changed by someone (either the contractor or the company) and when re-uploading it, my system should detect changes, update the actual data, and save the action (The fact that the cell went from a value to another value :: oldValue -> newValue) so we can go back and run a versions comparison (e.g 3 re-uploads === 3 versions). (oldValue Version1 VS newValue Version5)
I developed a tiny mechanism for saving the changes => I have a table to save Imports data (each time a user import a file a new row will be inserted in this table) and another table for saving the actual changes
Versioning data
I save the id of the row that have some changes, as well as the id and the table where the actual data was modified (Uploading a file results in a insertion in multiple tables, so whenever a change occurs, I need to know in which table that happened). I also save the new value and the old value which is gonna help me with restoring the "archives data".
To restore a version : SELECT * FROM 'Archive' WHERE idImport = ${versionNumber}
To restore a version for one row : SELECT * FROM 'Archive' WHERE idImport = ${versionNumber} and rowId = ${rowId}
To restore all version for one row : SELECT * FROM 'Archive' WHERE rowId = ${rowId}
To restore version for one table : SELECT * FROM 'Archine' WHERE tableName = ${table}
Etc.
Now with this structure, I'm struggling to restore a version or to run a comparaison between two versions, which makes think that I've came up with a wrong approach since it makes it hard to do the job! I am trying to know if anyone had done this before or what a good approach would look like?
Cases when things get really messy :
The rows that have changed in a version might not have changed in the other version (I am working on a time machine to search in other versions when this happens)
The rows have changed in both versions but not the same fields. (Say we have a user table, the data of the user with id 15 have changed in 2nd and 5th upload, great! Now for the second version only the name was changed, but for the fifth version his address was changed! When comparing these two versions, we will run into a problem constrcuting our data array. name went from "some"-> NULL (Name was never null. No name changes in 5th version) and address went from NULL -> "some' is which obviously wrong).
My actual approach (php)
<?php
//Join records sets and Compare them
foreach ($firstRecord as $frecord) {
//Retrieve first record fields that have changed
$fFields = $frecord->fieldName;
//Check if the same record have changed in the second version as well
$sId = array_search($frecord->idRecord, $secondRecord);
if($sId) {
$srecord = $secondRecord[$sId];
//Retrieve straversee fields that have changed
$sFields = $srecord->fieldName;
//Compare the two records fields
foreach ($fFields as $fField) {
$sfId = array_search($fField, $sFields);
//The same field for the same record was changed in both version (perfect case)
if($sfId) {
$sField = $sFields[$sfId];
$deltaRow[$fField]["oldValue"] = $frecord->deltaValue;
$deltaRow[$fField]["newValue"] = $srecord->deltaValue;
//Delete the checked field from the second version traversee to avoid re-checking
unset($sField[$sfId]);
}
//The changed field in V1 was not found in V2 -> Lookup for a value
else {
$deltaRow[$fField]["oldValue"] = $frecord->deltaValue;
$deltaRow[$fField]["newValue"] = $this->valueLookUp();
}
}
$dataArray[] = $deltaRow;
//Delete the checked record from the second version set to avoid re-checking
unset($secondRecord[$srecord]);
}
I don't know how to deal with that, as I said I m working on a value lookup algorithm so when no data found in a version I will try to find it in the versions between theses two so I can construct my data array. I would be very happy if anyone could give some hints, ideas, improvements so I can go futher with that.
Thank you!
Is there a way to store database modifications with a versioning feature (for eventual versions comparaison [sic!])?
What constitutes versioning depends on the database itself and how you make use of it.
As far as a relational database is concerned (e.g. MariaDB), this boils down to the so called Normal Form which is in numbers.
On Database Normalization: 5th Normal Form and Beyond you can find the following guidance:
Beyond 5th normal form you enter the heady realms of domain key normal form, a kind of theoretical ideal. Its practical use to a database designer os [sic!] similar to that of infinity to a bookkeeper - i.e. it exists in theory but is not going to be used in practice. Even the most demanding owner is not going to expect that of the bookkeeper!
One strategy to step into these realms is to reach the 5th normal form first (do this just in theory, by going through all the normal forms, and study database normalization).
Additionally you can construe versioning outside and additional to the database itself, e.g. by creating your own versioning system. Reading about what you can do with normalization will help you to find better ways to decide on how to structure and handle the database data for your versioning needs.
However, as written it depends on what you want and need. So no straight forward "code" answer can be given to such a general question.

select from table where two parameters satisfy in mysql

I am totally clueless how to get around to get the following kinda result from the same table in MySQL.
Required Result:
The raw data as shown in below image.
Mc_id and op_id can be different. For example, if mc_id is 4 and op_id is 10 then it has to loop through each vouid and extract done_on_date, again it has to loop through for the same mc_id 4 and op_id 10 and extract done_on_date where done_on_date is after first extracted done_on_date. Here second extracted done_on_date, we refer to, as next_done_on_date, just to distinguish it differently. Accordingly continue till end of the table. I hope I am clear enough now.
The idea is basically to see when was particular operation_id carried out for the said machine having mc_id. First time operation done is refered to as done_on_date and when the same operation carried out for the same machine next time, we refer to as next_done_on_date but actually inside the database table it is done_on_date.
Though let me know if anything yet to be clarified

Best way to parse a big and intricated Json file with OpenRefine (or R)

I know how to parse json cells in Open refine, but this one is too tricky for me.
I've used an API to extract the calendar of 4730 AirBNB's rooms, identified by their IDs.
Here is an example of one Json file : https://fr.airbnb.com/api/v2/calendar_months?key=d306zoyjsyarp7ifhu67rjxn52tv0t20&currency=EUR&locale=fr&listing_id=4212133&month=11&year=2016&count=12&_format=with_conditions
For each ID and each day of the year from now until november 2017, i would like to extract the availability of this rooms (true or false) and its price at this day.
I can't figure out how to parse out these informations. I guess that it implies a series of nested forEach, but i can't find the right way to do this with Open Refine.
I've tried, of course,
forEach(value.parseJson().calendar_months, e, e.days)
The result is an array of arrays of dictionnaries that disrupts me.
Any help would be appreciate. If the operation is too difficult in Open Refine, a solution with R (or Python) would also be fine for me.
Rather than just creating your Project as text, and working with GREL to parse out...
The best way is just select the JSON record part that you want to work with using our visual importer wizard for JSON files and XML files (you can even use a URL pointing to a JSON file as in your example). (A video tutorial shows how here: https://www.youtube.com/watch?v=vUxdB-nl0Bw )
Select the JSON part that contains your records that you want to parse and work with (this can be any repeating part, just select one of them and OpenRefine will extract all the rest)
Limit the amount of data rows that you want to load in during creation, or leave default of all rows.
Click Create Project and now your in Rows mode. However if you think that Records mode might be better suited for context, just import the project again as JSON and then select the next outside area of the content, perhaps a larger array that contains a key field, etc. In the example, the key field would probably be the Date, and why I highlight the whole record for a given date. This way OpenRefine will have Keys for each record and Records mode lets you work with them better than Row mode.
Feel free to take this example and make it better and even more helpful for all , add it to our Wiki section on How to Use
I think you are on the right track. The output of:
forEach(value.parseJson().calendar_months, e, e.days)
is hard to read because OpenRefine and JSON both use square brackets to indicate arrays. What you are getting from this expression is an OR array containing twelve items (one for each month of the year). The items in the OR array are JSON - each one an array of days in the month.
To keep the steps manageable I'd suggest tackling it like this:
First use
forEach(value.parseJson().calendar_months,m,m.days).join("|")
You have to use 'join' because OR can't store OR arrays directly in a cell - it has to be a string.
Then use "Edit Cells->Split multi-valued cells" - this will get you 12 rows per ID, each containing a JSON expression. Now for each ID you have 12 rows in OR
Then use:
forEach(value.parseJson(),d,d).join("|")
This splits the JSON down into the individual days
Then use "Edit Cells->Split multi-valued cells" again to split the details for each day into its own cell.
Using the JSON from example URL above - this gives me 441 rows for the single ID - each contains the JSON describing the availability & price for a single day. At this point you can use the 'fill down' function on the ID column to fill in the ID for each of the rows.
You've now got some pretty easy JSON in each cell - so you can extract availability using
value.parseJson().available
etc.

SAP BusinessObjects - Merging dimensions with no directly related attributes

Given the following 3 queries
Query 1
SELECT
COMPONENTINFO__SOFTWARE.SOFTWARENAME,
COMPONENTINFO__SOFTWARE.SOFTWAREVERSION,
COMPONENTINFO__SOFTWARE.PARENTOID,
COMPONENTINFO__SOFTWARE.OID,
COMPONENT_VERSION_INFO.OID,
COMPONENT_VERSION_INFO.HWSERIAL,
COMPONENT_VERSION_INFO.COMPONENTID
FROM
COMPONENTINFO__SOFTWARE,
COMPONENT_VERSION_INFO
WHERE
( COMPONENTINFO__SOFTWARE.PARENTOID=COMPONENT_VERSION_INFO.OID )
Query 2
SELECT
V_MACH.OID,
V_MACH.NAME,
V_MACH.IPADDR
FROM
V_MACH
Query 3
SELECT
V_VERSIONINFO.MACHINEOID,
VM_VERSIONINFO_VERSIONINFOINFO.HWSERIAL,
VM_VERSIONINFO_VERSIONINFOINFO.OSVERSION,
VM_VERSIONINFO_VERSIONINFOINFO.PARENTOID,
VM_VERSIONINFO_VERSIONINFOINFO.OID,
COMPONENT_VERSION_INFO.PARENTOID,
V_VERSIONINFO.OID
FROM
V_VERSIONINFO,
VM_VERSIONINFO_VERSIONINFOINFO,
COMPONENT_VERSION_INFO
WHERE
( VM_VERSIONINFO_VERSIONINFOINFO.PARENTOID=V_VERSIONINFO.OID )
I'm trying to produce a report (Webi, using the rich client) that shows in 1 table:
V_MACH.NAME, COMPONENTINFO__SOFTWARE.SOFTWARENAME, COMPONENTINFO__SOFTWARE.SOFTWAREVERSION
But no matter what dimensions I merge, it won't let me put the NAME field alongside the software version fields.
I've tried to merge on:
VM_VERSIONINFO_VERSIONINFOINFO.HWSERIAL + COMPONENT_VERSION_INFO.HWSERIAL.
VM_VERSIONINFO_VERSIONINFOINFO.OID + COMPONENT_VERSION_INFO.OID (I found these represent the same values for each machine)
But nothing works.
Is the only way to do a join at the SQL level? I was hoping to avoid that but if it's the only way then that's ok.
I think what you need to do is this:
1) Create a merged dimension between V_MACH.OID in Query 2 and
V_VERSIONINFO.MACHINEOID in Query 3. Call the merged dim
"machineoid".
Create a merged dimension between
VM_VERSIONINFO_VERSIONINFOINFO.OID in Query 3 and
COMPONENT_VERSION_INFO.OID in Query 1. Call the merged dim "oid".
Create a new variable as a detail type, defined as
=[V_MACH.NAME], and its associated dimension as the merged
machineoid dimension. Call it name_detail.
Use the two merged dims in place of the
underlying dims in your report block, then add in the name_detail variable.
The reason you're having trouble is that BO can't recognize what Query 2.NAME should be associated with. By creating a detail variable, you are explicitly telling it that it is an attribute of the now-merged OID dimension.

"Diffing" objects from a relational database

Our win32 application assembles objects from the data in a number of tables in a MySQL relational database. Of such an object, multiple revisions are stored in the database.
When storing multiple revisions of something, sooner or later you'll ask yourself the question if you can visualize the differences between two revisions :) So my question is: what would be a good way to "diff" two such database objects?
Would you do the comparison at the database level? (Doesn't sound like a good idea: too low-level, and too sensitive to the schema).
Would you compare the objects?
Would you write a function that "manually" compares the properties and fields of two objects?
How would you store the diff? In a separate, generic "TDiff" object?
Any general recommendations on how to visualize such things in a user interface?
Advice, or stories about your own experiences with this, are very welcome; thanks a bunch!
Extra info on use case (20090515)
In reply to Antony's comment: this specific application is used to schedule training courses, run by teams of teachers. The schedule of a teacher is stored in various tables in the database, and contains info such as "where does she have to go on which day", "who are her colleagues in the team", etc. This information is spread out over multiple tables.
Once in a while, we "publish" the schedule, so the teachers can see it on a webpage. Each "publication" is a revision, and we'd like to be able to show the users (and later also the teachers) what's changed between two publications --- if anything.
Hope that makes the scenario a bit more tangible :)
Some final remarks
Well, the bounty has come to an end, so I've accepted an answer. If it'd somehow be possible to slice a couple of extra 100's off of my rep and give it to some of the other answers, I would do so without hesitation. All your guys' help has been great, and I am very grateful! ~ Onno 20090519
Just an idea, but would it be worthwhile for you to convert the two object versions being compared to some text format and then comparing these text objects using an existing diff program - like diff for example? There are lots of nice diff programs out there that can offer nice visual representations, etc.
So for example
Text version of Object 1:
first_name: Harry
last_name: Lime
address: Wien
version: 0.1
Text version of Object 2:
first_name: Harry
last_name: Lime
address: Vienna
version: 0.2
The diff would be something like:
3,4c3,4
< address: Wien
< version: 0.1
---
> address: Vienna
> version: 0.2
Assume that a class has 5 known properties - date, time, subject, outline, location. When I look at my schedule, I'm most interested in the most recent (ie current/accurate) version of these properties. It would also be useful for me to know what, if anything, has changed. (As a side note, if the date, time or location changed, I'd also expect to get an email/sms advising me in case I don't check for an updated schedule :-))
I would suggest that the 'diff' is performed at the time the schedule is amended. So, when version 2 of the class is created, record which values have changed, and store this in two 'changelog' fields on the version 2 object (there must already be one parent table that sits atop all your tables - use that one!). One changelog field is 'human readable text' eg 'Date changed from Mon 1 May to Tues 2 May, Time changed from 10:00am to 10:30am'. The second changelog field is a delimted list of changed fields eg 'date,time' To do this, before saving you would loop over the values submitted by the user, compare to current database values, and concatenate 2 strings, one human readable, one a list of field names. Then, update the data and set your concatenated strings as the 'changelog' values.
When displaying the schedule load the current version by default. Loop through the fields in the changelog field list, and annotate the display to show that the value has changed (a * or a highlight, etc). Then, in a separate panel display the human readable change log.
If a schedule is amended more than once, you would probably want to combine the changelogs between version 1 & 2, and 2 & 3. Say in version 3 only the course outline changed - if that was the only changelog you had when displaying the schedule, the change to date and time wouldn't be displayed.
Note that this denormalised approach won't be great for analysis - eg working out which specific location always has classes changed out of it - but you could extend it using an E-A-V model to store the change log.
Doing a comparison at the database level would be good if what you cared about was changes to the database. That makes the most sense if you're trying to design a layer of generic functionality on top of the database itself.
Doing a comparison at the object level would be good if you care about changes to the data. For example, if the data was the input to a program and you were interested in looking at changes in the input to verify that changes to the output were correct.
Your use case doesn't appear to be either of these. You appear to care about the output and want differences from that perspective. If that's the case, I would do differences on the output report (or a pure-text version of it) instead of on the underlying data. You can do that with any off-the-shelf diff tool. To make things easier for your end-users you could parse the diff results and render them as HTML. There are lots of options here: side-by-side with color coding to indicate changes, one document with markup for changes (e.g. red strikethrough for deletions and green for additions), maybe just highlight areas that have changed and use balloons to show the previous/current values on demand.
I've thought about doing database comparisons but never tried to implement it. As you noted, any such attempts are intimately intertwined with the schema.
I have done object-level comparisons. The general algorithm was this:
Do a set comparison on the lists of object IDs. This creates three result groupings: added objects, deleted objects, and objects that live in both sets.
Report the deletions.
Report the additions.
For the things in both sets, do an attribute-by-attribute comparison.
If any differences are found, report the object ID, the attributes that differ, and the respective values. If appropriate, highlight the portion of the attribute value that has changed.
In my case, the comparison algorithms were hand-written to match the object attributes. This gave me control over which attributes were compared and how. A generic comparator might be possible for some cases but would depend on the situation and at least partially on the implementation language.
I've looked into MysQL Diffing a number of times. Unfortunately, there aren't any really good solutions available.
One tool I've tried was mysqldiff (www.mysqldiff.org). mysqldiff is a tool written in PHP which is capable of diffing mysql schemas. Unfortunately, it doesn't do a great job a lot of the time.
MySQL Workbench, MySQLs own SQL IDE provides the option to generate an alter script and I would imagine it does this by performing some kind of diff operation internally.
Aqua Data Studio is another tool that is capable of comparing schemas and outputing a diff of the two. While the ADS diff is quite nice, it does not provide a tool to create an alter script.
If I were writing my own I guess I would write code capable of comparing structure of two tables. Such code could be tuned to be highly sensitive (Ig if column order differs from from version to the next, it's a difference) or more moderately sensitive (Eg Column order is not a major issue, datatypes and lengths are important, as are indices and constraints).
Storage, I'm not to sure. I would look into how a version control system such as Mercurial stores its diff information for revisions and use that to elaborate a method appropriate for the DB.
Finally, for visual output I recommend you take a look at the Aqua Data Stduio compare feature (You can use the Trial version to test this...). Its diff output is pretty good.
My application dbscript compares hierarchical data (database schemas) in a stored procedure, which of course has to compare each field/property of every object with its counterpart. I guess you won't get around that step (unless you have a generic object description model)
As for the UI part of your question, have a look at screenshots to view and select differences.
I would think about some sort of common text representation of the objects and let the texts compare with an existing diffing tool like WinMerge.
I see no need to invent diffing by myself since there are already plenty of nice tools I can use.
In your situation in PostgreSQL I used a difference tables with the schema:
history_columns (
column_id smallint primary key,
column_name text not null,
table_name text not null,
unique (table_name, column_name)
);
create temporary sequence column_id_seq;
insert into history_columns
select nextval('column_id_seq'), column_name, table_name
from information_schema.columns
where
table_name in ('table1','table2','table3')
and table_schema=current_schema() and table_catalog=current_database();
create table history (
column_id smallint not null references history_columns,
id int not null,
change_time timestamp with time zone not null
constraint change_time_full_second -- only one change allowed per second
check (date_trunc('second',change_time)=change_time),
primary key (column_id,id,change_time),
value text
);
And on the tables I used a trigger like this:
create or replace function save_history() returns trigger as
$$
if (tg_op = 'DELETE') then
insert into historia values (
find_column_id('id',tg_relname), OLD.id,
date_trunc('second',current_timestamp),
OLD.id );
[for each column_name] {
if (char_length(OLD.column_name)>0) then
insert into history values (
find_column_id(column_name,tg_relname), OLD.id,
OLD.change_time, OLD.column_name
)
}
elsif (tg_op = 'UPDATE') then
[for each column_name] {
if (OLD.column_name is distinct from NEW.column_name) then
insert into history values (
find_column_id(column_name,tg_relname), OLD.id,
OLD.change_time, OLD.column_name
);
end if;
}
end if;
$$ language plpgsql volatile;
create trigger save_history_table1
before update or delete on table1
for each row execute procedure save_history();
This isn't really an answer to the question you asked rather an attempt to re-imagine the problem. Would you consider altering your database and object model to store the aggregate root and a series of deltas? That is, model and store RevisionSets that are collections of Revisions; a Revision is an entity property paired with a value. In a sense this is internalizing the revision structure into your architecture that the other posters are suggesting that you bolt-on to what you already have via "logs".
It's trivial to display the aggregate from the deltas, and even easier to display the deltas as a change history. The fact that you are using a rich client with state and local memory makes this even more compelling. You could very easily display "all the changes since date xxxx" without revisiting the database.
Credit for the basic idea goes to Greg Young and his work with financial data streams, but it is imminently applicable to your problem.
I'm riffing off of what Harry Lime suggested: Output your properties to text format, then hash the results. That way you can compare the hash values and easily flag the data that has been altered. This way you get the best of both worlds as you can visually see differences but programmatically identify differences. With the has you'll have a good source for an index should you want to store and retrieve the deltas.
Given you want to create a UI for this and need to indicate where the differences are, it seems to me you can either go custom or create a generic object comparer - the latter being dependent on the language you are using.
For the custom method, you need to create a class that takes to two instances of the classes to be comparied. It then returns differences;
public class Person
{
public string name;
}
public class PersonComparer
{
public PersonComparer(Person old, Person new)
{
....
}
public bool NameIsDifferent() { return old.Name != new.Name; }
public string NameDifferentText() { return NameIsDifferent() ? "Name changed from " + old.Name + " to " + new.Name : ""; }
}
This way you can use the NameComparer object to create your GUI.
The gereric approach would be much the same, just that you generalize the calls, and use object insepection (getObjectProperty call below) to find differences;
public class ObjectComparer()
{
public ObjectComparer(object old, object new)
{
...
}
public bool PropertyIsDifferent(string propertyName) { return getObjectProperty(old, propertyName) != getObjectProperty(new, propertyName) };
public string PropertyDifferentText(string propertyName) { return PropertyIsDifferent(propertyName) ? propertyName + " " + changed from " + getObjectProperty(old, propertyName) + " to " + getObjectProperty(new, propertyName): ""; }
}
}
I would go for the second, as it makes things really easy to change GUI on needs. The GUI I would try 'yellowing' the differences to make them easy to see - but that depends on how you want to show the differences.
Getting the object to compare would be loading your object with the initial revision and latest revision.
My 2 cents... Not as techy as the database compare stuff already here.
Have you looked at Open Source DiffKit?
www.diffkit.org
I think it does what you want.
Example with Oracle.
Export ordered objects to text with dbms_metadata
Export ordered tables data into CSV or query format
Make big text file
Diff