Vega Lite - Scaling to Large Datasets - vega-lite

I have used the density transform in Vega Lite for smaller datasets. However, I have a larger dataset with millions of observations that is represented more compactly for which I'd like to do a weighted density transform. My attempt as follows:
`
{
"$schema": "https://vega.github.io/schema/vega-lite/v5.json",
// My data set is represented more compactly as follows
// "data": {
// "values": [
// {"size": 1, "observations": 1},
// {"size": 2, "observations": 2},
// {"size": 3, "observations": 4},
// {"size": 4, "observations": 6},
// {"size": 5, "observations": 3},
// ]
// },
// Expanding the dataset produces the right plot but is impractical
// given data volumes (in the millions of observations)
"data": {
"values": [
{"size": 1, "observation": "observation 1 of 1"},
{"size": 2, "observation": "observation 1 of 2"},
{"size": 2, "observation": "observation 2 of 2"},
{"size": 3, "observation": "observation 1 of 4"},
{"size": 3, "observation": "observation 2 of 4"},
{"size": 3, "observation": "observation 3 of 4"},
{"size": 3, "observation": "observation 4 of 4"},
{"size": 4, "observation": "observation 1 of 6"},
{"size": 4, "observation": "observation 2 of 6"},
{"size": 4, "observation": "observation 3 of 6"},
{"size": 4, "observation": "observation 4 of 6"},
{"size": 4, "observation": "observation 5 of 6"},
{"size": 4, "observation": "observation 6 of 6"},
{"size": 5, "observation": "observation 1 of 1"},
{"size": 5, "observation": "observation 2 of 2"}
]
},
"mark": "area",
"transform": [
{
// I believe Vega has a weight parameter in the density transform
// Is there an equivalent in Vega Lite?
//"weight": "observations",
"density": "size"
}
],
"encoding": {
"x": {"field": "value", "type": "quantitative"},
"y": {"field": "density", "type": "quantitative"}
}
}
`
The dataset I have available to me is commented out above. Expanding out the dataset produces the correct plot. However, given the number of observations, I suspect this is impractical unless there's a performant way to do this inside Vega Lite.
I believe Vega has a weight parameter in the density transform, but in the environment I'm working, I only have access to Vega Lite. Is there another way to think about producing a weighted density transform in Vega Lite?

That weight parameter in Vega isn't what you're looking for - it is to weight the different probability distributions if you need to use multiple types. Out of the box, both Vega and Vega-Lite are not suitable for scaling to huge datasets but there are several projects that use Vega to scale to large datasets.
https://github.com/vega/scalable-vega
https://vega.github.io/scalable-vega/
https://vegafusion.io/
If you can't use one of the other projects, you're only option it to precompute the distributions and get Vega to display the result.

Related

Add field of similar data with percentage in database

I am working on a scraping task where I have to collect products title and prices from two websites. I have the large dataset after scraping and the structure of the table:
[
{
"id": 1,
"title": "ProductA",
"price": 10,
"matches": []
},
{
"id": 2,
"title": "ProductB",
"price": 20,
"matches": []
},
{
"id": 3,
"title": "Another One",
"price": 30,
"matches": []
},
]
I am using MongoDB right now as a Database. I have to run some script that will find the matches products with a score and store it in the column within a large dataset. Example:
[
{
"id": 1,
"title": "ProductA",
"price": 10,
"matches": [{score: 0.75, productId: 2}]
},
{
"id": 2,
"title": "ProductB",
"price": 20,
"matches": [{score: 0.75, productId: 1}]
},
{
"id": 3,
"title": "Another One",
"price": 30,
"matches": []
},
]
I tried LIKE in SQL and $text in MongoDB but both are working only when I have to find similarities with a specific text.
Is there any built-in operation in DB that can go thru all the documents and find similarities by title, generate a percentage of how much it matches, and then add it in the matches field?
NOTE: MongoDB is not mandatory, any database can be mentioned.

How do I use factorial (!) in a vega formula transform

I am trying to create a histogram of a binomial distribution PMF using a vega js specification.
How is this usually done? The vega expressions does not include functions for choose, or factorial, nor does it include a binomial distribution under the statistical functions.
I also cannot seem to reference other functions within the vega specification (i.e. for yval below).
"data":[
{"name": "dataset",
"transform": [
{"type":"sequence", "start": 1, "stop": 50, "step": 1, "as": "seq" },
{"type": "formula", "as": "xval", "expr": "if(datum.seq<nval,datum.seq,NaN)"},
{"type": "formula", "as": "yval", "expr": "math.factorial(datum.xval)
" }
]}],
Thanks.
There is no factorial operation available, but one suitable option might be to approximate it with Stirling's approximation, or perhaps a Stirling series if more accuracy is required.
For example, in Vega-Lite (view in editor):
{
"data": {
"values": [
{"n": 0, "factorial": 1},
{"n": 1, "factorial": 1},
{"n": 2, "factorial": 2},
{"n": 3, "factorial": 6},
{"n": 4, "factorial": 24},
{"n": 5, "factorial": 120},
{"n": 6, "factorial": 720},
{"n": 7, "factorial": 5040},
{"n": 8, "factorial": 40320},
{"n": 9, "factorial": 362880},
{"n": 10, "factorial": 3628800}
]
},
"transform": [
{
"calculate": "datum.n == 0 ? 1 : sqrt(2 * PI * datum.n) * pow(datum.n / E, datum.n)",
"as": "stirling"
},
{"fold": ["factorial", "stirling"]}
],
"mark": "point",
"encoding": {
"x": {"field": "n", "type": "quantitative"},
"y": {"field": "value", "type": "quantitative", "scale": {"type": "log"}},
"color": {"field": "key", "type": "nominal"}
}
}

Using Vega Lite to display already-aggregated data

I'm trying to show a stacked bar chart of sums over time. The data looks something like this:
[
{
"date": 12345,
"sumA": 100,
"sumB": 150
},
...
]
I'm encoding the x axis to the field "date". I need the bar at date 12345 to be stacked with one part being 100 high, and the other, shown in another color, being 150 high.
Vega Lite seems to expect the raw data, but this would be too slow. I do this aggregate on the server side to save time. Can I spoon-feed Vega Lite the aggregates like in my example above?
You can use the fold transform to fold your two columns into one, and then the channel encodings take care of the rest. For example (vega editor):
{
"data": {
"values": [
{"date": 1, "sumA": 100, "sumB": 150},
{"date": 2, "sumA": 200, "sumB": 50},
{"date": 3, "sumA": 80, "sumB": 120},
{"date": 4, "sumA": 120, "sumB": 30},
{"date": 5, "sumA": 150, "sumB": 110}
]
},
"transform": [
{"fold": ["sumA", "sumB"], "as": ["column", "value"]}
],
"mark": {"type": "bar"},
"encoding": {
"x": {"type": "ordinal", "field": "date"},
"y": {"type": "quantitative", "field": "value"},
"color": {"type": "nominal", "field": "column"}
}
}

Querying JSONB using Postgres

I am attempting to get an element in my JSON with a query.
I am using Groovy, Postgres 9.4 and JSONB.
Here is my JSON
{
"id": "${ID}",
"team": {
"id": "123",
"name": "Shire Soldiers"
},
"playersContainer": {
"series": [
{
"id": "1",
"name": "Nick",
"teamName": "Shire Soldiers",
"ratings": [
1,
5,
6,
9
],
"assists": 17,
"manOfTheMatches": 20,
"cleanSheets": 1,
"data": [
3,
2,
3,
5,
6
],
"totalGoals": 19
},
{
"id": "2",
"name": "Pasty",
"teamName": "Shire Soldiers",
"ratings": [
6,
8,
9,
10
],
"assists": 25,
"manOfTheMatches": 32,
"cleanSheets": 2,
"data": [
3,
5,
7,
9,
10
],
"totalGoals": 24
}
]
}
}
I want to fetch the individual elements in the series array by their ID, I am currently using this query below
select content->'playersContainer'->'series' from site_content
where content->'playersContainer'->'series' #> '[{"id":"1"}]';
However this brings me back me back both the element with an id of 1 and 2
Below is what I get back
"[{"id": "1", "data": [3, 2, 3, 5, 6], "name": "Nick", "assists": 17, "ratings": [1, 5, 6, 9], "teamName": "Shire Soldiers", "totalGoals": 19, "cleanSheets": 1, "manOfTheMatches": 20}, {"id": "2", "data": [3, 5, 7, 9, 10], "name": "Pasty", "assists": 25, "r (...)"
Can anyone see where I am going wrong? I have seen some other questions on here but they don't help with this.
content->'playersContainer'->'series' is an array. Use jsonb_array_elements() if you want to find a specific element in an array.
select elem
from site_content,
lateral jsonb_array_elements(content->'playersContainer'->'series') elem
where elem #> '{"id":"1"}';
Test it here.

Query JSON in Postgres

I have some JSON in my postgres DB, it's in a table called site_content, the table has two rows, id and content, in content is where I store my JSON. I want to be able to find the a player given his id, my players are stored under the key series as this is the key needed to create my charts from JSON.
Here is the query I am currently using:
Blocking.get {
sql.firstRow("""SELECT * from site_content where content -> 'playersContainer' -> 'series' -> 'id' = ${id} """)
}.map { row ->
log.info("row is: ${row}")
if (row) {
objectMapper.readValue(row.getAt(0).toString(), Player)
}
}
}
However I get back this error:
org.postgresql.util.PSQLException: ERROR: operator does not exist:
json = character varying Hint: No operator matches the given name
and argument type(s). You might need to add explicit type casts.
Here is an example of my JSON:
"id": "${ID}",
"team": {
"id": "123",
"name": "Shire Soldiers"
},
"playersContainer": {
"series": [
{
"id": "1",
"name": "Nick",
"teamName": "Shire Soldiers",
"ratings": [
1,
5,
6,
9
],
"assists": 17,
"manOfTheMatches": 20,
"cleanSheets": 1,
"data": [
3,
2,
3,
5,
6
],
"totalGoals": 19
},
{
"id": "2",
"name": "Pasty",
"teamName": "Shire Soldiers",
"ratings": [
6,
8,
9,
10
],
"assists": 25,
"manOfTheMatches": 32,
"cleanSheets": 2,
"data": [
3,
5,
7,
9,
10
],
"totalGoals": 24
}
]
}
I am using Groovy for this project, but I guess it's just the general JSON postgres syntax I am having problems with.
You're right, that's a problem with SQL syntax. Correct you query:
select * from json_test where content->'playersContainer'->'series' #> '[{"id":"1"}]';
Full example:
CREATE TABLE json_test (
content jsonb
);
insert into json_test(content) VALUES ('{"id": "1",
"team": {
"id": "123",
"name": "Shire Soldiers"
},
"playersContainer": {
"series": [
{
"id": "1",
"name": "Nick",
"teamName": "Shire Soldiers",
"ratings": [
1,
5,
6,
9
],
"assists": 17,
"manOfTheMatches": 20,
"cleanSheets": 1,
"data": [
3,
2,
3,
5,
6
],
"totalGoals": 19
},
{
"id": "2",
"name": "Pasty",
"teamName": "Shire Soldiers",
"ratings": [
6,
8,
9,
10
],
"assists": 25,
"manOfTheMatches": 32,
"cleanSheets": 2,
"data": [
3,
5,
7,
9,
10
],
"totalGoals": 24
}
]
}}');
select * from json_test where content->'playersContainer'->'series' #> '[{"id":"1"}]';
About #> operator. This question might be also useful.
May be it could help: Into the sql statement, I added this 'cast' where I have the json field:
INSERT INTO map_file(type, data)
VALUES (?, CAST(? AS json))
RETURNING id
the datatype of 'data' into map_file table is: json