How to get from nested JSON by int rather then by name in MySQL 8 - mysql

So I'm currently using MySQL's JSON field to store some data.
So the 'reports' table looks like this:
id | stock_id | type | doc |
1 | 5 | Income_Statement | https://pastebin.com/bj1hdK0S|
The pastebin is the content of the json field
What I want to do is get a number (ebit) from the first object under yearly (2018-12-31) in the JSON and then use that to do a WHERE query on so that it only returns where ebit > 50000000 for example. The issue is that the dates under yearly are not standard (i.e. one might be 2018-12-31, the other might by 2018-12-15). So essentially I want a way to get the data using integer indexes rather than the actual names of the objects, so something like yearly.[0].ebit.
How would I do this in MySQL? Alternatively if it's not possible in MySQL, would it be possible in either PostgeSQL or Mongo? If so, could you give me an example? Most of the data fits well into MySQL only this table has a JSON column which is why I started with MySQL.
so StackOverflow isn't letting my link to pastebin without some code so here's some random code:
if(dog == "poodle") {
print "test"
}

I don't know for MySQL nor MongoDB, but here's a simple version for PostgreSQL JSONB type:
SELECT (doc->'yearly'-> max(years) -> 'ebit')::numeric AS ebit
FROM reports, jsonb_object_keys(doc->'yearly') AS years
GROUP BY reports.doc;
...with simplistic test data:
WITH reports(doc) AS (
SELECT '{"yearly":{"2018-12-31":{"ebit":123},"2017-12-31":{"ebit":1.23}}}'::jsonb
)
SELECT (doc->'yearly'-> max(years) -> 'ebit')::numeric AS ebit
FROM reports, jsonb_object_keys(doc->'yearly') AS years
GROUP BY reports.doc;
...gives:
ebit
------
123
(1 row)
So I've basically selected the latest entry under "yearly" without knowing actual values but assuming that the key date formatting will allow a sort order (in this case it seems to comply with ISO-8601).
Using data type JSON instead of JSONB would preserve object key order but is not as efficient in PostgreSQL further down the road and wouldn't help here either.
IF you want to then select only those reports entries having their latest ebit greater than a certain value, just pack it into a sub-select or a CTE. I usualy prefer CTE's because they are better to read, so here we go:
WITH
reports (id, doc) AS (
VALUES
(1, '{"yearly":{"2018-12-31":{"ebit":123},"2017-12-31":{"ebit":1.23}}}'::jsonb),
(2, '{"yearly":{"2018-12-23":{"ebit":50},"2017-12-22":{"ebit":"1200.00"}}}'::jsonb)
),
r_ebit (id, ebit) AS (
SELECT reports.id, (reports.doc->'yearly'-> max(years) -> 'ebit')::numeric AS ebit
FROM reports, jsonb_object_keys(doc->'yearly') AS years
GROUP BY reports.id, reports.doc
)
SELECT id, ebit
FROM r_ebit
WHERE ebit > 100;
However, as you already see, it is not possible to filter the original rows using this strategy. A pre-processing step would make sense here so that the JSON format actually is filter-friendly.
ADDENDUM
To add the possibility of selecting the values for the n-th completed fiscal year, we need resort to window functions and we also need to reduce the resulting set to only return a single row per actual group (in the demonstration case: reports.id):
WITH reports(id, doc) AS (VALUES
(1, '{"yearly":{"2018-12-31":{"ebit":123},"2017-12-31":{"ebit":1.23},"2016-12-31":{"ebit":"23.42"}}}'::jsonb),
(2, '{"yearly":{"2018-12-23":{"ebit":50},"2017-12-22":{"ebit":"1200.00"}}}'::jsonb)
)
SELECT DISTINCT ON (1) reports.id, (reports.doc->'yearly'-> (lead(years, 0) over (partition by reports.doc order by years desc nulls last)) ->>'ebit')::numeric AS ebit
FROM reports, jsonb_object_keys(doc->'yearly') AS years
GROUP BY 1, reports.doc, years.years ORDER BY 1;
...will behave exactly as using the max aggregate function previously. Increasing the offset parameter within the lead(years, <offset>) function all will select the n-th year backwards (because of descending order of the window partition).
The DISTINCT ON (1) clause is the magic that reduces the result to a single row per distinct column value (first column = reports.id). This is why the NULLS LAST is very important inside the window OVER clause.
Here are results for different offsets (I've added a third historic entry for the first id but not for the second to also show how it deals with absent entries):
N = 0:
id | ebit
----+------
1 | 123
2 | 50
N = 1
id | ebit
----+---------
1 | 1.23
2 | 1200.00
N = 2
id | ebit
----+-------
1 | 23.42
2 |
...which means absent entries will just result in a NULL value.

Related

When comparing current row with previous row the query is too slow

When subtracting the previous row from the current row the query is too slow, is there a more efficient way to do this?
I am trying to create a data filter which has the capacity to highlight events which occur sequentially to those that do not. I have a table of machine operational data 'source' which is ordered chronologically. Using a WHERE clause I filter out the data which is of less relevance to this particular analysis. The remaining data is inserted into a new table 'filtered'. Using the inserted ID numbers from 'source' I compare each row with its proceeding row to find the difference in value – if the difference is 1 then then the events have occurred in sequence and if the difference is null then they have not. My problem is with the length of time it takes to compare a row with the previous row. I have reduced my data volume to just 2.5% (275000 rows) of what it full volume will be and the query takes 3012 seconds according to the MySQL Workbench action output. I have experimented with structuring the query differently but ultimately have reached dead ends. So my question is – Is there a more efficient way to compare a row with its previous row ?
OK – here are some more details.
/*First I create the table for the filtered data */
drop table if exists filtered_dta;
create table filtered_dta
(
ID int (11) not null auto_increment,
IDx1 int (11),
primary key (ID)
);
/Then I insert the filtered data/
insert into filtered_dta (IDx1)
select seq from source
WHERE range_value < -1.75
and range_value > -5 ;
/* Then I compare each row with its previous */
select t1.ID, t1.IDx1,(t1.IDx1-t2.IDx1)
as seq_value
from filtered_dta t1
left outer join filtered_dta t2
on t1.IDx1 = t2.IDx1+1
order by IDx1
;
Here are sample tables.
Table - filtered_dta Results
| ID | IDx1 | | ID | IDx1 | seq_value |
1 3 1 3 null
2 4 2 4 1
3 7 3 7 null
4 12 4 12 null
5 13 5 13 1
6 14 6 14 1
A full data set from the source table is expected to be between 3 and 10 million rows. The database will create and use about 50 tables. This database is being used as a back end engine for simulation software which does not have the capacity to process this amount of data and give an appropriate analysis of the system which the data represents.
I have spent some time on the issue and have come across the following;
It may be possible that the find_seq table is creates with myISAM and requires converting to an innoDB table. I tried to set the default engine to innoDB but seen no noticeable differences.
This question was similar in its problem of a slow query MySQL query painfully slow on large data - but its issue lay in having a function in a where clause – from my action output I can see the where clause is not too slow.
I would appreciate any input anyone may have on this. Also I am not a proficient user of MySQL so if possible give details.
Kind regards.
You can use something like this template to identify sequential "islands" without a self-join:
SELECT #island := #island + IF(seqId <> #lastSeqId + 1, 1, 0) AS island
, orderQ.[fieldsYouWant]
, #lastSeqId := seqId
FROM (
SELECT [fieldsYouWant], [sequentialIdentifier] AS seqId
FROM [theTable] AS t
, (SELECT #island := 0, #lastSeqId := [somethingItCannotBe]) AS init_dnr -- Initializes variables, do not reference
WHERE [filteringConditionsMet]
ORDER BY [orderingCriteria]
) AS orderingQ
;
I tried keeping it as generic as possible, but you'll note I had to revert to the assumption that seqId was numeric and expected to increment by one. Conditions in the island calculation can be much more complicated if needed (for cases such as where (A, 1), (A, 2), (B, 3) should be two islands based on the sequence not being defined by a single value).
You can take this template further, to identify "island" boundaries and sizes by simple making the above query as subquery for something like:
SELECT island, MIN(seqId), MAX(seqId), COUNT(seqId)
FROM ([above query]) AS islandQ
GROUP BY island
;

MySql SELECT by Spliting multiple values separated by ||

Below is the result set from SELECT query,
mysql> select * from mytable where userid =242 ;
+--------+-----------------------------+------------+---------------------+---------------------+
| UserId | ActiveLinks | ModifiedBy | DateCreated | DateModified |
+--------+-----------------------------+------------+---------------------+---------------------+
| 242 | 1|2|4|6|9|15|22|33|43|57|58 | 66 | 2013-11-28 16:17:25 | 2013-11-28 16:17:25 |
+--------+-----------------------------+------------+---------------------+---------------------+
What I want is to SELECT the records by splitting the Active links columns and associating it with UserId in the below format,
eg,
UserId ActiveLinks
242 1
242 2
242 4
242 6
Can anyone help me with this query , as of now nothing coming to my mind. Thanks
Dealing with lists stored in data is a pain. In MySQL, you can use substring_index(). The following should do what you want:
SELECT userid,
substring_index(substring_index(l.ActiveLinks, '||', n.n), '|', -1) as link
FROM (select 1 as n union all select 2 union all select 3 union all select 4) n join
ipadminuserslinks l
on length(l.ActiveLinks) - length(replace(l.ActiveLinks, '||', '')) + 1 <= n.n
WHERE userid = 242;
The first subquery generates a bunch of numbers, which you need. You may have to increase the size of this list.
The on clause limits the numbers to the number of elements in the list.
As you can probably tell, this is rather complicated. It is much easier to use a junction table, which is the relational way to store this type of information.
I would create a routine which will have the delimiter as an argument.
Another in_var would be the correspondent line.
Every time you call it, it will return the set of values for the UserId called.
It will basically use a loop based on the count of '|' (we call this pipeline)
This way you can implement the solution proposed by #Gordon Linoff without the need to know how many active links you have.
If this is just a list of values that do not relate to anything on another table I would do it the same way as Gordon (if needs be you can cross join the sub query that gets the lists of numbers to easily generate far larger ranges of numbers). One minor issue is that if the range of number is bigger than the number of delimited values on a row then the last value will be repeated (easily removed using DISTINCT in this case, more complicated when there are duplicate values in there that you want to keep).
However if the list of delimited values are related to another table (such as being the id field of another table then you could do it this way:-
SELECT a.UserId, b.link_id
FROM mytable a
LEFT OUTER JOIN my_link_table b
ON FIND_IN_SET(b.link_id, replace(a.ActiveLinks, '|', ','))
Ie, use FIND_IN_SET to join your table with the related table. In this case converting any | symbols used as delimiters to commas to allow FIND_IN_SET to work.

Complicated joining on multiple id's

I have a table like this
id | user_id | code | type | time
-----------------------------------
2 2 fdsa r 1358300000
3 2 barf r 1358311000
4 2 yack r 1358311220
5 3 surf r 1358311000
6 3 yooo r 1358300000
7 4 poot r 1358311220
I want to get the concatenated 'code' column for user 2 and user 3 for each matching time.
I want to receive a result set like this:
code | time
-------------------------------
fdsayooo 1358300000
barfsurf 1358311000
Please note that there is no yackpoot code because the query was not looking for user 4.
You can use GROUP_CONCAT function. Try this:
SELECT GROUP_CONCAT(code SEPARATOR '') code, time
FROM tbl
WHERE user_id in (2, 3)
GROUP BY time
HAVING COUNT(time) = 2;
SQL FIDDLE DEMO
What you are looking for is GROUP_CONCAT, but you are missing a lot of details in your question to provide a good example. This should get you started:
SELECT GROUP_CONCAT(code), time
FROM myTable
WHERE user_id in (2, 3)
GROUP BY time;
Missing details are:
Is there an order required? Not sure how ordering would be done useing grouping, would need to test if critical
Need other fields? If so you will likely end up needing to do a sub-select or secondary query.
Do you only want results with multiple times?
Do you really want no separator between values in the results column (specify the delimiter with SEPARATOR '' in the GROUP_CONCAT
Notes:
You can add more fields to the GROUP BY if you want to do it by something else (like user_id and time).

count rows where date is equal but separated by name

I think it will be easiest to start with the table I have and the result I am aiming for.
Name | Date
A | 03/01/2012
A | 03/01/2012
B | 02/01/2012
A | 02/01/2012
B | 02/01/2012
A | 02/01/2012
B | 01/01/2012
B | 01/01/2012
A | 01/01/2012
I want the result of my query to be:
Name | 01/01/2012 | 02/01/2012 | 03/01/2012
A | 1 | 2 | 2
B | 2 | 2 | 0
So basically I want to count the number of rows that have the same date, but for each individual name. So a simple group by of dates won't do because it would merge the names together. And then I want to output a table that shows the counts for each individual date using php.
I've seen answers suggest something like this:
SELECT
NAME,
SUM(CASE WHEN GRADE = 1 THEN 1 ELSE 0 END) AS GRADE1,
SUM(CASE WHEN GRADE = 2 THEN 1 ELSE 0 END) AS GRADE2,
SUM(CASE WHEN GRADE = 3 THEN 1 ELSE 0 END) AS GRADE3
FROM Rodzaj
GROUP BY NAME
so I imagine there would be a way for me to tweak that but I was wondering if there is another way, or is that the most efficient?
I was perhaps thinking if the while loop were to output just one specific name and date each time along with the count, so the first result would be A,01/01/2012,1 then the next A,02/01/2012,2 - A,03/01/2012,3 - B,01/01/2012,2 etc. then perhaps that would be doable through a different technique but not sure if something like that is possible and if it would be efficient.
So I'm basically looking to see if anyone has any ideas that are a bit outside the box for this and how they would compare.
I hope I explained everything well enough and thanks in advance for any help.
You have to include two columns in your GROUP BY:
SELECT name, COUNT(*) AS count
FROM your_table
GROUP BY name, date
This will get the counts of each name -> date combination in row-format. Since you also wanted to include a 0 count if the name didn't have any rows on a certain date, you can use:
SELECT a.name,
b.date,
COUNT(c.name) AS date_count
FROM (SELECT DISTINCT name FROM your_table) a
CROSS JOIN (SELECT DISTINCT date FROM your_table) b
LEFT JOIN your_table c ON a.name = c.name AND
b.date = c.date
GROUP BY a.name,
b.date
SQLFiddle Demo
You're asking for a "pivot". Basically, it is what it is. The real problem with a pivot is that the column names must adapt to the data, which is impossible to do with SQL alone.
Here's how you do it:
SELECT
Name,
SUM(`Date` = '01/01/2012') AS `01/01/2012`,
SUM(`Date` = '02/01/2012') AS `02/01/2012`,
SUM(`Date` = '03/01/2012') AS `03/01/2012`
FROM mytable
GROUP BY Name
Note the cool way you can SUM() a condition in mysql, becasue in mysql true is 1 and false is 0, so summing a condition is equivalent to counting the number of times it's true.
It is not more efficient to use an inner group by first.
Just in case anyone is interested in what was the best method:
Zane's second suggestion was the slowest, I loaded in a third of the data I did for the other two and it took quite a while. Perhaps on smaller tables it would be more efficient, and although I am not working with a huge table roughly 28,000 rows was enough to create significant lag, with the between clause dropping the result to about 4000 rows.
Bohemian's answer gave me the least amount to code, I threw in a loop to create all the case statements and it worked with relative ease. The benefit of this method was the simplicity, besides creating the loop for the cases, the results come in without the need for any php tricks, just simple foreach to get all the columns. Recommended for those not confident with php.
However, I found Zane's first suggestion the quickest performing and despite the need for extra php coding it seems I will be sticking with this method. The disadvantage of this method is that it only gives the dates that actually have data, so creating a table with all the dates becomes a bit more complicated. What I did was create a variable that keeps track of what date it is supposed to be compared to the table column which is reset on each table row, when the result of the query is equal to that date it echoes the value otherwise it does a while loop echoing table cells with 0 until the dates do match. It also had to do a check to see if the 'Name' value is still the same and if not it would switch to the next row after filling in any missing cells with 0 to the end of that row. If anyone is interested in seeing the code you can message me.
Results of the two methods over 3 months of data (a column for each day so roughly 90 case statements) ~ 12,000 rows out of 28,000:Bohemian's Pivot - ~0.158s (highest seen ~0.36s)Zane's Double Group by - ~0.086s (highest seen ~0.15s)

Formatting a MySQL Query result

I've currently got a table as follows,
Column Type
time datetime
ticket int(20)
agentid int(20)
ExitStatus varchar(50)
Queue varchar(50)
I want to write a query which will break this down by week, providing a column with a count for each ExitStatus. So far I have this,
SELECT ExitStatus,COUNT(ExitStatus) AS ExitStatusCount, DAY(time) AS TimePeriod
FROM `table`
GROUP BY TimePeriod, ExitStatus
Output:
ExitStatus ExitStatusCount TimePeriod
NoAgentID 1 4
Success 3 4
NoAgentID 1 5
Success 5 5
I want to change this so it returns results in this format:
week | COUNT(NoAgentID) | COUNT(Success) |
Ideally, I'd like the columns to be dynamic as other ExitStatus values may be possible.
This information will be formatted and presented to end user in a table on a page. Can this be done in SQL or should I reformat it in PHP?
There is no "general" solution to your problem (called cross tabulation) that can be achieved with a single query. There are four possible solutions:
Hardcode all possible ExitStatus'es in your query and keep it updated as you see the need for more and more of them. For example:
SELECT
Day(Time) AS TimePeriod,
SUM(IF(ExitStatus = 'NoAgentID', 1, 0)) AS NoAgentID,
SUM(IF(ExitStatus = 'Success', 1, 0)) AS Success
-- #TODO: Add others here when/if needed
FROM table
WHERE ...
GROUP BY TimePeriod
Do a first query to get all possible ExitStatus'es and then create your final query from your high-level programming language based on those results.
Use a special module for cross tabulation on your high-level programming language. For Perl, you have the SQLCrossTab module but I couldn't find one for PHP
Add another layer to your application by using OLAP (multi-dimensional views of your data) like Pentaho and then querying that layer instead of your original data
You can read a lot more about these solutions and an overall discussion of the subject
This is one way; you can use SUM() to count the number of items a particular condition is true. At the end you just group by the time as per normal.
SELECT DAY(time) AS TimePeriod,
SUM('NoAgentID' = exitStatus) AS NoAgentID,
SUM('Success' = exitStatus) AS Success, ...
FROM `table`
GROUP BY TimePeriod
Output:
4 1 3
5 1 5
The columns here are not dynamic though, which means you have to add conditions as you go along.
SELECT week(time) AS week,
SUM(ExitStatus = 'NoAgentID') AS 'COUNT(NoAgentID)',
SUM(ExitStatus = 'Success') AS 'COUNT(Success)'
FROM `table`
GROUP BY week
I'm making some guesses about how ExitStatus column works. Also, there are many ways of interpretting "week", such as week of year, of month, or quarter, ... You will need to put the appropriate function there.