Azure Stream Analytics Querying JSON - json

I have a problem writing a query to extract a table out of the arrays from a json file: The problem is how to get the information of the array “clientPerformance” and then make them all in a normal sql table.
The file looks like this:
"clientPerformance":[
{
"name":"opportunity",
"clientProcess":{
"value":3620000.0,
"count":1.0,
"min":3620000.0,
"max":3620000.0,
"stdDev":0.0,
"sampledValue":3620000.0
},
"networkConnection":{
"value":10000.0,
"count":1.0,
"min":10000.0,
"max":10000.0,
"stdDev":0.0,
"sampledValue":10000.0
},
"receiveRequest":{
"value":9470000.0,
"count":1.0,
"min":9470000.0,
"max":9470000.0,
"stdDev":0.0,
"sampledValue":9470000.0
},
"sendRequest":{
"value":1400000.0,
"count":1.0,
"min":1400000.0,
"max":1400000.0,
"stdDev":0.0,
"sampledValue":1400000.0
},
"total":{
"value":14500000.0,
"count":1.0,
"min":14500000.0,
"max":14500000.0,
"stdDev":0.0,
"sampledValue":14500000.0
},
"url":"https://xxxx",
"urlData":{
"base":"/main.aspx",
"host":"xxxx",
"hashTag":"",
"protocol":"https"
}
}
]
I tried to use Get array elements method and other ways but I am never able to access to clientProcess, networkConnection.. elements
I tried to use for exampls this one:
Select
GetRecordPropertyValue(GetArrayElement(Input.clientPerformance, 0), 'name') AS Name,
GetRecordPropertyValue(GetArrayElement(Input.clientPerformance, 1), 'clientProcess.count') AS clientProcessCount,
FROM [app-insights-blob-dev] Input
I would appreciate any help :)

I slightly edited your query to return the clientProcessCount.
(I changed the array index from 1 to 0). Also make sure your JSON object starts with { and ends with }.
Select
GetRecordPropertyValue(GetArrayElement(Input.clientPerformance, 0), 'name') AS Name,
GetRecordPropertyValue(GetArrayElement(Input.clientPerformance, 0), 'clientProcess.count') AS clientProcessCount
FROM [app-insights-blob-dev] Input
For other examples to query complex objects with ASA, don't hesitate to look at this blog post.

Related

What JSON format does STRIP_OUTER_ARRAY support?

I have a file composed of a single array containing multiple records.
{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}
I load it with the following code:
create or replace stage stage.test
url='azure://xxx/xxx'
credentials=(azure_sas_token='xxx');
create table if not exists stage.client (json_data variant not null);
copy into stage.client_test
from #stage.test/client_test.json
file_format = (type = 'JSON' strip_outer_array = true);
Snowflake imports the entire file as one row.
I would like the the COPY INTO command to remove the outer array structure and load the records into separate table rows.
When I load larger files, I hit the size limit for variant and get the error Error parsing JSON: document is too large, max size 16777216 bytes.
If you can import the file into Snowflake, into a single row, then you can use LATERAL FLATTEN on the Clients field to generate one row per element in the array.
Here's a blog post on LATERAL and FLATTEN (or you could look them up in the snowflake docs):
https://support.snowflake.net/s/article/How-To-Lateral-Join-Tutorial
If the format of the file is, as specified, a single object with a single property that contains an array with 500 MB worth of elements in it, then perhaps importing it will still work -- if that works, then LATERAL FLATTEN is exactly what you want. But that form is not particularly great for data processing. You might want to use some text processing script to massage the data if that's needed.
RECOMMENDATION #1:
The problem with your JSON is that it doesn't have an outer array. It has a single outer object containing a property with an inner array.
If you can fix the JSON, that would be the best solution, and then STRIP_OUTER_ARRAY will work as you expected.
You could also try to recompose the JSON (an ugly business) after reading line for line with:
CREATE OR REPLACE TABLE X (CLIENT VARCHAR);
COPY INTO X FROM (SELECT $1 CLIENT FROM #My_Stage/Client.json);
User Response to Recommendation #1:
Thank you. So from what I gather, COPY with STRIP_OUTER_ARRAY can handle a file starting and ending with square brackets, and parse the file as if they were not there.
The real files don't have line breaks, so I can't read the file line by line. I will see if the source system can change the export.
RECOMMENDATION #2:
Also if you would like to see what the JSON parser does, you can experiment using this code, I have parsed JSON on the copy command using similar code. Working with your JSON data in small project can help you shape the Copy command to work as intended.
CREATE OR REPLACE TABLE SAMPLE_JSON
(ID INTEGER,
DATA VARIANT
);
INSERT INTO SAMPLE_JSON(ID,DATA)
SELECT
1,parse_json('{
"Client": [
{
"ClientNo": 1,
"ClientName": "Alpha",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "12345"
},
{
"BusinessNo": 2,
"IndustryCode": "23456"
}
]
},
{
"ClientNo": 2,
"ClientName": "Bravo",
"ClientBusiness": [
{
"BusinessNo": 1,
"IndustryCode": "34567"
},
{
"BusinessNo": 2,
"IndustryCode": "45678"
}
]
}
]
}');
SELECT
C.value:ClientNo AS ClientNo
,C.value:ClientName::STRING AS ClientName
,ClientBusiness.value:BusinessNo::Integer AS BusinessNo
,ClientBusiness.value:IndustryCode::Integer AS IndustryCode
from SAMPLE_JSON f
,table(flatten( f.DATA,'Client' )) C
,table(flatten(c.value:ClientBusiness,'')) ClientBusiness;
User Response to Recommendation #2:
Thank you for the parse_json example!
Trouble is, the real files are sometimes 500 MB, so the parse_json function chokes.
Follow-up on Recommendation #2:
The JSON needs to be in the NDJSON http://ndjson.org/ format. Otherwise the JSON will be impossible to parse because of the potential for large files.
Hope the above helps other running into similar questions!

How to parse in Google Sheets a nested JSON structure with fallback when the data isn't available?

I'm getting Yahoo Finance data as a JSON file (via the YahooFinancials python API) and I would like to be able to parse the data in a smart way to feed my Google Sheet.
For this example, I'm interested in getting the "cash" variable under the "date" nested structure. But as you'll see, sometimes there is no "cash" variable under the first date, so I would like the script/formula to go and get the "cash" variable that's under the second date structure.
Here is sample 1 of JSON code:
{ "balanceSheetHistoryQuarterly": {
"ABBV": [
{
"2018-12-31": {
"totalStockholderEquity": -2921000000,
"netTangibleAssets": -45264000000
}
},
{
"2018-09-30": {
"intangibleAssets": 26625000000,
"capitalSurplus": 14680000000,
"totalLiab": 69085000000,
"totalStockholderEquity": -2921000000,
"otherCurrentLiab": 378000000,
"totalAssets": 66164000000,
"commonStock": 18000000,
"otherCurrentAssets": 112000000,
"retainedEarnings": 6789000000,
"otherLiab": 16511000000,
"goodWill": 15718000000,
"treasuryStock": -24408000000,
"otherAssets": 943000000,
"cash": 8015000000,
"totalCurrentLiabilities": 15387000000,
"shortLongTermDebt": 1026000000,
"otherStockholderEquity": -2559000000,
"propertyPlantEquipment": 2950000000,
"totalCurrentAssets": 18465000000,
"longTermInvestments": 1463000000,
"netTangibleAssets": -45264000000,
"shortTermInvestments": 770000000,
"netReceivables": 5780000000,
"longTermDebt": 37187000000,
"inventory": 1786000000,
"accountsPayable": 10981000000
}
},
{
"2018-06-30": {
"intangibleAssets": 26903000000,
"capitalSurplus": 14596000000,
"totalLiab": 65016000000,
"totalStockholderEquity": -3375000000,
"otherCurrentLiab": 350000000,
"totalAssets": 61641000000,
"commonStock": 18000000,
"otherCurrentAssets": 128000000,
"retainedEarnings": 5495000000,
"otherLiab": 16576000000,
"goodWill": 15692000000,
"treasuryStock": -23484000000,
"otherAssets": 909000000,
"cash": 3547000000,
"totalCurrentLiabilities": 17224000000,
"shortLongTermDebt": 3026000000,
"otherStockholderEquity": -2639000000,
"propertyPlantEquipment": 2787000000,
"totalCurrentAssets": 13845000000,
"longTermInvestments": 1505000000,
"netTangibleAssets": -45970000000,
"shortTermInvestments": 196000000,
"netReceivables": 5793000000,
"longTermDebt": 31216000000,
"inventory": 1580000000,
"accountsPayable": 10337000000
}
},
{
"2018-03-31": {
"intangibleAssets": 27230000000,
"capitalSurplus": 14519000000,
"totalLiab": 65789000000,
"totalStockholderEquity": 3553000000,
"otherCurrentLiab": 125000000,
"totalAssets": 69342000000,
"commonStock": 18000000,
"otherCurrentAssets": 17000000,
"retainedEarnings": 4977000000,
"otherLiab": 17250000000,
"goodWill": 15880000000,
"treasuryStock": -15961000000,
"otherAssets": 903000000,
"cash": 9007000000,
"totalCurrentLiabilities": 17058000000,
"shortLongTermDebt": 6024000000,
"otherStockholderEquity": -2630000000,
"propertyPlantEquipment": 2828000000,
"totalCurrentAssets": 20444000000,
"longTermInvestments": 2057000000,
"netTangibleAssets": -39557000000,
"shortTermInvestments": 467000000,
"netReceivables": 5841000000,
"longTermDebt": 31481000000,
"inventory": 1738000000,
"accountsPayable": 10542000000
}
}
]
}
}
The first date structure (under 2018-12-31) doesn't contain the cash variable. So I would like the Google sheet to go and search for the same data in 2018-09-30 and if not available go and search in 2018-06-30.
OR just scan the nested structure dates and fetch the first "cash" occurrence that will be found.
Basically, I would like to know how to skip the name of the date variable (i.e.2018-12-31) as it doesn't really matter, and just make the formula seek for the first available "cash" variable.
Main questions recap
How to skip mentioning an exact nested level name and scan what's
inside?
How to keep scanning until you find the desired variable with
a value that is not "null" (this can happen)?
What would be the entire formula to achieve the following logic: Scan the JSON file until you find the value > if no value found, fallback to this IMPORTXML function that calls an external API.
Let me know if you need more context about the issue and thanks in advance for your help :)
EDIT: this is the IMPORTJSON formula I use in the cell of the spreadsheet right now.
=ImportJSON("https://api.myjson.com/bins/8mxvi", "/financial/balanceSheetHistoryQuarterly/ABBV/2018-31-12/cash", "noHeaders")
Obviously, this one returns an error as there is nothing under that date. The JSON is also the valid link I use just now.
=REGEXEXTRACT(FILTER(
TRANSPOSE(SPLIT(SUBSTITUTE(A1, ","&CHAR(10), "×"), "×")),
ISNUMBER(SEARCH("*"&"cash"&"*",
TRANSPOSE(SPLIT(SUBSTITUTE(A1, ","&CHAR(10), "×"), "×"))))), ": (.+)")
=INDEX(ARRAYFORMULA(SUBSTITUTE(REGEXEXTRACT(FILTER(TRANSPOSE(SPLIT(SUBSTITUTE(
TRANSPOSE(IMPORTDATA("https://api.myjson.com/bins/8mxvi")), ","&CHAR(10), "×"), "×")),
ISNUMBER(SEARCH("*"&"cash"&"*", TRANSPOSE(SPLIT(SUBSTITUTE(
TRANSPOSE(IMPORTDATA("https://api.myjson.com/bins/8mxvi")), ","&CHAR(10), "×"), "×"))))),
":(.+)"), ",", "")), 1, 1)

Google books api returns missing parameters

I am making a react app that searches for a book by title and returns the results.
It's mostly working fine, but for some titles searched (such as "hello") it can't get the results because the parameters are missing.
Specially, the "amount" value is missing, and it can get me e-books that are not for sale even if I add the filter=paid-ebooks param while fetching the api. Using projection=full doesn't help either.
For example, when I call the api with
https://www.googleapis.com/books/v1/volumes?printType=books&filter=paid-ebooks&key=${APIKEY}
and use the fetched data inside books array in reactjs:
this.props.books.map((book, index) => {
return (
<CardItem
key={index}
title={book.volumeInfo.title}
authors={book.volumeInfo.authors ?
book.volumeInfo.authors.join(', ') :
"Not provided"}
price={book.saleInfo.listPrice.amount}
publisher={book.volumeInfo.publisher}
addToCart={() =>
this.props.addItem(this.props.books[index])}
/>
)
})
One of the results it gets is like this:
"saleInfo": {
"country": "TR",
"saleability": "NOT_FOR_SALE",
"isEbook": false
}
While it should be like, what's expected is :
"saleInfo": {
"country": "TR",
"saleability": "FOR_SALE",
"isEbook": true,
"listPrice": {
"amount": 17.23,
"currencyCode": "TRY"
}
And trying to search with this api answer throws the error :
TypeError: Cannot read property 'amount' of undefined
price={book.saleInfo.listPrice.amount}
As you can see in react code's authors, this issue comes up with authors parameter too, which I've bypassed as seen in the code. But I cannot do the same with amount. Is this a known error in Google Books API or is there a way to prevent this? I don't understand why it still returns me e-books that are not for sale even with filter=paid-ebooks param.
I have not dug into the API documentation. An ideal solution would be a query param that only sends back books with a list price (like you tried with filter=paid-ebooks). Because that's not working, a simple fix would be to filter your results once you get them.
Assuming the response contains an array of book objects, it would look something like this:
const paidBooks = apiResponse.data.filter(book => book.listPrice)
This code will take the response from the API, and filter out all books that do not contain a truthy value for listPrice
That totally right, actually i never used react but the same logic try using try{ }catch(error){} for those missing data

How will we ignore duplicates while insert json data into mongoDB based on multiple condition

[
{
"RollNo":1,
"name":"John",
"age":20,
"Hobby":"Music",
"Date":"9/05/2018"
"sal":5000
},
{
"RollNo":2,
"name":"Ravi",
"age":25,
"Hobby":"TV",
"Date":"9/05/2018"
"sal":5000
},
{
"RollNo":3,
"name":"Devi",
"age":30,
"Hobby":"cooking",
"Date":"9/04/2018"
"sal":5000
}
]
Above is the JSON file i need to insert into a MongoDB. Similar JSON data is already in my mongoDB collection named 'Tests'.I have to ignore the records which is already
in the mongoDB based on a certain condition.
[RollNo in mongoDB == RollNo in the json need to insert && Hobby in mongoDB ==Hobby in the json need to insert && Date in mongoDB == Date in the json need to insert].
If this condition matches, i need to igore the insertion,else need to insert the data into DB .
I am using nodejs. Anyone please help me to do it.
If you are using mongoose then use upsert.
db.people.update(
{ RollNo: 1 },
{
"RollNo":1,
"name":"John",
"age":20,
"Hobby":"Music",
"Date":"9/05/2018"
"sal":5000
},
{ upsert: true }
)
But to avoid inserting the same document more than once, only use upsert: true if the query field is uniquely indexed.
The easiest and safest way to do this is by using a compound index.
You can create a compound index like this:
db.people.createIndex( { "RollNo": 1, "Hobby": 1, "Date" : 1 }, { unique: true } )
Then the duplicated inserts will produce an error which you need to process in your code.

Possible to chain results in N1ql?

I'm currently trying to do a bit of complex N1QL for a project I'm working on, theoretically I could do all of this processing in multiple N1QL calls and by parsing the results each time, however if possible I'd like for this to contained in one call.
What I would like to do is:
filter all documents that contain a "dataSync.test.id" field with more than 1 id
Read back all other ids in that list
Use that list to get other documents containing those ids
Get the "dataSync.test._channels" field for those documents (optionally a filter by docType might help parsing)
This would probably return a list of "dataSync.test._channels"
Is this possible in N1QL? It appears like it might be but I can't get the syntax right.
My data structures look a little like
{
"dataSync": {
"test": {
"_channels": [
"RP"
],
"id": [
"dataSync_user_1015",
"dataSync_user_1010",
"dataSync_user_1005"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
{
"dataSync": {
"test": {
"_channels": [
"RSD"
],
"id": [
"dataSync_user_1010"
],
"_lastUpdatedBy": "TEST"
}
},
...
}
Yes. I think you can do all these.
Initial set of IDs with filtering can be retrieved as a subquery and then you can get subsquent documents by joins.
SELECT fulldoc
FROM (select meta().id as dockey from doc where a=1) as mydoc
INNER JOIN doc fulldoc ON KEYS mydoc.dockey;
There are optimizations that can be done here. Try the sequencing first to ensure you're get the job done.