Handling big JSONs in Azure Data Factory - json

I'm trying to use ADF for the following scenario:
a JSON is uploaded to a Azure Storage Blob, containing an array of similar objects
this JSON is read by ADF with a Lookup Activity and uploaded via a Web Activity to an external sink
I cannot use the Copy Activity, because I need to create a JSON payload for the Web Activity, so I have to lookup the array and paste it like this (payload of the Web Activity):
{
"some field": "value",
"some more fields": "value",
...
"items": #{activity('GetJsonLookupActivity').output.value}
}
The Lookup activity has a known limitation of an upper limit of 5000 rows at a time. If the JSON is larger, only 5000 top rows will be read and all else will be ignored.
I know this, so I have a system that chops payloads into chunks of 5000 rows before uploading to storage. But I'm not the only user, so there's a valid concern that someone else will try uploading bigger files and the pipeline will silently pass with a partial upload, while the user would obviously expect all rows to be uploaded.
I've come up with two concepts for a workaround, but I don't see how to implement either:
Is there any way for me to check if the JSON file is too large and fail the pipeline if so? The Lookup Activity doesn't seem to allow row counting, and the Get Metadata Activity only returns the size in bytes.
Alternatively, the MSDN docs propose a workaround of copying data in a foreach loop. But I cannot figure out how I'd use Lookup to first get rows 1-5000 and then 5001-10000 etc. from a JSON. It's easy enough with SQL using OFFSET N FETCH NEXT 5000 ROWS ONLY, but how to do it with a JSON?

You can't set any index range(1-5,000,5,000-10,000) when you use LookUp Activity.The workaround mentioned in the doc doesn't means you could use LookUp Activity with pagination,in my opinion.
My workaround is writing an azure function to get the total length of json array before data transfer.Inside azure function,divide the data into different sub temporary files with pagination like sub1.json,sub2.json....Then output an array contains file names.
Grab the array with ForEach Activity, execute lookup activity in the loop. The file path could be set as dynamic value.Then do next Web Activity.
Surely,my idea could be improved.For example,you get the total length of json array and it is under 5000 limitation,you could just return {"NeedIterate":false}.Evaluate that response by IfCondition Activity to decide which way should be next.It the value is false,execute the LookUp activity directly.All can be divided in the branches.

Related

Using Flow to return a SharePoint list to Powerapps

I've used this Flow to return the count of any Sharepoint list to Powerapps.
https://masteroffice365.com/get-sharepoint-library-or-list-total-items-from-power-apps/
How would I modify it to return the contents of a list to Powerapps, so that I can use Powerapps to put it into a collection?
Would this mean I don't have to worry about Delegation if the list has more than 2000 items?
This is what I've tried so far.
There is a variable TotalItemsCount which I have changed to ListItems. Instead of using an Integer I set ListItems to an array.
In the Get Library list contents I use this for the URI.
concat( '_api/web/lists/GetbyTitle(''', first( body('Filter_Library_List_Being_Queried') )?['displayName'], ''')/Items' )
I'm not sure what to put in as the last step given that I want it to be able to return the contents of any list. I think this rules out a parse json step as that requires a definite schema.
I've added an ApplyToEach
I'm getting this error message when it runs.
ExpressionEvaluationFailed. The execution of template action
'Apply_to_each' failed: the result of the evaluation of 'foreach'
expression '#body('Get_Library_List_Contents')' is of type 'Object'.
The result must be a valid array.
I don't think you can return an array back to PowerApps. You would have to return the response as a JSON string, then have your PowerApp do the logic to convert the JSON string into a collection.
Likely your PowerApp would have to include something like this to convert the JSON string that's returned from the flow:
ClearCollect(
*collectionName*,
MatchAll(
*JSON_String*,
*"\{""date"":""(?<date>[^""]*)"",""message"":""(?<message>[^""]*)"",""user"":""(?<user>[^""]*)""\}"*
));
Flow returning response body to PowerApp
If I understand you correctly, you want to retrieve 14000 records from the Sharepoint list, and not just the total count.
Would this mean I don't have to worry about Delegation if the list has more than 2000 items?
Yes, when you use a cloud flow rather than directly accessing Sharepoint list from Powerapps, you basically avoid delegation of 2k records.
Now coming back to you main topic of retrieving Records, you would have to Test and run your flow and check what does the below http return. I believe it returns a JSON Array.
concat( '_api/web/lists/GetbyTitle(''', first( body('Filter_Library_List_Being_Queried') )?['displayName'], ''')/Items' )
You would have to apply a for each or clean your JSON output to return String Array or JSON Array as output of your all 14K Records.
In addition if you are using Sharepoint online why not use connector for flow mentioned here

Neo4J Cypher - Is it quicker to load from 100k Json Files or 1 file with 100k entries?

I am performing a daily load of 100k+ json files into a neo4j database which is taking approximately 2 to 3 hours each day.
I would like to know whether neo4j would run quicker if the files were all rolled into one large file and then iterated through by the database?
I will need to learn how to do this in Python if so, but I would just like to know this before embarking on the work.
Current code snippet I use to load files, the range can change each day based on generated filenames which are based on IDs in the json records.
UNWIND range(215300000,215457000) as id
WITH DISTINCT id+"_20220103.json" as file
CALL apoc.load.json("file:///output/"+file,null, {failOnError:false})
YIELD value
Thank you!
The json construction in Python was updated to include all 150k+ json objects into one file and then Cypher was updated to iterate over the file and run the code against each json object. I initially tried a batch size of 1000 and then 100 but they resulted in many exception locks where the code must have been attempting to update the same nodes at the same time, so I have reduced the batch size down to 1 and it loads about 99% of the json objects on a first pass in 7 minutes.... much better than the initial 2 to 3 hours :-)
Code I am now using:
CALL apoc.periodic.iterate(
'CALL apoc.load.json("file:///20220107.json") YIELD value',
'UNWIND value as item.... perform other actions...
',{ batchSize:1, parallel:true})

Loosing data from Source to Sink in Copy Data

In my MS Azure datafactory, I have a rest API connection to a nested JSON dataset.
The Source "Preview data" shows all data. (7 orders from the online store)
In the "Activity Copy Data", is the menu tab "Mapping" where I map JSON fields with the sink SQL table columns. If I under "Collection Reference" I select None, all 7 orders are copied over.
But if I want the nested metadata, I select the meta field in "Collection Reference", then I get my nested data, in multiple order lines, each with a one metadata point, but I only get data from 1 order, not 7
I think I have a reason for my problem. One of the fields in the nested meta data, is both a string and array. But I still don't have a solution
sceen shot of meta data
Your sense is right,it caused by your nested structure meta data. Based on the statements of Collection Reference property:
If you want to iterate and extract data from the objects inside an
array field with the same pattern and convert to per row per object,
specify the JSON path of that array to do cross-apply. This property
is supported only when hierarchical data is source.
same pattern is key point here, I think. However, your data inside metadata array are not same as your screenshot.
My workaround is using Azure Blob Storage to make a transition, REST API ---> Azure Blob Storage--->Your sink dataset. Inside Blob Storage Dataset, you could flatten the incoming JSON data with Cross-apply nested JSON array setting:
You could refer to this blog to learn about this feature. Then you could copy the flatten data into your destination.

Read Excel data .csv file sequentially in Jmeter

I have a scenario to use data from each row for validation using a HTTP Request. have tried with the CSV config but it reads the first row only for the iteration.
I have a single iteration and all my samplers are in a single thread group. The data from csv file is retrieved sequentially only when i give the iterations to a value say 3 (each iteration each row is taken)
How to achieve on reading the csv file rows sequentially for single iteration ,where the thread group contains many HTTP request and i need the value from each row for each requests.
Kindly suggest me a solution
As per CSV Data Set Config documentation:
By default, the file is only opened once, and each thread will use a different line from the file. However the order in which lines are passed to threads depends on the order in which they execute, which may vary between iterations. Lines are read at the start of each test iteration. The file name and mode are resolved in the first iteration.
So implementing your scenario using CSV Data Set Config doesn't seem to be possible, I would recommend considering using JMeter functions instead such as:
__StringFromFile()
__CSVRead()
These functions read next line from the file each time they're called so you can use them instead of CSV Data Set config in each of the HTTP Request samplers. Check out Apache JMeter Functions - An Introduction to get familiarized with JMeter Functions concept.

Random selection from CSV file in Jmeter

I have a very large CSV file (8000+ items) of URLs that I'm reading with a CSV Data Set Config element. It is populating the path of an HTTP Request sampler and iterating through with a while controller.
This is fine except what I want is have each user (thread) to pick a random URL from the CSV URL list. What I don't want is each thread using CSV items sequentially.
I was able to achieve this with a Random Order Controller with multiple HTTP Request samplers , however 8000+ HTTP Samplers really bogged down jmeter to an unusable state. So this is why I put the HTTP Sampler URLs in the CSV file. It doesn't appear that I can use the Random Order Controller with the CSV file data however. So how can I achieve random CSV data item selection per thread?
There is another way to achieve this:
create a separate thread group
depending on what you want to achieve:
add a (random) loop count -> this will set a start offset for the thread group that does the work
add a loop count or forever and a timer and let it loop while the other thread group is running. This thread group will read a 'pseudo' random line
It's not really random, the file is still read sequentially, but your work thread makes jumps in the file. It worked for me ;-)
There's no random selection function when reading csv data. The reason is you would need to read the whole file into memory first to do this and that's a bad idea with a load test tool (any load test tool).
Other commercial tools solve this problem by automatically re-processing the data. In JMeter you can achieve the same manually by simply sorting the data using an arbitrary field. If you sort by, say Surname, then the result is effectively random distribution.
Note. If you ensure the default All Threads is set for the CSV Data Set Config then the data will be unique in the scope of the JMeter process.
The new Random CSV Data Set Config from BlazeMeter plugin should perfectly fit your needs.
As other answers have stated, the reason you're not able to select a line at random is because you would have to read the whole file into memory which is inefficient.
Rather than trying to get JMeter to handle this on the fly, why not just randomise the file order itself before you start the test?
A scripting language such as perl makes short work of this:
cat unrandom.csv | perl -MList::Util=shuffle -e 'print shuffle<STDIN>' > random.csv
For my case:
single column
small dataset
Non-changing CSV
I just discard using CSV and refer to https://stackoverflow.com/a/22042337/6463291 and use a Bean Preprocessor instead, something like this:
String[] query = new String[]{"csv_element1", "csv_element2", "csv_element3"};
Random random = new Random();
int i = random.nextInt(query.length);
vars.put("randomOption",query[i]);
Performance seems ok, if you got the same issue can try this out.
I am not sure if this will work, but I will anyways suggest it.
Why not divide your URLs in 100 different CSV files. Then in each thread you generate the random number and use that number to identify CSV file to read using __CSVRead function.
CSVRead">http://jmeter.apache.org/usermanual/functions.html#_CSVRead
Now the only part I am not sure if the __CSVRead function reopens the file every time or shares the same file handle across the threads.
You may want to try it. Please share your findings.
A much straight forward solution.
In CSV file, add another column (say B)
apply =RAND() function in the first cell of column B (say B1). This will create random float number.
Drag the cell (say B1) corner to apply for all the corresponding URLs
Sort column B.
your URL will be sorted randomly.
Delete column B.