Parsing website search history JSON data in PowerBI - json

I am doing some analytics on a certain website, and I'm trying to get people's search terms to an easier format. They're in JSON format, and I've already parsed the column several times to get other relevant data. The problem is that when trying to parse the keywords/search data, each search word goes into its own column, and with tens of thousands of searches it becomes a problem. Hard to explain, so I will add a picture that illustrates the problem.
The "Ai O" and "JAHAHASHAS" are searches that I did on the website, rest are hidden, but the list goes on forever
https://i.imgur.com/5IXTXn1.png
Here is an example of a person searching "business model" in the JSON
"{""sort"": ""lastUpdated"", ""limit"": 5, ""fields"": {""keywords"": {""business model"": true}}, ""offset"": 0}"
Tried to parse the keywords column anyways, but creating thousands of new columns doesn't really work

I added some extra records to your json file, I hope that it matches what you have to work with in reality. In the future, you'll get faster and better results if you're willing to share a bit more concrete information about your project. In most cases, it's very difficult to provide meaningful advice if we can't reasonably recreate your situation.
Anyway, I wanted the solution to work against multiple search terms, so I tweaked the JSON you provided to look like this:
{ "Searches": [
{"sort": "lastUpdated", "limit": 5, "fields": {"keywords": {"business model": true}}, "offset": 0},
{"sort": "lastUpdated", "limit": 5, "fields": {"keywords": {"Alpha Beta": true}}, "offset": 0},
{"sort": "lastUpdated", "limit": 5, "fields": {"keywords": {"Gamma Delta Epsilon": true}}, "offset": 0}
]}
and then provided the following Power Query (taken from the advanced editor)
let
Source = Json.Document(File.Contents("C:\Users\XXXXXX\Desktop\foo.json")),
#"Converted to Table" = Table.FromList(Source[Searches], Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"fields"}, {"Fields"}),
#"Expanded Fields" = Table.ExpandRecordColumn(#"Expanded Column1", "Fields", {"keywords"}, {"keywords"}),
#"Added Custom" = Table.AddColumn(#"Expanded Fields", "keywords_text", each Record.FieldNames([keywords])),
#"Extracted Values" = Table.TransformColumns(#"Added Custom", {"keywords_text", each Text.Combine(List.Transform(_, Text.From), " "), type text}),
#"Added Custom1" = Table.AddColumn(#"Extracted Values", "keywords_text_list", each Text.Split([keywords_text], " ")),
Custom1 = List.Combine( #"Added Custom1"[keywords_text_list])
in
Custom1
After loading the data from file, I extracted the records (searches, that is) from the Json.
Next, having converted the data to a table, I expanded twice to dive down from Records to Fields, and Fields to keywords. Then I added a custom column and used the M Query 'Record.Fieldnames' function to generate a list on every fieldnames in our keyword collections. Keep in mind, each list here only has one element, even though there might be spaces in that text.
Then I extracted values from those lists
and followed that up by splitting those strings back into lists -- this time using space as a delimiter.
As a final step, I combined the multiple lists into a single list of all our keywords as single elements.
So, that's my approach. Hope it helps.

Related

Handling Pagination from API (Json) with Excel Query

Have search on a few sites, and any suggestions I am finding does not fit my situation. In many cases, such as how to export html table to excel, with pagination there are only limited responses. I am able to pull data using API key from a website, but it is paginated. I have been able to adjust my query to pull 100 records per page (default 25) and can input a page number to pull a select page, but have been unsuccessful in pulling as one table. Currently one of the data sets is over 800 records, so my work around is 8 queries, all pulling down a separate page, and then using the query amalgamate function to group into one table. I have a new project, that will likely return several thousand line items, and would prefer a simpler way to handle this.
This is current code
let
Source = Json.Document(Web.Contents("https://api.keeptruckin.com/v1/vehicles?access_token=xxxxxxxxxxxxxx&per_page=100&page_no=1", [Headers=[#"X-Api-Key"="f4f1f1f0-005b-4fbb-a525-3144ba89e1f2", #"Content-Type"="application/x-www-form-urlencoded"]])),
vehicles = Source[vehicles],#"Converted to Table" = Table.FromList(vehicles, Splitter.SplitByNothing(), null, null, ExtraValues.Error),
#"Expanded Column1" = Table.ExpandRecordColumn(#"Converted to Table", "Column1", {"vehicle"}, {"Column1.vehicle"}),
#"Expanded Column1.vehicle" = Table.ExpandRecordColumn(#"Expanded Column1", "Column1.vehicle", {"id", "company_id", "number", "status", "ifta", "vin", "make", "model", "year", "license_plate_state", "license_plate_number", "metric_units", "fuel_type", "prevent_auto_odometer_entry", "eld_device", "current_driver"}, {"Column1.vehicle.id", "Column1.vehicle.company_id", "Column1.vehicle.number", "Column1.vehicle.status", "Column1.vehicle.ifta", "Column1.vehicle.vin", "Column1.vehicle.make", "Column1.vehicle.model", "Column1.vehicle.year", "Column1.vehicle.license_plate_state", "Column1.vehicle.license_plate_number", "Column1.vehicle.metric_units", "Column1.vehicle.fuel_type", "Column1.vehicle.prevent_auto_odometer_entry", "Column1.vehicle.eld_device", "Column1.vehicle.current_driver"})
in
#"Expanded Column1.vehicle"

How can I use RegEx to extract data within a JSON document

I am no RegEx expert. I am trying to understand if can use RegEx to find a block of data from a JSON file.
My Scenario:
I am using an AWS RDS instance with enhanced monitoring. The monitoring data is being sent to a CloudWatch log stream. I am trying to use the data posted in CloudWatch to be visible in log management solution Loggly.
The ingestion is no problem, I can see the data in Loggly. However, the whole message is contained in one big blob field. The field content is a JSON document. I am trying to figure out if I can use RegEx to extract only certain parts of the JSON document.
Here is an sample extract from the JSON payload I am using:
{
"engine": "MySQL",
"instanceID": "rds-mysql-test",
"instanceResourceID": "db-XXXXXXXXXXXXXXXXXXXXXXXXX",
"timestamp": "2017-02-13T09:49:50Z",
"version": 1,
"uptime": "0:05:36",
"numVCPUs": 1,
"cpuUtilization": {
"guest": 0,
"irq": 0.02,
"system": 1.02,
"wait": 7.52,
"idle": 87.04,
"user": 1.91,
"total": 12.96,
"steal": 2.42,
"nice": 0.07
},
"loadAverageMinute": {
"fifteen": 0.12,
"five": 0.26,
"one": 0.27
},
"memory": {
"writeback": 0,
"hugePagesFree": 0,
"hugePagesRsvd": 0,
"hugePagesSurp": 0,
"cached": 505160,
"hugePagesSize": 2048,
"free": 2830972,
"hugePagesTotal": 0,
"inactive": 363904,
"pageTables": 3652,
"dirty": 64,
"mapped": 26572,
"active": 539432,
"total": 3842628,
"slab": 34020,
"buffers": 16512
},
My Question
My question is, can I use RegEx to extract, say a subset of the document? For example, CPU Utilization, or Memory etc.? If that is possible, how do I write the RegEx? If possible, I can use it to drill down into the extracted document to get individual data elements as well.
Many thanks for your help.
First I agree with Sebastian: A proper JSON parser is better.
Anyway sometimes the dirty approach must be used. If your text layout will not change, then a regexp is simple:
E.g. "total": (\d+\.\d+) gets the CPU usage and "total": (\d\d\d+) the total memory usage (match at least 3 digits not to match the first total text, memory will probably never be less than 100 :-).
If changes are to be expected make it a bit more stable: ["']total["']\s*:\s*(\d+\.\d+).
It may also be possible to match agains return chars like this: "cpuUtilization"\s*:\s*\{\s*\n.*\n\s*"irq"\s*:\s*(\d+\.\d+) making it a bit more stable (this time for irq value).
And so on and so on.
You see that you can get fast into very complex expressions. That approach is very fragile!
P.S. Depending of the exact details of the regex of loggy, details may change. Above examples are based on Perl.

Check if JSON string in Orbeon repeating grid contains a specific value

Working with the repeating grids through the form builder.
I have a custom control that has a string value represented in json.
{
"data": {
"type": "File",
"itemID": "12345",
"name": "Annual Summary",
"parentFolderID": "fileID",
"owner": "Owner",
"lastModifiedDate": "2016-10-17 22:48:05Z"
}
}
In the controls outside of the repeating grid, i need to check if name = "Annual Summary"
Previously, i had a drop down control and using Calculated Value $dropdownControl = "Annual Summary" it was able to return true if any of the repeated rows contained the value. My understanding is that using the = operator, it will validate against all rows.
Now with the json output of the control, I am attempting to use
contains($jsonStringValue, 'Annual Summary')
However, this only works with one entry and will be null if there are multiple rows.
2 questions:
How would validate whether "Annual Summary" (or any other text) is present within any of the repeated rows?
Is there any way to navigate the json or parse it to XML and navigate it?
Constraint:
within the Calculated Value or Visibility fields within form builder
manipulating the source that is generated by the form builder
You probably want to parse the JSON string first. See also this other Stackoverflow question.
Until Orbeon Forms 2016.3 is released, you would write:
(
for $v in $jsonStringValue
return converter:jsonStringToXml($v)
)//name = 'Annual Summary'
With the above, you also need to scope the namespace:
xmlns:converter="org.orbeon.oxf.json.Converter"
Once Orbeon Forms 2016.3 is released you can switch to:
$jsonStringValue/xxf:json-to-xml()//name = 'Annual Summary'

How to enter multiple table data in mongoDB using json

I am trying to learn mongodb. Suppose there are two tables and they are related. For example like this -
1st table has
First name- Fred, last name- Zhang, age- 20, id- s1234
2nd table has
id- s1234, course- COSC2406, semester- 1
id- s1234, course- COSC1127, semester- 1
id- s1234, course- COSC2110, semester- 1
how to insert data in the mongo db? I wrote it like this, not sure is it correct or not -
db.users.insert({
given_name: 'Fred',
family_name: 'Zhang',
Age: 20,
student_number: 's1234',
Course: ['COSC2406', 'COSC1127', 'COSC2110'],
Semester: 1
});
Thank you in advance
This would be a assuming that what you want to model has the "student_number" and the "Semester" as what is basically a unique identifier for the entries. But there would be a way to do this without accumulating the array contents in code.
You can make use of the upsert functionality in the .update() method, with the help of of few other operators in the statement.
I am going to assume you are going this inside a loop of sorts, so everything on the right side values is actually a variable:
db.users.update(
{
"student_number": student_number,
"Semester": semester
},
{
"$setOnInsert": {
"given_name": given_name,
"family_name": family_name,
"Age": age
},
"$addToSet": { "courses": course }
},
{ "upsert": true }
)
What this does in an "upsert" operation is first looks for a document that may exist in your collection that matches the query criteria given. In this case a "student_number" with the current "Semester" value.
When that match is found, the document is merely "updated". So what is being done here is using the $addToSet operator in order to "update" only unique values into the "courses" array element. This would seem to make sense to have unique courses but if that is not your case then of course you can simply use the $push operator instead. So that is the operation you want to happen every time, whether the document was "matched" or not.
In the case where no "matching" document is found, a new document will then be inserted into the collection. This is where the $setOnInsert operator comes in.
So the point of that section is that it will only be called when a new document is created as there is no need to update those fields with the same information every time. In addition to this, the fields you specified in the query criteria have explicit values, so the behavior of the "upsert" is to automatically create those fields with those values in the newly created document.
After a new document is created, then the next "upsert" statement that uses the same criteria will of course only "update" the now existing document, and as such only your new course information would be added.
Overall working like this allows you to "pre-join" the two tables from your source with an appropriate query. Then you are just looping the results without needing to write code for trying to group the correct entries together and simply letting MongoDB do the accumulation work for you.
Of course you can always just write the code to do this yourself and it would result in fewer "trips" to the database in order to insert your already accumulated records if that would suit your needs.
As a final note, though it does require some additional complexity, you can get better performance out of the operation as shown by using the newly introduced "batch updates" functionality.For this your MongoDB server version will need to be 2.6 or higher. But that is one way of still reducing the logic while maintaining fewer actual "over the wire" writes to the database.
You can either have two separate collections - one with student details and other with courses and link them with "id".
Else you can have a single document with courses as inner document in form of array as below:
{
"FirstName": "Fred",
"LastName": "Zhang",
"age": 20,
"id": "s1234",
"Courses": [
{
"courseId": "COSC2406",
"semester": 1
},
{
"courseId": "COSC1127",
"semester": 1
},
{
"courseId": "COSC2110",
"semester": 1
},
{
"courseId": "COSC2110",
"semester": 2
}
]
}

How to add nested json object to Lucene Index

I need a little help regarding lucene index files, thought, maybe some of you guys can help me out.
I have json like this:
[
{
"Id": 4476,
"UrlName": null,
"PhoneData": [
{
"PhoneType": "O",
"PhoneNumber": "0065898",
},
{
"PhoneType": "F",
"PhoneNumber": "0065898",
}
],
"Contact": [],
"Services": [
{
"ServiceId": 10,
"ServiceGroup": 2
},
{
"ServiceId": 20,
"ServiceGroup": 1
}
],
}
]
Adding first two fields is relatively easy:
// add lucene fields mapped to db fields
doc.Add(new Field("Id", sampleData.Id.Value.ToString(), Field.Store.YES, Field.Index.NOT_ANALYZED));
doc.Add(new Field("UrlName", sampleData.UrlName.Value ?? "null" , Field.Store.YES, Field.Index.ANALYZED));
But how I can add PhoneData and Services to index so it can be connected to unique Id??
For indexing JSON objects I would go this way:
Store the whole value under a payload field, named for example $json. This field would be stored but not indexed.
For each (indexable) property (maybe nested) create an indexable field with its name as a XMLPath-like expression identifying the property, for example PhoneData.PhoneType
If is ok that all nested properties will be indexed then it's simple, just iterate over all of them generating this indexable field.
But if you don't want to index all of them (a more realistic case), how to know which property is indexable is another problem; in this case you could:
Accept from the client the path expressions of the index fields to be created when storing the document, or
Put JSON Schema into play to describe your data (assuming your JSON records have a common schema), and extend it with a custom property that would allow you to tag which properties are indexable.
I have created a library doing this (and much more) that maybe can help you.
You can check it at https://github.com/brutusin/flea-db