EvaluateJsonPathAttributeCustom - Nifi - json

I have to extract attributes from a json file that I receive from an api call using InvokeHTTPCustom. JSON FILE has the following sample data :
[
{
"input_index": 0,
"candidate_index": 0,
"delivery_line_1": "1 Santa Claus Ln",
"last_line": "North Pole AK 99705-9901",
"delivery_point_barcode": "997059901010",
"components": {
"primary_number": "1",
"street_name": "Santa Claus",
"street_suffix": "Ln",
"city_name": "North Pole",
"state_abbreviation": "AK",
"zipcode": "99705",
"plus4_code": "9901",
"delivery_point": "01",
"delivery_point_check_digit": "0"
},
"metadata": {
"record_type": "S",
"zip_type": "Standard",
"county_fips": "02090",
"county_name": "Fairbanks North Star",
"carrier_route": "C004",
"congressional_district": "AL",
"rdi": "Commercial",
"elot_sequence": "0001",
"elot_sort": "A",
"latitude": 64.75233,
"longitude": -147.35297,
"coordinate_license": 1,
"precision": "Rooftop",
"time_zone": "Alaska",
"utc_offset": -9,
"dst": true
},
"analysis": {
"dpv_match_code": "Y",
"dpv_footnotes": "AABB",
"dpv_cmra": "N",
"dpv_vacant": "N",
"dpv_no_stat": "Y",
"active": "Y",
"footnotes": "L#"
}
},
{
"input_index": 1,
"candidate_index": 0,
"addressee": "Apple Inc",
"delivery_line_1": "1 Infinite Loop",
// truncated for brevity
}
]
I have extracted all the required data such as address, state, city, primary_number, etc.
However, when I try to extract latitude,longitude from metadata, it leads to failure in EvaluateJsonPathAttributeCustom processor. Other attributes, which are in string format, get extracted correctly. However, this being not a string, might be issue, is my diagnosis.
How do I get this working?
I need to extract longitudes and latitudes.
Please give detail explanation as I am new to nifi.
Configuration in nifi for EvaluateJsonPathAttributeCustom:
Attribute Name Input : x**.json
Attribute Name Output : latitude
JsonPathExpresssion : $[0].metadata.latitude
Splitif.. : False

One way to do this is by using the JOLT https://jolt-demo.appspot.com/.
I would recommend using the JoltTransformJSON NiFi Processor as it can really help make things easy to pull out only the data that you want. I have tried your specific request and it will work to pull out those data. You can configure JOLT to pull any data you require and it might be easier once you get the hang of it.
[{
"operation": "shift",
"spec": {
"*": {
"metadata": {
"latitude": "latitude",
"longitude": "longitude"
}
}
}
}]

Related

JMESPath how to write a query with multi-level filter?

I have been studying official documentation of JMESPath and a few other resources. However I was not successful with the following task:
my data structure is a json from vimeo api (video list):
data array contains lots of objects, each object is the uploaded file that has many attributes and various options.
"data": [
{
"uri": "/videos/00001",
"name": "Video will be added.mp4",
"description": null,
"type": "video",
"link": "https://vimeo.com/00001",
"duration": 9,
"files":[
{
"quality": "hd",
"type": "video/mp4",
"width": 1440,
"height": 1440,
"link": "https://player.vimeo.com/external/4443333.sd.mp4",
"created_time": "2020-09-01T19:10:01+00:00",
"fps": 30,
"size": 10807854,
"md5": "643d9f18e0a63e0630da4ad85eecc7cb",
"public_name": "UHD 1440p",
"size_short": "10.31MB"
},
{
"quality": "sd",
"type": "video/mp4",
"width": 540,
"height": 540,
"link": "https://player.vimeo.com/external/44444444.sd.mp4",
"created_time": "2020-09-01T19:10:01+00:00",
"fps": 30,
"size": 1345793,
"md5": "cb568939bb7b276eb468d9474c1f63f6",
"public_name": "SD 540p",
"size_short": "1.28MB"
},
... other data
]
},
... other uploaded files
]
Filter I need to apply is that duration needs to be less than 10 and width of file needs to be 540 and the result needs to contain a link (url) from files
I have managed to get only one of structure-levels working:
data[].files[?width == '540'].link
I need to extract this kind of list
[
{
"uri": "/videos/111111",
"link": "https://player.vimeo.com/external/4123112312.sd.mp4"
},
{
"uri": "/videos/22222",
"link": "https://player.vimeo.com/external/1231231231.sd.mp4"
},
...other data
]
Since the duration is in your data array, you will have to add this filter at that level.
You will also have to use what is described under the section filtering and selecting nested data because you only care of one specific type of file under the files array, so, you can use the same type of query structure | [0] in order to pull only the first element of the filtered files array.
So on your reduced exemple, the query:
data[?duration < `10`].{ uri: uri, link: files[?width == `540`].link | [0] }
Would yield the expected:
[
{
"uri": "/videos/00001",
"link": "https://player.vimeo.com/external/44444444.sd.mp4"
}
]

Google json style guide: How to send single item response?

There is an items node in the specifications which says it is for an array of items, like paging items, youtube video list
What if I have GET request on a single item, how should the response be formatted ?
Just to one item in the array?
items:[item]
https://google.github.io/styleguide/jsoncstyleguide.xml
I don't think #tanmay_vijay's answer is correct or nuanced enough as it seems that single item responses are in arrays in the YouTube example in the docs.
{
"apiVersion": "2.0",
"data": {
"updated": "2010-02-04T19:29:54.001Z",
"totalItems": 6741,
"startIndex": 1,
"itemsPerPage": 1,
"items": [
{
"id": "BGODurRfVv4",
"uploaded": "2009-11-17T20:10:06.000Z",
"updated": "2010-02-04T06:25:57.000Z",
"uploader": "docchat",
"category": "Animals",
"title": "From service dog to SURFice dog",
"description": "Surf dog Ricochets inspirational video ...",
"tags": [
"Surf dog",
"dog surfing",
"dog",
"golden retriever",
],
"thumbnail": {
"default": "https://i.ytimg.com/vi/BGODurRfVv4/default.jpg",
"hqDefault": "https://i.ytimg.com/vi/BGODurRfVv4/hqdefault.jpg"
},
"player": {
"default": "https://www.youtube.com/watch?v=BGODurRfVv4&feature=youtube_gdata",
"mobile": "https://m.youtube.com/details?v=BGODurRfVv4"
},
"content": {
"1": "rtsp://v5.cache6.c.youtube.com/CiILENy73wIaGQn-Vl-0uoNjBBMYDSANFEgGUgZ2aWRlb3MM/0/0/0/video.3gp",
"5": "https://www.youtube.com/v/BGODurRfVv4?f=videos&app=youtube_gdata",
"6": "rtsp://v7.cache7.c.youtube.com/CiILENy73wIaGQn-Vl-0uoNjBBMYESARFEgGUgZ2aWRlb3MM/0/0/0/video.3gp"
},
"duration": 315,
"rating": 4.96,
"ratingCount": 2043,
"viewCount": 1781691,
"favoriteCount": 3363,
"commentCount": 1007,
"commentsAllowed": true
}
]
}
}
It could however be that it depends on the resource being targeted from the request. This is the way it is in the competing JSONAPI standard.
From JSONAPI standard:
A logical collection of resources MUST be represented as an array, even if it only contains one item or is empty.
You don't need to have items field for showing single item. If you're sure your API is always going to return single object, you can return it as data itself.
{
"data": {
"kind": "user",
"fields": "author,id",
"id": "bart",
"author": "Bart"
}
}
Fields such as data.kind data.fields data.etag data.id data.lang data.updated data.deleted can still be used here.
Source for snippet docs

Retrieve a JSON object from JSON array using Cloudant

I am doing an API call every 40 mins to retrieve the current status information of every car in a car fleet. And each call adds one new JSON document to a Cloudant database. Each JSON document defines the current availability status for every car across many locations in many cities. There are currently around 2200 JSON documents in the database. All JSON documents have one field called payload that contains all information; it is a large array of objects. Instead of retrieving the whole payload array of objects I would like to retrieve only the needed info with a query (so, only one or several objects of that array). However, I have difficulty drafting a query that results only in the needed data.
Below, I'll explain my problem in more detail:
When saving the JSON document to Cloudant, a timestamp is defined in the document. The _id parameter is defined to be equal to this timestamp. Below, I show a simplified version of these JSON documents:
{
"_id": "1540914946026",
"_rev": "3-c1834c8a230cf772e41bbcb9cf6b682e",
"timestamp": 1540914946026,
"datetime": "2018-10-30 15:55:46",
"payload": [
{
"cityName": "Abcoude",
"locations": [
{
"address": "asterlaan 28",
"geoPoint": {
"latitude": 52.27312,
"longitude": 4.96768
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
}
],
"availableCars": 1,
"occupiedCars": 0
},
{
"cityName": "Alkmaar",
"locations": [
{
"address": "Aert de Gelderlaan 14",
"geoPoint": {
"latitude": 52.63131,
"longitude": 4.72329
},
"cars": [
{
"model": "Volswagen",
"state": "FREE"
}
]
},
{
"address": "Ardennenstraat 49",
"geoPoint": {
"latitude": 52.66721,
"longitude": 4.76046
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
},
{
"address": "Beneluxplein 7",
"geoPoint": {
"latitude": 52.65356,
"longitude": 4.75817
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
},
{
"address": "Dr. Schaepmankade 1",
"geoPoint": {
"latitude": 52.62595,
"longitude": 4.75122
},
"cars": [
{
"mod": "BMW",
"state": "OCCUPIED"
}
]
},
{
"address": "Kennemerstraatweg",
"geoPoint": {
"latitude": 52.62909,
"longitude": 4.74226
},
"cars": [
{
"model": "Mercedes",
"state": "FREE"
}
]
},
{
"address": "NS Station Alkmaar Noord/Parkeerterrein Noord",
"geoPoint": {
"latitude": 52.64366,
"longitude": 4.7627
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "NS Station Alkmaar/Stationsweg 56",
"geoPoint": {
"latitude": 52.6371,
"longitude": 4.73935
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "Oude Hoeverweg",
"geoPoint": {
"latitude": 52.63943,
"longitude": 4.72928
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "Parkeerterrein Wortelsteeg",
"geoPoint": {
"latitude": 52.63048,
"longitude": 4.75487
},
"cars": [
{
"model": "Tesla",
"state": "OCCUPIED"
}
]
},
{
"address": "Schoklandstraat 38",
"geoPoint": {
"latitude": 52.65812,
"longitude": 4.75359
},
"cars": [
{
"model": "Volkswagen",
"state": "FREE"
}
]
}
],
"availableCars": 8,
"occupiedCars": 2
}
]
}
As you can see, the payload field is an array that has several objects (FYI: every object in this array represents one specific city: there are 1600 cities, so 1600 nested objects inside the payload array). Furthermore, inside each of the 1600 objects mentioned, other arrays and objects are again nested inside. For all objects in the payload array, the first field is cityName.
Furthermore, there is a nested array locations (inside each of the 1600 objects of the payload array) representing all addresses in a specific city. The locations array can be of size 1 to 600, meaning 1 to 600 nested objects / addresses per city. The last two fields in all objects of the payload array are availableCars and occupiedCars.
I want query documents to see how many cars are available and occupied for a specific city during a specific time interval. To do this:
I have to specify a start timestamp (or id) and an end timestamp, resulting in only the JSON documents within this interval.
Furthermore, I will need to specify inside the JSON documents only one or more specific cities by cityName (there are 1600 cities) and then get the number of available cars availableCars and the number of occupiedCars for those cities.
For example, in this simplified example, I would like to query for the status information (availableCars & `occupiedCars) for the city of Alkmaar from 1540914946026 (epoch time) until now. I would like to get the following result:
{
"id":"1540914946026",
"cityName":"Alkmaar",
"availableCars":8,
"occupiedCars":2
}
This is just an example, in reality, I want to be able to query for other cities as well, or query for several cities together and then get for each of those cities the number of available cars availableCars and the number of occupied cars occupiedCars.
Could anyone help me to define a query and index to be able to get the above result? Can I do this with cloudant query?
Your data model does not play to Cloudant's strengths. Let each document group data that changes and is accessed together. Your items in your payload array would be much better stored as discrete documents.
If you find yourself reaching into growing arrays inside documents for subsets of data, this is a warning sign that your data model is not ideal: the document is now mutable and growing (with potential update conflicts as a result), and access becomes more cumbersome over time as Cloudant has no mechanism to only retrieve parts of a document. Moreover, Cloudant has a limit (1M) on document size, so by using your proposed model, you will likely hit that limit, too, and your application would stop working.
With that said, it is possible to create a view index that lets you emit each component of your payload, which would let you look up data per city -- but that solution is still subject to all the limitations above (document model is mutable, documents grow large etc).
Rule of thumb: small documents. Immutable model, where possible. Documents group data that either change, or are accessed as a unit.

PostgreSQL, get JSON object field based on a the value of a parallel attribute

Suppose we are dealing with a JSON object where there can be multiple child nodes with the same structure, and we want to get the value of attribute B,C,D,etc. where attribute A equals a specific value. Below is an example.
{
"addresses": [{
"type": "home",
"address": "123 fake street",
"zip": "24301"
}, {
"type": "work",
"address": "346 Main street",
"zip": "24352"
}, {
"type": "PO Box",
"address": "PO BOX 132, New York, NY",
"zip": "10001"
}, {
"type": "second",
"address": "1600 Pennsylvania Ave.",
"zip": "90210"
}]}
Is there any JSON operator in PostgreSQL where I can get the zip code, where the address type is "work" or "home"? I am looking at https://www.postgresql.org/docs/current/static/functions-json.html and not finding what I'm looking for.
You need to "unnest" (i.e. normalize) the data, then you can apply a WHERE condition on it:
select t.adr ->> 'zip', t.adr ->> 'address'
from the_table
cross join lateral jsonb_array_elements(the_column -> 'addresses') as t(adr)
where t.adr ->> 'type' in ('work', 'home');
Online example: http://rextester.com/TDB99535

NYC Open data DOB Missing Information

I am facing some issue in NYC department of building API.
help me if you know any other API giving the same information
I have used this API but didn't work for me
https://data.cityofnewyork.us/resource/83x8-shf7.json
Missing fields
permitee detailed address
https://data.cityofnewyork.us/resource/83x8-shf7.json?$where=filing_date BETWEEN '2018-05-01T06:00:00' AND '2018-05-30T10:00:00'
Also i am not able get expected data using filters for "filing_date" from same api
expected data should return all data between 2018-05-01 and 2018-05-30 for this API But i am getting only few results.
I am getting this data
[
{
"bin__": "3118313",
"bldg_type": "1",
"block": "05143",
"borough": "BROOKLYN",
"city": "BROOKLYN",
"community_board": "314",
"dobrundate": "2018-05-03T00:00:00.000",
"expiration_date": "2018-06-11T00:00:00.000",
"filing_date": "2018-05-02T00:00:00.000",
"filing_status": "INITIAL",
"gis_census_tract": "1522",
"gis_council_district": "40",
"gis_latitude": "40.641731",
"gis_longitude": "-73.966432",
"gis_nta_name": "Flatbush",
"house__": "328",
"issuance_date": "2018-05-02T00:00:00.000",
"job__": "321679046",
"job_doc___": "01",
"job_start_date": "2018-05-02T00:00:00.000",
"job_type": "A2",
"lot": "00068",
"non_profit": "N",
"owner_s_business_name": "N/A",
"owner_s_business_type": "INDIVIDUAL",
"owner_s_first_name": "MATTHEW",
"owner_s_house__": "328",
"owner_s_house_street_name": "ARGYLE ROAD",
"owner_s_last_name": "LIMA",
"owner_s_phone__": "3475968096",
"owner_s_zip_code": "11218",
"permit_sequence__": "01",
"permit_si_no": "3452932",
"permit_status": "ISSUED",
"permit_subtype": "OT",
"permit_type": "EW",
"permittee_s_business_name": "BMB BUILDER INC",
"permittee_s_first_name": "YUAN HANG",
"permittee_s_last_name": "XIAO",
"permittee_s_license__": "0612790",
"permittee_s_license_type": "GC",
"permittee_s_phone__": "9175776544",
"residential": "YES",
"self_cert": "N",
"site_fill": "NOT APPLICABLE",
"state": "NY",
"street_name": "ARGYLE ROAD",
"superintendent_business_name": "BMB BUILDER INC",
"superintendent_first___last_name": "YUAN HANG XIAO",
"work_type": "OT",
"zip_code": "11218"
}]
Expected Data should be
[{
"bin__": "1090379",
"bldg_type": "2",
"block": "00760",
"borough": "MANHATTAN",
"city": "GREAT NECK",
"community_board": "104",
"dobrundate": "2018-05-02T00:00:00.000",
"expiration_date": "2018-10-28T00:00:00.000",
"filing_date": "2018-05-01T00:00:00.000",
"filing_status": "RENEWAL",
"gis_census_tract": "111",
"gis_council_district": "3",
"gis_latitude": "40.753978",
"gis_longitude": "-73.993673",
"gis_nta_name": "Hudson Yards-Chelsea-Flatiron-Union Square",
"house__": "337",
"issuance_date": "2018-05-01T00:00:00.000",
"job__": "121187606",
"job_doc___": "01",
"job_start_date": "2016-02-19T00:00:00.000",
"job_type": "NB",
"lot": "00020",
"non_profit": "N",
"owner_s_business_name": "HKONY WEST 36 LLC",
"owner_s_business_type": "PARTNERSHIP",
"owner_s_first_name": "SAM",
"owner_s_house__": "420",
"owner_s_house_street_name": "GREAT NECK ROAD",
"owner_s_last_name": "CHANG",
"owner_s_phone__": "9178380886",
"owner_s_zip_code": "11021",
"permit_sequence__": "07",
"permit_si_no": "3451790",
"permit_status": "ISSUED",
"permit_type": "NB",
"permittee_s_business_name": "OMNIBUILD CONSTRUCTION IN",
"permittee_s_first_name": "PETER",
"permittee_s_last_name": "SERPICO",
"permittee_s_license__": "0608390",
"permittee_s_license_type": "GC",
"permittee_s_phone__": "2124191930",
"self_cert": "N",
"site_fill": "ON-SITE",
"site_safety_mgr_s_first_name": "ROBERT",
"site_safety_mgr_s_last_name": "FILIPPONE",
"special_district_1": "GC",
"state": "NY",
"street_name": "W 36 ST",
"zip_code": "10018"
}]
Combing through the JSON, it appears that these columns are not matching: permit_subtype, superintendent_business_name, superintendent_first___last_name, site_safety_mgr_s_first_name, site_safety_mgr_s_last_name, and special_district_1.
Looking at the original data sources, the columns that do not match are instances where the field is blank for that field. That is, bin__ = 1090379 does not have a permit_subtype, so it is omitted in the JSON (which is standard practice).
It will, however, be included in the CSV output since that format must include all columns: https://data.cityofnewyork.us/resource/83x8-shf7.csv?$where=filing_date%20BETWEEN%20%272018-05-01T06:00:00%27%20AND%20%272018-05-30T10:00:00%27.
This answer took a bit of digging because it wasn't immediately obvious which columns were different between the two examples. It's always helpful to over-explain to make it easier to track-down the issue.
Likewise, per the filing_date question, please include the query you're attempting to use.