I am doing an API call every 40 mins to retrieve the current status information of every car in a car fleet. And each call adds one new JSON document to a Cloudant database. Each JSON document defines the current availability status for every car across many locations in many cities. There are currently around 2200 JSON documents in the database. All JSON documents have one field called payload that contains all information; it is a large array of objects. Instead of retrieving the whole payload array of objects I would like to retrieve only the needed info with a query (so, only one or several objects of that array). However, I have difficulty drafting a query that results only in the needed data.
Below, I'll explain my problem in more detail:
When saving the JSON document to Cloudant, a timestamp is defined in the document. The _id parameter is defined to be equal to this timestamp. Below, I show a simplified version of these JSON documents:
{
"_id": "1540914946026",
"_rev": "3-c1834c8a230cf772e41bbcb9cf6b682e",
"timestamp": 1540914946026,
"datetime": "2018-10-30 15:55:46",
"payload": [
{
"cityName": "Abcoude",
"locations": [
{
"address": "asterlaan 28",
"geoPoint": {
"latitude": 52.27312,
"longitude": 4.96768
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
}
],
"availableCars": 1,
"occupiedCars": 0
},
{
"cityName": "Alkmaar",
"locations": [
{
"address": "Aert de Gelderlaan 14",
"geoPoint": {
"latitude": 52.63131,
"longitude": 4.72329
},
"cars": [
{
"model": "Volswagen",
"state": "FREE"
}
]
},
{
"address": "Ardennenstraat 49",
"geoPoint": {
"latitude": 52.66721,
"longitude": 4.76046
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
},
{
"address": "Beneluxplein 7",
"geoPoint": {
"latitude": 52.65356,
"longitude": 4.75817
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
},
{
"address": "Dr. Schaepmankade 1",
"geoPoint": {
"latitude": 52.62595,
"longitude": 4.75122
},
"cars": [
{
"mod": "BMW",
"state": "OCCUPIED"
}
]
},
{
"address": "Kennemerstraatweg",
"geoPoint": {
"latitude": 52.62909,
"longitude": 4.74226
},
"cars": [
{
"model": "Mercedes",
"state": "FREE"
}
]
},
{
"address": "NS Station Alkmaar Noord/Parkeerterrein Noord",
"geoPoint": {
"latitude": 52.64366,
"longitude": 4.7627
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "NS Station Alkmaar/Stationsweg 56",
"geoPoint": {
"latitude": 52.6371,
"longitude": 4.73935
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "Oude Hoeverweg",
"geoPoint": {
"latitude": 52.63943,
"longitude": 4.72928
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "Parkeerterrein Wortelsteeg",
"geoPoint": {
"latitude": 52.63048,
"longitude": 4.75487
},
"cars": [
{
"model": "Tesla",
"state": "OCCUPIED"
}
]
},
{
"address": "Schoklandstraat 38",
"geoPoint": {
"latitude": 52.65812,
"longitude": 4.75359
},
"cars": [
{
"model": "Volkswagen",
"state": "FREE"
}
]
}
],
"availableCars": 8,
"occupiedCars": 2
}
]
}
As you can see, the payload field is an array that has several objects (FYI: every object in this array represents one specific city: there are 1600 cities, so 1600 nested objects inside the payload array). Furthermore, inside each of the 1600 objects mentioned, other arrays and objects are again nested inside. For all objects in the payload array, the first field is cityName.
Furthermore, there is a nested array locations (inside each of the 1600 objects of the payload array) representing all addresses in a specific city. The locations array can be of size 1 to 600, meaning 1 to 600 nested objects / addresses per city. The last two fields in all objects of the payload array are availableCars and occupiedCars.
I want query documents to see how many cars are available and occupied for a specific city during a specific time interval. To do this:
I have to specify a start timestamp (or id) and an end timestamp, resulting in only the JSON documents within this interval.
Furthermore, I will need to specify inside the JSON documents only one or more specific cities by cityName (there are 1600 cities) and then get the number of available cars availableCars and the number of occupiedCars for those cities.
For example, in this simplified example, I would like to query for the status information (availableCars & `occupiedCars) for the city of Alkmaar from 1540914946026 (epoch time) until now. I would like to get the following result:
{
"id":"1540914946026",
"cityName":"Alkmaar",
"availableCars":8,
"occupiedCars":2
}
This is just an example, in reality, I want to be able to query for other cities as well, or query for several cities together and then get for each of those cities the number of available cars availableCars and the number of occupied cars occupiedCars.
Could anyone help me to define a query and index to be able to get the above result? Can I do this with cloudant query?
Your data model does not play to Cloudant's strengths. Let each document group data that changes and is accessed together. Your items in your payload array would be much better stored as discrete documents.
If you find yourself reaching into growing arrays inside documents for subsets of data, this is a warning sign that your data model is not ideal: the document is now mutable and growing (with potential update conflicts as a result), and access becomes more cumbersome over time as Cloudant has no mechanism to only retrieve parts of a document. Moreover, Cloudant has a limit (1M) on document size, so by using your proposed model, you will likely hit that limit, too, and your application would stop working.
With that said, it is possible to create a view index that lets you emit each component of your payload, which would let you look up data per city -- but that solution is still subject to all the limitations above (document model is mutable, documents grow large etc).
Rule of thumb: small documents. Immutable model, where possible. Documents group data that either change, or are accessed as a unit.
Related
I have to extract attributes from a json file that I receive from an api call using InvokeHTTPCustom. JSON FILE has the following sample data :
[
{
"input_index": 0,
"candidate_index": 0,
"delivery_line_1": "1 Santa Claus Ln",
"last_line": "North Pole AK 99705-9901",
"delivery_point_barcode": "997059901010",
"components": {
"primary_number": "1",
"street_name": "Santa Claus",
"street_suffix": "Ln",
"city_name": "North Pole",
"state_abbreviation": "AK",
"zipcode": "99705",
"plus4_code": "9901",
"delivery_point": "01",
"delivery_point_check_digit": "0"
},
"metadata": {
"record_type": "S",
"zip_type": "Standard",
"county_fips": "02090",
"county_name": "Fairbanks North Star",
"carrier_route": "C004",
"congressional_district": "AL",
"rdi": "Commercial",
"elot_sequence": "0001",
"elot_sort": "A",
"latitude": 64.75233,
"longitude": -147.35297,
"coordinate_license": 1,
"precision": "Rooftop",
"time_zone": "Alaska",
"utc_offset": -9,
"dst": true
},
"analysis": {
"dpv_match_code": "Y",
"dpv_footnotes": "AABB",
"dpv_cmra": "N",
"dpv_vacant": "N",
"dpv_no_stat": "Y",
"active": "Y",
"footnotes": "L#"
}
},
{
"input_index": 1,
"candidate_index": 0,
"addressee": "Apple Inc",
"delivery_line_1": "1 Infinite Loop",
// truncated for brevity
}
]
I have extracted all the required data such as address, state, city, primary_number, etc.
However, when I try to extract latitude,longitude from metadata, it leads to failure in EvaluateJsonPathAttributeCustom processor. Other attributes, which are in string format, get extracted correctly. However, this being not a string, might be issue, is my diagnosis.
How do I get this working?
I need to extract longitudes and latitudes.
Please give detail explanation as I am new to nifi.
Configuration in nifi for EvaluateJsonPathAttributeCustom:
Attribute Name Input : x**.json
Attribute Name Output : latitude
JsonPathExpresssion : $[0].metadata.latitude
Splitif.. : False
One way to do this is by using the JOLT https://jolt-demo.appspot.com/.
I would recommend using the JoltTransformJSON NiFi Processor as it can really help make things easy to pull out only the data that you want. I have tried your specific request and it will work to pull out those data. You can configure JOLT to pull any data you require and it might be easier once you get the hang of it.
[{
"operation": "shift",
"spec": {
"*": {
"metadata": {
"latitude": "latitude",
"longitude": "longitude"
}
}
}
}]
I use the GCP metadata API (http://metadata.google.internal/computeMetadata/v1/) to get information about the instance that a process is running on, including machine type (e.g. "projects/818238156224/machineTypes/n1-standard-4" -- presumably the important part is the "n1-standard-4"), region, zone, and whether the instance is preemptible.
I would like to be able to retrieve information programmatically about how much GCP is charging (e.g. per hour) for usage of the instance.
I can query the GCP billing API (https://cloudbilling.googleapis.com/v1/services/6F81-5844-456A/skus), but that returns JSON like
{
"name": "services/6F81-5844-456A/skus/0048-21CE-74C3",
"skuId": "0048-21CE-74C3",
"description": "Preemptible N2 Custom Instance Core running in Sao Paulo",
"category": {
"serviceDisplayName": "Compute Engine",
"resourceFamily": "Compute",
"resourceGroup": "CPU",
"usageType": "Preemptible"
},
"serviceRegions": [
"southamerica-east1"
],
"pricingInfo": [
{
"summary": "",
"pricingExpression": {
"usageUnit": "h",
"usageUnitDescription": "hour",
"baseUnit": "s",
"baseUnitDescription": "second",
"baseUnitConversionFactor": 3600,
"displayQuantity": 1,
"tieredRates": [
{
"startUsageAmount": 0,
"unitPrice": {
"currencyCode": "USD",
"units": "0",
"nanos": 11538000
}
}
]
},
"currencyConversionRate": 1,
"effectiveTime": "2021-05-26T08:47:05.220Z"
}
],
"serviceProviderName": "Google",
"geoTaxonomy": {
"type": "REGIONAL",
"regions": [
"southamerica-east1"
]
}
}
And it's very unclear how to retrieve an objects in one API given an object in the other.
Do I need to parse the description somehow? Does that even work? Is there a better way?
When returning a list of objects in a JSON response, say a GET request to a /movies endpoint, is it more common to return a JSON array or an object that wraps a JSON array? I've seen both formats in APIs and I was wondering if the standard. If there isn't, which way is preferable?
i.e.
[
{
"name": "Harry Potter",
"year": 2000
}
]
vs.
{
"movies": [
{
"name": "Harry Potter",
"year": 2000
}
]
}
In general if you have a service that only return a list, the first option is perfect fine:
[
{
"name": "Harry Potter",
"year": 2000
}
]
But if you are thinking in a general way to do it will be better add more context data, as total items counter, pagination variables or status values. So in spite of the first one is perfectly fine, I always prefer the second one, but without the name of the collection/array/table name and with more context info, as for example:
{
"items": [
{
"name": "Harry Potter",
"year": 2000
}
],
"total": 1,
"page": 1,
"pages": 1
"status": 1,
"timestamp: 121344
}
Set the array nested on movies value is a bit redundant. But for my it's only a practical approach that for my experience is more readable and used in all projects which I am related.
Actually i am pushing data to other system but before pushing i have to change the "key" in the whole JSON. JSON may contain 200 or 10000 or 250000 data.
sample JSON:
{
"insert": "table",
"contacts": [
{
"testName": "testname",
"ContactID": 212121
},
{
"testName": "testname",
"ContactID": 2146354564
},
{
"testName": "testname",
"ContactID": 12312
},
{
"testName": "testname",
"ContactID": 211221
},
{
"testName": "testname",
"ContactID": 10218550
}
]
}
I need to change contacts array Keys. These contacts may be in bulk. So i need to work with this efficiently with minimal complexity.
The above JSON to be converted as below
{
"insert": "table",
"contacts": [
{
"name": "testname",
"phone": 212121
},
{
"name": "testname",
"phone": 2146354564
},
{
"name": "testname",
"phone": 12312
},
{
"name": "testname",
"phone": 211221
},
{
"name": "testname",
"phone": 10218550
}
]
}
here is my code trying by loop
ini_dict = request.data
contact_data = ini_dict['contacts']
for i in contact_data:
i['name'] = i.pop('testName')
print(contact_data)
Please suggest me how can i change the key names efficiently for bulk data. i mean for 50000 lists in contacts. "for loop" will be leading a performance issue. So please let me know the efficient way to achieve this
I dont know how fast you need it to be nor how you are choosing to store your json. One simple solution is just store it as a string and then replace all the instances of your attributes.
# Something like this using a jsonstring
jsonstring.replace("'testName':", "'name':")
jsonstring.replace("'ContactId':", "'phone':")
If you want to do this in bulk you, may need to create some batch process to be able to fetch multiple existing records and make changes at once. I have done this before with the java equivalent of https://pypi.org/project/JayDeBeApi/ but, that was more for modifying existing records in a database.
There is an items node in the specifications which says it is for an array of items, like paging items, youtube video list
What if I have GET request on a single item, how should the response be formatted ?
Just to one item in the array?
items:[item]
https://google.github.io/styleguide/jsoncstyleguide.xml
I don't think #tanmay_vijay's answer is correct or nuanced enough as it seems that single item responses are in arrays in the YouTube example in the docs.
{
"apiVersion": "2.0",
"data": {
"updated": "2010-02-04T19:29:54.001Z",
"totalItems": 6741,
"startIndex": 1,
"itemsPerPage": 1,
"items": [
{
"id": "BGODurRfVv4",
"uploaded": "2009-11-17T20:10:06.000Z",
"updated": "2010-02-04T06:25:57.000Z",
"uploader": "docchat",
"category": "Animals",
"title": "From service dog to SURFice dog",
"description": "Surf dog Ricochets inspirational video ...",
"tags": [
"Surf dog",
"dog surfing",
"dog",
"golden retriever",
],
"thumbnail": {
"default": "https://i.ytimg.com/vi/BGODurRfVv4/default.jpg",
"hqDefault": "https://i.ytimg.com/vi/BGODurRfVv4/hqdefault.jpg"
},
"player": {
"default": "https://www.youtube.com/watch?v=BGODurRfVv4&feature=youtube_gdata",
"mobile": "https://m.youtube.com/details?v=BGODurRfVv4"
},
"content": {
"1": "rtsp://v5.cache6.c.youtube.com/CiILENy73wIaGQn-Vl-0uoNjBBMYDSANFEgGUgZ2aWRlb3MM/0/0/0/video.3gp",
"5": "https://www.youtube.com/v/BGODurRfVv4?f=videos&app=youtube_gdata",
"6": "rtsp://v7.cache7.c.youtube.com/CiILENy73wIaGQn-Vl-0uoNjBBMYESARFEgGUgZ2aWRlb3MM/0/0/0/video.3gp"
},
"duration": 315,
"rating": 4.96,
"ratingCount": 2043,
"viewCount": 1781691,
"favoriteCount": 3363,
"commentCount": 1007,
"commentsAllowed": true
}
]
}
}
It could however be that it depends on the resource being targeted from the request. This is the way it is in the competing JSONAPI standard.
From JSONAPI standard:
A logical collection of resources MUST be represented as an array, even if it only contains one item or is empty.
You don't need to have items field for showing single item. If you're sure your API is always going to return single object, you can return it as data itself.
{
"data": {
"kind": "user",
"fields": "author,id",
"id": "bart",
"author": "Bart"
}
}
Fields such as data.kind data.fields data.etag data.id data.lang data.updated data.deleted can still be used here.
Source for snippet docs