Couchbase group by name and collect field values - couchbase

coming from mongoDB for 5 years, now just studying/trying Couchbase. I have a test data, which I want to group by city, and collect/push all main.temp values to a field. this can be done in mongoDB aggregation by $push or $addToSet
test data
{
"city": "a",
"main": {
temp: 1
}
},
{
"city": "a",
"main": {
temp: 2
}
},
{
"city": "b",
"main": {
temp: 3
}
},
{
"city": "b",
"main": {
temp: 4
}
}
I want the result to be like this, if possible, I want to sort the temp array to desc/asc
{
"city": "a",
"temp": [1, 2],
},
{
"city": "b",
"temp": [4, 3],
},
I tried something using the gui query from admin, but it doesn't give me the exact result I want.
select name, max(main.temp) as temp
from weather103
group by city

ok, I think I got the answer, by using array_agg
select city, array_agg(main.current_temp) as temp
from weather103
group by city
order by temp desc

#Hokutosei, Welcome to Couchbase!
Checkout these resources and tutorials to get you started with queries quickly.
http://query.pub.couchbase.com/tutorial/#1
http://developer.couchbase.com/documentation/server/4.5/getting-started/first-n1ql-query.html
We also have pretty active forum to ask questions.
https://forums.couchbase.com/

Related

Bigquery Get json key name

I have a BigQuery table that contains a column that contains a JSON string. Within the JSON, there may be a key called "person" or "corp" or "sme". I want to run a query that will return which of the possible keys exist in the JSON and store it in a new column.
Below is the data from a column 'class', which is one long string each in BQ. The first level key name can equal ‘corp’, ’sme’, or ‘person’ (see examples below).
Example 1
{
"corp": {
"address": {
"city": "London",
"countryCode": "gb",
"streetAddress": [
"Fairlop road"
],
"zip": "e111bn"
},
"cin": 1234567420,
"title": "Demo Corp"
}
}
Example 2
{
"person": {
"address": {
"city": "Madrid",
"countryCode": "es",
"streetAddress": [
"Some street 1"
],
"zip": "z1123ab"
},
"cin": 1234567411,
"title": "Demo Person"
}
}
I've tried using the json_xxx functions, but they require specifying the json_path. I'm interested in fetching the json_path name to create a new column (cust_type)which lists corp, sme, person for each row.
example
cust_type
1
corp
2
person
This is my first question so pls bear with me! Thnx
Also you can use a function to extract first level keys whatever they are.
CREATE TEMP FUNCTION json_keys(input STRING) RETURNS ARRAY<STRING> LANGUAGE js AS """
return Object.keys(JSON.parse(input))
""";
SELECT json_keys(json_text) AS cust_type
FROM UNNEST([
'{"corp": {"address": {"city": "London","countryCode": "gb","streetAddress": ["Fairlop road"],"zip": "e111bn"},"cin": 1234567420,"title": "Demo Corp"}}',
'{"person": {"address": {"city": "Madrid","countryCode": "es","streetAddress": ["Some street 1"],"zip": "z1123ab"},"cin": 1234567411,"title": "Demo Person"}}'
]) AS json_text;
output:
Maybe we can use the JSON_EXTRACT function and look to see if the field exists (is not null). An example test might be:
SELECT CASE
WHEN JSON_EXTRACT(json_text, '$.corp') is not null then 'corp'
WHEN JSON_EXTRACT(json_text, '$.person') is not null then 'person'
WHEN JSON_EXTRACT(json_text, '$.sme') is not null then 'sme'
END AS cust_type
FROM UNNEST([
'{"corp": {"address": {"city": "London","countryCode": "gb","streetAddress": ["Fairlop road"],"zip": "e111bn"},"cin": 1234567420,"title": "Demo Corp"}}',
'{"person": {"address": {"city": "Madrid","countryCode": "es","streetAddress": ["Some street 1"],"zip": "z1123ab"},"cin": 1234567411,"title": "Demo Person"}}'
]) AS json_text;

Select part of Json array based on a value from another column in PostgreSQL

I have a table where I store country information in a column and json data in another column.
I'd like to select only part of the json data, basically I'd like to find the country value inside the Json, and return the animals values from key "animals" that are the closest (and on the left side) to the country found in the json.
This is the table "myanimals":
Country
Metadata
US
{ "a": 1, "b": 2, "animals": ["dog","cat","mouse"], "region": {"country": "china"}, "animals": ["horse","bear","eagle"], "region": { "country": "us" } }
India
{ "a": 20, "b": 40, "animals": ["fish","cat","rat","hamster"], "region": {"country": "india"}, "animals": ["dog","rabbit","fox","fish"], "region": { "country": "poland" } }
Metadata is in json and NOT jsonb.
Using Postgres, I wanted to query so I'd end up with a new column, something like "animals_in_country", where the only information shown would be the values from key
"animals" which are the closest (and located on the left) to the matched country, as it follows:
Country
Metadata
animals_in_country
US
{ "a": 1, "b": 2, "animals": ["dog","cat","mouse"], "region": {"country": "china"}, "animals": ["horse","bear","eagle"], "region": { "country": "us" } }
["horse","bear","eagle"]
India
{ "a": 20, "b": 40, "animals": ["fish","cat","rat","hamster"], "region": {"country": "india"}, "animals": ["dog","rabbit","fox","fish"], "region": { "country": "poland" } }
["fish","cat","rat","hamster"]
Here's some pseudo code of what I am trying to achieve (please refer to the table shown above)
- Take the value in "Country", "US", and find the location of the same value in the JSON column
- location found, now search before this key 'country' for the key 'animals'
- Return whole array of values from 'animals'
- should be ["horse","bear","eagle"]
- shouldn't be ["dog","cat","mouse"] (as this one is part of "china" country in the JSON)
NOTE: Although this is dummy data, this is more or less the issue I am solving. And yes, the JSON is showing the same key more than once.
In case you are looking for the first animal in the array, this is your answer.
select a.data->'animals'->0 as first_animal from myanimals a
where a.data->'region'->>'country'='us'
or without filter (for all countries)
select a.data->'region'->>'country' as country,
a.data->'animals'->0 as first_animal from myanimals a

reshape jq nested file and make csv

I've been struggling with this one for the whole day which i want to turn to a csv.
It represents the officers attached to company whose number is "OC418979" in the UK Company House API.
I've already truncated the json to contain just 2 objects inside "items".
What I would like to get is a csv like this
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...
There are 2 extra complication: there are 2 types of "officers", some are people, some are companies, so not all key in people are present in the other and viceversa. I'd like these entries to be 'null'. Second complication is those nested objects like "name" which contains a comma in it! or address, which contains several sub-objects (which I guess I could flatten in pandas tho).
{
"total_results": 13,
"resigned_count": 9,
"links": {
"self": "/company/OC418979/officers"
},
"items_per_page": 35,
"etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
"kind": "officer-list",
"active_count": 4,
"inactive_count": 0,
"start_index": 0,
"items": [
{
"officer_role": "llp-designated-member",
"name": "BARRICK, David James",
"date_of_birth": {
"year": 1984,
"month": 1
},
"appointed_on": "2017-09-15",
"country_of_residence": "England",
"address": {
"country": "United Kingdom",
"address_line_1": "Old Gloucester Street",
"locality": "London",
"premises": "27",
"postal_code": "WC1N 3AX"
},
"links": {
"officer": {
"appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
}
}
},
{
"links": {
"officer": {
"appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
}
},
"address": {
"locality": "Tadcaster",
"country": "United Kingdom",
"address_line_1": "Westgate",
"postal_code": "LS24 9AB",
"premises": "5a"
},
"identification": {
"legal_authority": "UK",
"identification_type": "non-eea",
"legal_form": "UK"
},
"name": "PREMIER DRIVER LIMITED",
"officer_role": "corporate-llp-designated-member",
"appointed_on": "2017-09-15"
}
]
}
What I've been doing is creating new json objects extracting the fields I needed like this:
{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}
But the query runs for hours - and I am sure there is a quicker way.
Right now I am trying this new approach - creating a json whose root is the company number and as argument a list of its officers.
{(.links.self | split("/")[2]): .items[]}
Using jq, it's easier to extract values from the top-level object that will be shared and generate the desired rows. You'll want to limit the amounts of times you go through the items to at most once.
$ jq -r '(.links.self | split("/")[2]) as $companyCode
| .items[]
| [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
| #csv
' input.json
Ok, you want to scan the list of officers, extract some fields from there if they are present and write that in csv format.
First part is to extract the data from the json. Assuming you loaded it is a data Python object, you have:
print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
data['items'][0]['country_of_residence'])
gives:
llp-designated-member 2017-09-15 England
Time to put everything together with the csv module:
import csv
...
with open('output.csv', 'w', newline='') as fd:
wr = csv.writer(fd)
for officer in data['items']:
_ = wr.writerow(('OC418979',
officer.get('country_of_residence',''),
officer.get('officer_role', ''),
officer.get('appointed_on', '')
))
The get method on a dictionnary allows to use a default value (here the empty string) if the key is not present, and the csv module ensures that if a field contains a comma, it will be enclosed in quotation marks.
With your example input, it gives:
OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15

Retrieve a JSON object from JSON array using Cloudant

I am doing an API call every 40 mins to retrieve the current status information of every car in a car fleet. And each call adds one new JSON document to a Cloudant database. Each JSON document defines the current availability status for every car across many locations in many cities. There are currently around 2200 JSON documents in the database. All JSON documents have one field called payload that contains all information; it is a large array of objects. Instead of retrieving the whole payload array of objects I would like to retrieve only the needed info with a query (so, only one or several objects of that array). However, I have difficulty drafting a query that results only in the needed data.
Below, I'll explain my problem in more detail:
When saving the JSON document to Cloudant, a timestamp is defined in the document. The _id parameter is defined to be equal to this timestamp. Below, I show a simplified version of these JSON documents:
{
"_id": "1540914946026",
"_rev": "3-c1834c8a230cf772e41bbcb9cf6b682e",
"timestamp": 1540914946026,
"datetime": "2018-10-30 15:55:46",
"payload": [
{
"cityName": "Abcoude",
"locations": [
{
"address": "asterlaan 28",
"geoPoint": {
"latitude": 52.27312,
"longitude": 4.96768
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
}
],
"availableCars": 1,
"occupiedCars": 0
},
{
"cityName": "Alkmaar",
"locations": [
{
"address": "Aert de Gelderlaan 14",
"geoPoint": {
"latitude": 52.63131,
"longitude": 4.72329
},
"cars": [
{
"model": "Volswagen",
"state": "FREE"
}
]
},
{
"address": "Ardennenstraat 49",
"geoPoint": {
"latitude": 52.66721,
"longitude": 4.76046
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
},
{
"address": "Beneluxplein 7",
"geoPoint": {
"latitude": 52.65356,
"longitude": 4.75817
},
"cars": [
{
"mod": "BMW",
"state": "FREE"
}
]
},
{
"address": "Dr. Schaepmankade 1",
"geoPoint": {
"latitude": 52.62595,
"longitude": 4.75122
},
"cars": [
{
"mod": "BMW",
"state": "OCCUPIED"
}
]
},
{
"address": "Kennemerstraatweg",
"geoPoint": {
"latitude": 52.62909,
"longitude": 4.74226
},
"cars": [
{
"model": "Mercedes",
"state": "FREE"
}
]
},
{
"address": "NS Station Alkmaar Noord/Parkeerterrein Noord",
"geoPoint": {
"latitude": 52.64366,
"longitude": 4.7627
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "NS Station Alkmaar/Stationsweg 56",
"geoPoint": {
"latitude": 52.6371,
"longitude": 4.73935
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "Oude Hoeverweg",
"geoPoint": {
"latitude": 52.63943,
"longitude": 4.72928
},
"cars": [
{
"model": "Tesla",
"state": "FREE"
}
]
},
{
"address": "Parkeerterrein Wortelsteeg",
"geoPoint": {
"latitude": 52.63048,
"longitude": 4.75487
},
"cars": [
{
"model": "Tesla",
"state": "OCCUPIED"
}
]
},
{
"address": "Schoklandstraat 38",
"geoPoint": {
"latitude": 52.65812,
"longitude": 4.75359
},
"cars": [
{
"model": "Volkswagen",
"state": "FREE"
}
]
}
],
"availableCars": 8,
"occupiedCars": 2
}
]
}
As you can see, the payload field is an array that has several objects (FYI: every object in this array represents one specific city: there are 1600 cities, so 1600 nested objects inside the payload array). Furthermore, inside each of the 1600 objects mentioned, other arrays and objects are again nested inside. For all objects in the payload array, the first field is cityName.
Furthermore, there is a nested array locations (inside each of the 1600 objects of the payload array) representing all addresses in a specific city. The locations array can be of size 1 to 600, meaning 1 to 600 nested objects / addresses per city. The last two fields in all objects of the payload array are availableCars and occupiedCars.
I want query documents to see how many cars are available and occupied for a specific city during a specific time interval. To do this:
I have to specify a start timestamp (or id) and an end timestamp, resulting in only the JSON documents within this interval.
Furthermore, I will need to specify inside the JSON documents only one or more specific cities by cityName (there are 1600 cities) and then get the number of available cars availableCars and the number of occupiedCars for those cities.
For example, in this simplified example, I would like to query for the status information (availableCars & `occupiedCars) for the city of Alkmaar from 1540914946026 (epoch time) until now. I would like to get the following result:
{
"id":"1540914946026",
"cityName":"Alkmaar",
"availableCars":8,
"occupiedCars":2
}
This is just an example, in reality, I want to be able to query for other cities as well, or query for several cities together and then get for each of those cities the number of available cars availableCars and the number of occupied cars occupiedCars.
Could anyone help me to define a query and index to be able to get the above result? Can I do this with cloudant query?
Your data model does not play to Cloudant's strengths. Let each document group data that changes and is accessed together. Your items in your payload array would be much better stored as discrete documents.
If you find yourself reaching into growing arrays inside documents for subsets of data, this is a warning sign that your data model is not ideal: the document is now mutable and growing (with potential update conflicts as a result), and access becomes more cumbersome over time as Cloudant has no mechanism to only retrieve parts of a document. Moreover, Cloudant has a limit (1M) on document size, so by using your proposed model, you will likely hit that limit, too, and your application would stop working.
With that said, it is possible to create a view index that lets you emit each component of your payload, which would let you look up data per city -- but that solution is still subject to all the limitations above (document model is mutable, documents grow large etc).
Rule of thumb: small documents. Immutable model, where possible. Documents group data that either change, or are accessed as a unit.

BQ: How to UNNEST into new table

I'm exporting billing data from Google Cloud Platform to BigQuery (BQ).
The task at hand is to build a query that UNNEST relevant data to a new 'flat' table
The structure of the data in BQ is this:
[{
"billing_account_id": "01234-1778EC-123456",
"service": {
"id": "2062-016F-44A2",
"description": "Maps"
},
"sku": {
"id": "5D8F-0D17-AAA2",
"description": "Google Maps"
},
"usage_start_time": "2018-11-05 14:45:00 UTC",
"usage_end_time": "2018-11-05 15:00:00 UTC",
"project": {
"id": null,
"name": null,
"labels": []
},
"labels": [],
"system_labels": [],
"location": null,
"export_time": "2018-11-05 21:54:09.779 UTC",
"cost": "5.0",
"currency": "EUR",
"currency_conversion_rate": "0.87860000000017424",
"usage": {
"amount": "900.0",
"unit": "seconds",
"amount_in_pricing_units": "0.00034674063800277393",
"pricing_unit": "month"
},
"credits": "-1.25",
"invoice": {
"month": "201811"
}
},
I wish to schedule a job that builds a new table every day with just this schema
billing_account_id, usage_start_time, usage_end_time, cost, credit_amount
So far I'm at this:
select billing_account_id, usage_start_time, usage_end_time, cost, credits AS CREDITS from clientBilling.gcp_billing_export_v1_XXXX , UNNEST(credits);
But in the results credits are still nested and not 'flat' as I need. Any input is welcome, thanks! :)
Result
credits is an array of structs (each struct being "name, amount") - a "repeated" record in BigQuery - so you have to first unnest the array and then reference the struct member you want.
Thus:
UNNEST the credits record
Alias the credits.amount struct member as credit_amount
SELECT
billing_account_id,
usage_start_time,
usage_end_time,
cost,
credit.amount as credit_amount
FROM
`optimum-rock-145719.billing_export.gcp_billing_export_v1*`,
UNNEST(credits) as credit
This will return a result table with just the credits.amount column as credits_amount. You were doing step 1, but not step 2, and ignoring the unnested fields in your SELECT clause.