Deserialize a json object for time series analysis

Deserialize a json object for time series analysis - json

I would like to build a table showing the changes in stock prices over 1M, 3M, 6M, etc. The API call for a specific day in the past
returns the following json:
{
"date": "2018-01-02",
"data": {
"AAPL": {
"open": "170.16",
"close": "172.26",
"high": "172.30",
"low": "169.26",
"volume": "25555934"
},
"MSFT": {
"open": "86.13",
"close": "85.95",
"high": "86.31",
"low": "85.50",
"volume": "22483797"
}
}
}
I have built a for loop in java that goes over the required dates using Calendar and make the API calls for those dates and a list of stock symbols. I am able to deserialize the json using the following code:
JsonParser jsonParser = new JsonParser();
JsonObject jsonObject = (JsonObject) jsonParser.parse(result);
JsonElement jsonElement = jsonObject.get("data");
Set<Map.Entry<String, JsonElement>> entrySet = jsonElement.getAsJsonObject().entrySet();
entrySet.parallelStream().forEach(entry -> {
Stock stk = new Stock();
stk.setSymbol(entry.getKey());
stk.setClose(entry.getValue().getAsJsonObject().get("close").getAsFloat());
stk.setDate(date.getKey());
The problem is that using this code I can only save the data in my database that is unique by date. I wish to save it so that is unique by stock symbol. Presumably I need to collect the dates in a map object as a property to the POJO. Unfortunately, I have not been able to make it work. Any suggestions would be very much appreciated.

Create a flat table in your database, with columns for the date, the symbol (such as AAPL), and the five numbers. So seven columns all together.
This strategy assumes you have no additional data to store per date or per stock. (If you did have additional data per date or stock, you would need multiple tables, with parent tables for the date or the stock.)
Personally, I would slap on another column with a UUID generated as a primary key. I believe in always using surrogate key. But not necessary strictly speaking, if your database supports a multi-column combination as a natural key. You could combine the date and the symbol together as a primary key to uniquely identify each row.
CREATE TABLE daily_high_low_ (
date_of_closing_ DATE ,
symbol_ VARCHAR( 4 ) ,
open_ NUMERIC( 12 , 2 ) ,
close_ NUMERIC( 12 , 2 ) ,
high_ NUMERIC( 12 , 2 ) ,
low_ NUMERIC( 12 , 2 ) ,
volume_ INTEGER ,
PRIMARY KEY ( date_of_closing_ , symbol_ )
)
I am purposely keeping this overly simplistic. In real life you would have to worry about stock ticker symbols not being unique across stock exchanges. And I have hard-coded the numeric precision and scale for US dollars and cents amounts, which may not be appropriate for all markets. And we are ignoring the fact that date varies across time zones around the globe.
As you process your JSON, flatten the data to fit this flat table.
For retrieval, make a class to match this database table.
public class HighLow {
public LocalDate dateOfClosing ;
public String symbol ;
public BigDecimal open, close, high, low ;
public Integer volume ;
}
Load from database into a data structure that meets the needs of your data analysis. Perhaps hard-coded maps for each time period: months1, months3, months6, etc. Each maps the symbol to a list of HighLow objects.
To save memory, you share the same objects across the lists. For example, the list in the months3 map has the same HighLow objects from the list in the months1 map, plus two more months worth of HighLow objects.
Map< String , List< HighLow > > months1 = new TreeMap<>() ;
Map< String , List< HighLow > > months3 = new TreeMap<>() ;
Map< String , List< HighLow > > months6 = new TreeMap<>() ;

Related

Remove duplicates with date parameter

I need to compare duplicates ip of a json by date field and remove the older date
Ex:
[
{
"IP": "10.0.0.20",
"Date": "2019-09-14T20:00:11.543-03:00"
},
{
"IP": "10.0.0.10",
"Date": "2019-09-17T15:45:16.943-03:00"
},
{
"IP": "10.0.0.10",
"Date": "2019-09-18T15:45:16.943-03:00"
}
]
The output of operation need to be like this:
[
{
"IP": "10.0.0.20",
"Date": "2019-09-14T20:00:11.543-03:00"
},
{
"IP": "10.0.0.10",
"Date": "2019-09-18T15:45:16.943-03:00"
}
]

For simplicity's sake, I'll assume the order of the data doesn't matter.
First, if your data isn't already in Python, you can use json.load or json.loads to convert it into a Python object, following the straightforward type mappings.
Then you problem has three parts: comparing date strings as dates, finding the maximum element of a list by that date, and performing this process for each distinct IP address. For these purposes, you can use two of Pyhton's built-in methods and two from the standard library.
Python's built-in max and sorted functions (as well as list.sort) support a (keyword-only) key argument, which uses a function to determine the value to compare by. For example, max(d1, d2, key=lambda x: x[0]) compares the data by the first index of the each (like d1[0] < d2[0]), and returns whichever of d1 and d2 produced the larger key.
To allow that type of comparison between dates, you can use the datetime.datetime class. If your dates are all in the format specified by datetime.datetime.fromisoformat, you can use that function to turn your date strings into datetimes, which can then be compared to each other. Using that in a function that extracts the dates from the dictionaries gives you the key function you need.
def extract_date(item):
return datetime.datetime.fromisoformat(item['Date'])
Those functions allow you to choose the object from the list with the largest date, but not to keep separate values for different IP addresses. To do that, you can use itertools.groupby, which takes a key function and puts the elements of the input into separate outputs based on that key. However, there are two things you might need to watch out for with groupby:
It only groups elements that are next to each other. For example, if you give it [3, 3, 2, 2, 3], it will group two 3s, then two 2s, then one 3 rather than grouping all three 3 together.
It returns an iterator of key, iterator pairs, so you have to collect the results yourself. The best way to do that may depend on your application, but a basic approach is nested iterations:
for key, values in groupby(data, key_function):
for value in values:
print(key, value)
With the functions I've mentioned above, it should be relatively straightforward to assemble an answer to your problem.

Select couchbase nested object

I have a bunch of documents with a structure like this:
{
"month": 11,
"year": 2017,
//other fields
"Cars":[
{
"CarId": 123,
// other fields
},
{
"CarId": 456,
// other fields
}
// other cars
]
}
I am searching for a concrete car instance with id = 456. So far I have:
SELECT Cars
FROM DevBucket
WHERE year = 2017
AND month = 11
AND [CarId=456]
Couchbase returns correct document (which contains target car). However the output includes an array of all Car nodes within a document, but I'd like to have a single car (as if I use SELECT Cars[1] in the example above)
Search through couchbase tutorials didn't give me an answer. Is there a better way?

Using the UNNEST clause you can perform "a join of the nested array with its parent object." This will produce an object for each nested element that includes the nested element as a top-level field, along with the rest of the original document (nested elements, and all).
This query will retrieve the car with an id of 456 when its parent object has a month and year of 11/2017.
SELECT car
FROM DevBucket db
UNNEST Cars car
WHERE car.CarId = 456
AND db.year = 2017
AND db.month = 11;
Create this index for a quicker lookup than what you'll get with a Primary Index:
CREATE INDEX cars_index
ON DevBucket(DISTINCT ARRAY car.CarId FOR car IN Cars END);
For more information on UNNEST see NEST and UNNEST: Normalizing and Denormalizing JSON on the Fly

Load complex json in hive using jsonserde

I am trying to build a table in hive for following json
{
"business_id": "vcNAWiLM4dR7D2nwwJ7nCA",
"hours": {
"Tuesday": {
"close": "17:00",
"open": "08:00"
},
"Friday": {
"close": "17:00",
"open": "08:00"
}
},
"open": true,
"categories": [
"Doctors",
"Health & Medical"
],
"review_count": 9,
"name": "Eric Goldberg, MD",
"neighborhoods": [],
"attributes": {
"By Appointment Only": true,
"Accepts Credit Cards": true,
"Good For Groups": 1
},
"type": "business"
}
I can create a table using following DDL,however I get an exception while querying that table.
CREATE TABLE IF NOT EXISTS business (
business_id string,
hours map<string,string>,
open boolean,
categories array<string>,
review_count int,
name string,
neighborhoods array<string>,
attributes map<string,string>,
type string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.JsonSerde';
The exception while retrieving data is "ClassCast:Cant cast jsoanarray to json object" . What is the correct schema for this json? Is there any took which can help me generate correct schema for given json to be used with jsonserde?

It looks to me that the problem is hours which you defined as hours map<string,string> but should be a map<string,map<string,string> instead.
There's a tool you can use to generate the hive table definition automatically from your JSON data: https://github.com/quux00/hive-json-schema
but you may want to adjust it because when encountering a JSON Object (Anything between {} ) the tool can't know wether to translate it to a hive map or to a struct.
On your data, the tool gives me this:
CREATE TABLE x (
attributes struct<accepts credit cards:boolean,
by appointment only:boolean, good for groups:int>,
business_id string,
categories array<string>,
hours map<string:struct<close:string, open:string>
name string,
neighborhoods array<string>,
open boolean,
review_count int,
type string
)
but it looks like you want something like this:
CREATE TABLE x (
attributes map<string,string>,
business_id string,
categories array<string>,
hours map<string,struct<close:string, open:string>>,
name string,
neighborhoods array<string>,
open boolean,
review_count int,
type string
) ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
STORED AS TEXTFILE;
hive> load data local inpath 'json.data' overwrite into table x;
hive> Table default.x stats: [numFiles=1, numRows=0, totalSize=416,rawDataSize=0]
OK
hive> select * from x;
OK
{"accepts credit cards":"true","by appointment only":"true",
"good for groups":"1"}
vcNAWiLM4dR7D2nwwJ7nCA
["Doctors","Health & Medical"]
{"tuesday":{"close":"17:00","open":"08:00"},
"friday":{"close":"17:00","open":"08:00"}}
Eric Goldberg, MD ["HELLO"] true 9 business
Time taken: 0.335 seconds, Fetched: 1 row(s)
hive>
A few notes though:
Notice I used a different JSON SerDe because I don't have on my system the one you used. I used this one, I like it better because, well, I wrote it. But the create statement should work just as well with the other serde.
You may want to convert some of those maps to structs, as they may be more convenient to query. For instance, attributes could be a struct, but you'd need to map the names with a space in them like accepts credit cards. My SerDe allows to map a json attribute to a different hive column name. That is also needed then JSON uses an attribute that is a hive keyword like 'timestamp' or 'create'.

How to write a view in couchbase for this sql statement

Let's say I have the following documents
Document 1
{
companyId: "1",
salesDate: "1425254400000" //this is UTC time as a long
}
Document 2
{
companyId: "1",
salesDate: "1425340800000" //this is UTC time as a long
}
Document 3
{
companyId: "2",
salesDate: "1425254400000" //this is UTC time as a long
}
I currently have my view set up as
function(doc, meta) { emit([doc.salesDate, doc.companyId], doc); }
Which is pulling back all 3 documents when using
?startkey=[1425254400000,"1"]&endkey=[1425340800000,"1"]
I'm not sure how to make it only pull back the sales for that date range by company id.
The sql version would be SELECT * FROM sales WHERE companyId = :companyId AND salesDate BETWEEN :rangeStart AND :rangeEnd
EDIT: I'm using the rest API.

When designing views for range queries with multiple query fields, the fixed query field(companyId) should be a prefix of the compound index and the range query field should be at the end. With the current view, Couchbase will emit every document where salesDate is within the range without considering companyId.
Reversing the order of keys will work:
function(doc, meta) {
emit([doc.companyId, doc.salesDate], doc);
}
Query:
?startkey=["1", 1425254400000]&endkey=["1", 1425340800000]
N.B. if salesDate is a string and not a numeric value, Couchbase will use lexicographic ordering to perform the range query.

Limiting and sorting by different properties on couchbase

Given a JSON document on couchbase, for example, a milestone collections, which is similar to this:
{
"milestoneDate" : /Date(1335191824495+0100)/,
"companyId" : 43,
"ownerUserId": 475,
"participants" : [
{
"userId": 2,
"docId" : "132546"
},
{
"userId": 67,
"docId" : "153"
}
]
}
If I were to select all the milestones of the company 43 and want to order them by latest first.. my view on couchbase would be something similar to this:
function (doc, meta) {
if(doc.companyId && doc.milestoneDate)
{
//key made up of date particles + company id
var eventKey = dateToArray(new Date(parseInt(doc.milestoneDate.substr(6))));
eventKey.push(doc.companyId);
emit(eventKey, null);
}
}
I do get both dates and the company Id on rest urls.. however, being quite new to couchbase, I am unable to work out how to restrict the view to return only milestones of company 43
The return key is similar to this:
"key":[2013,6,19,16,11,25,14]
where the last element (14) is the company id.. which is quite obviously wrong.
The query parameters that I have tried are:
&descending=true&startkey=[{},43]
&descending=true&startkey=[{},43]&endKey=[{},43]
tried adding companyId to value but couldn't restrict return results by value.
And according to couchbase documentation I need the date parts in the beginning to sort them. How do I restrict them by company id now, please?
thanks.

Put the company id at the start of the array, and because you'll be limiting by company id, couchbase sorts by company id and then by date array so you will be only ever getting the one company's milestone documents
I'd modify the view to emit
emit([doc.copmanyId, eventKey], null);
and then you can query the view with
&descending=true&startkey=[43,{}]
This was what worked for me previously..
I went back and tried it with end key and this seems to work - restricts and orders as required:
&descending=true&startkey=[43,{}]&endkey=[42,{}]
or
&descending=true&startkey=[43,{}]&endkey=[43,{}]&inclusive_end=true
either specify the next incremented/decremented value (based on descending flag) with end key, or use the same endkey as startkey and set inclusiveEnd to true
Both of these options should work fine. (I only tested the one with endkey=42 but they should both work)

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Deserialize a json object for time series analysis - json

Related

Remove duplicates with date parameter

Select couchbase nested object

Load complex json in hive using jsonserde

How to write a view in couchbase for this sql statement

Limiting and sorting by different properties on couchbase

Categories

Resources