I am trying to build up a dictionary / json object of sensor values in esphome. I have a sensor that sends me key / value pairs (e.g. one sensor reading could be { “temperature“: 25.1 }, another one could be { “speed“: 50.1 } and so forth) in very high frequency (milliseconds). What I would like to do is to collect data for these key / value pairs for a certain time span, for simplicity say ten seconds, and only then take the dictionary and post it to a web service. It would also somehow combine the readings for the same key if sent multiple times within the ten seconds time span for example by averaging them out, using a filter or whatever. So the final dictionary to be posted to the web service would then look like
{
“temperature“: 26.3,
“speed“: 52.5,
…
}
How could I achieve this - any idea / proposal?
Thanks and best regards
Dear stackoverflow community,
I found a solution to this issue. I am now using a global variable in esphome. This can be defined as follows:
globals:
- id: "my_dict"
type: std::map<std::string, std::string>
With this, I have a global map which I can use to store the key / value pairs. Adding a new key / value pair via a lambda works as simple as shown in the following (where in this example, the key is stored in the variable key and the value is stored in the variable value:
lambda: |-
id(my_dict)[key] = value;
Every ten seconds, I post the dictionary content to the web service and then clear the dictionary again:
interval:
- interval: 10s
then:
- http_request.post:
url: "https://<URL>"
json: |-
for ( auto item : id(my_dict) ) {
root[item.first] = item.second;
}
- lambda: |-
id(my_dict).clear();
I try to read out OBD-2 data from Hyundai Ioniq Electro (Version 28kWh), using a Raspberry PI and a Bluetooth ELM327 interface. Connection and data transfer works fine.
For example: sending 2105<cr><lf> gives a response (<cr> is value 0x0d = 13):
7F2112<cr>7F2112<cr>7F2112<cr>02D<cr>0:6105FFFFFFFF<cr>7F2112<cr>1:00000000001616<cr>2:161616161621FA<cr>3:26480001501616<cr>4:03E82403E80FC0<cr>5:003A0000000000<cr>6:00000000000000<cr><cr>>
The value C0 in 4:03E82403E80FC0 seems to be the State of charge (SOC) display value:
C0 -> 192 -> 192/2 % = 96%
There are some tables for decoding available (see https://github.com/JejuSoul/OBD-PIDs-for-HKMC-EVs/tree/master/Ioniq%20EV%20-%2028kWh), but how to use these tables?
For example sending 2101<cr><lf> gives the response:
02C<cr>
0:6101FFFFF800<cr>
01E<cr>
0:6101000003FF<cr>
03D<cr>
0:6101FFFFFFFF<cr>
016<cr>
0:6101FFE00000<cr>
1:0002D402CD03F0<cr>
1:0838010A015C2F<cr>
7F2112<cr>
1:B4256026480000<cr>
1:0921921A061B03<cr>
2:000582003401BD<cr>
2:0000000A002702<cr>
2:000F4816161616<cr>
2:00000000276234<cr>
3:04B84100000000<cr>
3:5B04692F180018<cr>
3:01200000000000<cr>
3:1616160016CB3F<cr>
4:00220000600000<cr>
4:00D0FF00000000<cr>
4:CB0100007A0002<cr>
5:000001F3026A02<cr>
5:5D4000025D4600<cr>
6:D2000000000000<cr>
6:00DECA0000D8E6<cr>
7:008A2FEB090002<cr>
8:0000000003E800<cr>
<cr>
>
Please note, that the line feed was added behind every carriage return (<cr>) for better readability and is not part of the original data response.
How can I decode temperature, currents, ... from these data?
I have found the mistake by myself. The ELM327 description (http://elmelectronics.com/DSheets/ELM327DS.pdf) explains the AT commands in detail.
The problem on this issue was the mixing of CAN responses from multiple ECU's caused by the AT H0 command (headers off) in the initialization phase (not described in question). See also EM327DS.pdf page 44 (Multiple Responses).
When using AT H1 on startup, the responses can be decoded without problem.
Initialization (with AT H1 = headers on)
AT D\r\n
AT Z\r\n
AT L0\r\n
AT E0\r\n
AT S0\r\n
AT H1\r\n
AT SP 0\r\n
Afterwards communication with ECU's:
Response on first command 0100\r\n:
SEARCHING...\r7EB06410080000001\r7EC06410080000001\r\r>
Response on second command 2101\r\n:
7EE037F2112\r7ED102C6101FFFFF800\r7EA10166101FFE00000\r7EC103D6101FFFFFFFF\r7EB101E6101000003FF\r7EA2109211024062703\r7EC214626482648A3FF\r7ED2100907D87E15592\r7EB210838011D88B132\r7ED2202A1A7024C0134\r7EA2200000000546900\r7EC22C00D9E1C1B1B1B\r7EB220000000A000802\r7EA2307200000000000\r7ED23050343102000C8\r7EC231B1B1C001BB50F\r7EB233C04B8320000D0\r7EC24B5010000810002\r7ED24047400C8760017\r7EB24FF300000000000\r7ED25001401F387F46A\r7EC256AC100026CB100\r7EC2600E3C50000DE69\r7ED263F001300000000\r7EC27008CC38209015C\r7EC280000000003E800\r\r>
Response on third command 2105\r\n:
7EE037F2112\r7ED037F2112\r7EA037F2112\r7EC102D6105FFFFFFFF\r7EB037F2112\r7EC2100000000001B1C\r7EC221C1B1B1B1B2648\r7EC2326480001641A1B\r7EC2403E80803E80147\r7EC25003A0000000000\r7EC2600000000000000\r\r>
Now every response starts with the id of the ECU. Take attention only to responses starting with 7EC.
Example:
Looking for battery current in amps. In the document Spreadsheet_IoniqEV_BMS_2101_2105.xls you find the battery current on:
response 21 for 2101: last byte = High Byte of battery current
response 22 for 2101: first byte = Low Byte of battery current
So look to the response of 2101\r\n and search for 7EC21 and 7EC22: You will find:
7EC214626482648A3FF: take last byte for battery high value -> FF
7EC22C00D9E1C1B1B1B: take first byte after 7EC22 for battery low value -> C0
The battery current value is: FFC0
This value is two complements encoded:
0xffc0 = 65472 -> 65472 - 65536 = -64 -> -6.4A
Result: the battery is charged with 6.4A
For a coding example see:
https://github.com/greenenergyprojects/obd2-gateway, file src/obd2/obd2.ts
This script defining a dummy model using a small nested model
from keras.layers import Input, Dense
from keras.models import Model
import keras
input_inner = Input(shape=(4,), name='input_inner')
output_inner = Dense(3, name='inner_dense')(input_inner)
inner_model = Model(inputs=input_inner, outputs=output_inner)
input = Input(shape=(5,), name='input')
x = Dense(4, name='dense_1')(input)
x = inner_model(x)
x = Dense(2, name='dense_2')(x)
output = keras.layers.concatenate([x, x], name='concat_1')
model = Model(inputs=input, outputs=output)
print(model.summary())
yields the following output
Layer (type) Output Shape Param # Connected to
====================================================================================================
input (InputLayer) (None, 5) 0
____________________________________________________________________________________________________
dense_1 (Dense) (None, 4) 24 input[0][0]
____________________________________________________________________________________________________
model_1 (Model) (None, 3) 15 dense_1[0][0]
____________________________________________________________________________________________________
dense_2 (Dense) (None, 2) 8 model_1[1][0]
____________________________________________________________________________________________________
concat_1 (Concatenate) (None, 4) 0 dense_2[0][0]
dense_2[0][0]
My question concerns the content of the Connected to column.
I understand that a layer can have multiple nodes.
The notation of this column is layer_name[node_index][tensor_index].
If we regard inner_model as a layer I would expect it to have only one node, so I would expect dense_2 to be connected to model_1[0][0]. But in reality it is connected to model_1[1][0]. Why is this the case?
1.Background
When you say:
If we regard inner_model as a layer I would expect it to have only one
node
This is true in the sense that it has only one node which is part of the network.
Consider the github repository of the model.summary function. The function that prints the connections is print_layer_summary_with_connections (line 76), and it considers only the nodes from relevant_nodes array. All the nodes that are not in this array are considered not part of the network, and so the function skips them. The relevant lines are lines 88-90:
if relevant_nodes and node not in relevant_nodes:
# node is not part of the current network
continue
2.Your model
Now let's see what happens with your particular model. First let us define relevant_nodes:
relevant_nodes = []
for v in model.nodes_by_depth.values():
relevant_nodes += v
The array relevant_nodes looks like:
[<keras.engine.topology.Node at 0x9dfa518>,
<keras.engine.topology.Node at 0x9dfa278>,
<keras.engine.topology.Node at 0x9d8bac8>,
<keras.engine.topology.Node at 0x9d8ba58>,
<keras.engine.topology.Node at 0x9d74518>]
However, when we print the inbound nodes at every layer, we will get:
for i in model.layers:
print(i.inbound_nodes)
[<keras.engine.topology.Node object at 0x0000000009D74518>]
[<keras.engine.topology.Node object at 0x0000000009D8BA58>]
[<keras.engine.topology.Node object at 0x0000000009D743C8>, <keras.engine.topology.Node object at 0x0000000009D8BAC8>]
[<keras.engine.topology.Node object at 0x0000000009DFA278>]
[<keras.engine.topology.Node object at 0x0000000009DFA518>]
You can see that there is exactly one node in the list above that does not appear in relevant_nodes. This is the node in position 0 in the third array:
<keras.engine.topology.Node object at 0x0000000009D743C8>
It was not considered a part of the model, and hence did not appear in relevant_nodes. The node in position 1 in this array does appear in relevant_nodes, and this is why you see it as model_1[1][0].
3.The reason
The reason for that is basically the line x=inner_model(input). Even If you run much smaller model, as the one below:
input_inner = Input(shape=(4,), name='input_inner')
output_inner = Dense(3, name='inner_dense')(input_inner)
inner_model = Model(inputs=input_inner, outputs=output_inner)
input = Input(shape=(5,), name='input')
output = inner_model(input)
model = Model(inputs=input, outputs=output)
You will see that relevant_nodes contains two elements, while via
for i in model.layers:
print(i.inbound_nodes)
you'll get three nodes.
This is because layer 1 (of the smaller model above) has two nodes, but only the second one is considered part of the model. In particular, if you print the input at each one of the nodes at layer 1 with layer.get_input_at(node_index), you'll get:
print(model.layers[1].get_input_at(0))
print(model.layers[1].get_input_at(1))
#prints
/input_inner
/input
4.Answers to the questions in the comment
1) Do you also know what this non-relevant node is good for / where it
comes from?
This node seems to be an "internal node" created during the application of inner_model. In particular, if you print the input and output shape at each one of the three nodes (in the small model above), you get:
nodes=[model.layers[0].inbound_nodes[0],model.layers[1].inbound_nodes[0],model.layers[1].inbound_nodes[1]]
for i in nodes:
print(i.input_shapes)
print(i.output_shapes)
print(" ")
#prints
[(None, 5)]
[(None, 5)]
[(None, 4)]
[(None, 3)]
[(None, 5)]
[(None, 3)]
so you could see that the shapes of the middle node (the one that does not appear in the list of relevant nodes) correspond to the shapes in inner_model.
2) Will an inner model with n output nodes always present them with node
indices 1 to n instead of 0 to n-1?
I am not sure if always, as I guess there are various possibilities to have several output nodes nodes, but if I consider the following quite natural generalization of the small model above, this is indeed the case:
input_inner = Input(shape=(4,), name='input_inner')
output_inner = Dense(3, name='inner_dense')(input_inner)
inner_model = Model(inputs=input_inner, outputs=output_inner)
input = Input(shape=(5,), name='input')
output = inner_model(input)
output = inner_model(output)
model = Model(inputs=input, outputs=output)
print(model.summary())
Here I just added output = inner_model(output) to the small model. The list of relevant nodes is
[<keras.engine.topology.Node at 0xd10c390>,
<keras.engine.topology.Node at 0xd10c9b0>,
<keras.engine.topology.Node at 0xd10ca20>]
and the list of all inbound nodes is
[<keras.engine.topology.Node object at 0x000000000D10CA20>]
[<keras.engine.topology.Node object at 0x000000000D10C588>, <keras.engine.topology.Node object at 0x000000000D10C9B0>, <keras.engine.topology.Node object at 0x000000000D10C390>]
Indeed the node indices are 1 and 2, as you mentioned in the comment. It will continue similarly if I add another output = inner_model(output), with node indices being 1,2,3 and so on.
Updated on Sep, 2020. The selected answer was a bit outdated (the link does not point to the right place), and not exactly answered the question: model_1[1][0]. Why is 1 in in [1][0] in the this the case? Here's what I found.
The code I played with is as below (I added some names for layers for better reading). You can copy and run to see the output info.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers
input_inner = layers.Input(shape=(4,), name='inn_input')
output_inner = layers.Dense(3, name='inn_dense')(input_inner)
inner_model = keras.Model(inputs=input_inner, outputs=output_inner,name='inn_model')
inn_allLayers = inner_model.layers
# print(type(inn_allLayers))
print(inner_model.name,': total layer number:',len(inn_allLayers))
for i in inn_allLayers:
print(i.name, i)
print(len(i._inbound_nodes))
for n in i._inbound_nodes:
print(n.get_config())
print(n)
print('===================')
print('************************************************')
nest_input = layers.Input(shape=(5,), name='nest_input')
nest_d1_out = layers.Dense(4, name='nest_dense_1')(nest_input)
nest_m_out = inner_model(nest_d1_out)
nest_d2_out = layers.Dense(2, name='nest_dense_2')(nest_m_out)
nest_add_out = layers.concatenate([nest_d2_out, nest_d2_out], name='nest_concat')
model = keras.Model(inputs=nest_input, outputs=nest_add_out,name='nest_model')
inn_allLayers = inner_model.layers
# print(type(inn_allLayers))
print(inner_model.name,': total layer number:',len(inn_allLayers))
for i in inn_allLayers:
print(i.name, i)
print(len(i._inbound_nodes))
for n in i._inbound_nodes:
print(n.get_config())
print(n)
print('===================')
print('************************************************')
allLayers = model.layers
# print(type(allLayers))
print(model.name,': total layer number:',len(allLayers))
for i in allLayers:
print(i.name, i)
print(len(i._inbound_nodes))
for n in i._inbound_nodes:
print(n.get_config())
print(n)
print('===================')
for op in tf.get_default_graph().get_operations():
print(str(op.name))
1. [1][0] represents [node_index][tensor_index]
2. what is node_index?
Under tensorflow/python/keras/engine/base_layer.py, it's described in this class:
class KerasHistory(
collections.namedtuple('KerasHistory',
['layer', 'node_index', 'tensor_index'])):
"""Tracks the Layer call that created a Tensor, for Keras Graph Networks.
During construction of Keras Graph Networks, this metadata is added to
each Tensor produced as the output of a Layer, starting with an
`InputLayer`. This allows Keras to track how each Tensor was produced, and
this information is later retraced by the `keras.engine.Network` class to
reconstruct the Keras Graph Network.
Attributes:
layer: The Layer that produced the Tensor.
node_index: The specific call to the Layer that produced this Tensor. Layers
can be called multiple times in order to share weights. A new node is
created every time a Tensor is called.
tensor_index: The output index for this Tensor. Always zero if the Layer
that produced this Tensor only has one output. Nested structures of
Tensors are deterministically assigned an index via `nest.flatten`.
"""
# Added to maintain memory and performance characteristics of `namedtuple`
# while subclassing.
It says a Node is created each time a Tensor is called. To me, it is a bit vague. My understanding is that when a layer is called, it produces a Tensor, and different ways involve calling this layer will create multiple nodes (will show some print results later.)
3. How to print each node?
Under the same py file, there is this snippet:
# Create node, add it to inbound nodes.
Node(
self,
inbound_layers=inbound_layers,
node_indices=node_indices,
tensor_indices=tensor_indices,
input_tensors=input_tensors,
output_tensors=output_tensors,
arguments=arguments)
# Update tensor history metadata.
# The metadata attribute consists of
# 1) a layer instance
# 2) a node index for the layer
# 3) a tensor index for the node.
# The allows layer reuse (multiple nodes per layer) and multi-output
# or multi-input layers (e.g. a layer can return multiple tensors,
# and each can be sent to a different layer).
for i, tensor in enumerate(nest.flatten(output_tensors)):
tensor._keras_history = KerasHistory(self,
len(self._inbound_nodes) - 1, i)
The self refers Layer object. the info is recoded in each tensor's _keras_history and self._inbound_nodes attribute. Hence we can print exactly the node by print(layer._inbound_nodes[index_of_node].get_config() I already typed the runnable code in the code at the beginning.
(What is inbound and outbound nodes? It's big confusing by first look, but if you imagine each node is an arrow pointing from one layer to another layer, it might be easier. The code description is below)
class Node(object):
"""A `Node` describes the connectivity between two layers.
Each time a layer is connected to some new input,
a node is added to `layer._inbound_nodes`.
Each time the output of a layer is used by another layer,
a node is added to `layer._outbound_nodes`.
Arguments:
outbound_layer: the layer that takes
`input_tensors` and turns them into `output_tensors`
(the node gets created when the `call`
method of the layer was called).
inbound_layers: a list of layers, the same length as `input_tensors`,
the layers from where `input_tensors` originate.
node_indices: a list of integers, the same length as `inbound_layers`.
`node_indices[i]` is the origin node of `input_tensors[i]`
(necessary since each inbound layer might have several nodes,
e.g. if the layer is being shared with a different data stream).
tensor_indices: a list of integers,
the same length as `inbound_layers`.
`tensor_indices[i]` is the index of `input_tensors[i]` within the
output of the inbound layer
(necessary since each inbound layer might
have multiple tensor outputs, with each one being
independently manipulable).
input_tensors: list of input tensors.
output_tensors: list of output tensors.
arguments: dictionary of keyword arguments that were passed to the
`call` method of the layer at the call that created the node.
`node_indices` and `tensor_indices` are basically fine-grained coordinates
describing the origin of the `input_tensors`.
A node from layer A to layer B is added to:
- A._outbound_nodes
- B._inbound_nodes
"""
4. Observe node creation.
You might notice there are two exactly same print blocks for inner_model in the code: one is before nested model is built, one is after.
The output is as below:
inn_model : total layer number: 2
inn_input <tensorflow.python.keras.engine.input_layer.InputLayer object at 0x7fd1c6755780>
1
{'outbound_layer': 'inn_input', 'inbound_layers': [], 'node_indices': [], 'tensor_indices': []}
<tensorflow.python.keras.engine.base_layer.Node object at 0x7fd1d2e75e10>
===================
inn_dense <tensorflow.python.keras.layers.core.Dense object at 0x7fd1d2e75e80>
1
{'outbound_layer': 'inn_dense', 'inbound_layers': 'inn_input', 'node_indices': 0, 'tensor_indices': 0}
<tensorflow.python.keras.engine.base_layer.Node object at 0x7fd1d2e92550>
===================
************************************************
inn_model : total layer number: 2
inn_input <tensorflow.python.keras.engine.input_layer.InputLayer object at 0x7fd1c6755780>
1
{'outbound_layer': 'inn_input', 'inbound_layers': [], 'node_indices': [], 'tensor_indices': []}
<tensorflow.python.keras.engine.base_layer.Node object at 0x7fd1d2e75e10>
===================
inn_dense <tensorflow.python.keras.layers.core.Dense object at 0x7fd1d2e75e80>
2
{'outbound_layer': 'inn_dense', 'inbound_layers': 'inn_input', 'node_indices': 0, 'tensor_indices': 0}
<tensorflow.python.keras.engine.base_layer.Node object at 0x7fd1d2e92550>
{'outbound_layer': 'inn_dense', 'inbound_layers': 'nest_dense_1', 'node_indices': 0, 'tensor_indices': 0}
<tensorflow.python.keras.engine.base_layer.Node object at 0x7fd1d2ac4358>
===================
************************************************
You will notice immediately that after the nested model is built, one extra (inbound)node (or an arrow) is created, pointing to inn_dense. One was created, pointing from inn_input to inn_dense, another was created, pointing from nest_dense_1 to inn_dense. This is what it was said earlier, each time a layer is called, a new node (an arrow) is created.
5. Question answered
So far, I think it already explains the original question: why is 1 in [1][0]. It is because reusing the the inner_model causes the inner_dense layer to be used to create a Tensor for a second time.
The rest of the code snippet has bit extra information, you can check it out and get a better idea under the hood.
It seems it is now "_nodes_by_depth" instead of "nodes_by_depth". Same for inbound_nodes etc. Maybe the answer has to be updated..
I have JSON file (post responses from an API) - I need to sort the dictionaries by a certain key in order to parse the JSON file in chronological order. After studying the data, I can sort it by the date format in metadata or by the number sequences of the S5CV[0156]P0.xml
One text example that you can load in JSON here - http://pastebin.com/0NS5BiDk
I have written 2 codes to sort the list of objects by a certain key. The 1st one sorts by the 'text' of the xml. The 2nd one by [metadata][0][value].
The 1st one works, but a few of the XMLs, even if they are higher in number, actually have documents inside older than I expected.
For the 2nd code the format of date is not consistent and sometimes the value is not present at all. I am struggling to extract the datetime format in a consistent way. The second one also gives me an error, but I cannot figure out why - string indices must be integers.
# 1st code (it works but not ideal)
# load post response r1 in json (python 3.5)
j=r1.json()
# iterate through dictionaries and sort by the 4 num of xml (ex. 0156)
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
newlist = sorted(list, key=lambda k: k['text'][-9:])
print(newlist)
# 2nd code. I need something to make consistent datetime,
# except missing values and solve the list index error
list = []
for row in j["tree"]["children"][0]["children"]:
list.append(row)
# extract the last 3 blocks of characters from the [metadata][0][value]
# usually are like this "7th april, 1922." and trasform in datatime format
# using dparser.parse
def date(key):
return dparser.parse((' '.join(key.split(' ')[-3:])),fuzzy=True)
def order(slist):
try:
return sorted(slist, key=lambda k: k[date(["metadata"][0]["value"])])
except ValueError:
return 0
print(order(list))
#update
orig_list = j["tree"]["children"][0]["children"]
cleaned_list = sorted((x for x in orig_list if extract_date(x) != DEFAULT_DATE),
key=extract_date)
first_date = extract_date(cleaned_list[0])
if first_date != DEFAULT_DATE: # valid date found?
cleaned_list [0] ['date'] = first_date
print(first_date)
middle_date = extract_date(cleaned_list[len(cleaned_list)//2])
if middle_date != DEFAULT_DATE: # valid date found?
cleaned_list [0] ['date'] = middle_date
print(middle_date)
last_date = extract_date(cleaned_list [-1])
if last_date != DEFAULT_DATE: # valid date found?
cleaned_list [0] ['date'] = last_date
print(last_date)
Clearly you can't use the .xml filenames to sort the data if it's unreliable, so the most promising strategy seems to be what you're attempting to do in the 2nd code.
When I mentioned needing a datetime to sort the items in my comments to your other question, I literally meant something like datetime.date instances, not strings like "28th july, 1933", which wouldn't provide the proper ordering needed since they would be compared lexicographically with one another, not numerically like datetime.dates.
Here's something that seems to work. It uses the re module to search for the date pattern in the strings that usually contain them (those with a "name" associated with the value "Comprising period from"). If there's more than one date match in the string, it uses the last one. This is then converted into a date instance and returned as the value to key on.
Since some of the items don't have valid date strings, a default one is substituted for sorting purposes. In the code below, a earliest valid date is used as the default—which makes all items with date problems appear at the beginning of the sorted list. Any items following them should be in the proper order.
Not sure what you should do about items lacking date information—if it isn't there, your only options are to guess a value, ignore them, or consider it an error.
# v3.2.1
import datetime
import json
import re
# default date when one isn't found
DEFAULT_DATE = datetime.date(1, 1, datetime.MINYEAR) # 01/01/0001
MONTHS = ('january february march april may june july august september october'
' november december'.split())
# dictionary to map month names to numeric values 1-12
MONTH_TO_ORDINAL = dict( zip(MONTHS, range(1, 13)) )
DMY_DATE_REGEX = (r'(3[01]|[12][0-9]|[1-9])\s*(?:st|nd|rd|th)?\s*'
+ r'(' + '|'.join(MONTHS) + ')(?:[,.])*\s*'
+ r'([0-9]{4})')
MDY_DATE_REGEX = (r'(' + '|'.join(MONTHS) + ')\s+'
+ r'(3[01]|[12][0-9]|[1-9])\s*(?:st|nd|rd|th)?,\s*'
+ r'([0-9]{4})')
DMY_DATE = re.compile(DMY_DATE_REGEX, re.IGNORECASE)
MDY_DATE = re.compile(MDY_DATE_REGEX, re.IGNORECASE)
def extract_date(item):
metadata0 = item["metadata"][0] # check only first item in metadata list
if metadata0.get("name") != "Comprising period from":
return DEFAULT_DATE
else:
value = metadata0.get("value", "")
matches = DMY_DATE.findall(value) # try dmy pattern (most common)
if matches:
day, month, year = matches[-1] # use last match if more than one
else:
matches = MDY_DATE.findall(value) # try mdy pattern...
if matches:
month, day, year = matches[-1] # use last match if more than one
else:
print('warning: date patterns not found in "{}"'.format(value))
return DEFAULT_DATE
# convert strings found into numerical values
year, month, day = int(year), MONTH_TO_ORDINAL[month.lower()], int(day)
return datetime.date(year, month, day)
# test files: 'json_sample.txt', 'india_congress.txt', 'olympic_games.txt'
with open('json_sample.txt', 'r') as f:
j = json.load(f)
orig_list = j["tree"]["children"][0]["children"]
sorted_list = sorted(orig_list, key=extract_date)
for item in sorted_list:
print(json.dumps(item, indent=4))
To answer your latest follow-on questions, you could leave out all the items in the list that don't have recognizable dates by using extract_date() to filter them out beforehand in a generator expression with something like this:
# to obtain a list containing only entries with a parsable date
cleaned_list = sorted((x for x in orig_list if extract_date(x) != DEFAULT_DATE),
key=extract_date)
Once you have a sorted list of items that all have a valid date, you can do things like the following, again reusing the extract_date() function:
# extract and display dates of items in cleaned list
print('first date: {}'.format(extract_date(cleaned_list[0])))
print('middle date: {}'.format(extract_date(cleaned_list[len(cleaned_list)//2])))
print('last date: {}'.format(extract_date(cleaned_list[-1])))
Calling extract_date() on the same item multiple times is somewhat inefficient. To avoid that you could easily add the datetime.date value it returns to the object on-the-fly since it's a dictionary, and then just refer to it as often as needed with very little additional overhead:
# add extracted datetime.date entry to a list item[i] if a valid one was found
date = extract_date(some_list[i])
if date != DEFAULT_DATE: # valid date found?
some_list[i]['date'] = date # save by adding it to object
This effectively caches the extracted date by storing it in the item itself. Afterwards, the datetime.date value can simply be referenced with some_list[i]['date'].
As a concrete example, consider this revised example of displaying the datesof the first, middle, and last objects:
# display dates of items in cleaned list
print('first date: {}'.format(cleaned_list[0]['date']))
middle = len(cleaned_list)//2
print('middle date: {}'.format(cleaned_list[middle]['date']))
print('last date: {}'.format(cleaned_list[-1]['date']))
I'm trying to process the following with an JSON Input step:
{"address":[
{"AddressId":"1_1","Street":"A Street"},
{"AddressId":"1_101","Street":"Another Street"},
{"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
{"AddressId":"1_102","Locality":"New York"}
]}
However this seems not to be possible:
Json Input.0 - ERROR (version 4.2.1-stable, build 15952 from 2011-10-25 15.27.10 by buildguy) :
The data structure is not the same inside the resource!
We found 1 values for json path [$..Locality], which is different that the number retourned for path [$..Street] (3509 values).
We MUST have the same number of values for all paths.
The step provides Ignore Missing Path flag but it only works if all the rows misses the same path. In that case that step acts as as expected an fills the missing values with null.
This limits the power of this step to read uneven data, which was really one of my priorities.
My step Fields are defined as follows:
Am I missing something? Is this the correct behavior?
What I have done is use JSON Input using $.address[*] to read to a jsonRow field the full map of each element p.e:
{"address":[
{"AddressId":"1_1","Street":"A Street"},
{"AddressId":"1_101","Street":"Another Street"},
{"AddressId":"1_102","Street":"One more street", "Locality":"Buenos Aires"},
{"AddressId":"1_102","Locality":"New York"}
]}
This results in 4 jsonRows one for each element, p.e. jsonRow = {"AddressId":"1_101","Street":"Another Street"}. Then using a Javascript step I map my values using this:
var AddressId = getFromMap('AddressId', jsonRow);
var Street = getFromMap('Street', jsonRow);
var Locality = getFromMap('Locality', jsonRow);
In a second script tab I inserted minified JSON parse code from https://github.com/douglascrockford/JSON-js and the getFromMap function:
function getFromMap(key,jsonRow){
try{
var map = JSON.parse(jsonRow);
}
catch(e){
var message = "Unparsable JSON: "+jsonRow+" Desc: "+e.message;
var nr_errors = 1;
var field = "jsonRow";
var errcode = "JSON_PARSE";
_step_.putError(getInputRowMeta(), row, nr_errors, message, field, errcode);
trans_Status = SKIP_TRANSFORMATION;
return null;
}
if(map[key] == undefined){
return null;
}
trans_Status = CONTINUE_TRANSFORMATION;
return map[key]
}
You can solve this by changing the JSONPath and splitting up the steps in two JSON input steps. The following website explains a lot about JSONPath: http://goessner.net/articles/JsonPath/
$..AddressId
Does in fact return all the AddressId's in the address array, BUT since Pentaho is using grid rows for input and output [4 rows x 3 columns], it can't handle a missing value aka null value when you want as results return all the Streets (3 rows) and return all the Locality (2 rows), simply because there are no null values in the array itself as in you can't drive out of your garage with 3 wheels on your car instead of the usual 4.
I guess your script returns null (where X is zero) values like:
A S X
A S X
A S L
A X L
The scripting step can be avoided same by changing the Fields path of the first JSONinput step into:
$.address[*]
This is to retrieve all the 4 address lines. Create a next JSONinput step based on the new source field which contains the address line(s) to retrieve the address details per line:
$.AddressId
$.Street
$.Locality
This yields the null values on the four address lines when a address details is not available in an address line.