Comments in textual serialized protobuf? (not the scheme definition) - configuration

I'm using textual protobuf files for system configuration.
One problem I have with this is that the serialized protobuf format does not support comments.
Is there any way around this?
I'm talking about the textual serialized data format, not the scheme definition.
Was this problem solved somewhere by someone?

Textual Protobuf format (serialized protobuf messages in text formal) supports comments using the # syntax. I could not find a reference for the same in any online documentation but have used the same in projects in the past so I put together a small example that one can test with:
Sample message description - [SampleProtoSchema.proto]
message SampleProtoSchema {
optional int32 first_val = 1; // Note: This supports C/C++ style comments
optional int32 second_val = 2;
}
Sample text message - [SampleTextualProto.prototxt]
# This is how textual protobuf format supports comments
first_val: 12 # can also be inline comments
# This is another comment
second_val: 23
Note though that these comments cannot be generated automatically at time of serialization. They can only be added manually afterwards.
Compile and test:
> protoc --python_out=. SampleProtoSchema.proto
>
> ipython
[1]: import SampleProtoSchema_pb2
[2]: sps = SampleProtoSchema_pb2.SampleProtoSchema()
[3]: from google.protobuf import text_format
[4]: with open('SampleTextualProto.prototxt', 'r') as f:
text_format.Merge(f.read(), sps)
[5]: sps.first_val
[5]> 12
[6]: sps.second_val
[6]> 23

You may want to take a look at the Piqi project. It addresses this problem by introducing a new human-readable "Piq" data format and a command-line tool for converting data between Protobuf, Piq, JSON and XML formats.
The Piq data format was specially designed for human interaction. It supports comments, binary literals and verbatim text literals.

Related

Reading the Json string which is put together in one field

I have a Json pattern string in a text file, I have to pharse the below string like below and put it in to a external file.
Please let me know how this can be handled with Informatica Powercenter or Unix or Python?
{"CONTACTID":"3b2a25b2","ANI":"+16146748702","DNIS":"+18006081123","START_TIME":"01/22/2023 03:31:42","MODULE":[{"Name":"MainIVR","Time":"01/22/2023 03:31:42",Dialog":[{"name":"offer_Spanish","dialogeresult":"(|raw:7|R|7|1.0|nm=0|ni=0|2023/22/21 03:02:01)"}],"backend":[{"Time":"01/22/2023)"}],"END_STATE":"XC"}
In The above sample string the special charcters should be removed and the values should be assigned to the corresponding columns like below 2 o/p formats
Output:
CONTACTID, ANI, DNIS, START_TIME, MODULE, Time,Dialog,dialogeresult,END_STATE
3b2a25b2,+16146748702 +18006081123 01/22/2023 03:31:42,Name:MainIVR,
or
Output:
CONTACTID : 3b2a25b2
ANI:16146748702
DNI :+18006081123
I tried this to read thru Informatica powercenter and using the expression tranformations but nothing worked and tried with Python too.
For a start, your JSON is invalid. The opening double quotes for Dialog are missing and it's not properly closed - MODULE array is not closed and root is not closed. Here's the fixed JSON:
{"CONTACTID":"3b2a25b2","ANI":"+16146748702","DNIS":"+18006081123","START_TIME":"01/22/2023 03:31:42","MODULE":[{"Name":"MainIVR","Time":"01/22/2023 03:31:42","Dialog":[{"name":"offer_Spanish","dialogeresult":"(|raw:7|R|7|1.0|nm=0|ni=0|2023/22/21 03:02:01)"}],"backend":[{"Time":"01/22/2023)"}],"END_STATE":"XC"}]}
Use some JSON validation tool, like this one - it helps a lot.
Next, here's some starter code you may use to achieve the required result:
import json
# some JSON:
x = '{"CONTACTID":"3b2a25b2","ANI":"+16146748702","DNIS":"+18006081123","START_TIME":"01/22/2023 03:31:42","MODULE":[{"Name":"MainIVR","Time":"01/22/2023 03:31:42","Dialog":[{"name":"offer_Spanish","dialogeresult":"(|raw:7|R|7|1.0|nm=0|ni=0|2023/22/21 03:02:01)"}],"backend":[{"Time":"01/22/2023)"}],"END_STATE":"XC"}]}'
# parse x:
y = json.loads(x)
# the result is a Python dictionary:
print(y.keys())
You may test it on Replit
Finally regarding Informatica Powercenter - it is a terrible choice for complex string processing. You would need a Hierarchy Parser Transformation. Long story short: it's very tedious, but possible. I would highly recommend picking up a differen approach, if this is not a regular data loading process you will need to build.

Reading JSON in Azure Synapse

I'm trying to understand the code for reading JSON file in Synapse Analytics. And here's the code provided by Microsoft documentation:
Query JSON files using serverless SQL pool in Azure Synapse Analytics
select top 10 *
from openrowset(
bulk 'https://pandemicdatalake.blob.core.windows.net/public/curated/covid-19/ecdc_cases/latest/ecdc_cases.jsonl',
format = 'csv',
fieldterminator ='0x0b',
fieldquote = '0x0b'
) with (doc nvarchar(max)) as rows
go
I wonder why the format = 'csv'. Is it trying to convert JSON to CSV to flatten the file?
Why they didn't just read the file as a SINGLE_CLOB I don't know
When you use SINGLE_CLOB then the entire file is important as one value and the content of the file in the doc is not well formatted as a single JSON. Using SINGLE_CLOB will make us do more work after using the openrowset, before we can use the content as JSON (since it is not valid JSON we will need to parse the value). It can be done but will require more work probably.
The format of the file is multiple JSON's like strings, each in separate line. "line-delimited JSON", as the document call it.
By the way, If you will check the history of the document at GitHub, then you will find that originally this was not the case. As much as I remember, originally the file included a single JSON document with an array of objects (was wrapped with [] after loaded). Someone named "Ronen Ariely" in fact found this issue in the document, which is why you can see my name in the list if the Authors of the document :-)
I wonder why the format = 'csv'. Is it trying to convert json to csv to flatten the hierarchy?
(1) JSON is not a data type in SQL Server. There is no data type name JSON. What we have in SQL Server are tools like functions which work on text and provide support for strings which are JSON's like format. Therefore, we do not CONVERT to JSON or from JSON.
(2) The format parameter has nothing to do with JSON. It specifies that the content of the file is a comma separated values file. You can (and should) use it whenever your file is well formatted as comma separated values file (also commonly known as csv file).
In this specific sample in the document, the values in the csv file are strings, which each one of them has a valid JSON format. Only after you read the file using the openrowset, we start to parse the content of the text as JSON.
Notice that only after the title "Parse JSON documents" in the document, the document starts to speak about parsing the text as JSON.

NiFi non-Avro JSON Reader/Writer

It appears that the standard Apache NiFi readers/writers can only parse JSON input based on Avro schema.
Avro schema is limiting for JSON, e.g. it does not allow valid JSON properties starting with digits.
JoltTransformJSON processor can help here (it doesn't impose Avro limitations to how the input JSON may look like), but it seems that this processor does not support batch FlowFiles. It is also not based on the readers and writers (maybe because of that).
Is there a way to read arbitrary valid batch JSON input, e.g. in multi-line form
{"myprop":"myval","12345":"12345",...}
{"myprop":"myval2","12345":"67890",...}
and transform it to other JSON structure, e.g. defined by JSON schema, and e.g. using JSON Patch transformation, without writing my own processor?
Update
I am using Apache NiFi 1.7.1
Update 2
Unfortunately, #Shu's suggestion did work. I am getting same error.
Reduced the case to a single UpdateRecord processor that reads JSON with numeric properties and writes to a JSON without such properties using
myprop : /data/5836c846e4b0f28d05b40202
mapping. Still same error :(
it does not allow valid JSON properties starting with digits?
This bug NiFi-4612 fixed in NiFi-1.5 version, We can use AvroSchemaRegistry to defined your schema and change the
Validate Field Names
false
Then we can have avro schema field names starting with digits.
For more details refer to this link.
Is there a way to read arbitrary valid batch JSON input, e.g. in multi-line form?
This bug NiFi-4456 fixed in NiFi-1.7, if you are not using this version of NiFi then we can do workaround to create an array of json messages with ,(comma delimiter) by using.
Flow:
1.SplitText //split the flowfile with 1 line count
2.MergeRecord //merge the flowfiles into one
3.ConvertRecord
For more details regards to this particular issues refer to this link(i have explained with the flow).

Spark - load numbers from a CSV file with non-US number format

I have a CSV file which I want to convert to Parquet for futher processing. Using
sqlContext.read()
.format("com.databricks.spark.csv")
.schema(schema)
.option("delimiter",";")
.(other options...)
.load(...)
.write()
.parquet(...)
works fine when my schema contains only Strings. However, some of the fields are numbers that I'd like to be able to store as numbers.
The problem is that the file arrives not as an actual "csv" but semicolon delimited file, and the numbers are formatted with German notation, i.e. comma is used as decimal delimiter.
For example, what in US would be 123.01 in this file would be stored as 123,01
Is there a way to force reading the numbers in different Locale or some other workaround that would allow me to convert this file without first converting the CSV file to a different format? I looked in Spark code and one nasty thing that seems to be causing issue is in CSVInferSchema.scala line 268 (spark 2.1.0) - the parser enforces US formatting rather than e.g. rely on the Locale set for the JVM, or allowing configuring this somehow.
I thought of using UDT but got nowhere with that - I can't work out how to get it to let me handle the parsing myself (couldn't really find a good example of using UDT...)
Any suggestions on a way of achieving this directly, i.e. on parsing step, or will I be forced to do intermediate conversion and only then convert it into parquet?
For anybody else who might be looking for answer - the workaround I went with (in Java) for now is:
JavaRDD<Row> convertedRDD = sqlContext.read()
.format("com.databricks.spark.csv")
.schema(stringOnlySchema)
.option("delimiter",";")
.(other options...)
.load(...)
.javaRDD()
.map ( this::conversionFunction );
sqlContext.createDataFrame(convertedRDD, schemaWithNumbers).write().parquet(...);
The conversion function takes a Row and needs to return a new Row with fields converted to numerical values as appropriate (or, in fact, this could perform any conversion). Rows in Java can be created by RowFactory.create(newFields).
I'd be happy to hear any other suggestions how to approach this but for now this works. :)

Json-Opening Yelp Data Challenge's data set

I am interested in data mining and I am writing my thesis about it. For my thesis I want to use yelp's data challenge's data set, however i can not open it since it is in json format and almost 2 gb. In its website its been said that the dataset can be opened in phyton using mrjob, but I am also not very good with programming. I searched online and looked some of the codes yelp provided in github however I couldn't seem to find an article or something which explains how to open the dataset, clearly.
Can you please tell me step by step how to open this file and maybe how to convert it to csv?
https://www.yelp.com.tr/dataset_challenge
https://github.com/Yelp/dataset-examples
data is in .tar format when u extract it again it has another file,rename it to .tar and then extract it.you will get all the json files
yes you can use pandas. Take a look:
import pandas as pd
# read the entire file into a python array
with open('yelp_academic_dataset_review.json', 'rb') as f:
data = f.readlines()
# remove the trailing "\n" from each line
data = map(lambda x: x.rstrip(), data)
data_json_str = "[" + ','.join(data) + "]"
# now, load it into pandas
data_df = pd.read_json(data_json_str)
Now 'data_df' contains the yelp data ;)
Case, you want convert it directly to csv, you can use this script
https://github.com/Yelp/dataset-examples/blob/master/json_to_csv_converter.py
I hope it can help you
To process huge json files, use a streaming parser.
Many of these files aren't a single json, but a stream of jsons (known as "jsons format"). Then a regular json parser will consider everything but the first entry to be junk.
With a streaming parser, you can start reading the file, process parts, and wrote them to the desired output; then continue writing.
There is no single json-to-csv conversion.
Thus, you will not find a general conversion utility, you have to customize the conversion for your needs.
The reason is that a JSON is a tree but a CSV is not. There exists no ultimative and efficient conversion from trees to table rows. I'd stick with JSON unless you are always extracting only the same x attributes from the tree.
Start coding, to become a better programmer. To succeed with such amounts of data, you need to become a better programmer.