Couchbase SDK cannot decode binary document - couchbase

I am trying to fetch BinaryDocuments uploaded by cbworkloadgen from Couchbase 4.0.0-4051 Community Edition. Couchbase Java client version is 2.4.1.
The exception given by decoder is -
WARNING: Decoding of document with BinaryTranscoder failed. exception: Flags (0x0) indicate non-binary document for id pymc0, could not decode., id: "pymc0", cas: 1486468016723525632, expiry: 0, flags: 0x0, status: SUCCESS, content size: 2048 bytes, content: "".
com.couchbase.client.java.error.TranscodingException: Flags (0x0) indicate non-binary document for id pymc0, could not decode.
com.couchbase.client.java.error.TranscodingException: Flags (0x0) indicate non-binary document for id pymc0, could not decode.
at com.couchbase.client.java.transcoder.BinaryTranscoder.doDecode(BinaryTranscoder.java:32)
at com.couchbase.client.java.transcoder.BinaryTranscoder.doDecode(BinaryTranscoder.java:26)
at com.couchbase.client.java.transcoder.AbstractTranscoder.decode(AbstractTranscoder.java:42)
at com.couchbase.client.java.CouchbaseAsyncBucket$1.call(CouchbaseAsyncBucket.java:274)
at com.couchbase.client.java.CouchbaseAsyncBucket$1.call(CouchbaseAsyncBucket.java:270)
at rx.internal.operators.OnSubscribeMap$MapSubscriber.onNext(OnSubscribeMap.java:69)
I use following to get document -
AbstractDocument&lt?&gt doc = destinationBucket.get((String) row.key(), isJson ? JsonDocument.class : BinaryDocument.class);
For JsonDocument things work okay. row is AsyncViewRow.
What am I doing wrong? Can someone please tell me? Or is this a bug related to incorrect value in flags field?

Well, due to want of time, I changed the approach since I was also getting out of memory errors when iterating a view asynchronously on a bucket million documents.
As regard to this issue, it may be that he flags field set by cbworkloadgen without -j option for each document is 0 and the BinaryTranscoder thinks it's not a binary document because of this value. I got around the problem by using N1ql instead of doing get(). However, I am not sure if this is an issue with cbworkloadgen where it's not setting correct flags.

You can not decode binary documents on your own. If you save something that implements Serializable, it will be serialized and saved to Couchbase and you can retrieve the same easily. But if you fire a N1QL query and try fetching the binary data, you won't be able to decode it. This is something Couchbase doesn't support yet. You can do the same with Json docs.

Related

Filtering with regex vs json

When filtering logs, Logstash may use grok to parse the received log file (let's say it is Nginx logs). Parsing with grok requires you to properly set the field type - e.g., %{HTTPDATE:timestamp}.
However, if Nginx starts logging in JSON format then Logstash does very little processing. It simply creates the index, and outputs to Elasticseach. This leads me to believe that only Elasticsearch benefits from the "way" it receives the index.
Is there any advantage for Elasticseatch in having index data that was processed with Regex vs. JSON? E.g., Does it impact query time?
For elasticsearch it doesn't matter how you are parsing the messages, it has no information about it, you only need to send a JSON document with the fields that you want to store and search on according to your index mapping.
However, how you are parsing the message matters for Logstash, since it will impact directly in the performance.
For example, consider the following message:
2020-04-17 08:10:50,123 [26] INFO ApplicationName - LogMessage From The Application
If you want to be able to search and apply filters on each part of this message, you will need to parse it into fields.
timestamp: 2020-04-17 08:10:50,123
thread: 26
loglevel: INFO
application: ApplicationName
logmessage: LogMessage From The Application
To parse this message you can use different filters, one of them is grok, which uses regex, but if your message has always the same format, you can use another filter, like dissect, in this case both will achieve the same thing, but while grok uses regex to match the fields, dissect is only positional, this make a huge difference in CPU use when you have a high number of events per seconds.
Consider now that you have the same message, but in a JSON format.
{ "timestamp":"2020-04-17 08:10:50,123", "thread":26, "loglevel":"INFO", "application":"ApplicationName","logmessage":"LogMessage From The Application" }
It is easier and fast for logstash to parse this message, you can do it in your input using the json codec or you can use the json filter in your filter block.
If you have control on how your log messages are created, choose something that will make you do not need to use grok.

Knime MongoDBReader read an Object with an array in it

I tried to read an Object with the MongoDB reader and I always get the following error message:
"ERROR MongoDB Reader 0:19 Execute failed: Invalid type 19 for field value".
The typ of the field value is an Array.
I want to read the object and want to get the array inside the Object.
Here you can see the MongoDB with the Object I want to read.
“Type 19” indicates a Decimal128 type, see this link.
I’d assume that this type is just not supported by the MongoDB nodes (note how above link says “New in version 3.4.”).
[personal opinion] The KNIME MongoDB nodes are not very well maintained. Beside some basic “hello world” attempts I wasn’t able to use them for any of my real-world usage scenarios.

JSON escape quotes on value before deserializing

I have a server written in Rust, this server gets a request in JSON, the JSON the server is getting is a string and sometimes users write quotes inside the value. For example when making a new forum thread.
The only thing I really need to do is to escape the quotes inside the value.
So this:
"{"name":""test"", "username":"tomdrc1", "date_created":"07/12/2019", "category":"Developer", "content":"awdawdasdwd"}"
Needs to be turned into this:
"{"name":"\"test\"", "username":"tomdrc1", "date_created":"07/12/2019", "category":"Developer", "content":"awdawdasdwd"}"
I tried to replace:
let data = let data = "{"name":""test"", "username":"tomdrc1", "date_created":"07/12/2019", "category":"Developer", "content":"awdawdasdwd"}".to_string().replace("\"", "\\\"");
let res: serde_json::Value = serde_json::from_str(&data).unwrap();
But it results in the following error:
thread '' panicked at 'called Result::unwrap() on an Err value: Error("key must be a string", line: 1, column: 2)
I suspect because it transforms the string to the following:
let data = "{\"name\":\"\"test\"\", \"username\":\"tomdrc1\", \"date_created\":\"07/12/2019\", \"category\":\"Developer\", \"content\":\"awdawdasdwd\"}"
If I understand your question right, the issue is that you are receiving strings which should be JSON but are in fact malformed (perhaps generated by concatenating strings).
If you are unable to fix the source of those non-JSON strings the only solution I can think of involves a lot of heavy lifting with caveat:
Writing a custom "malformed-JSON" parser
Careful inspection/testing/analysis of how the broken client is broken
Using the brokenness information to fix the "malformed-JSON"
Using the fixed JSON to do normal request processing
I would recommend not to do that except for maybe a training excercise. Fixing the client will be done in minutes but implementing this perfectly on the server will take days or weeks. The next time this one problematic client has been changed you'll have to redo all the hard work.
The real answer:
Return "400 Bad Request" with some additional "malformed json" hint
Fix the client if you have access to it
Additional notes:
Avoid unwrapping in a server
Look for ways to propagate the Result::Err to caller and use it to trigger a "400 Bad Request" response
Check out error handling chapter in the Rust book for more

Gemfire pdxInstance datatype

I am writing pdxInstances to GemFire using the sequence: rabbitmq => springxd => gemfire.
If I put this JSON into rabbitmq {'ID':11,'value':5}, value appears as a byte value in GemFire. If I put {'ID':11,'value':500}, value appears as a word and if I put {'ID':11,'value':50000} it appears as an Integer.
A problem arises when I query data from GemFire and order them. For example, if I use a query such as select * from /my_region order by value it fails, saying it cannot compare a byte with a word (or byte with an integer).
Is there any way to declare the data type in JSON? Or any other method to get rid of this problem?
To add a bit of insight into this problem... in reviewing GemFire/Geode source code, it would seem it is not possible to configure the desired value type and override GemFire/Geode's default behavior, which can be seen in JSONFormatter.setNumberField(..).
I will not explain how GemFire/Geode involves the JSONFormatter during a Region.put(key, value) operation as it is rather involved and beyond the scope of this discussion.
However, one could argue that the problem is not necessarily with the JSONFormatter class, since storing a numeric value in a byte is more efficient than storing the value in an integer, especially when the value would indeed fit into a byte. Therefore, the problem is really that the Comparator used in the Query processor should be able to compare numeric values in the same type family (byte, short, int, long), upcasting where appropriate.
If you feel so inclined, feel free to file a JIRA ticket in the Apache Geode JIRA repository at https://issues.apache.org/jira/browse/GEODE-72?jql=project%20%3D%20GEODE
Note, Apache Geode is the open source "core" of Pivotal GemFire now. See the Apache Geode website for more details.
Cheers!
Your best bet would be to take care of this with a custom module or a groovy script. You can either write a custom module in Java to do the conversion and then upload the custom module into SpringXD, then you could reference your custom module like any other processor. Or you could write a script in Groovy and pass the incoming data through a transform processor.
http://docs.spring.io/spring-xd/docs/current/reference/html/#processors
The actual conversion probably won't be too tricky, but will vary depending on which method you use. The stream creation would look something like this when you're done.
stream create --name myRabbitStream --definition "rabbit | my-custom-module | gemfire-json-server etc....."
stream create --name myRabbitStream --definition "rabbit | transform --script=file:/transform.groovy | gemfire-json-server etc...."
It seems like you have your source and sink modules set up just fine, so all you need to do is get your processor module setup to do the conversion and you should be all set.

parsing a huge json file without increasing heap size

I have problem parsing a huge json file (200mb). At first i tried to use JACKSON to parse the json as a tree. However, i encountered heap size problem. For some reason, increasing heap size is not an option.
JSON format :
{
"a1":{ "b1":{"c1":"somevalue", "c2":"somevalue"}, ... },
"a2":{ "b1":{"c1":"somevalue"},"c3":"somevalue"}, ... },
....
}
what i want to do is to produce strings like
str1 = "{ "b1":{"c1":"somevalue", "c2":"somevalue"}, ... }"
str2 = "{ "b1":{"c3":"somevalue"},"c4":"somevalue"}, ... }"
Is there any way to do this without heap problem?
In python, there is simple way to do this and no heap problem(no JVM)
data = json.loads(xxx)
for key,val in data.iteritems():
puts val
some thoughts:
I might not need to use Jackson tree approach since i only want string.
Streaming Jackson might be an option, but i have difficulties write it because our json format is quite flexible. Any suggestion will be appreciated!
Thanks
Using object-based data-binding is bit more memory-efficient, so if you can define Java classes to match the structure, that is much better way: faster, uses less memory.
But sometimes tree model is needed when the structure is not known in advance.
Streaming API can help, and you can also mix approaches: iterate over JSON tokens, and then use JsonParser.readValueAs(MyType.class) or JsonParser.readValueAsTree().
This lets you only build in-memory tree or object for subset of JSON input.
Finally I use a streaming approach. I open a stream from http and each time read a fix amount of bytes to buffer. after I identify i have built a valid string in the buffer, i emit the string and truncate the buffer. In this way I use very few memory. Thanks!