jmeter Invalid UTF-8 middle byte - json

I'm using jMeter to shoot json through post requests to my test server.
the following request always fail:
{
"location": {
"latitude": "37.390737",
"longitude": "-121.973864"
},
"category": "Café & Bakeries"
}
the error message in the response data is:
Invalid UTF-8 middle byte 0x20
at [Source: org.apache.catalina.connector.CoyoteInputStream#6073ddf0; line: 6, column: 20]
The request is not sent to the server at all.
other requests (e.g. replacing the value in category with other valid category like "Delis") works perfectly.
I guess it's an encoding issue related to "Café" but I don't know how to resolve it.
any idea?
Thanks!

in the HTTP request itself, it is possible to set "content encoding". I set there "utf-8" and it solved the problem

You'll probably need an HTTP header to post that JSON:
Content-Type: application/json; charset=utf-8
Without this, it's likely that the string isn't UTF-8–encoded. JSON should be in UTF-8, so the hex bytes for é should be 0xc3 0xa9.
Without that header, the byte sequence is probably 0xe9, which is in ISO-8859-1 encoding. That would explain the error, as UTF-8 sequences starting 0xe_ are 3-byte sequences, so it sees 0xe9 0x20 (where 0x20 is the space after the é) and complains about an "invalid middle byte".
Source: Posting a JSON request with JMeter

None of the above solutions worked for me in jmeter 5.2.1
What I tried?
Set the jmeter properties file: sampleresult.default.encoding=UTF-8 (Didn't work)
Set the header Content-Type=application/json;UTF-8 ( Didn't work)
Tried preProcessor sampler.getArguments().getArgument(0).setValue(new String(utf8Bytes)) ( Didn't work)
Fortunately I noticed a field "Content encoding" ( As mentioned by Ofir Kolar) which seem to worked finally

Related

CICS TS(DFHJS2LS): Chinese characters are getting corrupted when received into MAINFRAME from POSTMAN tool

We have developed a webservice having CICS as the HTTP SERVER (service provider). This Webservice takes the input JSON (which has both English and Chinese characters) from any client/POSTMAN tool and will be processed in Mainframe (CICS).
DFHJS2LS: JSON schema to high-level language conversion for request-response services
We are using this proc - DFHJS2LS to enable webservices in Mainframe. ThisI BM provided procedure does the conversion of JSON to MF copybook and vice-versa. Also it converts the UTF-8 code unit into UTF-16 when it reaches mainframe copybook.
Issue:
The issue what we face now is on the Chinese characters. The Chinese characters which we pass in JSON are not getting converted properly and they are getting corrupted when it is received inside mainframe. The conversion from UTF-8 to UTF-16 is not happening (this is my suspect).
市 - this is the chinese character passed in JSON (POSTMAN).
Expected value in Mainframe copybook is 5E02(UTF-16 - hex value)
but we got 00E5 00B8 0082(UTF-8 hex value)
we have tried all header values and still no luck.....
content type = application/json
charset=UTF-8 / UTF-16
Your inputs are much appreciated in addressing this DBCS/unicode/chinese character issue.
In the COBOL are you declaring the filed that will receive the Chinese characters as Pic G :
01 China-Test-Message.
03 Msg-using-pic-x Pic X(10).
03 Msg-using-pic-g Pic G(4) Usage Display-1.
Try "USAGE NATIONAL" which shoul dmap to UTF-16 which is probably the code page for the chinese character.
RTFM here:-
https://www.ibm.com/support/knowledgecenter/SS6SG3_6.3.0/pg/concepts/cpuni01.html
The chinese conversion is resolved once we changed our HTTP header to this -
Content-Type = application/json;charset=UTF-8
thanks everyone for the support.

How to decode an HTTP request with utf-8 and treat the surrogate keys (Emojis)

I'm having a hard time dealing with some parsing issues related to Emojis.
I have a json requested through the brandwatch site using urllib.(1) Then, I must decode it in utf-8, however, when I do so, I'm getting surrogate keys and the json.loader cannot deal with them. (2)
I've tried to use BeautifulSoup4, which works great, however, when there's a &quot on the site result, it is transformed to ", and then, the json.loader cannot deal with it for it says that a , is missing. After tons of search, I gave up trying to escape the " which would be the ideal.(3)
So now, I'm stuck with both "solutions/problems". Any ideas on how to proceed?
Obs: This is a program that fetchs data from the brandwatch and put it inside an MySQL database. So performance is an issue here.
Obs2: PyJQ is a JQ for Python with does the request and I can change the opener.
(1) - Dealing with the first approach using urllib, these are the relevants parts of the code used for it:
def downloader(url):
return json.loads(urllib.request.urlopen(url).read().decode('utf8'))
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
Error Casted:
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
If I print the result of urllib.request.urlopen(url).read().decode('utf8') instead of sending it to json.loader, that's what appears. These keys seems to be Emojis.
"fullname":"Botinhas\uD83D\uDC62"
(2) Dealing with the second approach using BeautifulSoup4, here's the relevant part of the code. (Same as above, just changed the downloader function)
def downloader(url):
return json.loads(BeautifulSoup(urllib.request.urlopen(url), 'lxml').get_text())
...
parsed = pyjq.all(jqparser,url=url, vars={"today" : start_date}, opener=downloader)
And this is the error casted:
Expecting ',' delimiter: line 1 column 4814765 (char 4814764)
Doing the print, the " before Diretas Já should be escaped.
"title":"Por "Diretas Já", manifestações pelo país ocorrem em preparação ao "Ocupa Brasília" - Sindicato dos Engenheiros no Estado do Rio de Janeiro"
I've thought of running a regex, however, I'm not sure whether this would be the most appropriate solution to this case as performance is an issue.
(3) - Part of Brandwatch result with the &quot problem mentioned above
UPDATE:
As Martin stated in the comments, I ran a replace swapping &quot for nothing. Then, it raised the former problem, of the emoji.
Exception ignored in: '_pyjq.pyobj_to_jv'
UnicodeEncodeError: 'utf-8' codec can't encode character '\ud83d' in position 339: surrogates not allowed
*** Error in `python': munmap_chunk(): invalid pointer: 0x00007f5f806303f0 ***
UPDATE2:
I've added this to the downloader function:
re.sub(r'\\u(d|D)([a-z|A-Z|0-9]{3})', "", urllib.request.urlopen(url).read().decode('utf-8','ignore'))
It solved the issue, however, I don't think it's the best way to solve it. If anybody knows a better option.

JSON validation fails (RFC 4627)

I have an Api method which returns json data. When I try to validate the json data using the online json validator: http://pro.jsonlint.com/, with the compare option, giving url in one section and the output of the url in another section, the url section shows an error and the section with data copied and pasted is validated.
What could be the issue here?
UPDATE:
I copied the 2 outputs into notepad and did a file compare, there is a non-printable character at the begining of the output from url.
D:\>fc j1.js j2.js
Comparing files j1.js and J2.JS
***** j1.js
{
"responseStatus": null,
***** J2.JS
{
"responseStatus": null,
*****
The content-type of the api response is "application/json; charset=utf-8".
Without knowledge of your locale and character set, this is speculation; but the placement of the spurious text suggests that it may be a Unicode BOM. (Hmmm, six bytes? Two UTF-8 BOMs?)

What is the boundary in multipart/form-data?

I want to ask a question about the multipart/form-data. In the HTTP header, I find that the Content-Type: multipart/form-data; boundary=???.
Is the ??? free to be defined by the user? Or is it generated from the HTML? Is it possible for me to define the ??? = abcdefg?
Is the ??? free to be defined by the user?
Yes.
or is it supplied by the HTML?
No. HTML has nothing to do with that. Read below.
Is it possible for me to define the ??? as abcdefg?
Yes.
If you want to send the following data to the web server:
name = John
age = 12
using application/x-www-form-urlencoded would be like this:
name=John&age=12
As you can see, the server knows that parameters are separated by an ampersand &. If & is required for a parameter value then it must be encoded.
So how does the server know where a parameter value starts and ends when it receives an HTTP request using multipart/form-data?
Using the boundary, similar to &.
For example:
--XXX
Content-Disposition: form-data; name="name"
John
--XXX
Content-Disposition: form-data; name="age"
12
--XXX--
In that case, the boundary value is XXX. You specify it in the Content-Type header so that the server knows how to split the data it receives.
So you need to:
Use a value that won't appear in the HTTP data sent to the server.
Be consistent and use the same value everywhere in the request message.
The answer to substance of the question is yes. You can use an arbitrary value for the boundary parameter as long as it is less than 70 bytes long and only contains 7-bit US-ASCII (printable) characters.
If you use one of multipart/* content types, you are actually required to specify the boundary parameter in the Content-Type header. Otherwise, in the case of an HTTP request, the server will be unable to parse the payload.
Unless you are absolutely certain that only the US-ASCII character set will be used in its payload, you may want to add a Content-Type header to each part, with the charset parameter set to UTF-8.
A few relevant excerpts from the RFC2046:
4.1. Text Media Type
A "charset" parameter may be used to indicate the character set of the body text for "text" subtypes, notably including the subtype "text/plain", which is a generic subtype for plain text.
4.1.2. Charset Parameter
A critical parameter that may be specified in the Content-Type field
for "text/plain" data is the character set.
Unlike some other parameter values, the values of the charset parameter are NOT case sensitive. The default character set, which must be assumed in the absence of a charset parameter, is US-ASCII.
5.1. Multipart Media Type
As stated in the definition of the Content-Transfer-Encoding field [RFC 2045], no encoding other than "7bit", "8bit", or "binary" is permitted for entities of type "multipart". The "multipart" boundary delimiters and header fields are always represented as 7bit US-ASCII in any case (though the header fields may encode non-US-ASCII header text as per RFC 2047) and data within the body parts can be encoded on a part-by-part basis, with Content-Transfer-Encoding fields for each appropriate body part.
The Content-Type field for multipart entities requires one parameter, "boundary". The boundary delimiter line is then defined as a line consisting entirely of two hyphen characters ("-", decimal value 45) followed by the boundary parameter value from the Content-Type header field, optional linear whitespace, and a terminating CRLF.
Boundary delimiters must not appear within the encapsulated material, and must be no longer than 70 characters, not counting the two leading hyphens.
The boundary delimiter line following the last body part is a distinguished delimiter that indicates that no further body parts will follow. Such a delimiter line is identical to the previous delimiter lines, with the addition of two more hyphens after the boundary parameter value.
Here is an example using an arbitrary boundary:
Content-Type: multipart/form-data; boundary="yet another boundary"
--yet another boundary
Content-Disposition: form-data; name="foo"
bar
--yet another boundary
Content-Disposition: form-data; name="baz"
quux
--yet another boundary
Content-Disposition: form-data; name="feels"
Content-Type: text/plain; charset=utf-8
🤷
--yet another boundary--
multipart/form-data contains boundary to separate name/value pairs. The boundary acts like a marker of each chunk of name/value pairs passed when a form gets submitted. The boundary is automatically added to a content-type of a request header.
The form with enctype="multipart/form-data" attribute will have a request header Content-Type : multipart/form-data; boundary --- WebKit193844043-h (browser generated vaue).
The payload passed looks something like this:
Content-Type: multipart/form-data; boundary=---WebKitFormBoundary7MA4YWxkTrZu0gW
-----WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name=”file”; filename=”captcha”
Content-Type:
-----WebKitFormBoundary7MA4YWxkTrZu0gW
Content-Disposition: form-data; name=”action”
submit
-----WebKitFormBoundary7MA4YWxkTrZu0gW--
On the webservice side, it's consumed in #Consumes("multipart/form-data") form.
Beware, when testing your webservice using chrome postman, you need to check the form data option(radio button) and File menu from the dropdown box to send attachment. Explicit provision of content-type as multipart/form-data throws an error. Because boundary is missing as it overrides the curl request of post man to server with content-type by appending the boundary which works fine.
See RFC1341 sec7.2 The Multipart Content-Type
we have to split our data. So, the server understands what we send.
1 Example: We split data
$email = $_POST['email'];
$p_id = $_POST['pid'];
2.Example: if We send JSON data ( With ) content type Multipart/form-data, we get a warning related to boundary
$json = file_get_contents("php://input");
use this
headers: {
'content-type': 'application/x-www-form-urlencoded'
}
for boundary error

json character encoding problem

When I encode an array to JSON I get "u00e1" instead of á.
How could I solve the character encoding?
Thanks
Your input data is not Unicode. 0xE1 is legacy latin1/ISO-8859-*/Windows-1252 for á. \u00e1 is the JSON/JavaScript to encode that. JSON must use a Unicode encoding.
Solve it by either fixing your input or converting it using something like iconv.
The browser's default encoding is probably Unicode UTF-8. Try
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">.
One problem can be if you check the response only (the response is only a text but the JSON must be an object).
You have to parse the response text to be a javascript object first (JSON.parse in javascript) and after that the characters will become the same as on the server side.
Example:
On the server in the php code:
$myString = "árvízrtűrő tükörfúrógép";
echo json_encode($myString); //this sends the encoded string via a protocol that maybe can handle only ascii characters, so the result on the client side is:
On the client side
alert(response); //check the text sent by the php
output: "\u00e1rv\u00edzrt\u0171r\u0151 t\u00fck\u00f6rf\u00far\u00f3g\u00e9p"
Make a js object from the respopnse
parsedResponse = JSON.parse(response);
alert(parsedResponse);
output: "árvízrtűrő tükörfúrógép"