Sphinx 3 Search engine: Having problems reading JSON from CSV source - json

When I try to read JSON content from a field I get:
WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'
Here are the details:
This is the (super simplified) CSV file I'm trying to read:
1,hello world, document number one,a:foo
22,hello again, document number two,foo:bar
23,hello now, This is some stuff,foo:{bar:baz}
24,hello cow, more test stuff and things,{foo:bar}
55,hello suess, box and sox and goats and moats,[a]
56,hello raven, nevermore said the thing,foo:bar
When I run the indexer this is the result I get:
../bin/indexer --config /home/ec2-user/sphinx/etc/sphinx.conf --all --rotate
Sphinx 3.3.1 (commit b72d67b)
Copyright (c) 2001-2020, Andrew Aksyonoff
Copyright (c) 2008-2016, Sphinx Technologies Inc (http://sphinxsearch.com)
using config file '/home/ec2-user/sphinx/etc/sphinx.conf'...
indexing index 'csvtest'...
WARNING: document 1, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'a:foo'
WARNING: document 22, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
WARNING: document 23, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:{bar:baz}'
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
WARNING: document 56, attribute assorted: JSON error: syntax error, unexpected TOK_IDENT, expecting $end near 'foo:bar'
collected 6 docs, 0.0 MB
sorted 0.0 Mhits, 100.0% done
total 6 docs, 0.1 Kb
total 0.0 sec, 17.7 Kb/sec, 1709 docs/sec
rotating indices: successfully sent SIGHUP to searchd (pid=14393).
This is the entire config file:
source csvsrc
{
type = csvpipe
csvpipe_delimiter = ,
csvpipe_command = cat /home/ec2-user/sphinx/etc/example.csv
csvpipe_field_string =t
csvpipe_attr_string =c
csvpipe_attr_json =assorted
}
index csvtest
{
source = csvsrc
path = /var/data/test7
morphology = stem_en
rt_field = t
rt_field = c
rt_field = assorted
}
indexer
{
mem_limit = 128M
}
searchd
{
listen = 9312
listen = 9306:mysql41
log = /var/log/searchd.log
query_log = /var/log/query.log
pid_file = /var/log/searchd.pid
binlog_path = /var/data
}
And If I do log in and query, it's pretty obvious that the JSON was not, in fact, indexed (as expected from the warnings)
select * from csvtest;
+------+-------------+----------------------------------+----------+
| id | t | c | assorted |
+------+-------------+----------------------------------+----------+
| 1 | hello world | document number one | NULL |
| 22 | hello again | document number two | NULL |
| 23 | hello now | This is some stuff | NULL |
| 24 | hello cow | more test stuff and things | NULL |
| 55 | hello suess | box and sox and goats and moats | NULL |
| 56 | hello raven | nevermore said the thing | NULL |
+------+-------------+----------------------------------+----------+
6 rows in set (0.00 sec)
I have tried a few things, but I'm just groping in the dark.
Some things I have tried:
Alternate formats of JSON. I have tried using {foo:bar} and {[foo:bar]} and [{foo,bar}] based on some experiences with other JSON inputs where they want it to be either an array or dict at the top level. These actually generate slightly different errors:
WARNING: document 24, attribute assorted: JSON error: syntax error, unexpected '}', expecting '[' near '}'
WARNING: document 55, attribute assorted: JSON error: syntax error, unexpected ']', expecting '[' near ']'
I have tried adding a trailing comma thinking that might be the $end token that the parser is looking for. This generates an actual error ERROR: index 'csvtest': source 'csvsrc': not all columns found (found=5, total=4, line=1). which prevents index generation. That makes sense to me
2a) I tried adding a whole other column after the JSON so I could have the ending comma but not get an error that would prevent the index from generating. This did generate the index, but did not provide the $end token that the JSON parser was looking for.
I'm totally stumped.

Well as such a:foo isnt a valid JSON value AFAIK. LOoks like it meant to be object? So would need {...} surrounding it.
But even {foo:bar} is not valid either. At the very least the 'value' shoud be quoted {foo:"bar"}. But really the keys quoting too {"foo":"bar"}
Javascript Objects technically allow unquoted key names, but JSON requires the quoting.
... but also remember it CSV. Quotes are typically used for quoting (eg when columns contain commas), so the quotes need double encoding! Ends up a bit messy...
24,hello cow, more test stuff and things,"{""foo"":""bar""}"

Related

Invalid JSON Expression

I am making use of non-static import
JsonPath jp = response.jsonPath();
System.out.println(jp.get("data?(#.id>14).employee_name").toString());
For a JSON as shown below:
{"status":"success","data":[{"id":"1","employee_name":"Tiger Nixon","employee_salary":"320800","employee_age":"61","profile_image":""},{"id":"2","employee_name":"Garrett Winters","employee_salary":"170750","employee_age":"63","profile_image":""}]}
When i am trying to run it , i am getting below error:
java.lang.IllegalArgumentException: Invalid JSON expression:
Script1.groovy: 1: expecting EOF, found '[' # line 1, column 31.
data[?(#.id>14)].employee_name
^
1 error
Can someone guide me why is this error being thrown ?
I doubt if that syntax is right, nonetheless, you should be using the below
Also note that the id is a string in your response so you will have to include it in quotes
js.get("data.find {it.id > '14'}.employee_name").toString();

Scala create multi-line JSON String

I'm trying to create a multi-line String in Scala as below.
val errorReport: String =
"""
|{
|"errorName":"blah",
|"moreError":"blah2",
|"errorMessage":{
| "status": "bad",
| "message": "Unrecognized token 'noformatting': was expecting 'null', 'true', 'false' or NaN
at [Source: (ByteArrayInputStream); line: 1, column: 25]"
| }
|}
"""
.stripMargin
It's a nested JSON and it's not displaying properly when I print it. The message field inside errorMessage (which is the output of calling getMessage on an instance of a Throwable) is causing the issue because it looks like there is a newline right before
at [Source: ....
If I get rid of that line the JSON displays properly. Any ideas on how to properly format this are appreciated.
EDIT: The issue is with the newline character. So I think the question is more concisely - how to handle the newline within the triple quotes so that it's still recognized as a JSON?
EDIT 2: message is being set by a variable like so:
"message": "${ex.getMessage}"
where ex is a Throwable. An example of the contents of that getMessage call is provided above.
I assume that your question has nothing to do with JSON, and that you're simply asking how to create very wide strings without violating the horizontal 80-character limit in your Scala code. Fortunately, Scala's string literals have at least the following properties:
You can go from ordinary code to string-literal mode using quotes "..." and triple quotes """...""".
You can go from string-literal mode to ordinary code mode using ${...}
Free monoid over characters is reified as methods, that is, there is the + operation that concatenates string literals.
The whole construction can be made robust to whitespace and indentation using | and stripMargin.
All together, it allows you to write down arbitrary string literals without ever violating horizontal character limits, in a way that is robust w.r.t. indentation.
In this particular case, you want to make a line break in the ambient scala code without introducing a line break in your text. For this, you simply
exit the string-literal mode by closing """
insert concatenation operator + in code mode
make a line-break
indent however you want
re-enter the string-literal mode again by opening """
That is,
"""blah-""" +
"""blah"""
will create the string "blah-blah", without line break in the produced string.
Applied to your concrete problem:
val errorReport: String = (
"""{
| "errorName": "blah",
| "moreError": "blah2",
| "errorMessage": {
| "status": "bad",
| "message": "Unrecognized token 'noformatting'""" +
""": was expecting 'null', 'true', 'false' or NaN at """ +
"""[Source: (ByteArrayInputStream); line: 1, column: 25]"
| }
|}
"""
).stripMargin
Maybe a more readable option would be to construct the lengthy message separately from the neatly indented JSON, and then use string interpolation to combine the two components:
val errorReport: String = {
val msg =
"""Unrecognized token 'noformatting': """ +
"""was expecting 'null', 'true', 'false' or NaN at """ +
"""[Source: (ByteArrayInputStream); line: 1, column: 25]"""
s"""{
| "errorName": "blah",
| "moreError": "blah2",
| "errorMessage": {
| "status": "bad",
| "message": "${msg}"
| }
|}
"""
}.stripMargin
If the message itself contains line breaks
Since JSON does not allow multiline string literals, you have to do something else:
To remove line breaks, use .replaceAll("\\n", "") or rather .replaceAll("\\n", " ")
To encode line breaks with the escape sequence \n, use .replaceAll("\\n", "\\\\n") (yes... backslashes...)

Assign puppet Hash to hieradata yaml

I want to assign a hash variable from puppet to a hiera data structure but i only get a string.
Here is a example to illustrate, what I want. Finaly I don't want to access a fact.
1 ---
2 filesystems:
3 - partitions: "%{::partitions}"
And here is my debug code:
1 $filesystemsarray = lookup('filesystems',Array,'deep',[])
2 $filesystems = $filesystemsarray.map | $fs | {
3 notice("fs: ${fs['partitions']}")
4 }
5
6 notice("sda1: ${filesystemsarray[0]['partitions']['/dev/sda1']}")
The map leads to the following output:
Notice: Scope(Class[Profile::App::Kms]): fs: {"/dev/mapper/localhost--vg-root"=>{"filesystem"=>"ext4", "mount"=>"/", "size"=>"19.02 GiB", "size_bytes"=>20422066176, "uuid"=>"02e2ba2c-2ee4-411d-ac63-fc963c8026b4"}, "/dev/mapper/localhost--vg-swap_1"=>{"filesystem"=>"swap", "size"=>"512.00 MiB", "size_bytes"=>536870912, "uuid"=>"95ba4b2a-7434-48fd-9331-66443c752a9e"}, "/dev/sda1"=>{"filesystem"=>"ext2", "mount"=>"/boot", "partuuid"=>"de90a5ed-01", "size"=>"487.00 MiB", "size_bytes"=>510656512, "uuid"=>"398f2ab6-a7e8-4983-bd81-db03984fbd0e"}, "/dev/sda2"=>{"size"=>"1.00 KiB", "size_bytes"=>1024}, "/dev/sda5"=>{"filesystem"=>"LVM2_member", "partuuid"=>"de90a5ed-05", "size"=>"19.52 GiB", "size_bytes"=>20961034240, "uuid"=>"wLKRQm-9bdn-mHA8-M8bE-NL76-Gmas-L7Gp0J"}}
Seem to be a Hash as expected but the notice in Line 6 leads to:
Error: Evaluation Error: A substring operation does not accept a String as a character index. Expected an Integer at ...
What's my fault?

Trying to understand number of ParseError in html5lib-test

I was looking at following test case in html5lib-tests:
{"description":"<!DOCTYPE\\u0008",
"input":"<!DOCTYPE\u0008",
"output":["ParseError", "ParseError", "ParseError",
["DOCTYPE", "\u0008", null, null, false]]},
source
State |Input char | Actions
--------------------------------------------------------------------------------------------
Data State | "<" | -> TagOpenState
TagOpenState | "!" | -> MarkupDeclarationOpenState
MarkupDeclarationOpenState | "DOCTYPE" | -> DOCTYPE state
DOCTYPE state | "\u0008" | Parse error; -> before DOCTYPE name state (reconsume)
before DOCTYPE name state | "\u0008" | DOCTYPE(name = "\u0008"); -> DOCTYPE name state
DOCTYPE name state | EOF | Parse error. Set force quirks on. Emit DOCTYPE -> Data state.
Data state | EOF | Emit EOF.
I'm wondering where do those three errors come from? I can only track two, but I assume I'm making an error in logic, somewhere.
The one you're missing is the one from the "Preprocessing the input stream" section:
Any occurrences of any characters in the ranges U+0001 to U+0008, U+000E to U+001F, U+007F to U+009F, U+FDD0 to U+FDEF, and characters U+000B, U+FFFE, U+FFFF, U+1FFFE, U+1FFFF, U+2FFFE, U+2FFFF, U+3FFFE, U+3FFFF, U+4FFFE, U+4FFFF, U+5FFFE, U+5FFFF, U+6FFFE, U+6FFFF, U+7FFFE, U+7FFFF, U+8FFFE, U+8FFFF, U+9FFFE, U+9FFFF, U+AFFFE, U+AFFFF, U+BFFFE, U+BFFFF, U+CFFFE, U+CFFFF, U+DFFFE, U+DFFFF, U+EFFFE, U+EFFFF, U+FFFFE, U+FFFFF, U+10FFFE, and U+10FFFF are parse errors. These are all control characters or permanently undefined Unicode characters (noncharacters).
This causes a parse error prior to the U+0008 character ever reaching the tokenizer. Given the tokenizer is defined as reading from the input stream, the tokenizer tests assume the input stream has its normal preprocessing applied to it.

Json Type Provider: Parsing Valid Json Fails

I have the following code block in my REPL
#r "../packages/FSharp.Data.2.2.1/lib/net40/FSharp.Data.dll"
open FSharp.Data
[<Literal>]
let uri = "http://www.google.com/finance/option_chain?q=AAPL&output=json"
type OptionChain = JsonProvider<uri>
When I run it, FSI is returning
Error 1 The type provider 'ProviderImplementation.JsonProvider'
reported an error: Cannot read sample JSON from
'http://www.google.com/finance/option_chain?q=AAPL&output=json':
Invalid JSON starting at character 1, snippet =
---- {expiry:{y:2
----- json =
------ {expiry:{y:2015,m:5,d:8},expirations: [{y:2015,m:5,d:8},{y:2015,m:5,d:15},
This json is valid according to two other sites. Is it a bug in the TP?
The output isn't valid JSON because some keys are not quoted.
{expiry:{y:2015,m:5,d:8},expirations:[{y:2015,m:5,d:8},{y:2015,m:5,d:15},{y:2015,m:5,d:22},{y:2015,m:5,d:29},{y:2015,m:6,d:5},{y:2015,m:6,d:12},{y:2015,m:6,d:19},{y:2015,m:6,d:26},{y:2015,m:7,d:17},{y:2015,m:8,d:21},{y:2015,m:10,d:16},{y:2016,m:1,d:15},{y:2017,m:1,d:20}],
puts:[{cid:"43623726334021",s:"AAPL150508P00085000",e:"OPRA",p:"-",c:"-",b:"-",a:"-",oi:"-",vol:"-",strike:"85.00",expiry:"May 8, 2015"},
...