pdf2json: how to customize the output json file? - json

Can I customize the output of pdf2json command line utility so that the output json file has a specific structure?
I'm trying to extract data from a pdf (see figure below) and store it as a json file.
I tried pdf2json -f [input directory or pdf file]. The command does output a json file that contains the information I need, but it also contains a lot of information I don't need:
{"formImage":{"Transcoder":"pdf2json#0.6.6","Agency":"","Id":{"AgencyId":"","Name":"","MC":false,"Max":1,"Parent":""},"Pages":[{"Height":49.5,"HLines":[{"x":13.111828125000002,"y":4.678418750000001,"w":0.44775000000000004,"l":78.96384375000001},{"x":13.111828125000002,"y":44.074375,"w":0.44775000000000004,"l":78.96384375000001}],"VLines":[],"Fills":[{"x":0,"y":0,"w":0,"h":0,"clr":1}],"Texts":[{"x":13.632429687500002,"y":4.382312499999998,"w":4.163000000000001,"clr":0,"A":"left","R":[{"T":"abundant","S":-1,"TS":[0,13.9091,0,0]}]},{"x":25.021517303398443,"y":4.382312499999998,"w":4.139000000000001,"clr":0,"A":"left","R":[{"T":"positive%3A1","S":-1,"TS":[0,13.9091,0,0]}]},{"x":32.38324218816407,"y":4.382312499999998,"w":4.412000000000001,"clr":0,"A":"left","R":[{"T":"negative%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":40.12887364285157,"y":4.382312499999998,"w":3.1670000000000003,"clr":0,"A":"left","R":[{"T":"anger%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":46.1237223885547,"y":4.382312499999998,"w":5.993,"clr":0,"A":"left","R":[{"T":"anticipation%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":56.09123069480469,"y":4.382312499999998,"w":3.8400000000000003,"clr":0,"A":"left","R":[{"T":"disgust%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":63.0324864791797,"y":4.382312499999998,"w":2.4170000000000003,"clr":0,"A":"left","R":[{"T":"fear%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":67.97264684597657,"y":4.382312499999998,"w":2.109,"clr":0,"A":"left","R":[{"T":"joy%3A1","S":-1,"TS":[0,13.9091,0,0]}]},{"x":72.47968185183595,"y":4.382312499999998,"w":4.013,"clr":0,"A":"left","R":[{"T":"sadness%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":79.66421908894532,"y":4.382312499999998,"w":4.178000000000001,"clr":0,"A":"left","R":[{"T":"surprise%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":87.08078776941407,"y":4.382312499999998,"w":2.8930000000000002,"clr":0,"A":"left","R":[{"T":"trust%3A0","S":-1,"TS":[0,13.9091,0,0]}]},{"x":13.632429687500002,"y":5.017468750000002,"w":2.4480000000000004,"clr":0,"A":"left","R":
I only need the text from the pdf file. I don't need any information about the format. So I need something like this:
{"data":
{
"abundant": {
"positive":1,
"negative":0,
"anger":0,
...
},
"abuse": {...},
"abutment": {...},
...
}
}

I've build a Node.js module that uses pdf2json and some simple math to extract the table data from the PDF. The output is an array of rows.
https://www.npmjs.com/package/pdf2table

Related

For Jmeter post request, how can I send multiple requests in body from CSV(already have JSON inside) file?

I have a csv file and it has hundred record of JSON. I want to send JSON in body of JMETER post request one by one from CSV.
I tried this and I am getting desired results, but it is adding " " to every variable or data.for example: while sending this as a body
[
{
"id":"1232435",
"ref":"88f000",
"data":"5a344f",
"number":"896751245"
}
]
jmeeter is processing this body as
"[
{
""id"":""1232435"",
""ref"":""88f000"",
""data"":""5a344f"",
""number"":""896751245""
}
]"
I want it to process same as in csv file.
enter image description here
I cannot reproduce your issue, double check your JSON file contents.
If I put the following line into CSV file:
[{"id":"1232435","ref":"88f000","data":"5a344f","number":"896751245"}]
it's being sent as it is without any extra quotation marks.
Also you have dataa in the CSV Data Set Config and data in the HTTP Request so it might be the case that you're not actually reading the file at all because given your current CSV Data Set Config setup you would get only first line of the file into the data variable to wit your request would look like [
So use Debug Sampler and View Results Tree listener combination to see whether the variable is correct and how it does look like. If there are still extra quotation marks they can be removed using __strReplace() function

Creating individual JSON files from a CSV file that is already in JSON format

I have JSON data in a CVS file that I need to break apart into seperate JSON files. The data looks like this: {"EventMode":"","CalculateTax":"Y",.... There are multiple rows of this and I want each row to be a separate JSON file. I have used code provided by Jatin Grover that parses the CVS into JSON:
lcount = 0
out = json.dumps(row)
jsonoutput = open( 'json_file_path/parsedJSONfile'+str(lcount)+'.json', 'w')
jsonoutput.write(out)
lcount+=1
This does an excellent job the problem is it adds "R": " before the {"EventMode... and adds extra \ between each element as well as item at the end.
Each row of the CVS file is already valid JSON objects. I just need to break each row into a separate file with the .json extension.
I hope that makes sense. I am very new to this all.
It's not clear from your picture what your CSV actually looks like.
I mocked up a really small CSV with JSON lines that looks like this:
Request
"{""id"":""1"", ""name"":""alice""}"
"{""id"":""2"", ""name"":""bob""}"
(all the double-quotes are for escaping the quotes that are part of the JSON)
When I run this little script:
import csv
with open('input.csv', newline='') as input_file:
reader = csv.reader(input_file)
next(reader) # discard/skip the fist line ("header")
for i, row in enumerate(reader):
with open(f'json_file_path/parsedJSONfile{i}.json', 'w') as output_file:
output_file.write(row[0])
I get two files, json_file_path/parsedJSONfile0.json and json_file_path/parsedJSONfile1.json, that look like this:
{"id":"1", "name":"Alice"}
and
{"id":"2", "name":"bob"}
Note that I'm not using json.dumps(...), that only makes sense if you are starting with data inside Python and want to save it as JSON. Your file just has text that is complete JSON, so basically copy-paste each line as-is to a new file.

DataFrame write to CSV not supporting some characters

I am trying to parse the XML file and write to DataFrame result to CSV file.
My problem is some of characters are not supported when i write the output to the CSV. For eg, there is a field Nectarine tree named ‘Polar Zee’ its writes like Nectarine tree named ‘Polar Zee’.
Is there any settings need to be change? or any properties need to be added?

Concatenated JSON objects in one text file - how to import to R?

I have a txt file which is built from many JSON objects. The structure of the file looks something like this:
[{"id": 333, "key press:1 ....},{"id":321 ...}, ][{"id": 333, "key press:1 ....},{"id":321 ...}] etc.
Trying to read it using either jsonlite (error), rjson or RJSONIO - I get only part of the file - I can import the data until the first closure (the ]). Is there any way to parse it, or to read these objects to R?
I attach a full file (I have many): https://pastebin.com/Ee2cvdEi
Thanks

Defining schema in JsonLoader in PIG

I was trying to enter the schema of a dataset while using Pig from a JSON file using the JsonLoader.
The format of the data is as:
{
'cat_a':'some_text',
'cat_b':{(attribute_name):(attribute_value)}
}
I am trying to describe the schema as:
LOAD 'filename' USING JsonLoader('cat_a:chararray, cat_b:(attribute_name:chararray,attribute_value:int)');
I feel that I'm describing the schema incorrectly for cat_b.
Can someone help out in that?
Thanks in advance.
If your json is of the format
{"recipe":"Tacos","ingredients":[{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}]}
store the above json in test.json
run the below command
a = LOAD '/home/abhijit/Desktop/test.json' USING JsonLoader('recipe:chararray,ingredients: {(name:chararray)}');
dump a;
you will have output as
(Tacos,{(Beef),(Lettuce),(Cheese)},)
if your json is like below format
{"recipe":"Tacos","ingredients":[{"name":"Beef"},{"name":"Lettuce"},{"name":"Cheese"}],"inventor":{"name":"Alex","age":25}}
a = LOAD '/home/abhijit/Desktop/test.json' USING JsonLoader('recipe:chararray,ingredients: {(name:chararray)},inventor: (name:chararray, age:int)');
dump a;
output would be
(Tacos,{(Beef),(Lettuce),(Cheese)},(Alex,25))