How to convert a dynamic JSON like file to a CSV file - json

I have a file which looks exactly as below.
{"eventid" : "12345" ,"name":"test1","age":"18"}
{"eventid" : "12346" ,"age":"65"}
{"eventid" : "12336" ,"name":"test3","age":"22","gender":"Male"}
Think of the above file as event.json
The number of data objects may vary per line.
I would like the following csv output. and it would be output.csv
eventid,name,age,gender
12345,test1,18
12346,,65
12336,test3,22,Male
Could someone kindly help me? I could accept the answer from an any scripting language (Javascript, Python and etc.).

This code will collect all the headers dynamically and write the file to CSV.
Read comments in code for details:
import json
# Load data from file
data = '''{"eventid" : "12345" ,"name":"test1","age":"18"}
{"eventid" : "12346" ,"age":"65"}
{"eventid" : "12336" ,"name":"test3","age":"22","gender":"Male"}'''
# Store records for later use
records = [];
# Keep track of headers in a set
headers = set([]);
for line in data.split("\n"):
line = line.strip();
# Parse each line as JSON
parsedJson = json.loads(line)
records.append(parsedJson)
# Make sure all found headers are kept in the headers set
for header in parsedJson.keys():
headers.add(header)
# You only know what headers were there once you have read all the JSON once.
#Now we have all the information we need, like what all possible headers are.
outfile = open('output_json_to_csv.csv','w')
# write headers to the file in order
outfile.write(",".join(sorted(headers)) + '\n')
for record in records:
# write each record based on available fields
curLine = []
# For each header in alphabetical order
for header in sorted(headers):
# If that record has the field
if record.has_key(header):
# Then write that value to the line
curLine.append(record[header])
else:
# Otherwise put an empty value as a placeholder
curLine.append('')
# Write the line to file
outfile.write(",".join(curLine) + '\n')
outfile.close()

Here is a solution using jq.
If filter.jq contains the following filter
(reduce (.[]|keys_unsorted[]) as $k ({};.[$k]="")) as $o # object with all keys
| ($o | keys_unsorted), (.[] | $o * . | [.[]]) # generate header and data
| join(",") # convert to csv
and data.json contains the sample data then
$ jq -Mrs -f filter.jq data.json
produces
eventid,name,age,gender
12345,test1,18,
12346,,65,
12336,test3,22,Male

Here's a Python solution (should work in both Python 2 & 3).
I'm not proud of the code, as there's probably a better way to do this (using the csv module) but this gives you the desired output.
I've taken the liberty of naming your JSON data data.json and I'm naming the output csv file output.csv.
import json
header = ['eventid', 'name', 'age', 'gender']
with open('data.json', 'r') as infile, \
open('outfile.csv', 'w+') as outfile:
# Writes header row
outfile.write(','.join(header))
outfile.write('\n')
for row in infile:
line = ['', '', '', ''] # I'm sure there's a better way
datarow = json.loads(row)
for key in datarow:
line[header.index(key)] = datarow[key]
outfile.write(','.join(line))
outfile.write('\n')
Hope this helps.

Using Angularjs with ngCsv plugin we can generate csv file from desired json with dynamic headers.
Run in plunkr
// Code goes here
var myapp = angular.module('myapp', ["ngSanitize", "ngCsv"]);
myapp.controller('myctrl', function($scope) {
$scope.filename = "test";
$scope.getArray = [{
label: 'Apple',
value: 2,
x:1,
}, {
label: 'Pear',
value: 4,
x:38
}, {
label: 'Watermelon',
value: 4,
x:38
}];
$scope.getHeader = function() {
var vals = [];
for( var key in $scope.getArray ) {
for(var k in $scope.getArray[key]){
vals.push(k);
}
break;
}
return vals;
};
});
<!DOCTYPE html>
<html>
<head>
<link href="https://netdna.bootstrapcdn.com/bootstrap/3.0.0/css/bootstrap.min.css" rel="stylesheet">
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.4.7/angular.min.js"></script>
<script src="https://ajax.googleapis.com/ajax/libs/angularjs/1.4.7/angular-sanitize.min.js"></script>
<script src="https://cdnjs.cloudflare.com/ajax/libs/ng-csv/0.3.6/ng-csv.min.js"></script>
</head>
<body>
<div ng-app="myapp">
<div class="container" ng-controller="myctrl">
<div class="page-header">
<h1>ngCsv <small>example</small></h1>
</div>
<button class="btn btn-default" ng-csv="getArray" csv-header="getHeader()" filename="{{ filename }}.csv" field-separator="," decimal-separator=".">Export to CSV with header</button>
</div>
</div>
</body>
</html>

var arr = $.map(obj, function(el) { return el });
var content = "";
for(var element in arr){
content += element + ",";
}
var filePath = "someFile.csv";
var fso = new ActiveXObject("Scripting.FileSystemObject");
var fh = fso.OpenTextFile(filePath, 8, false, 0);
fh.WriteLine(content);
fh.Close();

Related

Seeding rails project with Json file

I'm at a lost and my searches have gotten me nowhere.
In my seeds.rb file I have the following code
require 'json'
jsonfile = File.open 'db/search_result2.json'
jsondata = JSON.load jsonfile
#jsondata = JSON.parse(jsonfile)
jsondata[].each do |data|
Jobpost.create!(post: data['title'],
link: data['link'],
image: data['pagemap']['cse_image']['src'] )
end
Snippet of the json file looks like this:
{
"kind": "customsearch#result",
"title": "Careers Open Positions - Databricks",
"link": "https://databricks.com/company/careers/open-positions",
"pagemap": {
"cse_image": [
{
"src": "https://databricks.com/wp-content/uploads/2020/08/careeers-new-og-image-sept20.jpg"
}
]
}
},
Fixed jsondata[].each to jasondata.each. Now I'm getting the following error:
TypeError: no implicit conversion of String into Integer
jsondata[] says to call the [] method with no arguments on the object in the jsondata variable. Normally [] would take an index like jsondata[0] to get the first element or a start and length like jsondata[0, 5] to get the first five elements.
You want to call the each method on jsondata, so jsondata.each.
So this is very specific to what you have posted:
require 'json'
file = File.open('path_to_file.json').read
json_data = JSON.parse file
p json_data['kind'] #=> "customsearch#result"
# etc for all the other keys
now maybe the json you posted is just the first element in an array:
[
{}, // where each {} is the json you posted
{},
{},
// etc
]
in which case you will indeed have to iterate:
require 'json'
file = File.open('path_to_file.json').read
json_data = JSON.parse file
json_data.each do |data|
p data['kind'] #=> "customsearch#result"
end

creating a nested json document

I have below document in mongodb.
I am using below python code to save it in .json file.
file = 'employee'
json_cur = find_document(file)
count_document = emp_collection.count_documents({})
with open(file_path, 'w') as f:
f.write('[')
for i, document in enumerate(json_cur, 1):
print("document : ", document)
f.write(dumps(document))
if i != count_document:
f.write(',')
f.write(']')
the output is -
{
"_id":{
"$oid":"611288c262c5c14df84f649b"
},
"Lname":"Borg",
"Fname":"James",
"Dname":"Headquarters",
"Projects":"[{"HOURS": 5.0, "PNAME": "Reorganization", "PNUMBER": 20}]"
}
But i need it like this (Projects value without quotes) -
{
"_id":{
"$oid":"611288c262c5c14df84f649b"
},
"Lname":"Borg",
"Fname":"James",
"Dname":"Headquarters",
"Projects":[{"HOURS": 5.0, "PNAME": "Reorganization", "PNUMBER": 20}]
}
Could anyone please help me to resolve this?
Thanks,
Jay
You should parse the JSON from the Projects field
Like this:
from json import loads
document['Projects'] = loads(document['Projects'])
So,
file = 'employee'
json_cur = find_document(file)
count_document = emp_collection.count_documents({})
with open(file_path, 'w') as f:
f.write('[')
for i, document in enumerate(json_cur, 1):
document['Projects'] = loads(document['Projects'])
print("document : ", document)
f.write(dumps(document))
if i != count_document:
f.write(',')
f.write(']')

Issue with output while using python and jinja2 to generate grafana json panels for dashboards

I am working on some code to generate dashboard json files specific for Solarwinds.
I need to generate a set of panels to monitor all critical links in the solarwinds.
The issue I am facing is in the Grafana Text Panel that should present the description of each link. There are no queries, I am suing a csv file a generated with all the links relevant information.
Python code to load the csv file
# Python source to read csv fileSource csv file
## load list with csv data
with open('crit_links.csv', newline='') as f:
reader = csv.reader(f)
data = list(reader)
Sample line of the csv file
crilinks.csv sample
businesshq;17;SWLINKS-bus;121;Unit: 1 Slot: 0 Port: 2 Gbit - Level · L2L_HLD-area_comp;111.111.111.110;103;OSPF;FALSE;02/03/20 05:19;;;;
Python code that generates the panels list for the dashboard
grid_Xa = [0,6,8,10]
grid_Xb = [12,18,20,22]
for i, row in enumerate(data):
rowlist = str(row).split(';')
if i == 0:
titlerow = rowlist #row of titles froms list columns dos items da lista
grid_y = initial_y
continue
if i % 2 == 1:
for x in grid_Xa:
if x == 0:
panelsList.append(createTextPanel(rowlist[0],rowlist[2],rowlist[4],x,grid_y, i+1))
continue
createTextPanel python function
def createTextPanel(siteName,nodeName,interfaceName,grid_X,grid_Y,g_id):
template = jenv.get_or_select_template('p-text.json.jinja')
return template.render( site=siteName, node=nodeName,interface=interfaceName ,grid_x = grid_X, grid_y = grid_Y,id=g_id)
jinja template:
{
"content": "Site: " + {{site}} + "Node: " + {{node}} + "Interface: " + {{interface}},
"gridPos": {
"h": 3,
"w": 6,
"x": {{ grid_x }},
"y": {{ grid_y }}
},...}
Problem:
The {{site}} string in the output.json is appearing with [' and this is crashing the quotes
output json:
{
"content": "Site: " + ['businessh1 + "Node: " + SWLINKS-bus + "Interface: " + Unit: 1 Slot: 0 Port: 2 Gbit - Level · L2L_HLD-area_comp,
"gridPos": {
"h": 3,
"w": 6,
"x": 0,
"y": 3
},
...}
My intention was that the content: parameter of the output look like this:
"output": "Site: businessh1 Node: SWLINKS-bus Interface: Unit: 1 Slot: 0 Port: 2 Gbit - Level · L2L_HLD-area_comp..."
Thanks!
I was able to solve this issue changing the following:
From:
# Python source to read csv fileSource csv file
## load list with csv data
with open('crit_links.csv', newline='') as f:
reader = csv.reader(f)
data = list(reader)
To:
# load list with csv data
with open('crit_links.csv') as f:
data = list(csv.reader(f, delimiter=','))

Is there a way to take a list of strings and create a JSON file, where both the key and value are list items?

I am creating a python script that can read scanned, and tabular .pdfs and extract some important data and insert it into a JSON to later be implemented into a SQL database (I will also be developing the DB as a project for learning MongoDB).
Basically, my issue is I have never worked with any JSON files before but that was the format I was recommended to output to. The scraping script works, the pre-processing could be a lot cleaner, but for now it works. The issue I run into is the keys, and values are in the same list, and some of the values because they had a decimal point are two different list items. Not really sure where to even start.
I don't really know where to start, I suppose since I know what the indexes of the list are I can easily assign keys and values, but then it may not be applicable to any .pdf, that is the script cannot be coded explicitly.
import PyPDF2 as pdf2
import textract
with "TestSpec.pdf" as filename:
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.pdfFileReader(pdfFileObj)
num_pages = pdfReader.numpages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(0)
count += 1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
def cleanText(x):
'''
This function takes the byte data extracted from scanned PDFs, and cleans it of all
unnessary data.
Requires re
'''
stringedText = str(x)
cleanText = stringedText.replace('\n','')
splitText = re.split(r'\W+', cleanText)
caseingText = [word.lower() for word in splitText]
cleanOne = [word for word in caseingText if word != 'n']
dexStop = cleanOne.index("od260")
dexStart = cleanOne.index("sheet")
clean = cleanOne[dexStart + 1:dexStop]
return clean
cleanText = cleanText(text)
This is the current output
['n21', 'feb', '2019', 'nsequence', 'lacz', 'rp', 'n5', 'gat', 'ctc', 'tac', 'cat', 'ggc', 'gca', 'cat', 'ttc', 'ccc', 'gaa', 'aag', 'tgc', '3', 'norder', 'no', '15775199', 'nref', 'no', '207335463', 'n25', 'nmole', 'dna', 'oligo', '36', 'bases', 'nproperties', 'amount', 'of', 'oligo', 'shipped', 'to', 'ntm', '50mm', 'nacl', '66', '8', 'xc2', 'xb0c', '11', '0', '32', '6', 'david', 'cook', 'ngc', 'content', '52', '8', 'd260', 'mmoles', 'kansas', 'state', 'university', 'biotechno', 'nmolecular', 'weight', '10', '965', '1', 'nnmoles']
and we want the output as a JSON setup like
{"Date | 21feb2019", "Sequence ID: | lacz-rp", "Sequence 5'-3' | gat..."}
and so on. Just not sure how to do that.
here is a screenshot of the data from my sample pdf
So, i have figured out some of this. I am still having issues with grabbing the last 3rd of the data i need without explicitly programming it in. but here is what i have so far. Once i have everything working then i will worry about optimizing it and condensing.
# for PDF reading
import PyPDF2 as pdf2
import textract
# for data preprocessing
import re
from dateutil.parser import parse
# For generating the JSON file array
import json
# This finds and opens the pdf file, reads the data, and extracts the data.
filename = "*.pdf"
pdfFileObj = open(filename, 'rb')
pdfReader = pdf2.PdfFileReader(pdfFileObj)
text = ""
pageObj = pdfReader.getPage(0)
text += pageObj.extractText()
# checks if extracted data is in string form or picture, if picture textract reads data.
# it then closes the pdf file
if text != "":
text = text
else:
text = textract.process(filename, method="tesseract", language="eng")
pdfFileObj.close()
# Converts text to string from byte data for preprocessing
stringedText = str(text)
# Removed escaped lines and replaced them with actual new lines.
formattedText = stringedText.replace('\\n', '\n').lower()
# Slices the long string into a workable piece (only contains useful data)
slice1 = formattedText[(formattedText.index("sheet") + 10): (formattedText.index("secondary") - 2)]
clean = re.sub('\n', " ", slice1)
clean2 = re.sub(' +', ' ', clean)
# Creating the PrimerData dictionary
with open("PrimerData.json",'w') as file:
primerDataSlice = clean[clean.index("molecular"): -1]
primerData = re.split(": |\n", primerDataSlice)
primerKeys = primerData[0::2]
primerValues = primerData[1::2]
primerDict = {"Primer Data": dict(zip(primerKeys,primerValues))}
# Generatring the JSON array "Primer Data"
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)
# Grabbing the date (this has just the date, so json will have to add date.)
date = re.findall('(\d{2}[\/\- ](\d{2}|january|jan|february|feb|march|mar|april|apr|may|may|june|jun|july|jul|august|aug|september|sep|october|oct|november|nov|december|dec)[\/\- ]\d{2,4})', clean2)
Without input data it is difficult to give you working code. A minimal working example with input would help. As for JSON handling, python dictionaries can dump to json easily. See examples here.
https://docs.python-guide.org/scenarios/json/
Get a json string from a dictionary and write to a file. Figure out how to parse the text into a dictionary.
import json
d = {"Date" : "21feb2019", "Sequence ID" : "lacz-rp", "Sequence 5'-3'" : "gat"}
json_data = json.dumps(d)
print(json_data)
# Write that data to a file
So, I did figure this out, the problem was really just that because of the way my pre-processing was pulling all the data into a single list wasn't really that great of an idea considering that the keys for the dictionary never changed.
Here is the semi-finished result for making the Dictionary and JSON file.
# Collect the sequence name
name = clean2[clean2.index("Sequence") + 11: clean2.index("Sequence") + 19]
# Collecting Shipment info
ordered = input("Who placed this order? ")
received = input("Who is receiving this order? ")
dateOrder = re.findall(
r"(\d{2}[/\- ](\d{2}|January|Jan|February|Feb|March|Mar|April|Apr|May|June|Jun|July|Jul|August|Aug|September|Sep|October|Oct|November|Nov|December|Dec)[/\- ]\d{2,4})",
clean2)
dateReceived = date.today()
refNo = clean2[clean2.index("ref.No. ") + 8: clean2.index("ref.No.") + 17]
orderNo = clean2[clean2.index("Order No.") +
10: clean2.index("Order No.") + 18]
# Finding and grabbing the sequence data. Storing it and then finding the
# GC content and melting temp or TM
bases = int(clean2[clean2.index("bases") - 3:clean2.index("bases") - 1])
seqList = [line for line in clean2 if re.match(r'^[AGCT]+$', line)]
sequence = "".join(i for i in seqList[:bases])
def gc_content(x):
count = 0
for i in x:
if i == 'G' or i == 'C':
count += 1
else:
count = count
return round((count / bases) * 100, 1)
gc = gc_content(sequence)
tm = mt.Tm_GC(sequence, Na=50)
moleWeight = round(mw(Seq(sequence, generic_dna)), 2)
dilWeight = float(clean2[clean2.index("ug/OD260:") +
10: clean2.index("ug/OD260:") + 14])
dilution = dilWeight * 10
primerDict = {"Primer Data": {
"Sequence": sequence,
"Bases": bases,
"TM (50mM NaCl)": tm,
"% GC content": gc,
"Molecular weight": moleWeight,
"ug/0D260": dilWeight,
"Dilution volume (uL)": dilution
},
"Shipment Info": {
"Ref. No.": refNo,
"Order No.": orderNo,
"Ordered by": ordered,
"Date of Order": dateOrder,
"Received By": received,
"Date Received": str(dateReceived.strftime("%d-%b-%Y"))
}}
# Generating the JSON array "Primer Data"
with open("".join(name) + ".json", 'w') as file:
primerJSON = json.dumps(primerDict, ensure_ascii=False)
file.write(primerJSON)

pug/jade & gulp-merge-json line break

I'm using pug/jade with gulp data & gulp-merge-json. I can't find a way to insert line break in my Json.
I have tried to insert in the json :
\n
\"
\\
\/
\b
\f
\r
\u
None works. I managed to have a space using \t, but I really need to be able to use line breaks.
var data = require('gulp-data');
fs = require('fs'),
path = require('path'),
merge = require('gulp-merge-json');
var pug = require('gulp-pug');
gulp.task('pug:data', function() {
return gulp.src('assets/json/**/*.json')
.pipe(merge({
fileName: 'data.json',
edit: (json, file) => {
// Extract the filename and strip the extension
var filename = path.basename(file.path),
primaryKey = filename.replace(path.extname(filename), '');
// Set the filename as the primary key for our JSON data
var data = {};
data[primaryKey.toUpperCase()] = json;
return data;
}
}))
.pipe(gulp.dest('assets/temp'));
});
Here is the pug code :
#abouttxt.hide.block.big.centered.layer_top
| #{ABOUT[langage]}
Here is the Json named About.json :
{"en": "Peace, Love, Unity and Having fun* !",
"fr": "Ecriture et Réalisation\nVina Hiridjee et David Boisseaux-Chical\n\nDirection artistique et Graphisme\nNicolas Dali\n\nConception et Développement\nRomain Malauzat
 
 Production\nKomet Prod ...... etc"
}
I didn't read the pug documentation well enough, my bad.
you can do Unescaped String Interpolation like this.
!{ABOUT[langage]}