Convert Json String in Text file to Json file with Python 3 - json

I have gathered around 5000 tweets using Twitter API but mistakenly wrote them into a .txt. I need them to be in .json format instead. I tried several solutions from simply changing the file format from txt to json to 3 hours worth of stackoverflow answers such as this:
import json
f = open("tweets.txt", "r")
content = f.read()
#print(content)
text_file = open("tweets4.json", "w")
text_file.write(content)
text_file.close()
But even this opened like a text string. I also tried to use json.dump() but since the string is already in Json format, it did double encoding and it was full of "\
In short, it looks like this:
Json string in text format
But should look like this (I re ran the code and saved new tweets to a json):
Json format

Related

Creating individual JSON files from a CSV file that is already in JSON format

I have JSON data in a CVS file that I need to break apart into seperate JSON files. The data looks like this: {"EventMode":"","CalculateTax":"Y",.... There are multiple rows of this and I want each row to be a separate JSON file. I have used code provided by Jatin Grover that parses the CVS into JSON:
lcount = 0
out = json.dumps(row)
jsonoutput = open( 'json_file_path/parsedJSONfile'+str(lcount)+'.json', 'w')
jsonoutput.write(out)
lcount+=1
This does an excellent job the problem is it adds "R": " before the {"EventMode... and adds extra \ between each element as well as item at the end.
Each row of the CVS file is already valid JSON objects. I just need to break each row into a separate file with the .json extension.
I hope that makes sense. I am very new to this all.
It's not clear from your picture what your CSV actually looks like.
I mocked up a really small CSV with JSON lines that looks like this:
Request
"{""id"":""1"", ""name"":""alice""}"
"{""id"":""2"", ""name"":""bob""}"
(all the double-quotes are for escaping the quotes that are part of the JSON)
When I run this little script:
import csv
with open('input.csv', newline='') as input_file:
reader = csv.reader(input_file)
next(reader) # discard/skip the fist line ("header")
for i, row in enumerate(reader):
with open(f'json_file_path/parsedJSONfile{i}.json', 'w') as output_file:
output_file.write(row[0])
I get two files, json_file_path/parsedJSONfile0.json and json_file_path/parsedJSONfile1.json, that look like this:
{"id":"1", "name":"Alice"}
and
{"id":"2", "name":"bob"}
Note that I'm not using json.dumps(...), that only makes sense if you are starting with data inside Python and want to save it as JSON. Your file just has text that is complete JSON, so basically copy-paste each line as-is to a new file.

I'm getting UnicodeDecodeError when trying to load a JSON file into a dataframe

So, I'm using the following code to get pandas to read my JSON text file-
f = open('C:/Users/stans/WFH Project/data.json')
data = json.load(f)
df = pd.DataFrame(data, index=[0])
f.close()
Once I execute the cell, I get
UnicodeDecodeError: 'charmap' codec can't decode byte 0x8f in position
1535: character maps to
I used the above coding for a smaller sample of JSON data and it worked. But, since I updated the file to include a much larger sample, I get that error.
I verified that the JSON format is correct, and I also tried in the open statement-
encoding='utf-8'
and
errors='ignore'
Both produced value errors. Any ideas? Thanks in advance for your help!

How to read a csv in pyspark using error_bad_line = False as we use in pandas

I am trying to read a csv into pyspark but the problem is that it has a text column due to which there are some bad line in the data
This text column also contains the new line characters due to which the data in further columns is getting corrupted
I have tried using pandas and use some extra parameters to load my csv
a = pd.read_csv("Mycsvname.csv",sep = '~',quoting=csv.QUOTE_NONE, dtype = str,error_bad_lines=False, quotechar='~', lineterminator='\n' )
It is working fine in pandas but I want to load the csv in pyspark
So, is there any similar way to load a csv in pyspark with all the above parameters?
In the current version of spark (I think it is even there from spark 2.2 onwards), you can also read multi-line from csv.
If the newline is your only problem with the text column you can use a read command like this:
spark.read.csv("YOUR_FILE_NAME", header="true", escape="\"", quote="\"", multiLine=True)
Note: in our case the escape and quotation characters where both " so you might want to edit those options with your ~ and include sep = '~'.
You can also look at the documentation (http://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html?highlight=csv#pyspark.sql.DataFrameReader.csv) for more details

Why is a JSON file so much smaller than a .txt file with the same content?

I am converting a JSON file (from an ajax call) into CSV.
When the JSON file is sent to me, it is 80kb.
When I save the contents of the JSON file into a .txt file, it becomes 291kb!
After converting the .txt file into a .csv file, it's 240kb.
How is the JSON file I received from an ajax call so much smaller a .txt file I created with identical content? Is there some way to reduce the size of the end product?
EDIT:
This is how I am getting the file size.
I find the AJAX request, and check its file size. Link. As you can see, it is about 80kb.
I copy the source of the request. Link.
Then I copy and paste the source into a blank .txt file. The result is a .txt file that is 291 kb in size.
EDIT:
I don't think the .txt to .csv conversion problem is the issue, but here is my code:
import json
import csv
import re
with open('jjj.txt') as f:
f = f.read()
parsed = json.loads(f)
unix_time = re.compile(r'(\d\d\d\d\d\d\d\d\d\d\d\d\d)')
data = parsed['d']['tables'][0]['rows']
for i in data:
for a in range(len(i)):
if a > 39 and a < 46:
if i[a] != None:
mo = unix_time.search(i[a])
i[a] = mo.group(1)
file = open('json.csv', 'w', newline='')
csvwriter = csv.writer(file)
csvwriter.writerows(data)
JSON is a string format used mostly for communication. If we want to save the JSON string in a file, it will be a text file. In this case, there will be no difference between JSON or any other content of the text file.
You are receiving a JSON string from your Ajax call, not a JSON file. You are receiving it over HTTP, and it is compressed (g-zipped). So you are comparing the size of the compressed text with the flat one which you are creating. Zip the file you are creating and you will get it down to almost the same size (depending on the compression tool and settings).

Unescaping Characters in a JSON response string

I made a JSON request that gives me a string that uses Unicode character codes that looks like:
s = "\u003Cp\u003E"
And I want to convert it to:
s = "<p>"
What's the best way to do this in Python?
Note, this is the same question as this one, only in Python except Ruby. I am also using the Posterous API.
>>> "\\u003Cp\\u003E".decode('unicode-escape')
u'<p>'
If the data came from JSON, the json module should already have decoded these escapes for you:
>>> import json
>>> json.loads('"\u003Cp\u003E"')
u'<p>'
EDIT: The original question "Unescaping Characters in a String with Python" did not clarify if the string was to be written or to be read (later on, the "JSON response" words were added, to clarify the intention was to read).
So I answered the opposite question: how to write JSON serialized data dumping them to a unescaped string (rather than loading data from the string).
My use case was producing a JSON file from my own data dictionary, but the file contained scaped non-ASCII characters. So I did it like this:
with open(filename,'w') as jsonfile:
jsonstr = json.dumps(myDictionary, ensure_ascii=False)
print(jsonstr) # to screen
jsonfile.write(jsonstr) # to file
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Taken from here: https://docs.python.org/3/library/json.html