Python 2.7 writing strings elements (character) to a binary file - json

I am using Python 2.7 to access an API that returns JSON with a single key="ringtone_file" and an associated value that is an mp3 file encoded for transport via HTTP. I created a bogus mp3 file consisting of 256 bytes in order from 0x00 through 0xff and the returned file appears below.
{"ringtone_file":"\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000b\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"}
I accessed the API using the following code minus exception handing code
import requests
response = requests.get(url)
dict = response.json()
print dict
This yields the following output
{u'ringtone_file': u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'}
What I desire to do is write each character or hex value of this string to a file in binary format. I desire the result to be a file of size 256 bytes where the first byte in the file has value 0 and the last byte has value 255. I can't change the API. Can someone suggest a reasonable way of accomplishing this with Python 2.7.
I attempted to do what was obvious to me which was to open a file for writing in binary mode and then writing the unicode string to the file. The error message from the codec indicates I can't write values between and including 128 and 255.

Since the string value is Unicode, you have to encode the string to write it to a file. The latin1 codec directly maps to the first 256 Unicode characters, so use .encode('latin1') on the string.
Example:
>>> s=u'\x00\x01\x02\xfd\xfe\xff'
>>> s
u'\x00\x01\x02\xfd\xfe\xff' # Unicode string
>>> s.encode('latin1')
'\x00\x01\x02\xfd\xfe\xff' # Now a byte string.

Related

option for \u instead of Unicode replacement

If I run this Go code:
package main
import (
"encoding/json"
"os"
)
func main() {
json.NewEncoder(os.Stdout).Encode("\xa1") // "\ufffd"
}
I lose data, since once the Unicode replacement is done, I can no longer get
back the original value. Compare with this Python code:
import json
a = '\xa1'
b = json.dumps(a) # "\u00a1"
print(json.loads(b) == a) # True
no replacement is done, so no data is lost. In addition, the resultant JSON is
still valid. Does Go have some method to encode JSON string with escaping
instead of replacement?
This example is a false equivalence. The '\xa1' is a valid Unicode string in Python, it's just one possible representation like '\u00a1' or '\U000000a1' or chr(0xa1) or '\N{INVERTED EXCLAMATION MARK}' or '¡' or ...
The equivalent in Python code would be:
>>> print(json.dumps(b'\xa1'.decode(errors='replace')))
"\ufffd"
Which is also printing an ascii representation of the coerced REPLACEMENT CHARACTER on stdout, the same as in Go.
This is because "\xa1" is not a valid Unicode string. It contains the byte 0xa1, which is not valid (not valid by itself). The not valid byte gets replaced with U+FFFD, which is the “replacement character”—used when the input is invalid.
If you want to encode the Unicode character U+00A1, write it as "\u00a1". If you want to make arbitrary data go round-trip through JSON, you will have to represent it a different way (like base64 encoding it, for example).
Python just works differently—in Python, the \xa1 escape sequence is U+00A1. Again, in Go, \xa1 is the byte 0xa1, which is not a valid Unicode string by itself and cannot be encoded as a JSON string.

How does Pyarrow read_csv handle different file encodings?

I have a .dat file that I had been reading with pd.read_csv and always needed to use encoding="latin" for it to read properly / without error. When I use pyarrow.csv.read_csv I dont see a parameter to select the encoding of the file but it still works without issue(which is great! but i dont understand why / if it only auto handles certain encodings). The only parameters im using are setting the delimiter="|" (with ParseOptions) and auto_dict_encode=True with (ConvertOptions).
How is pyarrow handling different encoding types?
pyarrow currently has no functionality to deal with different encodings, and assumes UTF8 for string/text data.
But the reason it doesn't raise an error is that pyarrow will read any non-UTF8 strings as a "binary" type column, instead of "string" type.
A small example:
# writing a small file with latin encoding
with open("test.csv", "w", encoding="latin") as f:
f.writelines(["col1,col2\n", "u,ù"])
Reading with pyarrow gives string for the first column (which only contains ASCII characters, thus also valid UTF8), but reads the second column as binary:
>>> from pyarrow import csv
>>> csv.read_csv("test.csv")
pyarrow.Table
col1: string
col2: binary
With pandas you indeed get an error by default (because pandas has no binary data type, and will try to read all text columns as python strings, thus UTF8):
>>> pd.read_csv("test.csv")
...
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf9 in position 0: invalid start byte
>>> pd.read_csv("test.csv", encoding="latin")
col1 col2
0 u ù
It's now possible to specify encodings with pyarrow.read_csv.
According to the pyarrow docs for read_csv:
The encoding can be changed using the ReadOptions class.
A minimal example follows:
from pyarrow import csv
options = csv.ReadOptions(encoding='latin1')
table = csv.read_csv('path/to/file', options)
From what I can tell, the functionality was added in this PR, so it should work starting with pyarrow 1.0.

Decoding string of bytes back to bytes

I need to save a byte string in a json file and get it back as a bytestring.
In order to be able to dump it into the json, I had to convert the bytes to a regular string. The problem I'm having is that once I read the json and try to encode the converted bytestring, the '\' are doubled, so the strings aren't the same.
How could you do it properly? :(
Input:
salt = b'\xd5KS\xe4\x1b\xbd'
Output = b'\xd5KS\xe4\x1b\xbd'

Error converting string to dictionary object

I am converting the Json string into a Python dictionary object and I get the following error for the below code:
import json
path = 'data2012-03-16.txt'
records = [json.loads(line) for line in open(path)]
Error:
UnicodeDecodeError: 'utf8' codec can't decode byte 0x92 in position 6: invalid start byte
few suggestion-
Maybe the encoding of the file is not valid? try to open it in notepad++ and change the encoding.
Are you sure your json file is well formatted? try open it in json parser and check it.
Why you got error with byte 0x92 in position 6 what is in this index of your file? maybe you have problem with all the \/ issue, try to replace it with other letters and check if it is working. beside, you can use the elimination way and try to open other file with the same code.After that work open thin version of this file etc.

Convert io.BytesIO to io.StringIO to parse HTML page

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.
Is there any way to convert io.BytesIO to io.StringIO?
Or Is there any other way to parse the HTML page?
I'm using Python 3.3.2.
the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.
# Initialize a read buffer
input = io.BytesIO(
b'Inital value for read buffer with unicode characters ' +
'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')
# Read from the buffer
print(wrapper.read())
A naive approach:
# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()
# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8') # Or use the encoding you expect
# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need