Convert io.BytesIO to io.StringIO to parse HTML page - html

I'm trying to parse a HTML page I retrieved through pyCurl but the pyCurl WRITEFUNCTION is returning the page as BYTES and not string, so I'm unable to Parse it using BeautifulSoup.
Is there any way to convert io.BytesIO to io.StringIO?
Or Is there any other way to parse the HTML page?
I'm using Python 3.3.2.

the code in the accepted answer actually reads from the stream completely for decoding. Below is the right way, converting one stream to another, where the data can be read chunk by chunk.
# Initialize a read buffer
input = io.BytesIO(
b'Inital value for read buffer with unicode characters ' +
'ÁÇÊ'.encode('utf-8')
)
wrapper = io.TextIOWrapper(input, encoding='utf-8')
# Read from the buffer
print(wrapper.read())

A naive approach:
# assume bytes_io is a `BytesIO` object
byte_str = bytes_io.read()
# Convert to a "unicode" object
text_obj = byte_str.decode('UTF-8') # Or use the encoding you expect
# Use text_obj how you see fit!
# io.StringIO(text_obj) will get you to a StringIO object if that's what you need

Related

How to write a list of string as double quotes, so that json can load?

Consider the following code, where the json can't load back because after the manipulation, the single quotes become double quotes, how can write to file as double quote list so that json can load back?
import configparser
import json
config = configparser.ConfigParser()
config.read("config.ini")
l = json.loads(config.get('Basic', 'simple_list'))
new_config = configparser.ConfigParser()
new_config.add_section("Basic")
new_config.set('Basic', 'simple_list', str(l))
with open("config1.ini", 'w') as f:
new_config.write(f)
config = configparser.ConfigParser()
config.read("config1.ini")
l = json.loads(config.get('Basic', 'simple_list'))
The settings.ini file content is like this:
[Basic]
simple_list = ["a", "b"]
As already mentionned by #L3viathan, the purely technical answer is "use json.dumps() instead of str()" (and yes, it works for dicts too).
BUT: storing json in an ini file is very bad idea. "ini" is a file format on it's own (even if not as strictly specified as json or yaml) and it has been designed to be user-editable with just any text editor. FWIW, the simple canonical way to store "lists" in an ini file is simply to store them as comma separated values, ie:
[Basic]
simple_list = a,b
and parse this back when reading the config as
values = config.get('Basic', 'simple_list')).split(",")
wrt/ "storing dicts", an ini file IS already a (kind of) dict since it's based on key:value pairs. It's restricted to two levels (sections and keys), but here again that's by design - it's a format designed for end-users, not for programmers.
Now if the ini forma doesn't suits your needs, nothing prevents you from just using a json (or yaml) file instead for the whole config

Read and parse a >400MB .json file in Julia without crashing kernel

The following is crashing my Julia kernel. Is there a better way to read and parse a large (>400 MB) JSON file?
using JSON
data = JSON.parsefile("file.json")
Unless some effort is invested into making a smarter JSON parser, the following might work: There is a good chance file.json has many lines. In this case, reading the file and parsing a big repetitive JSON section line-by-line or chunk-by-chuck (for the right chunk length) could do the trick. A possible way to code this, would be:
using JSON
f = open("file.json","r")
discard_lines = 12 # lines up to repetitive part
important_chunks = 1000 # number of data items
chunk_length = 2 # each data item has a 2-line JSON chunk
for i=1:discard_lines
l = readline(f)
end
for i=1:important_chunks
chunk = join([readline(f) for j=1:chunk_length])
push!(thedata,JSON.parse(chunk))
end
close(f)
# use thedata
There is a good chance this could be a temporary stopgap solution for your problem. Inspect file.json to find out.

How to read .csv file that contains utf-8 values by pandas dataframe

I'm trying to read .csv file that contains utf-8 data in some of its columns. The method of reading is by using pandas dataframe. The code is as following:
df = pd.read_csv('Cancer_training.csv', encoding='utf-8')
Then I got the following examples of errors with different files:
(1) 'utf-8' codec can't decode byte 0xcf in position 14:invalid continuation byte
(2) 'utf-8' codec can't decode byte 0xc9 in position 3:invalid continuation byte
Could you please share your ideas and experience with such problem? Thank you.
[python: 3.4.1.final.0,
pandas: 0.14.1]
sample of the raw data, I cannot put full record because of the legal restrictions of the medical data:
I had this problem for no apparent reason, I managed to get it work using this:
df = pd.read_csv('file', encoding = "ISO-8859-1")
not sure why though
I've also done as Irh09 proposed but the second file it read it was wrongly decoded and couldn't find a column with tildes (á, é, í, ó, ú).
So I recomend encapsulating the error like this:
try:
df = pd.read_csv('file', encoding = "utf-8")
except:
df = pd.read_csv('file', encoding= "ISO-8859-1")

Python 2.7 writing strings elements (character) to a binary file

I am using Python 2.7 to access an API that returns JSON with a single key="ringtone_file" and an associated value that is an mp3 file encoded for transport via HTTP. I created a bogus mp3 file consisting of 256 bytes in order from 0x00 through 0xff and the returned file appears below.
{"ringtone_file":"\u0000\u0001\u0002\u0003\u0004\u0005\u0006\u0007\b\t\n\u000b\f\r\u000e\u000f\u0010\u0011\u0012\u0013\u0014\u0015\u0016\u0017\u0018\u0019\u001a\u001b\u001c\u001d\u001e\u001f !\"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬­®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖ×ØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿ"}
I accessed the API using the following code minus exception handing code
import requests
response = requests.get(url)
dict = response.json()
print dict
This yields the following output
{u'ringtone_file': u'\x00\x01\x02\x03\x04\x05\x06\x07\x08\t\n\x0b\x0c\r\x0e\x0f\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1a\x1b\x1c\x1d\x1e\x1f !"#$%&\'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\\]^_`abcdefghijklmnopqrstuvwxyz{|}~\x7f\x80\x81\x82\x83\x84\x85\x86\x87\x88\x89\x8a\x8b\x8c\x8d\x8e\x8f\x90\x91\x92\x93\x94\x95\x96\x97\x98\x99\x9a\x9b\x9c\x9d\x9e\x9f\xa0\xa1\xa2\xa3\xa4\xa5\xa6\xa7\xa8\xa9\xaa\xab\xac\xad\xae\xaf\xb0\xb1\xb2\xb3\xb4\xb5\xb6\xb7\xb8\xb9\xba\xbb\xbc\xbd\xbe\xbf\xc0\xc1\xc2\xc3\xc4\xc5\xc6\xc7\xc8\xc9\xca\xcb\xcc\xcd\xce\xcf\xd0\xd1\xd2\xd3\xd4\xd5\xd6\xd7\xd8\xd9\xda\xdb\xdc\xdd\xde\xdf\xe0\xe1\xe2\xe3\xe4\xe5\xe6\xe7\xe8\xe9\xea\xeb\xec\xed\xee\xef\xf0\xf1\xf2\xf3\xf4\xf5\xf6\xf7\xf8\xf9\xfa\xfb\xfc\xfd\xfe\xff'}
What I desire to do is write each character or hex value of this string to a file in binary format. I desire the result to be a file of size 256 bytes where the first byte in the file has value 0 and the last byte has value 255. I can't change the API. Can someone suggest a reasonable way of accomplishing this with Python 2.7.
I attempted to do what was obvious to me which was to open a file for writing in binary mode and then writing the unicode string to the file. The error message from the codec indicates I can't write values between and including 128 and 255.
Since the string value is Unicode, you have to encode the string to write it to a file. The latin1 codec directly maps to the first 256 Unicode characters, so use .encode('latin1') on the string.
Example:
>>> s=u'\x00\x01\x02\xfd\xfe\xff'
>>> s
u'\x00\x01\x02\xfd\xfe\xff' # Unicode string
>>> s.encode('latin1')
'\x00\x01\x02\xfd\xfe\xff' # Now a byte string.

Unescaping Characters in a JSON response string

I made a JSON request that gives me a string that uses Unicode character codes that looks like:
s = "\u003Cp\u003E"
And I want to convert it to:
s = "<p>"
What's the best way to do this in Python?
Note, this is the same question as this one, only in Python except Ruby. I am also using the Posterous API.
>>> "\\u003Cp\\u003E".decode('unicode-escape')
u'<p>'
If the data came from JSON, the json module should already have decoded these escapes for you:
>>> import json
>>> json.loads('"\u003Cp\u003E"')
u'<p>'
EDIT: The original question "Unescaping Characters in a String with Python" did not clarify if the string was to be written or to be read (later on, the "JSON response" words were added, to clarify the intention was to read).
So I answered the opposite question: how to write JSON serialized data dumping them to a unescaped string (rather than loading data from the string).
My use case was producing a JSON file from my own data dictionary, but the file contained scaped non-ASCII characters. So I did it like this:
with open(filename,'w') as jsonfile:
jsonstr = json.dumps(myDictionary, ensure_ascii=False)
print(jsonstr) # to screen
jsonfile.write(jsonstr) # to file
If ensure_ascii is true (the default), the output is guaranteed to have all incoming non-ASCII characters escaped. If ensure_ascii is false, these characters will be output as-is.
Taken from here: https://docs.python.org/3/library/json.html