How to label encode this in python? - data-analysis

How to label encode this in python?
array(['Video/Internet/Voice', 'Video/Internet', 'Internet Only',
'Internet/Voice', nan, 'Video Only', 'Homesecurity Only',
'Video/Internet/Voice/Homesecurity', 'Internet/Homesecurity',
'Video/Internet/Homesecurity', 'Video/Voice',
'Internet/Voice/Homesecurity', 'Voice Only', 'Video/Homesecurity',
'Video/Voice/Homesecurity'], dtype=object)

Related

can't convert text data to json

I am trying to convert the following (json) string into a python data type:
data = "{'id': 26, 'photo': '/media/f082b5af-ad0.png', 'first_name': 'Islam', 'last_name': 'Mansour', 'email': 'islammansour06+8#gmail.com', 'city': 'Giza', 'cv': '/media/fbb61609-442.pdf', 'reference': 'Facebook', 'campaign': OrderedDict([('id', 2), ('name', 'javascript')]), 'status': 'Invitation Sent', 'user': None, 'at': '2020-01-20', 'time': '23:02:58.359179', 'technologies': [OrderedDict([('id', 46), ('name', 'Django'), ('category', OrderedDict([('id', 24), ('name', 'Framework'), ('_type', 'skill')]))])]}"
I am trying to convert it to JSON by using
json.loads(data.replace("\'", "\""))
but I am having the following error
json.decoder.JSONDecoderError: Expecting value: line 1 column 219 (char 218)
The issue is that your data is not valid json.
The main problem starts here: [OrderedDict([('id', 46), ('name', 'Django'), ('category', OrderedDict([('id', 24), ('name', 'Framework'), ('_type', 'skill')]))])]}. This looks like it is a string representaion of some python objects.
Below is a more friendly representation of your json data.
I have marked the problematic parts (with **) (basically everywhere there is a OrderedDict).
{
"id":26,
"photo":"/media/f082b5af-ad0.png",
"first_name":"Islam",
"last_name":"Mansour",
"email":"islammansour06+8#gmail.com",
"city":"Giza",
"cv":"/media/fbb61609-442.pdf",
"reference":"Facebook",
"campaign":**OrderedDict**([("id",
2), ("name", "javascript")]), "status":"Invitation Sent",
"user":None,
"at":"2020-01-20",
"time":"23:02:58.359179",
"technologies":[
**OrderedDict**([("id",
46),
("name",
"Django")
]("category", OrderedDict([("id", 24), ("name", "Framework"), ("_type", "skill")]))])]
}```
You could try making use of an [online json parser][1] which might give you some friendlier output.
[1]: http://json.parser.online.fr/
As previously said, OrderedDict is not correct JSON. But this is correct python.
To fix it:
from collections import OrderedDict # direct import because this is as this in your string
import json
jsonCorrect = json.dumps(eval(data))
json.loads(jsonCorrect) # it works
Not sure why you are adding the replace call. Should work with just the following:
json.loads(data)
You can read about it here.

How to extract certain information from a string and create a json object in python

I made a get request to a website and parsed it using BS4 using 'Html.parser'. I want to extract the ID, size and availability from the string. I have parsed it down to this final string:
'{"id":706816278547,"parent_id":81935859731,"available":false,
"sku":"665570057894","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["S"],
"option1":"s","option2":"","option3":"","option4":""},
{"id":707316252691,"parent_id":81935859731,"available":true,
"sku":"665570057900","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["M"],
"option1":"m","option2":"","option3":"", "option4":""},
{"id":707316285459,"parent_id":81935859731,"available":true,
"sku":"665570057917","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["L"],
"option1":"l","option2":"","option3":"","option4":""},`
{"id":707316318227,"parent_id":81935859731,"available":true,`
"sku":"665570057924","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["XL"],
"option1":"xl","option2":"","option3":"","option4":""}'
I also tried using the split() method but I get lost and im unable to extract the needed information without creating a cluttered list and getting lost.
I tried using json.loads() so i could just extract the information needed by calling the key and value pairs but i get the following error
final_id =
'{"id":706816278547,"parent_id":81935859731,"available":false,
"sku":"665570057894","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["S"],
"option1":"s","option2":"","option3":"","option4":""},
{"id":707316252691,"parent_id":81935859731,"available":true,
"sku":"665570057900","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["M"],
"option1":"m","option2":"","option3":"", "option4":""},
{"id":707316285459,"parent_id":81935859731,"available":true,
"sku":"665570057917","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["L"],
"option1":"l","option2":"","option3":"","option4":""},`
{"id":707316318227,"parent_id":81935859731,"available":true,`
"sku":"665570057924","featured_image":null,"public_title":null,
"requires_shipping":true,"price":40000,"options":["XL"],
"option1":"xl","option2":"","option3":"","option4":""}'
find_id = json.loads(final_id)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/anaconda3/lib/python3.7/json/__init__.py", line 348, in loads
return _default_decoder.decode(s)
File "/anaconda3/lib/python3.7/json/decoder.py", line 340, in decode
raise JSONDecodeError("Extra data", s, end)
json.decoder.JSONDecodeError: Extra data: line 1 column 233 (char 232)
I want to create a json object for each ID and Size and if that size is available or not.
Any help is welcomed. Thank you.
First thats not a valid json info
second, json.loads works for files, so a file containing this info will solve the issue because null in json equal None in python so json.load you can say translate a json file so python understand it, so
import json
with open('sof.json', 'r') as stackof:
final_id = json.load(stackof)
print(final_id)
will output
[{'id': 706816278547, 'parent_id': 81935859731, 'available': 'false', 'sku': '665570057894', 'featured_image': None, 'public_title': None, 'requires_shipping': True, 'price': 40000, 'options': ['S'], 'option1': 's', 'option2': '', 'option3': '', 'option4': ''}, {'id': 707316252691, 'parent_id': 81935859731, 'available': True, 'sku': '665570057900', 'featured_image': None, 'public_title': None, 'requires_shipping': True, 'price': 40000, 'options': ['M'], 'option1': 'm', 'option2': '', 'option3': '', 'option4': ''}, {'id': 707316285459, 'parent_id': 81935859731, 'available': True, 'sku': '665570057917', 'featured_image': None, 'public_title': None, 'requires_shipping': True, 'price': 40000, 'options': ['L'], 'option1': 'l', 'option2': '', 'option3': '', 'option4': ''}, {'id': 707316318227, 'parent_id': 81935859731, 'available': True, 'sku': '665570057924', 'featured_image': None, 'public_title': None, 'requires_shipping': True, 'price': 40000, 'options': ['XL'], 'option1': 'xl', 'option2': '', 'option3': '', 'option4': ''}]
i made all of them divided into array, so now if you print the first id you should write
print(final_id[0]['id'])
output:
706816278547
Tell me in the comments if that helped you,
btw click on >> sof.json to see sof.json

In my fetched json data, how can I seperate out the balance?

So, I have been testing block.io api, and so far I have this:
knee = block_io.get_address_balance(labels='shibe1')
s1 = json.dumps(knee)
d2 = json.loads(s1)
print (d2)
It returns me with this batch of text:
{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.0', 'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1', 'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000', 'pending_received_balance': '0.00000000'}]}}
How would I get it so that I could grab only the available_balance part of it, and print it out instead of all of the json data?
EDIT: Please help! Cant find a solution.
Try using some regex.
import re
data="{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.129',
'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1',
'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000',
'pending_received_balance': '0.00000000'}]}}"
pattern = re.compile("(?<=available_balance': ').*?(?=')")
matches = pattern.finditer(data)
for match in matches:
print(match.group())
Breakdown :
import re imports the regex library built into python
data="{'status': 'success', 'data': {'network': 'DOGE', 'available_balance': '0.129',
'pending_received_balance': '0.0', 'balances': [{'user_id': 1, 'label': 'shibe1',
'address': 'A9Bda9UMBcb1183PtsBxnbj5QgP6jwkCFG', 'available_balance': '0.00000000',
'pending_received_balance': '0.00000000'}]}}" is a string containing the data to match. You can replace this with the json data.
pattern = re.compile("(?<=available_balance': ').*?(?=')") compiles the regex for finding the data for available balance.
Regex breakdown
(?<= is a lookbehind, which means it will check if the value is actually available_balance.
.* matches everything inside a defined constraint.
(?= is a lookahead, which means it will match everything before the close parenthesis, and everything after the lookbehind.
pattern.finditer(data) matches the regex against data
for match in matches:
print(match.group()) prints the matches from the regex.
If you compile this code, you will get the following results :
0.129
0.00000000
If you want the code under your variables, here you go :
import re
pattern = re.compile("(?<=available_balance': ').*?(?=')")
matches = pattern.finditer(d2)
for match in matches:
print(match.group())

How to define special "untokenizable" words for nltk.word_tokenize

I'm using nltk.word_tokenize for tokenizing some sentences which contain programming languages, frameworks, etc., which get incorrectly tokenized.
For example:
>>> tokenize.word_tokenize("I work with C#.")
['I', 'work', 'with', 'C', '#', '.']
Is there a way to enter a list of "exceptions" like this to the tokenizer? I already have compiled a list of all the things (languages, etc.) that I don't want to split.
The Multi Word Expression Tokenizer should be what you need.
You add the list of exceptions as tuples and pass it the already tokenized sentences:
tokenizer = nltk.tokenize.MWETokenizer()
tokenizer.add_mwe(('C', '#'))
tokenizer.add_mwe(('F', '#'))
tokenizer.tokenize(['I', 'work', 'with', 'C', '#', '.'])
['I', 'work', 'with', 'C_#', '.']
tokenizer.tokenize(['I', 'work', 'with', 'F', '#', '.'])
['I', 'work', 'with', 'F_#', '.']

Simple Json decoding with SimpleJSON - Python

Ive just started learning python and Im having a go at using a google api. But I hit a brick wall trying to parse the JSON with simplejson.
How do I go about pulling single values (ie product or brand fields) out of this mess below
{'currentItemCount': 25, 'etag': '"izYJutfqR9tRDg1H4X3fGx1UiCI/hqqZ6pMwV1-CEu5NSqfJO0Ix-gs"', 'id': 'tag:google.com,2010:shopping/products', 'items': [{'id': 'tag:google.com,2010:shopping/products/1196682/8186421160532506003',
'kind': 'shopping#product',
'product': {'author': {'accountId': '1196682',
'name': "Dillard's"},
'brand': 'Merrell',
'condition': 'new',
'country': 'US',
'creationTime': '2011-03-10T08:11:08.000Z',
'description': u'Merrell\'s "Trail Glove" barefoot running shoe lets your feet follow their natural i$
'googleId': '8186421160532506003',
'gtin': '00797240569847',
'images': [{'link': 'http://dimg.dillards.com/is/image/DillardsZoom/03528718_zi_amazon?$product$'}],
'inventories': [{'availability': 'inStock',
'channel': 'online',
'currency': 'USD',
'price': 110.0}],
'language': 'en',
'link': 'http://www.dillards.com/product/Merrell-Mens-Trail-Glove-Barefoot-Running-Shoes_301_-1_301_5$
'modificationTime': '2011-05-25T07:42:51.000Z',
'title': 'Merrell Men\'s "Trail Glove" Barefoot Running Shoes'},
'selfLink': 'https://www.googleapis.com/shopping/search/v1/public/products/1196682/gid/8186421160532506003?alt=js$
The JSON you've pasted in the question is not valid. But when you fixed that here's how to use simplejson:
import simplejson as json
your_response_body = '["foo", {"bar":["baz", null, 1.0, 2]}]'
obj = json.loads(your_response_body)
print(obj[1]['bar'])
And a link to the documentation.