I imported some JSON data using rjson library. The problem I'm facing is that some of the data appears to be misaligned. I suspect this is due to missing values.
How can I detect and re-align the data that is in incorrect columns and fill empty values with NULL. I cannot share the data. I hope the image will be enough.
code used to import data:
library(rjson)
json_data <- do.call(rbind, lapply(readLines(training.file$filepaths[ind]), rjson::fromJSON))
json_data <- as.data.frame(json_data)
I have also tried using jsonlite::fromJSON function instead of rjson::fromJSON, but get the following error
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: trailing garbage
d_str": null, "place": null} {"truncated": false, "text": "R
(right here) ------^
json file format (data is manipulated but all properties are present in this example):
{
"truncated": false, "text": "abc abc", "in_reply_to_status_id": null,
"id": 123, "favorite_count": 0, "retweeted": false, "entities": {
"symbols": [], "user_mentions": [], "hashtags": [], "urls": []
},
"in_reply_to_screen_name": null, "id_str": "123", "retweet_count": 0,
"in_reply_to_user_id": null, "screen_name_statistics": {
"has_underscore": true, "contains_swear": false, "has_digits": false,
"contains_condition": false, "has_chars": true
},
"user": {
"verified": false, "geo_enabled": false, "followers_count": 0,
"utc_offset": -14400, "statuses_count": 17600, "friends_count": 4425,
"lang": "en", "favourites_count": 1900, "screen_name": "1name1",
"url": null, "created_at": "Sat Jun 00 03:36:27 +0000 2012",
"time_zone": "Atlantic Time (Canada)", "listed_count": 2
},
"geo": null, "in_reply_to_user_id_str": null, "lang": "en",
"created_at": "Mon Nov 55 05:18:49 +0000 2013",
"in_reply_to_status_id_str": null, "place": null
}
Further information:
obj1 and obj2 contain different number of properties obj1 contains 19 properties while obje contains 20 properties
misalignment occurs when list is converted to dataframe using as.data.frame. A custom function may be required to take property names into consideration.
I used rjson::fromJson function to import data. This was imported as a list which I then converted to dataframe for further analysis using as.data.frame.
At first I did not notice that the json objects had different number of properties which was causing misalignment of data in dataframe. column names were not matched.
To fix this, I wrote a custom mapping function which looks at individual values from list and maps them in a pre-defined dataframe.
Code specific to my example is available here
. Specifically the "importJSON" function tackles the mapping of list to dataframe.
Related
Say if I have JSON entry as follows(The JSON file generated by fetching data from a Firebase DB):
[{"goal_savings": 0.0, "social_id": "", "score": 0, "country": "BR", "photo": "http://graph.facebook", "id": "", "plates": 3, "rcu": null, "name": "", "email": ".", "provider": "facebook", "phone": "", "savings": [], "privacyPolicyAccepted": true, "currentRole": "RoleType.PERSONAL", "empty_lives_date": null, "userId": "", "authentication_token": "-------", "onboard_status": "ONBOARDING_WIZARD", "fcmToken": ----------", "level": 1, "dni": "", "social_token": "", "lives": 10, "bills": [{"date": "2020-12-10", "role": "RoleType.PERSONAL", "name": "Supermercado", "category": "feeding", "periodicity": "PeriodicityType.NONE", "value": 100.0"}], "payments": [], "goals": [], "goalTransactions": [], "incomes": [], "achievements": [{"created_at":", "name": ""}]}]
How do I extract the content corresponding to 'value' which is present inside column 'bills' . Any way to do this ?
My python code is as follows. With this I was only able to get data within bills column. But I need only the entry corresponding to 'value' which is present inside bills.
import json
filedata = open('firebase-dataset.json','r')
data = json.load(filedata)
listoffields = [] # To produce it into a list with fields
for dic in data:
try:
listoffields.append(dic['bills']) # only non-essential bill categories.
except KeyError:
pass
print(listoffields)
The JSON you posted contains misplaced quotes.
I think you are trying to extract the value of 'value' column within bills.
try this
print(listoffields[0][0]['value'])
which will print you 100.0 as str. use float() to use it in calculations.
---edit---
Say the JSON you having contains many JSON objects separated by commas as..
[{ first-entry },{ second-entry },{ third.. }, ....and so on]
..and you want to find the value of each bill in the each JSON obj..
may be the code below will work.-
bill_value_list = [] # to store 'value' of each bill
for bill_list in listoffields:
bill_value_list.append(float(bill_list[0]['value'])) # blill_list[0] will contain complete bill dictionary.
print(bill_value_list)
print(sum(bill_value_list)) # do something usefull
Paste it after the code you posted.(no changes to your code .. since it always works :-) )
This is just a sample of code
{
"created_at": "Fri Jan 31 05:51:59 +0000 2014",
"favorited": false,
"lang": "en",
"place": {
"country_code": "US",
"url": "https://api.twitter.com/1.1/geo/id/cf44347a08102884.json"
},
"retweeted": false,
"source": "Tweetbot for Mac",
"text": "Active crime scene on I-59/20 near Jeff/Tusc Co line. One dead, one injured; shooting involved. Police search in the area; traffic stopped",
"truncated": false
}
How do I parse this in python so that I can get the information in text or lang?
I'm assuming this fragment is incomplete, as it looks like json but is currently invalid. Assuming a valid json document then you can use the json module:
>>> import json
>>> s = """{"lang": "en", "favorited": false, "truncated": false, ... }"""
>>> data = json.loads(s)
>>> data['lang']
'en'
>>> data['text']
'Active crime scene on I-59/20 near Jeff/Tusc Co line. One dead, one injured; shooting involved. Police search in the area; traffic stopped'
Could someone rectify the error I am making while grouping a JSON data block for generating key, value tuples list. I am able to collect all key,value pairs except the last pair which ends with }. I don't seem to understand why or | alternation is not working for the last pair.
\"(.*?)\"[:]+\"?(.*?)\"?[,[^\s](?=\")|[^\s]\}$]
I am using Python's re.findall function to generate the groups.
Example data block :
{
"author_flair_text": null,
"author": "joeinfro",
"id": "d2nvjik",
"link_id": "t3_4h4boa",
"gilded": 0,
"created_utc": 1462060800,
"author_flair_css_class": null,
"parent_id": "t3_4h4boa",
"ups": 1,
"body": "thats 1 case per 5000 people. nice!",
"subreddit_id": "t5_2qh13",
"stickied": false,
"edited": false,
"subreddit": "worldnews",
"distinguished": null,
"score": 1,
"retrieved_on": 1465550534,
"controversiality": 0
}
EDIT : Finally found the solution for it. Made use of non-capturing group of regex.
Solution :
\"(.*?)\"[:]+\"?(.*?)\"?(?:,(?=\")|\}$)
`
Consider a social network. It has posts. For feed, you request /feed and get the list of posts.
In the UI, there are things to show for a post, like if the user liked the post or not, if the user starred it or not, etc. These things don't look like they belong inside the post object.
Another case is when you fetch the likes. The frontend needs to know if the user in each 'like' object is being followed or not.
Where to put this info in the response JSON?
Its depends on your application and which data you want to show to the user. For ex,consider you are listing a user's feeds. In that feed,you want to show
Message
Liked by the current user or not(i don't know the difference between liked and stared)
Number of likes
List of liked users.
shared by the user or not
Shared count
List ofShared users.
etc..
In the above list,
Some data need two api fetch to get complete info and some not. For example,"List of liked users","List of Shared users". This is generally a dynamic data module. You have to get those details in a separate api for better performance of the server and also data integrity.
In some cases,some apps needs sneak peek of the liked shared users info in the listing page. In that case,you can include the some fixed small number of users details in the same list /feeds response itself and include the "See More(like Facebook)" option in the UI.
Some static singular data(single column data) can be list in the initial get /feeds itself.
I wonder why don't you follow the same twitter's list tweets style,
https://dev.twitter.com/rest/reference/get/search/tweets
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Fri Sep 21 23:40:54 +0000 2012",
"id_str": "249292149810667520",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "FreeBandNames",
"indices": [
20,
34
]
}
],
"user_mentions": [
]
},
"in_reply_to_user_id_str": null,
"contributors": null,
"text": "Thee Namaste Nerdz. #FreeBandNames",
"metadata": {
"iso_language_code": "pl",
"result_type": "recent"
},
"retweet_count": 0,
"in_reply_to_status_id_str": null,
"id": 249292149810667520,
"geo": null,
"retweeted": false,
"in_reply_to_user_id": null,
"place": null,
"user":
{
"profile_sidebar_fill_color": "DDFFCC",
"profile_sidebar_border_color": "BDDCAD",
"profile_background_tile": true,
"name": "Chaz Martenstein",
"profile_image_url": "http://a0.twimg.com/profile_images/447958234/Lichtenstein_normal.jpg",
"created_at": "Tue Apr 07 19:05:07 +0000 2009",
"location": "Durham, NC",
"follow_request_sent": null,
"profile_link_color": "0084B4",
"is_translator": false,
"id_str": "29516238",
"entities": {
"url": {
"urls": [
{
"expanded_url": null,
"url": "http://bullcityrecords.com/wnng/",
"indices": [
0,
32
]
}
]
},
"description": {
"urls": [
]
}
},
"default_profile": false,
"contributors_enabled": false,
"favourites_count": 8,
"url": "http://bullcityrecords.com/wnng/",
"profile_image_url_https": "https://si0.twimg.com/profile_images/447958234/Lichtenstein_normal.jpg",
"utc_offset": -18000,
"id": 29516238,
"profile_use_background_image": true,
"listed_count": 118,
"profile_text_color": "333333",
"lang": "en",
"followers_count": 2052,
"protected": false,
"notifications": null,
"profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/9423277/background_tile.bmp",
"profile_background_color": "9AE4E8",
"verified": false,
"geo_enabled": false,
"time_zone": "Eastern Time (US & Canada)",
"description": "You will come to Durham, North Carolina. I will sell you some records then, here in Durham, North Carolina. Fun will happen.",
"default_profile_image": false,
"profile_background_image_url": "http://a0.twimg.com/profile_background_images/9423277/background_tile.bmp",
"statuses_count": 7579,
"friends_count": 348,
"following": null,
"show_all_inline_media": true,
"screen_name": "bullcityrecords"
},
"in_reply_to_screen_name": null,
"source": "web",
"in_reply_to_status_id": null
}
You have two options:
Make a separate API method for getting information about user context data - /api/users/1/feeds/1 .Pay attention, that this option will force you to send request per feed. So, if you'll have 1000 feeds - you will have 1000 + 1 request (so called N+1 problem).
As for me - it's not a good idea.
You can store user data in json, for example is this way:
{
"feedName": "feed1",
...
"currentUser": {
"liked": true,
"starred": true
}
}
By using this option you will avoid N+1 requests problem in your RESTful service
For all the users, the post resource should be the same. Adding specific user context info inside it seems like polluting it
I can see where you're coming from and I quite agree.
Ivan's 1st solution should not be used as he already mentioned, his 2nd is better but then if you GET the posts JSON which should contain only post objects, there is also this currentUser that doesn't really belong there.
My suggestion is that for each post you keep track of which users have liked and/or starred it, etc. Then you keep a clean structure while still having the info you need available in the same request/response.
Example
GET /feed HTTP/1.1
[
{
"text": "hello world, im a post!",
"author": "Jack",
"likes": 3,
"likedBy": [
"John",
"James",
"Jessica"
],
"stars": 2,
"starredBy": [
"John",
"Mary"
]
},
{
"text": "hello world, im also a post! :D",
"author": "Mary",
"likes": 1,
"likedBy": [
"James"
],
"stars": 0,
"starredBy": [
]
},
]
Where each {} object represents a post object.
On the client side, you could then check if the likedBy list contains the currently logged in user and proceed with the result as you see fit. Same for stars and any other of these properties a post might have.
First, I apologize if my description is not accurate enough for you, I am a total newbie and I don't know a thing about programming, so don't hesitate to tell me if you need more detailed info, but I will try to be as precise as possible.
So I have downloaded a bunch of tweets thanks to Twitter's API and the Terminal (through Twurl). All the tweets are in a .json file (that I open with TextWrangler, I'm on a Mac) and the thing is that when I export my .json file to a .csv file in order to process and analyze the data more easily thanks to Excel (or at least the Excel version of LibreOffice), I don't have all the parameters I would require for my study, I lack the "bio" part of each Tweet info present in the .json file. In other words, in my final table I have a column for the tweet ID, one for the tweet author, one for the text of the tweet itself and so on... But I don't have a column for the bio of the tweet author, whereas this information is displayed in the .json file itself. So my question is: is there a code or anything which would enable me to have one more column displaying some more info present in the basic .json file in my final .csv table?
Again, this may not be clear, so don't hesitate to tell me if you need me to highlight a specific point.
Thanks in advance for any insight, I really need help on this one, this is for a research project I need to carry on for my PhD, so any help would be more than welcome!
EDIT: As an example, here is a sample of the data I have for one tweet in my original .json file:
{
"created_at": "Mon Apr 28 09:00:40 +0000 2014",
"id": 460705144846712800,
"id_str": "460705144846712832",
"text": "Work can suck a dick today",
"source": "Twitter for iPhone",
"truncated": false,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
"in_reply_to_user_id_str": null,
"in_reply_to_screen_name": null,
"user": {
"id": 253350311,
"id_str": "253350311",
"name": "JEEEZUS",
"screen_name": "Maxi_Flex",
"location": "Southchestershire",
"url": "http://www.soundcloud.com/maxi_flex",
"description": "Jazz Personality.G Mentality.",
"protected": false,
"followers_count": 457,
"friends_count": 400,
"listed_count": 1,
"created_at": "Thu Feb 17 02:08:57 +0000 2011",
"favourites_count": 1229,
"utc_offset": null,
"time_zone": null,
"geo_enabled": true,
"verified": false,
"statuses_count": 13661,
"lang": "en",
"contributors_enabled": false,
"is_translator": false,
"is_translation_enabled": false,
"profile_background_color": "08ABFC",
"profile_background_image_url": "http://pbs.twimg.com/profile_background_images/444297891977244672/Z1BkfCFB.jpeg",
"profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/444297891977244672/Z1BkfCFB.jpeg",
"profile_background_tile": true,
"profile_image_url": "http://pbs.twimg.com/profile_images/454073282778902529/gCGicDBH_normal.jpeg",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/454073282778902529/gCGicDBH_normal.jpeg",
"profile_banner_url": "https://pbs.twimg.com/profile_banners/253350311/1392339276",
"profile_link_color": "FA05F2",
"profile_sidebar_border_color": "FFFFFF",
"profile_sidebar_fill_color": "DDEEF6",
"profile_text_color": "333333",
"profile_use_background_image": true,
"default_profile": false,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null
},
"geo": null,
"coordinates": null,
"place": null,
"contributors": null,
"retweet_count": 0,
"favorite_count": 0,
"entities": {
"hashtags": [],
"symbols": [],
"urls": [],
"user_mentions": []
},
"favorited": false,
"retweeted": false,
"filter_level": "medium",
"lang": "en"
}
So in the final csv file, I have some of the info I mentionned above, but what I would need to add in the csv file is the "description" part (bold) of each string. Any help would be appreciated!
The problem is probably that JSON is hierarchical and CSV is not. I'm guessing that you are only getting the top level JSON elements and not the nested objects. For example if your JSON is:
{
'name': 'test',
'author': {
'id': 123,
'created': ''
}
}
you are only getting 'name' and not 'author.id'? If this is the case, check out other questions on SO related to flattening JSON out for CSV e.g. flattening json to csv format
Any good JSON to CSV converter will work, try this one. If there is somehting funky in the JSON we need an example of the input JSON and what is getting spit out.
If you just need that one field enter the following command on the command line:
cat test.json | sed -n 's/.*description\":\"\([^"]*\)\".*/Description, \1/p' > result.csv
Where test.json is the file with all the JSON entries in it.
Here is the output from an example I ran:
cat test.json | sed -n 's/.*description\":\"\([^"]*\)\".*/\1/p'
Jazz Personality.G Mentality.
Jazz Personality.G Mentality.
Jazz Personality.G Mentality.
Jazz Personality.G Mentality.
If the file is very large you may need to split in to parts:
split -l N test.json part
Where N is the number of lines per part.