how do I parse info from a txt file? Python 3.0 - json

This is just a sample of code
{
"created_at": "Fri Jan 31 05:51:59 +0000 2014",
"favorited": false,
"lang": "en",
"place": {
"country_code": "US",
"url": "https://api.twitter.com/1.1/geo/id/cf44347a08102884.json"
},
"retweeted": false,
"source": "Tweetbot for Mac",
"text": "Active crime scene on I-59/20 near Jeff/Tusc Co line. One dead, one injured; shooting involved. Police search in the area; traffic stopped",
"truncated": false
}
How do I parse this in python so that I can get the information in text or lang?

I'm assuming this fragment is incomplete, as it looks like json but is currently invalid. Assuming a valid json document then you can use the json module:
>>> import json
>>> s = """{"lang": "en", "favorited": false, "truncated": false, ... }"""
>>> data = json.loads(s)
>>> data['lang']
'en'
>>> data['text']
'Active crime scene on I-59/20 near Jeff/Tusc Co line. One dead, one injured; shooting involved. Police search in the area; traffic stopped'

Related

json.load loads a string instead of json

I have a list of dictionaries written to a data.txt file. I was expecting to be able to read the list of dictionaries in a normal way when I load, but instead, I seem to load up a string.
For example - when I print(data[0]), I was expecting the first dictionary in the list, but instead, I got "[" instead.
Below attached is my codes and txt file:
read_json.py
import json
with open('./data.txt', 'r') as json_file:
data = json.load(json_file)
print(data[0])
data.txt
"[
{
"name": "Disney's Mulan (Mandarin) PG13 *",
"cast": [
"Jet Li",
"Donnie Yen",
"Yifei Liu"
],
"genre": [
"Action",
"Adventure",
"Drama"
],
"language": "Mandarin with no subtitles",
"rating": "PG13 - Some Violence",
"runtime": "115",
"open_date": "18 Sep 2020",
"description": "\u201cMulan\u201d is the epic adventure of a fearless young woman who masquerades as a man in order to fight Northern Invaders attacking China. The eldest daughter of an honored warrior, Hua Mulan is spirited, determined and quick on her feet. When the Emperor issues a decree that one man per family must serve in the Imperial Army, she steps in to take the place of her ailing father as Hua Jun, becoming one of China\u2019s greatest warriors ever."
},
{
"name": "The New Mutants M18",
"cast": [
"Maisie Williams",
"Henry Zaga",
"Anya Taylor-Joy",
"Charlie Heaton",
"Alice Braga",
"Blu Hunt"
],
"genre": [
"Action",
"Sci-Fi"
],
"language": "English",
"rating": "M18 - Some Mature Content",
"runtime": "94",
"open_date": "27 Aug 2020",
"description": "Five young mutants, just discovering their abilities while held in a secret facility against their will, fight to escape their past sins and save themselves."
}
]"
The above list is formatted properly for easy reading but the actual file is a single line and the different lines are denoted with "\n". Thanks for any help.
remove double quote in data.txt is useful for me。
eg. modify
"[{...},{...}]"
to
[{...},{...}]
Hope it helps!

Retrieving a specific object from JSON

This is a snippet of a JSON file that was returned by TMDB and I'm trying to access the title of every object. I've tried using the following methods like from this post How to access specific value from a nested array within an object array.
"results": [
{
"vote_count": 2358,
"id": 283366,
"video": false,
"vote_average": 6.5,
"title": "Miss Peregrine's Home for Peculiar Children",
"popularity": 20.662756,
"poster_path": "/AvekzUdI8HZnImdQulmTTmAZXrC.jpg",
"original_language": "en",
"original_title": "Miss Peregrine's Home for Peculiar Children",
"genre_ids": [
18,
14,
12
],
"backdrop_path": "/9BVHn78oQcFCRd4M3u3NT7OrhTk.jpg",
"adult": false,
"overview": "A teenager finds himself transported to an island where he must help protect a group of orphans with special powers from creatures intent on destroying them.",
"release_date": "2016-09-28"
},
{
"vote_count": 3073,
"id": 381288,
"video": false,
"vote_average": 6.8,
"title": "Split",
"popularity": 17.488396,
"poster_path": "/rXMWOZiCt6eMX22jWuTOSdQ98bY.jpg",
"original_language": "en",
"original_title": "Split",
"genre_ids": [
27,
53
],
"backdrop_path": "/4G6FNNLSIVrwSRZyFs91hQ3lZtD.jpg",
"adult": false,
"overview": "Though Kevin has evidenced 23 personalities to his trusted psychiatrist, Dr. Fletcher, there remains one still submerged who is set to materialize and dominate all the others. Compelled to abduct three teenage girls led by the willful, observant Casey, Kevin reaches a war for survival among all of those contained within him β€” as well as everyone around him β€” as the walls between his compartments shatter apart.",
"release_date": "2016-11-15"
},
var titles = results.map(function extract(item){return item.title})
The map function iterates through the array and builds the resulting array by applying the extract function on each item.
for (var i =0; i < obj.results.length; i++) {
console.log(obj.results[i].title);
}
First, we get the results key, and then, iterate over it since it is an array.
Results is nothing more than an array of objects, there's nothing nested.
You could use .forEach()
results.forEach(function(item){
console.log(item.title);
});

R: JSON data imported into wrong columns

I imported some JSON data using rjson library. The problem I'm facing is that some of the data appears to be misaligned. I suspect this is due to missing values.
How can I detect and re-align the data that is in incorrect columns and fill empty values with NULL. I cannot share the data. I hope the image will be enough.
code used to import data:
library(rjson)
json_data <- do.call(rbind, lapply(readLines(training.file$filepaths[ind]), rjson::fromJSON))
json_data <- as.data.frame(json_data)
I have also tried using jsonlite::fromJSON function instead of rjson::fromJSON, but get the following error
Error in feed_push_parser(readBin(con, raw(), n), reset = TRUE) :
parse error: trailing garbage
d_str": null, "place": null} {"truncated": false, "text": "R
(right here) ------^
json file format (data is manipulated but all properties are present in this example):
{
"truncated": false, "text": "abc abc", "in_reply_to_status_id": null,
"id": 123, "favorite_count": 0, "retweeted": false, "entities": {
"symbols": [], "user_mentions": [], "hashtags": [], "urls": []
},
"in_reply_to_screen_name": null, "id_str": "123", "retweet_count": 0,
"in_reply_to_user_id": null, "screen_name_statistics": {
"has_underscore": true, "contains_swear": false, "has_digits": false,
"contains_condition": false, "has_chars": true
},
"user": {
"verified": false, "geo_enabled": false, "followers_count": 0,
"utc_offset": -14400, "statuses_count": 17600, "friends_count": 4425,
"lang": "en", "favourites_count": 1900, "screen_name": "1name1",
"url": null, "created_at": "Sat Jun 00 03:36:27 +0000 2012",
"time_zone": "Atlantic Time (Canada)", "listed_count": 2
},
"geo": null, "in_reply_to_user_id_str": null, "lang": "en",
"created_at": "Mon Nov 55 05:18:49 +0000 2013",
"in_reply_to_status_id_str": null, "place": null
}
Further information:
obj1 and obj2 contain different number of properties obj1 contains 19 properties while obje contains 20 properties
misalignment occurs when list is converted to dataframe using as.data.frame. A custom function may be required to take property names into consideration.
I used rjson::fromJson function to import data. This was imported as a list which I then converted to dataframe for further analysis using as.data.frame.
At first I did not notice that the json objects had different number of properties which was causing misalignment of data in dataframe. column names were not matched.
To fix this, I wrote a custom mapping function which looks at individual values from list and maps them in a pre-defined dataframe.
Code specific to my example is available here
. Specifically the "importJSON" function tackles the mapping of list to dataframe.

Where to put current user context data in response JSON?

Consider a social network. It has posts. For feed, you request /feed and get the list of posts.
In the UI, there are things to show for a post, like if the user liked the post or not, if the user starred it or not, etc. These things don't look like they belong inside the post object.
Another case is when you fetch the likes. The frontend needs to know if the user in each 'like' object is being followed or not.
Where to put this info in the response JSON?
Its depends on your application and which data you want to show to the user. For ex,consider you are listing a user's feeds. In that feed,you want to show
Message
Liked by the current user or not(i don't know the difference between liked and stared)
Number of likes
List of liked users.
shared by the user or not
Shared count
List ofShared users.
etc..
In the above list,
Some data need two api fetch to get complete info and some not. For example,"List of liked users","List of Shared users". This is generally a dynamic data module. You have to get those details in a separate api for better performance of the server and also data integrity.
In some cases,some apps needs sneak peek of the liked shared users info in the listing page. In that case,you can include the some fixed small number of users details in the same list /feeds response itself and include the "See More(like Facebook)" option in the UI.
Some static singular data(single column data) can be list in the initial get /feeds itself.
I wonder why don't you follow the same twitter's list tweets style,
https://dev.twitter.com/rest/reference/get/search/tweets
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Fri Sep 21 23:40:54 +0000 2012",
"id_str": "249292149810667520",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "FreeBandNames",
"indices": [
20,
34
]
}
],
"user_mentions": [
]
},
"in_reply_to_user_id_str": null,
"contributors": null,
"text": "Thee Namaste Nerdz. #FreeBandNames",
"metadata": {
"iso_language_code": "pl",
"result_type": "recent"
},
"retweet_count": 0,
"in_reply_to_status_id_str": null,
"id": 249292149810667520,
"geo": null,
"retweeted": false,
"in_reply_to_user_id": null,
"place": null,
"user":
{
"profile_sidebar_fill_color": "DDFFCC",
"profile_sidebar_border_color": "BDDCAD",
"profile_background_tile": true,
"name": "Chaz Martenstein",
"profile_image_url": "http://a0.twimg.com/profile_images/447958234/Lichtenstein_normal.jpg",
"created_at": "Tue Apr 07 19:05:07 +0000 2009",
"location": "Durham, NC",
"follow_request_sent": null,
"profile_link_color": "0084B4",
"is_translator": false,
"id_str": "29516238",
"entities": {
"url": {
"urls": [
{
"expanded_url": null,
"url": "http://bullcityrecords.com/wnng/",
"indices": [
0,
32
]
}
]
},
"description": {
"urls": [
]
}
},
"default_profile": false,
"contributors_enabled": false,
"favourites_count": 8,
"url": "http://bullcityrecords.com/wnng/",
"profile_image_url_https": "https://si0.twimg.com/profile_images/447958234/Lichtenstein_normal.jpg",
"utc_offset": -18000,
"id": 29516238,
"profile_use_background_image": true,
"listed_count": 118,
"profile_text_color": "333333",
"lang": "en",
"followers_count": 2052,
"protected": false,
"notifications": null,
"profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/9423277/background_tile.bmp",
"profile_background_color": "9AE4E8",
"verified": false,
"geo_enabled": false,
"time_zone": "Eastern Time (US & Canada)",
"description": "You will come to Durham, North Carolina. I will sell you some records then, here in Durham, North Carolina. Fun will happen.",
"default_profile_image": false,
"profile_background_image_url": "http://a0.twimg.com/profile_background_images/9423277/background_tile.bmp",
"statuses_count": 7579,
"friends_count": 348,
"following": null,
"show_all_inline_media": true,
"screen_name": "bullcityrecords"
},
"in_reply_to_screen_name": null,
"source": "web",
"in_reply_to_status_id": null
}
You have two options:
Make a separate API method for getting information about user context data - /api/users/1/feeds/1 .Pay attention, that this option will force you to send request per feed. So, if you'll have 1000 feeds - you will have 1000 + 1 request (so called N+1 problem).
As for me - it's not a good idea.
You can store user data in json, for example is this way:
{
"feedName": "feed1",
...
"currentUser": {
"liked": true,
"starred": true
}
}
By using this option you will avoid N+1 requests problem in your RESTful service
For all the users, the post resource should be the same. Adding specific user context info inside it seems like polluting it
I can see where you're coming from and I quite agree.
Ivan's 1st solution should not be used as he already mentioned, his 2nd is better but then if you GET the posts JSON which should contain only post objects, there is also this currentUser that doesn't really belong there.
My suggestion is that for each post you keep track of which users have liked and/or starred it, etc. Then you keep a clean structure while still having the info you need available in the same request/response.
Example
GET /feed HTTP/1.1
[
{
"text": "hello world, im a post!",
"author": "Jack",
"likes": 3,
"likedBy": [
"John",
"James",
"Jessica"
],
"stars": 2,
"starredBy": [
"John",
"Mary"
]
},
{
"text": "hello world, im also a post! :D",
"author": "Mary",
"likes": 1,
"likedBy": [
"James"
],
"stars": 0,
"starredBy": [
]
},
]
Where each {} object represents a post object.
On the client side, you could then check if the likedBy list contains the currently logged in user and proceed with the result as you see fit. Same for stars and any other of these properties a post might have.

JSON to CSV: How to add filters (columns) in the final Excel table?

First, I apologize if my description is not accurate enough for you, I am a total newbie and I don't know a thing about programming, so don't hesitate to tell me if you need more detailed info, but I will try to be as precise as possible.
So I have downloaded a bunch of tweets thanks to Twitter's API and the Terminal (through Twurl). All the tweets are in a .json file (that I open with TextWrangler, I'm on a Mac) and the thing is that when I export my .json file to a .csv file in order to process and analyze the data more easily thanks to Excel (or at least the Excel version of LibreOffice), I don't have all the parameters I would require for my study, I lack the "bio" part of each Tweet info present in the .json file. In other words, in my final table I have a column for the tweet ID, one for the tweet author, one for the text of the tweet itself and so on... But I don't have a column for the bio of the tweet author, whereas this information is displayed in the .json file itself. So my question is: is there a code or anything which would enable me to have one more column displaying some more info present in the basic .json file in my final .csv table?
Again, this may not be clear, so don't hesitate to tell me if you need me to highlight a specific point.
Thanks in advance for any insight, I really need help on this one, this is for a research project I need to carry on for my PhD, so any help would be more than welcome!
EDIT: As an example, here is a sample of the data I have for one tweet in my original .json file:
{
"created_at": "Mon Apr 28 09:00:40 +0000 2014",
"id": 460705144846712800,
"id_str": "460705144846712832",
"text": "Work can suck a dick today",
"source": "Twitter for iPhone",
"truncated": false,
"in_reply_to_status_id": null,
"in_reply_to_status_id_str": null,
"in_reply_to_user_id": null,
"in_reply_to_user_id_str": null,
"in_reply_to_screen_name": null,
"user": {
"id": 253350311,
"id_str": "253350311",
"name": "JEEEZUS",
"screen_name": "Maxi_Flex",
"location": "Southchestershire",
"url": "http://www.soundcloud.com/maxi_flex",
"description": "Jazz Personality.G Mentality.",
"protected": false,
"followers_count": 457,
"friends_count": 400,
"listed_count": 1,
"created_at": "Thu Feb 17 02:08:57 +0000 2011",
"favourites_count": 1229,
"utc_offset": null,
"time_zone": null,
"geo_enabled": true,
"verified": false,
"statuses_count": 13661,
"lang": "en",
"contributors_enabled": false,
"is_translator": false,
"is_translation_enabled": false,
"profile_background_color": "08ABFC",
"profile_background_image_url": "http://pbs.twimg.com/profile_background_images/444297891977244672/Z1BkfCFB.jpeg",
"profile_background_image_url_https": "https://pbs.twimg.com/profile_background_images/444297891977244672/Z1BkfCFB.jpeg",
"profile_background_tile": true,
"profile_image_url": "http://pbs.twimg.com/profile_images/454073282778902529/gCGicDBH_normal.jpeg",
"profile_image_url_https": "https://pbs.twimg.com/profile_images/454073282778902529/gCGicDBH_normal.jpeg",
"profile_banner_url": "https://pbs.twimg.com/profile_banners/253350311/1392339276",
"profile_link_color": "FA05F2",
"profile_sidebar_border_color": "FFFFFF",
"profile_sidebar_fill_color": "DDEEF6",
"profile_text_color": "333333",
"profile_use_background_image": true,
"default_profile": false,
"default_profile_image": false,
"following": null,
"follow_request_sent": null,
"notifications": null
},
"geo": null,
"coordinates": null,
"place": null,
"contributors": null,
"retweet_count": 0,
"favorite_count": 0,
"entities": {
"hashtags": [],
"symbols": [],
"urls": [],
"user_mentions": []
},
"favorited": false,
"retweeted": false,
"filter_level": "medium",
"lang": "en"
}
So in the final csv file, I have some of the info I mentionned above, but what I would need to add in the csv file is the "description" part (bold) of each string. Any help would be appreciated!
The problem is probably that JSON is hierarchical and CSV is not. I'm guessing that you are only getting the top level JSON elements and not the nested objects. For example if your JSON is:
{
'name': 'test',
'author': {
'id': 123,
'created': ''
}
}
you are only getting 'name' and not 'author.id'? If this is the case, check out other questions on SO related to flattening JSON out for CSV e.g. flattening json to csv format
Any good JSON to CSV converter will work, try this one. If there is somehting funky in the JSON we need an example of the input JSON and what is getting spit out.
If you just need that one field enter the following command on the command line:
cat test.json | sed -n 's/.*description\":\"\([^"]*\)\".*/Description, \1/p' > result.csv
Where test.json is the file with all the JSON entries in it.
Here is the output from an example I ran:
cat test.json | sed -n 's/.*description\":\"\([^"]*\)\".*/\1/p'
Jazz Personality.G Mentality.
Jazz Personality.G Mentality.
Jazz Personality.G Mentality.
Jazz Personality.G Mentality.
If the file is very large you may need to split in to parts:
split -l N test.json part
Where N is the number of lines per part.