I'm working with twitter data which fetched in jsonl form. I've converted it to json and am trying to convert it to a csv (to import into a program which accepts either csv or MySQL). However, some people put forced new lines into their tweets or bios. This is causing the csv file to have multiple lines for entries, often breaking up in the middle of a tweet. I've tried a few of the python json to csv codes floating on github.
The latest attempt I tried:
jq -s "." tiny00subset.jsonl > tiny00subset.json
json2csv -i tiny00subset.json -o tiny00subset.csv
Partial example tweet (json format):
{
"created_at": "Mon Aug 13 10:40:34 +0000 2018",
"id": 1028954459110555600,
"id_str": "1028954459110555649",
"full_text": "Oh well, they deal with it quite well. Like they add numbers and facts and such crazy stuff.\nhttps://REPLACED/DuBGmHCnG8\n#climatechange https://REPLACED/d5IBchM3Uk",
"truncated": false,
"display_text_range": [
0,
131
],
"entities": {
"hashtags": [
{
"text": "climatechange",
"indices": [
117,
131
]
}
],
"symbols": [],
"user_mentions": [],
"urls": [
{
"url": "https://REPLACED/DuBGmHCnG8",
"expanded_url": "https://tamino.wordpress.com/2018/08/08/usa-temperature-can-i-sucker-you/",
"display_url": "tamino.wordpress.com/2018/08/08/usa…",
"indices": [
93,
116
]
},
{
"url": "https://REPLACED/d5IBchM3Uk",
"expanded_url": "https://twitter.com/Tony__Heller/status/1028672939753758720",
"display_url": "twitter.com/Tony__Heller/s…",
"indices": [
132,
155
]
}
]
},
}
CSV Output:
"Mon Aug 13 10:40:34 +0000 2018",1028954459110555600,"1028954459110555649","Oh well, they deal with it quite well. Like they add numbers and facts and such crazy stuff.
https://REPLACED/DuBGmHCnG8
#climatechange https://REPLACED/d5IBchM3Uk",false,"[0,131]","{""hashtags"":[{""text"":""climatechange"",""indices"":[117,131]}],""symbols"":[],""user_mentions"":[],""urls"":[{""url"":""https://REPLACED/DuBGmHCnG8"",""expanded_url"":""https://tamino.wordpress.com/2018/08/08/usa-temperature-can-i-sucker-you/"",""display_url"":""tamino.wordpress.com/2018/08/08/usa…"",""indices"":[93,116]},{""url"":""https://REPLACED/d5IBchM3Uk"",""expanded_url"":""https://twitter.com/Tony__Heller/status/1028672939753758720"",""display_url"":""twitter.com/Tony__Heller/s…"",""indices"":[132,155]}]}","TweetDeck",,,,,,"{""id"":59806323,""id_str"":""59806323"",""name"":""Daniel"",""screen_name"":""sleeksorrow"",""location"":""Karlsruhe, Germany"",""description"":""Politik, IT, Blödsinn und deren Schnittmenge. Ebenfalls: Hochmittelalter Darstellung, Falknerei, Greifvogelschutz - profile picture by #herrkausk"",""url"":""https://REPLACED/E8aNHIhCtg"",""entities"":{""url"":{""urls"":[{""url"":""https://REPLACED/E8aNHIhCtg"",""expanded_url"":""http://sleeksorrow.blogspot.com/"",""display_url"":""sleeksorrow.blogspot.com"",""indices"":[0,23]}]},""description"":{""urls"":[]}},""protected"":false,""followers_count"":572,""friends_count"":392,""listed_count"":47,""created_at"":""Fri Jul 24 15:15:25 +0000 2009"",""favourites_count"":13259,""utc_offset"":null,""time_zone"":null,""geo_enabled"":false,""verified"":false,""statuses_count"":48861,""lang"":null,""contributors_enabled"":false,""is_translator"":false,""is_translation_enabled"":false,""profile_background_color"":""1A1B1F"",""profile_background_image_url"":""http://abs.twimg.com/images/themes/theme9/bg.gif"",""profile_background_image_url_https"":""https://abs.twimg.com/images/themes/theme9/bg.gif"",""profile_background_tile"":false,""profile_image_url"":""http://pbs.twimg.com/profile_images/877219681513480192/1rj4xqpK_normal.jpg"",""profile_image_url_https"":""https://pbs.twimg.com/profile_images/877219681513480192/1rj4xqpK_normal.jpg"",""profile_banner_url"":""https://pbs.twimg.com/profile_banners/59806323/1397029131"",""profile_image_extensions_alt_text"":null,""profile_banner_extensions_alt_text"":null,""profile_link_color"":""2FC2EF"",""profile_sidebar_border_color"":""181A1E"",""profile_sidebar_fill_color"":""252429"",""profile_text_color"":""666666"",""profile_use_background_image"":true,""has_extended_profile"":false,""default_profile"":false,""default_profile_image"":false,""can_media_tag"":true,""followed_by"":false,""following"":false,""follow_request_sent"":false,""notifications"":false,""translator_type"":""none""}",,,,,true,1028672939753758700,"1028672939753758720","{""url"":""https://REPLACED/d5IBchM3Uk"",""expanded"":""https://twitter.com/Tony__Heller/status/1028672939753758720"",""display"":""twitter.com/Tony__Heller/s…""}","{""created_at"":""Sun Aug 12 16:01:55 +0000 2018"",""id"":1028672939753758700,""id_str"":""1028672939753758720"",""full_text"":""#DeanFieldingF1 It is very difficult or impossible for climate alarmists to deal with reality. https://REPLACED/wOJTptxIqH"",""truncated"":false,""display_text_range"":[16,94],""entities"":{""hashtags"":[],""symbols"":[],""user_mentions"":[{""screen_name"":""DeanFieldingF1"",""name"":""Dean Fielding"",""id"":797295219825897500,""id_str"":""797295219825897472"",""indices"":[0,15]}],""urls"":[],""media"":[{""id"":1028672868849090600,""id_str"":""1028672868849090560"",""indices"":[95,118],""media_url"":""http://pbs.twimg.com/media/DkaUhinVAAARrIY.jpg"",""media_url_https"":""https://pbs.twimg.com/media/DkaUhinVAAARrIY.jpg"",""url"":""https://REPLACED/wOJTptxIqH"",""display_url"":""pic.twitter.com/wOJTptxIqH"",""expanded_url"":""https://twitter.com/SteveSGoddard/status/1028672939753758720/photo/1"",""type"":""photo"",""sizes"":{""thumb"":{""w"":150,""h"":150,""resize"":""crop""},""medium"":{""w"":1070,""h"":983,""resize"":""fit""},""large"":{""w"":1070,""h"":983,""resize"":""fit""},""small"":{""w"":680,""h"":625,""resize"":""fit""}},""features"":{""orig"":{""faces"":[]},""medium"":{""faces"":[]},""large"":{""faces"":[]},""small"":{""faces"":[]}}}]},""extended_entities"":{""media"":[{""id"":1028672868849090600,""id_str"":""1028672868849090560"",""indices"":[95,118],""media_url"":""http://pbs.twimg.com/media/DkaUhinVAAARrIY.jpg"",""media_url_https"":""https://pbs.twimg.com/media/DkaUhinVAAARrIY.jpg"",""url"":""https://REPLACED/wOJTptxIqH"",""display_url"":""pic.twitter.com/wOJTptxIqH"",""expanded_url"":""https://twitter.com/SteveSGoddard/status/1028672939753758720/photo/1"",""type"":""photo"",""sizes"":{""thumb"":{""w"":150,""h"":150,""resize"":""crop""},""medium"":{""w"":1070,""h"":983,""resize"":""fit""},""large"":{""w"":1070,""h"":983,""resize"":""fit""},""small"":{""w"":680,""h"":625,""resize"":""fit""}},""features"":{""orig"":{""faces"":[]},""medium"":{""faces"":[]},""large"":{""faces"":[]},""small"":{""faces"":[]}},""ext_alt_text"":null},{""id"":1028672883986333700,""id_str"":""1028672883986333697"",""indices"":[95,118],""media_url"":""http://pbs.twimg.com/media/DkaUibAVAAEaQt0.jpg"",""media_url_https"":""https://pbs.twimg.com/media/DkaUibAVAAEaQt0.jpg"",""url"":""https://REPLACED/wOJTptxIqH"",""display_url"":""pic.twitter.com/wOJTptxIqH"",""expanded_url"":""https://twitter.com/SteveSGoddard/status/1028672939753758720/photo/1"",""type"":""photo"",""sizes"":{""thumb"":{""w"":150,""h"":150,""resize"":""crop""},""medium"":{""w"":1070,""h"":983,""resize"":""fit""},""large"":{""w"":1070,""h"":983,""resize"":""fit""},""small"":{""w"":680,""h"":625,""resize"":""fit""}},""features"":{""orig"":{""faces"":[]},""medium"":{""faces"":[]},""large"":{""faces"":[]},""small"":{""faces"":[]}},""ext_alt_text"":null}]},""source"":""Twitter Web Client"",""in_reply_to_status_id"":1028671170802081800,""in_reply_to_status_id_str"":""1028671170802081793"",""in_reply_to_user_id"":797295219825897500,""in_reply_to_user_id_str"":""797295219825897472"",""in_reply_to_screen_name"":""DeanFieldingF1"",""user"":{""id"":435704007,""id_str"":""435704007"",""name"":""Tony Heller"",""screen_name"":""Tony__Heller"",""location"":""Colorado"",""description"":""https://REPLACED/j5CaDNyIqE"",""url"":""https://REPLACED/Pyn117xXna"",""entities"":{""url"":{""urls"":[{""url"":""https://REPLACED/Pyn117xXna"",""expanded_url"":""http://realclimatescience.com"",""display_url"":""realclimatescience.com"",""indices"":[0,23]}]},""description"":{""urls"":[{""url"":""https://REPLACED/j5CaDNyIqE"",""expanded_url"":""https://realclimatescience.com/who-is-tony-heller/"",""display_url"":""realclimatescience.com/who-is-tony-he…"",""indices"":[0,23]}]}},""protected"":false,""followers_count"":44955,""friends_count"":374,""listed_count"":886,""created_at"":""Tue Dec 13 10:44:34 +0000 2011"",""favourites_count"":3740,""utc_offset"":null,""time_zone"":null,""geo_enabled"":true,""verified"":false,""statuses_count"":165165,""lang"":null,""contributors_enabled"":false,""is_translator"":false,""is_translation_enabled"":false,""profile_background_color"":""185370"",""profile_background_image_url"":""http://abs.twimg.com/images/themes/theme1/bg.png"",""profile_background_image_url_https"":""https://abs.twimg.com/images/themes/theme1/bg.png"",""profile_background_tile"":false,""profile_image_url"":""http://pbs.twimg.com/profile_images/1175541923508916225/0qEi4yIj_normal.jpg"",""profile_image_url_https"":""https://pbs.twimg.com/profile_images/1175541923508916225/0qEi4yIj_normal.jpg"",""profile_banner_url"":""https://pbs.twimg.com/profile_banners/435704007/1469798959"",""profile_image_extensions_alt_text"":null,""profile_banner_extensions_alt_text"":null,""profile_link_color"":""0084B4"",""profile_sidebar_border_color"":""FFFFFF"",""profile_sidebar_fill_color"":""DDEEF6"",""profile_text_color"":""333333"",""profile_use_background_image"":true,""has_extended_profile"":false,""default_profile"":false,""default_profile_image"":false,""can_media_tag"":false,""followed_by"":false,""following"":false,""follow_request_sent"":false,""notifications"":false,""translator_type"":""none""},""geo"":null,""coordinates"":null,""place"":null,""contributors"":null,""is_quote_status"":false,""retweet_count"":16,""favorite_count"":27,""favorited"":false,""retweeted"":false,""possibly_sensitive"":false,""lang"":""en""}",0,0,false,false,false,"en"
starting from
{
"created_at": "Mon Aug 13 10:40:34 +0000 2018",
"id": 1028954459110555600,
"id_str": "1028954459110555649",
"full_text": "Oh well, they deal with it quite well. Like they add numbers and facts and such crazy stuff.\nhttps://REPLACED/DuBGmHCnG8\n#climatechange https://REPLACED/d5IBchM3Uk",
"truncated": false,
"display_text_range": [
0,
131
],
"entities": {
"hashtags": [
{
"text": "climatechange",
"indices": [
117,
131
]
}
],
"symbols": [],
"user_mentions": [],
"urls": [
{
"url": "https://REPLACED/DuBGmHCnG8",
"expanded_url": "https://tamino.wordpress.com/2018/08/08/usa-temperature-can-i-sucker-you/",
"display_url": "tamino.wordpress.com/2018/08/08/usa…",
"indices": [
93,
116
]
},
{
"url": "https://REPLACED/d5IBchM3Uk",
"expanded_url": "https://twitter.com/Tony__Heller/status/1028672939753758720",
"display_url": "twitter.com/Tony__Heller/s…",
"indices": [
132,
155
]
}
]
}
}
and running (it's https://github.com/johnkerl/miller)
mlr --j2c unsparsify input.json >input.csv
you have this kind of output https://gist.github.com/aborruso/6e0361923a3c45b9fe55ebf7590953de#file-output-csv
If you open it as raw you have the carriage return. And a spreasheet read it properly.
Then, using properly the import process you need to use, the \n is not a problem.
I am zero in regex and need help in parsing the value for key "access_token" from the below output.This will be later used in passing as variable for another function.
So basically the regex should only fetch
eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJodHRwOi8vc2NoZW1hcy54bWxzb2FwLm9yZy93cy8yMDA1LzA1L2lkZW50aXR5L2NsYWltcy9uYW1lIjoiaXRwcm9kbW9uaXRvciIsImh0dHA6Ly9zY2hlbWFzLnhtbHNvYXAub3JnL3dzLzIwMDUvMDUvaWRlbnRpdHkvY2xhaW1zL2dpdmVubmFtZSI6IkxPS0VTSCBEVVJBSVJBSiIsImh0dHA6Ly9zY2hlbWFzLm1pY3Jvc29mdC5jb20vd3MvMjAwOC8wNi9pZGVudGl0eS9jbGFpbXMvcm9sZSI6IlNEQ1NfSEVMUERFU0siLCJQZXJtaXNzaW9ucyI6IjkwMzAsOTAwMCw5MDI4LDkwMjcsOTAyNiw5MDI1LDkwMjQsOTAyMyw5MDIyLDkwMjEsOTAyMCw5MDE5LDkwMTgsOTAxMyw5MDEyLDkwMTEsOTAxMCw5MDA5LDkwMDgsOTAwNyw5MDA2LDkwMDUsOTAwNCw5MDAzLDkwMDIsOTAwMSw5MDI5IiwiYWlycG9ydHMiOiJTWVoiLCJjbGllbnRJUCI6IjEwLjExMS4xLjEiLCJlbnYiOiJQUk9EIiwicmVzQ2hhbm5lbElEIjoiMTkiLCJpc0FwdENsbnQiOiJUcnVlIiwiY2hrTGNuIjoiQWlycG9ydCIsIm5iZiI6MTUzODkxNDY0OSwiZXhwIjoxNTM4OTE3NjQ5LCJpc3MiOiJmbHlkdWJhaS5jb20iLCJhdWQiOiIxNDEyMDAxIn0.qYID1b5lMjFhn7fTcSX5v6K6z2YpGJwAvE4gQfVrhxo
Here is the output of my Post output
{
"access_token": "eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJodHRwOi8vc2NoZW1hcy54bWxzb2FwLm9yZy93cy8yMDA1LzA1L2lkZW50aXR5L2NsYWltcy9uYW1lIjoiaXRwcm9kbW9uaXRvciIsImh0dHA6Ly9zY2hlbWFzLnhtbHNvYXAub3JnL3dzLzIwMDUvMDUvaWRlbnRpdHkvY2xhaW1zL2dpdmVubmFtZSI6IkxPS0VTSCBEVVJBSVJBSiIsImh0dHA6Ly9zY2hlbWFzLm1pY3Jvc29mdC5jb20vd3MvMjAwOC8wNi9pZGVudGl0eS9jbGFpbXMvcm9sZSI6IlNEQ1NfSEVMUERFU0siLCJQZXJtaXNzaW9ucyI6IjkwMzAsOTAwMCw5MDI4LDkwMjcsOTAyNiw5MDI1LDkwMjQsOTAyMyw5MDIyLDkwMjEsOTAyMCw5MDE5LDkwMTgsOTAxMyw5MDEyLDkwMTEsOTAxMCw5MDA5LDkwMDgsOTAwNyw5MDA2LDkwMDUsOTAwNCw5MDAzLDkwMDIsOTAwMSw5MDI5IiwiYWlycG9ydHMiOiJTWVoiLCJjbGllbnRJUCI6IjEwLjExMS4xLjEiLCJlbnYiOiJQUk9EIiwicmVzQ2hhbm5lbElEIjoiMTkiLCJpc0FwdENsbnQiOiJUcnVlIiwiY2hrTGNuIjoiQWlycG9ydCIsIm5iZiI6MTUzODkxNDY0OSwiZXhwIjoxNTM4OTE3NjQ5LCJpc3MiOiJmbHlkdWJhaS5jb20iLCJhdWQiOiIxNDEyMDAxIn0.qYID1b5lMjFhn7fTcSX5v6K6z2YpGJwAvE4gQfVrhxo",
"token_type": "bearer",
"expires_in": 2999,
"refresh_token": "65be41084c0b4adeaeec9725cb2e6240",
"audience": "1412001",
"displayName": "Lokesh",
"userId": "testuser78",
"rolesandpermission": "HELPDESK:9030,9000,9028,9027,9026,9025,9024,9023,9022,9021,9020,9019,9018,9013,9012,9011,9010,9009,9008,9007,9006,9005,9004,9003,9002,9001,9029",
"resChannelID": "19",
"Client": "FYC",
"isAptClnt": "True",
"scope": "apt:FYC env:PROD role:HELPDESK",
".issued": "Sun, 07 Oct 2018 12:17:29 GMT",
".expires": "Sun, 07 Oct 2018 13:07:29 GMT"
}
Simple PCRE look like
\"access_token\":.\"(.+?)\"
And you will get your token in fast captured group
You can practice with this regex on this website
https://regex101.com/
Consider a social network. It has posts. For feed, you request /feed and get the list of posts.
In the UI, there are things to show for a post, like if the user liked the post or not, if the user starred it or not, etc. These things don't look like they belong inside the post object.
Another case is when you fetch the likes. The frontend needs to know if the user in each 'like' object is being followed or not.
Where to put this info in the response JSON?
Its depends on your application and which data you want to show to the user. For ex,consider you are listing a user's feeds. In that feed,you want to show
Message
Liked by the current user or not(i don't know the difference between liked and stared)
Number of likes
List of liked users.
shared by the user or not
Shared count
List ofShared users.
etc..
In the above list,
Some data need two api fetch to get complete info and some not. For example,"List of liked users","List of Shared users". This is generally a dynamic data module. You have to get those details in a separate api for better performance of the server and also data integrity.
In some cases,some apps needs sneak peek of the liked shared users info in the listing page. In that case,you can include the some fixed small number of users details in the same list /feeds response itself and include the "See More(like Facebook)" option in the UI.
Some static singular data(single column data) can be list in the initial get /feeds itself.
I wonder why don't you follow the same twitter's list tweets style,
https://dev.twitter.com/rest/reference/get/search/tweets
{
"coordinates": null,
"favorited": false,
"truncated": false,
"created_at": "Fri Sep 21 23:40:54 +0000 2012",
"id_str": "249292149810667520",
"entities": {
"urls": [
],
"hashtags": [
{
"text": "FreeBandNames",
"indices": [
20,
34
]
}
],
"user_mentions": [
]
},
"in_reply_to_user_id_str": null,
"contributors": null,
"text": "Thee Namaste Nerdz. #FreeBandNames",
"metadata": {
"iso_language_code": "pl",
"result_type": "recent"
},
"retweet_count": 0,
"in_reply_to_status_id_str": null,
"id": 249292149810667520,
"geo": null,
"retweeted": false,
"in_reply_to_user_id": null,
"place": null,
"user":
{
"profile_sidebar_fill_color": "DDFFCC",
"profile_sidebar_border_color": "BDDCAD",
"profile_background_tile": true,
"name": "Chaz Martenstein",
"profile_image_url": "http://a0.twimg.com/profile_images/447958234/Lichtenstein_normal.jpg",
"created_at": "Tue Apr 07 19:05:07 +0000 2009",
"location": "Durham, NC",
"follow_request_sent": null,
"profile_link_color": "0084B4",
"is_translator": false,
"id_str": "29516238",
"entities": {
"url": {
"urls": [
{
"expanded_url": null,
"url": "http://bullcityrecords.com/wnng/",
"indices": [
0,
32
]
}
]
},
"description": {
"urls": [
]
}
},
"default_profile": false,
"contributors_enabled": false,
"favourites_count": 8,
"url": "http://bullcityrecords.com/wnng/",
"profile_image_url_https": "https://si0.twimg.com/profile_images/447958234/Lichtenstein_normal.jpg",
"utc_offset": -18000,
"id": 29516238,
"profile_use_background_image": true,
"listed_count": 118,
"profile_text_color": "333333",
"lang": "en",
"followers_count": 2052,
"protected": false,
"notifications": null,
"profile_background_image_url_https": "https://si0.twimg.com/profile_background_images/9423277/background_tile.bmp",
"profile_background_color": "9AE4E8",
"verified": false,
"geo_enabled": false,
"time_zone": "Eastern Time (US & Canada)",
"description": "You will come to Durham, North Carolina. I will sell you some records then, here in Durham, North Carolina. Fun will happen.",
"default_profile_image": false,
"profile_background_image_url": "http://a0.twimg.com/profile_background_images/9423277/background_tile.bmp",
"statuses_count": 7579,
"friends_count": 348,
"following": null,
"show_all_inline_media": true,
"screen_name": "bullcityrecords"
},
"in_reply_to_screen_name": null,
"source": "web",
"in_reply_to_status_id": null
}
You have two options:
Make a separate API method for getting information about user context data - /api/users/1/feeds/1 .Pay attention, that this option will force you to send request per feed. So, if you'll have 1000 feeds - you will have 1000 + 1 request (so called N+1 problem).
As for me - it's not a good idea.
You can store user data in json, for example is this way:
{
"feedName": "feed1",
...
"currentUser": {
"liked": true,
"starred": true
}
}
By using this option you will avoid N+1 requests problem in your RESTful service
For all the users, the post resource should be the same. Adding specific user context info inside it seems like polluting it
I can see where you're coming from and I quite agree.
Ivan's 1st solution should not be used as he already mentioned, his 2nd is better but then if you GET the posts JSON which should contain only post objects, there is also this currentUser that doesn't really belong there.
My suggestion is that for each post you keep track of which users have liked and/or starred it, etc. Then you keep a clean structure while still having the info you need available in the same request/response.
Example
GET /feed HTTP/1.1
[
{
"text": "hello world, im a post!",
"author": "Jack",
"likes": 3,
"likedBy": [
"John",
"James",
"Jessica"
],
"stars": 2,
"starredBy": [
"John",
"Mary"
]
},
{
"text": "hello world, im also a post! :D",
"author": "Mary",
"likes": 1,
"likedBy": [
"James"
],
"stars": 0,
"starredBy": [
]
},
]
Where each {} object represents a post object.
On the client side, you could then check if the likedBy list contains the currently logged in user and proceed with the result as you see fit. Same for stars and any other of these properties a post might have.