JOIN in Apache Pig

JOIN in Apache Pig - json

I have two files consisting of json objects in two different locations on my hdfs and I need to join those two depending on a common field.
First file consists of tweet data and has 34 fields (I literally counted). It looks like:
{"contributors": null, "truncated": false, "text": "US Bank Loans And credit card capitol one business", "avl_brand_all": ["US Bank"], "is_quote_status": false , "in_reply_to_status_id": null, "id": 770150015968825344, "favorite_count": 0, "avl_num_sentences": 1, "source": "<a href=\"http://twitter.com\" rel=\"nofollow\">Twitter Web Client</ a>", "retweeted": false, "coordinates": null, "entities": {"symbols": [], "user_mentions": [], "hashtags": [], "urls": [{"url": "<link>": [51, 74], "expand ed_url": "http://usbanklogins.com/bank/", "display_url": "usbanklogins.com/bank/"}]}, "in_reply_to_screen_name": null, "in_reply_to_user_id": null, "avl_word_tags": [{"distance": 1, " word": "u", "pos": "OTHER"}, {"distance": 1, "word": "bank", "pos": "NOUN"}, {"distance": 1, "word": "loan", "pos": "NOUN"}, {"distance": 1, "word": "credit", "pos": "NOUN"}, {"distan ce": 1, "word": "card", "pos": "NOUN"}, {"distance": 1, "word": "capitol", "pos": "VERB"}, {"distance": 1, "word": "one", "pos": "OTHER"}, {"distance": 1, "word": "business", "pos": " NOUN"}], "avl_brand_1": "US Bank", "retweet_count": 0, "avl_lexicon_text": "us bank loans and credit card capitol one business", "id_str": "770150015968825344", "favorited": false, "a vl_sentences": ["us bank loans and credit card capitol one business"], "user": {"follow_request_sent": false, "has_extended_profile": false, "profile_use_background_image": true, "id" : 485610502, "verified": false, "profile_text_color": "0C3E53", "profile_image_url_https": "<link>", "profile _sidebar_fill_color": "FFF7CC", "geo_enabled": false, "entities": {"url": {"urls": [{"url": "link", "indices": [0, 22], "expanded_url": "http://www.seowithme.com", " display_url": "seowithme.com"}]}, "description": {"urls": []}}, "followers_count": 347, "profile_sidebar_border_color": "F2E195", "location": "", "default_profile_image": false, "id_s tr": "485610502", "is_translation_enabled": false, "utc_offset": null, "statuses_count": 117, "description": "seowithme", "friends_count": 959, "profile_link_color": "FF0000", "profil e_image_url": "http://pbs.twimg.com/profile_images/2334489262/qyznw08zjrgv3vlxtdvt_normal.jpeg", "notifications": false, "profile_background_image_url_https": "https://abs.twimg.com/i mages/themes/theme12/bg.gif", "profile_background_color": "BADFCD", "profile_background_image_url": "http://abs.twimg.com/images/themes/theme12/bg.gif", "screen_name": "sajanshrestha2 2", "lang": "en", "following": false, "profile_background_tile": false, "favourites_count": 2, "name": "sajan shrestha", "url": "<link>", "created_at": "Tue Feb 07 11: 40:39 +0000 2012", "contributors_enabled": false, "time_zone": null, "protected": false, "default_profile": false, "is_translator": false, "listed_count": 0}, "avl_num_paragraphs": 1, "geo": null, "in_reply_to_user_id_str": null, "possibly_sensitive": false, "lang": "en", "created_at": "Mon Aug 29 06:44:07 +0000 2016", "avl_source": "individual", "in_reply_to_stat us_id_str": null, "place": null, "metadata": {"iso_language_code": "en", "result_type": "recent"}, "avl_num_words": 8}
The second file has json objects each having only two fields. Looks like:
{"avl_syntaxnet_tags": [{"pos_tag": "PRP", "position": "1", "dep_rel": "dep", "parent": "3", "word": "us"}, {"pos_tag": "NN", "position": "2", "dep_rel": "nn", "parent": "3", "word": "bank"}, {"pos_tag": "NNS", "position": "3", "dep_rel": "nsubj", "parent": "7", "word": "loans"}, {"pos_tag": "CC", "position": "4", "dep_rel": "cc", "parent": "3", "word": "and"}, {" pos_tag": "NN", "position": "5", "dep_rel": "nn", "parent": "6", "word": "credit"}, {"pos_tag": "NN", "position": "6", "dep_rel": "conj", "parent": "3", "word": "card"}, {"pos_tag": " VBP", "position": "7", "dep_rel": "ROOT", "parent": "0", "word": "capitol"}, {"pos_tag": "CD", "position": "8", "dep_rel": "num", "parent": "9", "word": "one"}, {"pos_tag": "NN", "pos ition": "9", "dep_rel": "dobj", "parent": "7", "word": "business"}], "avl_lexicon_text": "us bank loans and credit card capitol one business"}
Now, there is a common fiels in both the json_objects named avl_lexicon_text and I want to join these two objects using the common field.
I wrote the following Pig script for the join:
a = LOAD file1 as (a1, a2);
b = LOAD file2 as (b1, b2, b3, b4, b5, b6, b7, b8, b9, b10, b11, b12, b13, b14, b15, b16, b17, b18, b19, b20, b21, b22, b23, b24, b25, b26, b27, b28, b29, b30, b31, b32, b33, b34);
x = JOIN b BY b19 FULL, a BY a2;
STORE x INTO '$SYNTAXNET_OUTPUT';
I checked b19 is the avl_lexicon_text field in b and a2 is the same in a. The results I get are really weird. When I dump x, I am not getting a new json_object that contains all the fields in a and b. I get all the objects in b followed by all the objects in a.
Can someone suggest me the right way to do this?
EDIT: Also, is there a way I can do this without loading the schema? Because sometime in future, if the format of any of the files changes (a new field gets added or an existing field gets deleted), I do not want to change the pig script. Is there a way I can do the JOIN without referencing the field position but by accessing the field name? Thanks! )

The behavior is expected since you have specified a FULL outer join.
Remove FULL to only get matching records.See here for FULL outer join.
x = JOIN b BY b19, a BY a2;

Related

Python 3 - Extracting value from key in nested dictionary

Hoping for some pointers here as I'm striking out with my attempts. Am working with python 3.8.5.
I'm querying a Car Park booking system that returns a json list of availability. The result has lots of nested dictionaries and I'm struggling to extract just the values i want.
This is what I've been doing:
import requests
import json
enquiry = requests.post(url.....) #queries api, this works fine
results = enquiry.text #extracts response data
dictionary = json.loads(results) #convert response to python dict
If there is one slot available, I get this output (apologies for length). If there are multiple slots available, i get the same output repeated:
{
"data": {
"services": [{
"id": null,
"name": null,
"services": [{
"id": null,
"name": "Car Park",
"met": true,
"filterCount": 1,
"primary": true,
"options": [{
"id": "9",
"images": [],
"available": true,
"calendarId": "AAAA",
"templateId": "BBBB",
"capacity": 0,
"name": "Car Park Slot 3",
"sessionId": "CCCC",
"functions": null,
"startDate": "2020-08-18T13:30:00Z", <--this is what I want to extract
"endDate": "2020-08-18T14:30:00Z",
"geo": {
"lat": 0.0,
"lng": 0.0
},
"selected": false,
"linkedServices": [],
"tiers": null
}]
}],
"currentBookingId": null,
"startDate": {
"ms": 1597757400000,
"year": 2020,
"month": 8,
"day": 18,
"dayOfWeek": 2,
"time": {
"seconds": 0,
"minutes": 30,
"hours": 14,
"days": 0
}
},
"endDate": {
"ms": 1597761000000,
"year": 2020,
"month": 8,
"day": 18,
"dayOfWeek": 2,
"time": {
"seconds": 0,
"minutes": 30,
"hours": 15,
"days": 0
}
},
"sessionId": "2222222",
"chargeType": 1,
"hasPrimaryBookable": false,
"hasBookable": false,
"hasDiscounts": false,
"hasMultipleTiers": false,
"isPreferred": false,
"primaryServiceAvailable": true,
"primaryServiceId": null,
"primaryServiceType": "undefined",
"unavailableAttendees": []
}],
"bookingLimit": null
},
"success": true,
"suppress": false,
"version": "2.3.293",
"message": null,
"result": null,
"errors": null,
"code": null,
"flags": 0,
"redirect": null
}
I want to extract:
"startDate": "2020-08-18T13:30:00Z"
from each key, value pair in any of the returned slots. However, I cant work it out.
Extracting the whole of this nested dictionary would also contain the same data, but would then involve more work to tidy it up after.
"startDate": {
"ms": 1597757400000,
"year": 2020,
"month": 8,
"day": 18,
"dayOfWeek": 2,
"time": {
"seconds": 0,
"minutes": 30,
"hours": 14,
"days": 0
}
I've tried loads of dictionary.get and dictionary.items variations, but cant seem to get anywhere.
I tried something like
key = ('startDate')
availability = dictionary.get(key)
print(availability)
this just returns 'none', so think im way off
Any pointers?
Thanks in advance!

Thanks for the full data. It makes testing easier :)
I had to replace null -> None, true -> True, false -> False
slot = {
"data": {
"services": [{
"id": None,
"name": None,
"services": [{
"id": None,
"name": "Car Park",
"met": True,
"filterCount": 1,
"primary": True,
"options": [{
"id": "9",
"images": [],
"available": True,
"calendarId": "AAAA",
"templateId": "BBBB",
"capacity": 0,
"name": "Car Park Slot 3",
"sessionId": "CCCC",
"functions": None,
"startDate": "2020-08-18T13:30:00Z", # <--this is what I want to extract
................
print("Start Date:", slot['data']['services'][0]['services'][0]['options'][0]['startDate'])
Output
Start Date: 2020-08-18T13:30:00Z

How to extract a specific value from JSON file?

I'm trying to extract a specific value from JSON file.
the key value is: "info": "this is an example" (The key is unique)
I want to extract only the value: "this is an example"
My code:
cat 9.json | jq '.info'
result:
null
JSON file example:
{
"Event": {
"id": "13",
"orgc_id": "1",
"org_id": "1",
"date": "2019-01-09",
"threat_level_id": "3",
"info": "test9",
"published": false,
"uuid": "5c35d180",
"attribute_count": "2",
"analysis": "0",
"timestamp": "1547044733",
"distribution": "1",
"proposal_email_lock": false,
"locked": false,
"publish_timestamp": "1547034089",
"sharing_group_id": "0",
"disable_correlation": false,
"extends_uuid": "",
"event_creator_email": "o#cyhgfnt.com",
"Org": {
"id": "1",
"name": "Cygfdgfdnt",
"uuid": "5b9f938d-e3a0-4ecb-83b3-0bdeac1b41bc"
},
"Orgc": {
"id": "1",
"name": "Cyhgfgft",
"uuid": "5b9f938d-e3a0-4ecb-83b3-0bdeac1b41bc"
},
"Attribute": [{
"id": "292630",
"type": "domain",
"category": "Network activity",
"to_ids": true,
"uuid": "5c35dd94-cccc-4086-b386-682823717aa5",
"event_id": "1357",
"distribution": "5",
"timestamp": "1547034584",
"comment": "This is a comment",
"sharing_group_id": "0",
"deleted": false,
"disable_correlation": false,
"object_id": "0",
"object_relation": null,
"value": "dodskj.com",
"Galaxy": [],
"ShadowAttribute": [],
"Tag": [{
"id": "223",
"name": "kill-chain:Exploitation",
"colour": "#a80079",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}]
}, {
"id": "292631",
"type": "ip-dst",
"category": "Network activity",
"to_ids": true,
"uuid": "5c35dd94-fe90-4ef6-b3a9-682823717aa5",
"event_id": "1357",
"distribution": "5",
"timestamp": "1547044733",
"comment": "comment example",
"sharing_group_id": "0",
"deleted": false,
"disable_correlation": false,
"object_id": "0",
"object_relation": null,
"value": "8.8.6.6",
"Galaxy": [],
"ShadowAttribute": [],
"Tag": [{
"id": "247",
"name": "maec-malware-capabilities:maec-malware-capability=\"anti-removal\"",
"colour": "#3f0004",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}, {
"id": "465",
"name": "osint:lifetime=\"perpetual\"",
"colour": "#006ebe",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}]
}],
"ShadowAttribute": [],
"RelatedEvent": [],
"Galaxy": [{
"id": "3",
"uuid": "698774c7-8022-42c4-917f-8d6e4f06ada3",
"name": "Threat Actor",
"type": "threat-actor",
"description": "Threat actors are characteristics of malicious actors (or adversaries) representing a cyber attack threat including presumed intent and historically observed behaviour.",
"version": "3",
"icon": "user-secret",
"namespace": "misp",
"GalaxyCluster": [{
"id": "6397",
"collection_uuid": "7cdff317-a673-4474-84ec-4f1754947823",
"type": "threat-actor",
"value": "Sofacy",
"tag_name": "misp-galaxy:threat-actor=\"Sofacy\"",
"description": "The Sofacy Group (also known as APT28, Pawn Storm, Fancy Bear and Sednit) is a cyber espionage group believed to have ties to the Russian government. Likely operating since 2007, the group is known to target government, military, and security organizations. It has been characterized as an advanced persistent threat.",
"galaxy_id": "3",
"source": "MISP Project",
"authors": ["Alexandre Dulaunoy", "Florian Roth", "Thomas Schreck", "Timo Steffens", "Various"],
"version": "82",
"uuid": "5b4ee3ea-eee3-4c8e-8323-85ae32658754",
"tag_id": "608",
"meta": {
"cfr-suspected-state-sponsor": ["Russian Federation"],
"cfr-suspected-victims": ["Georgia", "France", "Jordan", "United States", "Hungary", "World Anti-Doping Agency", "Armenia", "Tajikistan", "Japan", "NATO", "Ukraine", "Belgium", "Pakistan", "Asia Pacific Economic Cooperation", "International Association of Athletics Federations", "Turkey", "Mongolia", "OSCE", "United Kingdom", "Germany", "Poland", "European Commission", "Afghanistan", "Kazakhstan", "China"],
"cfr-target-category": ["Government", "Military"],
"cfr-type-of-incident": ["Espionage"],
"country": ["RU"],
"refs": ["https:\/\/en.wikipedia.org\/wiki\/Sofacy_Group", "https:\/\/aptnotes.malwareconfig.com\/web\/viewer.html?file=..\/APTnotes\/2014\/apt28.pdf", "http:\/\/www.trendmicro.com\/cloud-content\/us\/pdfs\/security-intelligence\/white-papers\/wp-operation-pawn-storm.pdf", "https:\/\/www2.fireeye.com\/rs\/848-DID-242\/images\/wp-mandiant-matryoshka-mining.pdf", "https:\/\/www.crowdstrike.com\/blog\/bears-midst-intrusion-democratic-national-committee\/", "http:\/\/researchcenter.paloaltonetworks.com\/2016\/06\/unit42-new-sofacy-attacks-against-us-government-agency\/", "https:\/\/www.cfr.org\/interactive\/cyber-operations\/apt-28", "https:\/\/blogs.microsoft.com\/on-the-issues\/2018\/08\/20\/we-are-taking-new-steps-against-broadening-threats-to-democracy\/", "https:\/\/www.bleepingcomputer.com\/news\/security\/microsoft-disrupts-apt28-hacking-campaign-aimed-at-us-midterm-elections\/", "https:\/\/www.bleepingcomputer.com\/news\/security\/apt28-uses-lojax-first-uefi-rootkit-seen-in-the-wild\/"],
"synonyms": ["APT 28", "APT28", "Pawn Storm", "PawnStorm", "Fancy Bear", "Sednit", "TsarTeam", "Tsar Team", "TG-4127", "Group-4127", "STRONTIUM", "TAG_0700", "Swallowtail", "IRON TWILIGHT", "Group 74"]
}
}]
}],
"Object": [],
"Tag": [{
"id": "608",
"name": "misp-galaxy:threat-actor=\"Sofacy\"",
"colour": "#12e000",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}, {
"id": "118",
"name": "gdpr:special-categories=\"health\"",
"colour": "#3ce600",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}]
}
}

I suppose you are trying to get the .info field inside .Event which should have been written as below. Use -r for without quotes
jq '.Event.info'

JQ: Set toplevel path

I use jq 1.5 under a Windows enviroment to modify a given large json file to extract a single Array ("Offers") from that large file:
'.Offers[] | ({Price: .AdultPriceEUR, Currency: .Currency, Link: .Deeplink, Tickettyp: .TicketClassIndex, Flightindex: .FlightIndex })'
After that i got an "unnamed" Array. But for the later processing it is necessary that the Array keeps his old "Name". I checked the documentation and found the setpath function but it is not possible to keep the Name "easy" on extraction?
shorten example of the json file:
{"Airports": [
{
"Aliases": null,
"ContinentCode": "EU",
"ContinentGroup": 1,
"CountryCode": "DE",
"CountryName": "Germany",
"DST": "",
"DisplayName": "Hamburg (HAM) Germany",
"Iata": "HAM",
"IataLink": false,
"Icao": "EDDH",
"Latitude": 53.63215,
"Longitude": 10.0041609,
"MainCityCode": "HAM",
"MainCityDisplayName": "Hamburg (HAM) Germany",
"MainCityName": "Hamburg",
"Name": "Hamburg",
"Priority": 142,
"StateCode": null,
"StateName": null,
"TimeZone": -798214753
},
{
"Aliases": null,
"ContinentCode": "AS",
"ContinentGroup": 4,
"CountryCode": "TH",
"CountryName": "Thailand",
"DST": "",
"DisplayName": "Suvarnabhumi, Bangkok (BKK) Thailand",
"Iata": "BKK",
"IataLink": false,
"Icao": "VTBS",
"Latitude": 13.6922979,
"Longitude": 100.750694,
"MainCityCode": "BKK",
"MainCityDisplayName": "Bangkok (BKK) Thailand",
"MainCityName": "Bangkok",
"Name": "Suvarnabhumi",
"Priority": 1462,
"StateCode": null,
"StateName": null,
"TimeZone": -640089798
}], "Offers": [
{
"AdultPrice": 2977.6,
"AdultPriceEUR": 2977.6,
"AdultPriceExclTax": 0.0,
"Currency": "EUR",
"FeeIndexes": [
0,
1,
2,
3,
4,
5,
6
],
"FlightIndex": 0,
"IsPaymentIncluded": true,
"MobileDeepLink": null,
"PaymentMethods": [
"American Express",
"Diners Club",
"MasterCard Credit",
"MasterCard Debit",
"Paypal",
"Visa Credit",
"Visa Debit"
],
"Score": 2501.3,
"SegmentFares": null,
"SegmentKey": -1,
"TicketClassIndex": 1,
"TotalIsCalculated": false,
"TotalPrice": 2977.6,
"TotalPriceEUR": 2977.6,
"TotalPriceExclTax": 0.0
},
{
"AdultPrice": 4697.27,
"AdultPriceEUR": 4697.27,
"AdultPriceExclTax": 0.0,
"Currency": "EUR",
"FeeIndexes": [
0,
1,
2,
3,
4,
7,
8,
5,
6
],
"FlightIndex": 1,
"IsPaymentIncluded": true,
"MobileDeepLink": null,
"PaymentMethods": [
"American Express",
"Diners Club",
"MasterCard Credit",
"MasterCard Debit",
"Paypal",
"Sofortüberweisung",
"Überweisung",
"Visa Credit",
"Visa Debit"
],
"Score": 3438.64,
"SegmentFares": null,
"SegmentKey": -1,
"TicketClassIndex": 1,
"TotalIsCalculated": false,
"TotalPrice": 4697.27,
"TotalPriceEUR": 4697.27,
"TotalPriceExclTax": 0.0
}]
}
thanks
BR
Timo

Looks like you're looking at this:
jq '{Offers:[.Offers[] | {Price: .AdultPriceEUR, Currency: .Currency, Link: .Deeplink, Tickettyp: .TicketClassIndex, Flightindex: .FlightIndex }]}' file
It just creates a new object containing an Offers table with the content you want to put it it.

From JSON file to Data Frame in R

I have an extract of tweets in JSON format. I have attached an sample of the data. I would need to convert this JSON into a dataframe.
So far I managed to convert it using the "jsonlite" package:
json_data <- jsonlite::stream_in(file("myjsonfile.txt"))
But it does not load all the information contained in the tweets. For example I only see the user who retweeted but not who posted the tweet.
You can view the json file better using this website by copy pasting the file and selecting format: http://jsonviewer.stack.hu/
The data is coming from the Twitter API (more information on this data available here: https://dev.twitter.com/overview/api/tweets
Thank you in advance for your time and help.
ML_Enthousiast
{"favorited": false, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "truncated": false, "in_reply_to_user_id_str": null, "coordinates": null, "retweeted": false, "text": "RT #Antoniotalks: Revenue streams for #OpenData companies!\n#Cloud #StartUp #SMM #AI #IoT #Fintech #BigData #deeplearning #Mpgvip\u2026 ", "retweet_count": 0, "filter_level": "low", "created_at": "Thu Jun 29 18:47:18 +0000 2017", "favorite_count": 0, "retweeted_status": {"favorited": false, "in_reply_to_status_id_str": null, "in_reply_to_user_id": null, "display_text_range": [0, 140], "truncated": true, "in_reply_to_user_id_str": null, "coordinates": null, "retweeted": false, "text": "Revenue streams for #OpenData companies!\n#Cloud #StartUp #SMM #AI #IoT #Fintech #BigData #deeplearning #Mpgvip\u2026 ", "retweet_count": 38, "filter_level": "low", "created_at": "Wed Jun 28 12:45:08 +0000 2017", "favorite_count": 48, "in_reply_to_screen_name": null, "extended_tweet": {"extended_entities": {"media": [{"media_url_https": "", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "large": {"w": 1200, "h": 927, "resize": "fit"}, "medium": {"w": 1200, "h": 927, "resize": "fit"}, "small": {"w": 680, "h": 525, "resize": "fit"}}, "type": "photo", "expanded_url": "", "id": 880044388679901184, "media_url": "http://pbs.twimg.com/media/DDaLtXXXYAAI2eM.jpg", "id_str": "880044388679901184", "display_url": "pic.twitter.com/aw9HeukUYv", "indices": [139, 162], "url": ""}]}, "full_text": "Revenue streams for #OpenData companies!\n#Cloud #StartUp #SMM #AI #IoT #Fintech #BigData #deeplearning #Mpgvip #defstar5 #DataScience #CIO ", "entities": {"user_mentions": [], "hashtags": [{"text": "OpenData", "indices": [20, 29]}, {"text": "Cloud", "indices": [41, 47]}, {"text": "StartUp", "indices": [48, 56]}, {"text": "SMM", "indices": [57, 61]}, {"text": "AI", "indices": [62, 65]}, {"text": "IoT", "indices": [66, 70]}, {"text": "Fintech", "indices": [71, 79]}, {"text": "BigData", "indices": [80, 88]}, {"text": "deeplearning", "indices": [89, 102]}, {"text": "Mpgvip", "indices": [103, 110]}, {"text": "defstar5", "indices": [111, 120]}, {"text": "DataScience", "indices": [121, 133]}, {"text": "CIO", "indices": [134, 138]}], "media": [{"media_url_https": "", "sizes": {"thumb": {"w": 150, "h": 150, "resize": "crop"}, "large": {"w": 1200, "h": 927, "resize": "fit"}, "medium": {"w": 1200, "h": 927, "resize": "fit"}, "small": {"w": 680, "h": 525, "resize": "fit"}}, "type": "photo", "expanded_url": "", "id": 880044388679901184, "media_url": "", "id_str": "880044388679901184", "display_url": "pic.twitter.com/aw9HeukUYv", "indices": [139, 162], "url": ""}], "symbols": [], "urls": []}, "display_text_range": [0, 138]}, "in_reply_to_status_id": null, "source": "Buffer", "id_str": "880044392110796800", "entities": {"user_mentions": [], "hashtags": [{"text": "OpenData", "indices": [20, 29]}, {"text": "Cloud", "indices": [41, 47]}, {"text": "StartUp", "indices": [48, 56]}, {"text": "SMM", "indices": [57, 61]}, {"text": "AI", "indices": [62, 65]}, {"text": "IoT", "indices": [66, 70]}, {"text": "Fintech", "indices": [71, 79]}, {"text": "BigData", "indices": [80, 88]}, {"text": "deeplearning", "indices": [89, 102]}, {"text": "Mpgvip", "indices": [103, 110]}], "symbols": [], "urls": [{"display_url": "twitter.com/i/web/status/8\u2026", "indices": [112, 135], "expanded_url": "", "url": "8H"}]}, "lang": "en", "id": 880044392110796800, "is_quote_status": false, "geo": null, "user": {"screen_name": "Antoniotalks", "profile_background_image_url": "", "profile_image_url": "jpg", "follow_request_sent": null, "profile_background_tile": false, "id": 2445890839, "is_translator": false, "description": "A father & CEO of Recruitd (#imrecruitd). Helping companies magnify their #employer and #recruitment #brand and #jobseekers with the #skillstosucceed.", "listed_count": 198, "favourites_count": 398, "created_at": "Tue Apr 15 19:13:52 +0000 2014", "notifications": null, "profile_background_image_url_https": "https://abs.twimg.com/images/themes/theme1/bg.png", "contributors_enabled": false, "profile_background_color": "C0DEED", "following": null, "friends_count": 6792, "protected": false, "default_profile": true, "profile_use_background_image": true, "name": "Antonio Giugno", "location": "London, England", "geo_enabled": true, "id_str": "2445890839", "utc_offset": -25200, "profile_banner_url": "0", "profile_text_color": "333333", "lang": "en-gb", "statuses_count": 4058, "profile_sidebar_fill_color": "DDEEF6", "default_profile_image": false, "profile_image_url_https": "4433/dVeGYfTX_normal.jpg", "profile_link_color": "1DA1F2", "url": "rnontubein", "verified": false, "profile_sidebar_border_color": "C0DEED", "followers_count": 6323, "time_zone": "Pacific Time (US & Canada)"}, "contributors": null, "possibly_sensitive": false, "place": null}, "in_reply_to_screen_name": null, "timestamp_ms": "1498762038396", "in_reply_to_status_id": null, "source": "Mobile Web (M2)", "id_str": "880497923150286848", "entities": {"user_mentions": [{"screen_name": "Antoniotalks", "id": 2445890839, "id_str": "2445890839", "name": "Antonio Giugno", "indices": [3, 16]}], "hashtags": [{"text": "OpenData", "indices": [38, 47]}, {"text": "Cloud", "indices": [59, 65]}, {"text": "StartUp", "indices": [66, 74]}, {"text": "SMM", "indices": [75, 79]}, {"text": "AI", "indices": [80, 83]}, {"text": "IoT", "indices": [84, 88]}, {"text": "Fintech", "indices": [89, 97]}, {"text": "BigData", "indices": [98, 106]}, {"text": "deeplearning", "indices": [107, 120]}, {"text": "Mpgvip", "indices": [121, 128]}], "symbols": [], "urls": [{"indices": [130, 130], "expanded_url": null, "url": ""}]}, "lang": "en", "id": 880497923150286848, "is_quote_status": false, "geo": null, "user": {"screen_name": "henrymbuguak", "profile_background_image_url": "://abs.twimg.com/images/themes/theme3/bg.gif", "profile_image_url": "://pbs.twimg.com/profile_images/822772556818239489/0yTbHCGj_normal.jpg", "follow_request_sent": null, "profile_background_tile": false, "id": 310697279, "is_translator": false, "description": "I enjoy coding. Visit my github project: :// ://github.com/henrymbuguak", "listed_count": 62, "favourites_count": 978, "created_at": "Sat Jun 04 05:55:09 +0000 2011", "notifications": null, "profile_background_image_url_https": "://abs.twimg.com/images/themes/theme3/bg.gif", "contributors_enabled": false, "profile_background_color": "EDECE9", "following": null, "friends_count": 2540, "protected": false, "default_profile": false, "profile_use_background_image": true, "name": "kiarie henry mbugua", "location": "Njoro, Kenya.", "geo_enabled": false, "id_str": "310697279", "utc_offset": 10800, "profile_banner_url": "://pbs.twimg.com/profile_banners/310697279/1484999353", "profile_text_color": "634047", "lang": "en", "statuses_count": 3775, "profile_sidebar_fill_color": "E3E2DE", "default_profile_image": false, "profile_image_url_https": "//pbs.twimg.com/profile_images/822772556818239489/0yTbHCGj_normal.jpg", "profile_link_color": "088253", "url": null, "verified": false, "profile_sidebar_border_color": "D3D2CF", "followers_count": 2141, "time_zone": "Nairobi"}, "contributors": null, "place": null}

If I read in your data using
indata <- jsonlite::read_json("myjsonfile.json")
then I get all the information contained in the JSON file. It is a nested list so you may need to extract the information you want from one of the elements in the list
> names(indata)
[1] "favorited" "in_reply_to_status_id_str"
[3] "in_reply_to_user_id" "truncated"
[5] "in_reply_to_user_id_str" "coordinates"
[7] "retweeted" "text"
[9] "retweet_count" "filter_level"
[11] "created_at" "favorite_count"
[13] "retweeted_status" "in_reply_to_screen_name"
[15] "timestamp_ms" "in_reply_to_status_id"
[17] "source" "id_str"
[19] "entities" "lang"
[21] "id" "is_quote_status"
[23] "geo" "user"
[25] "contributors" "place"
The information about the user is for example (only a part shown)
> indata$user
$screen_name
[1] "henrymbuguak"
$profile_background_image_url
[1] "://abs.twimg.com/images/themes/theme3/bg.gif"
$profile_image_url
[1] "://pbs.twimg.com/profile_images/822772556818239489/0yTbHCGj_normal.jpg"
$follow_request_sent
NULL
$profile_background_tile
[1] FALSE
$id
[1] 310697279
so you can get the user with indata$user$screen_name

How to form JSON path

I have employers array as below; how to get employers:id and featuredReview:id using JSON expression.
"employers": [
{
"id": 194,
"name": "Target",
"website": "www.target.com",
"isEEP": false,
"exactMatch": false,
"industry": "Department, Clothing, & Shoe Stores",
"numberOfRatings": 11531,
"squareLogo": "http://media.glassdoor.com/sqll/194/target-squarelogo.png",
"overallRating": 3.2,
"ratingDescription": "OK",
"cultureAndValuesRating": "3.3",
"seniorLeadershipRating": "2.8",
"compensationAndBenefitsRating": "3.0",
"careerOpportunitiesRating": "3.0",
"workLifeBalanceRating": "3.0",
"recommendToFriendRating": "0.6",
"featuredReview": {
"id": 6613365,
"currentJob": false,
"reviewDateTime": "2015-05-15 16:32:06.997",
"jobTitle": "Executive Team Leader",
"location": "Buena Park, CA",
"jobTitleFromDb": "Executive Team Leader",
"headline": "Unrealistic expectations for leadership",
"overall": 4,
"overallNumeric": 4
},
"ceo": {
"name": "Brian Cornell",
"title": "CEO",
"numberOfRatings": 1127,
"pctApprove": 66,
"pctDisapprove": 34
}
}]

employers[0].id
employers[0].featuredReview.id

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

JOIN in Apache Pig - json

The behavior is expected since you have specified a FULL outer join. Remove FULL to only get matching records.See here for FULL outer join. x = JOIN b BY b19, a BY a2;

Related

Python 3 - Extracting value from key in nested dictionary

How to extract a specific value from JSON file?

JQ: Set toplevel path

From JSON file to Data Frame in R

How to form JSON path

Categories

Resources