How to get a text of a specific section via wikipedia api - mediawiki

I would like to extract only a specific setion from a wikipedia page:
example:
I would like to extract the text from section "Parts" from wikipedia article "House".
https://en.wikipedia.org/wiki/House
The resulting text would be :
Many houses have several large rooms ..... sections of the home (including in more recent eras a garage).
We can get the hole text from an article like the following:
https://en.wikipedia.org/w/api.php?action=query&prop=revisions&titles=house&rvprop=content&format=json
But howto get the text for a specific section ?

Do you need to plain wikitext or the resulting HTML of the parser?
The below examples gives you the section "Layout" (the 3rd section of the house article, you can use any other section ID as well).
When you want to retrieve the parsed html of a specific section, you should use the parse api:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=house&prop=text&section=3&disabletoc=1
or, as a API request outside of the sandbox:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=text&section=3&disabletoc=1
If you want to have the wikitext of a specific section, just use the wikitext prop instead of the text prop:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=wikitext&section=3&disabletoc=1
In order to know what section has what index, you can query this information with the "sections" prop, without any section index:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=sections&disabletoc=1
So, as a full example for retrieving the Layout section text in a way of using the API only, you would:
Retrieve the sections of the article:
https://en.wikipedia.org/w/api.php?action=parse&format=json&page=house&prop=sections&disabletoc=1
Response:
{
"parse": {
"title": "House",
"pageid": 13590,
"sections": [
{
"toclevel": 1,
"level": "2",
"line": "Etymology",
"number": "1",
"index": "1",
"fromtitle": "House",
"byteoffset": 3549,
"anchor": "Etymology"
},
{
"toclevel": 1,
"level": "2",
"line": "Elements",
"number": "2",
"index": "2",
"fromtitle": "House",
"byteoffset": 4960,
"anchor": "Elements"
},
{
"toclevel": 2,
"level": "3",
"line": "Layout",
"number": "2.1",
"index": "3",
"fromtitle": "House",
"byteoffset": 4976,
"anchor": "Layout"
},
{
"toclevel": 2,
"level": "3",
"line": "Parts",
"number": "2.2",
"index": "4",
"fromtitle": "House",
"byteoffset": 6432,
"anchor": "Parts"
},
{
"toclevel": 2,
"level": "3",
"line": "History of the interior",
"number": "2.3",
"index": "5",
"fromtitle": "House",
"byteoffset": 7539,
"anchor": "History_of_the_interior"
},
{
"toclevel": 3,
"level": "4",
"line": "Communal rooms",
"number": "2.3.1",
"index": "6",
"fromtitle": "House",
"byteoffset": 8786,
"anchor": "Communal_rooms"
},
{
"toclevel": 3,
"level": "4",
"line": "Interconnecting rooms",
"number": "2.3.2",
"index": "7",
"fromtitle": "House",
"byteoffset": 9736,
"anchor": "Interconnecting_rooms"
},
{
"toclevel": 3,
"level": "4",
"line": "Corridor",
"number": "2.3.3",
"index": "8",
"fromtitle": "House",
"byteoffset": 11126,
"anchor": "Corridor"
},
{
"toclevel": 3,
"level": "4",
"line": "Employment-free house",
"number": "2.3.4",
"index": "9",
"fromtitle": "House",
"byteoffset": 13092,
"anchor": "Employment-free_house"
},
{
"toclevel": 2,
"level": "3",
"line": "Work location, technology and doctors",
"number": "2.4",
"index": "10",
"fromtitle": "House",
"byteoffset": 15969,
"anchor": "Work_location,_technology_and_doctors"
},
{
"toclevel": 3,
"level": "4",
"line": "Technology and privacy",
"number": "2.4.1",
"index": "11",
"fromtitle": "House",
"byteoffset": 17291,
"anchor": "Technology_and_privacy"
},
{
"toclevel": 1,
"level": "2",
"line": "Construction",
"number": "3",
"index": "12",
"fromtitle": "House",
"byteoffset": 18782,
"anchor": "Construction"
},
{
"toclevel": 2,
"level": "3",
"line": "Energy efficiency",
"number": "3.1",
"index": "13",
"fromtitle": "House",
"byteoffset": 21899,
"anchor": "Energy_efficiency"
},
{
"toclevel": 2,
"level": "3",
"line": "Earthquake protection",
"number": "3.2",
"index": "14",
"fromtitle": "House",
"byteoffset": 23057,
"anchor": "Earthquake_protection"
},
{
"toclevel": 1,
"level": "2",
"line": "Found materials",
"number": "4",
"index": "15",
"fromtitle": "House",
"byteoffset": 25172,
"anchor": "Found_materials"
},
{
"toclevel": 1,
"level": "2",
"line": "Legal issues",
"number": "5",
"index": "16",
"fromtitle": "House",
"byteoffset": 26235,
"anchor": "Legal_issues"
},
{
"toclevel": 2,
"level": "3",
"line": "United Kingdom",
"number": "5.1",
"index": "17",
"fromtitle": "House",
"byteoffset": 26644,
"anchor": "United_Kingdom"
},
{
"toclevel": 1,
"level": "2",
"line": "Identifying houses",
"number": "6",
"index": "18",
"fromtitle": "House",
"byteoffset": 26922,
"anchor": "Identifying_houses"
},
{
"toclevel": 1,
"level": "2",
"line": "Animal houses",
"number": "7",
"index": "19",
"fromtitle": "House",
"byteoffset": 27397,
"anchor": "Animal_houses"
},
{
"toclevel": 1,
"level": "2",
"line": "Houses and symbolism",
"number": "8",
"index": "20",
"fromtitle": "House",
"byteoffset": 27826,
"anchor": "Houses_and_symbolism"
},
{
"toclevel": 1,
"level": "2",
"line": "See also",
"number": "9",
"index": "21",
"fromtitle": "House",
"byteoffset": 28620,
"anchor": "See_also"
},
{
"toclevel": 1,
"level": "2",
"line": "References",
"number": "10",
"index": "22",
"fromtitle": "House",
"byteoffset": 29690,
"anchor": "References"
},
{
"toclevel": 1,
"level": "2",
"line": "External links",
"number": "11",
"index": "23",
"fromtitle": "House",
"byteoffset": 29720,
"anchor": "External_links"
}
]
}
}
Iterate over the result and find the section you want to have, retrieve the index
Use the index in the next API request to get the section content:
https://en.wikipedia.org/wiki/Special:ApiSandbox#action=parse&format=json&page=house&prop=wikitext&section=3&disabletoc=1
Response:
{
"parse": {
"title": "House",
"pageid": 13590,
"wikitext": {
"*": "=== Layout ===\n[[File:Gingerbread House Essex CT.jpg|thumb|Example of an early [[Victorian architecture|Victorian]] \"Gingerbread House\" in [[Connecticut]], United States, built in 1855]]\n\nIdeally, [[architect]]s of houses design [[room]]s to meet the needs of the people who will live in the house. [[Feng shui]], originally a [[China|Chinese]] method of moving houses according to such factors as rain and micro-climates, has recently expanded its scope to address the design of interior spaces, with a view to promoting harmonious effects on the people living inside the house, although no actual effect has ever been demonstrated. Feng shui can also mean the \"aura\" in or around a dwelling, making it comparable to the [[real estate|real-estate]] sales concept of \"indoor-outdoor flow\".\n\nThe [[square footage]] of a house in the United States reports the area of \"living space\", excluding the garage and other non-living spaces. The \"square metres\" figure of a house in Europe <!-- including Malta ? --> reports the area of the walls enclosing the home, and thus includes any attached garage and non-living spaces.<ref>{{Cite book|title=Land Management: Challenges and Strategies (First Edition)|last=Iyyer|first=Chaitanya|publisher=Global India Publications Pvt Ltd|year=2009|isbn=978-9380228488|location=|pages=}}</ref>{{Citation needed|date=February 2007}} The number of floors or levels making up the house can affect the square footage of a home."
}
}
}
Background:
The idea of sections in a page is not integrated in revisions (yet), a revision is "just" the content of the whole page and additional metadata (e.g. in multiple other slots), but the sections are part of the content (which is one slot in the revision only). That's why, when using the revision query API, you can only get the whole text. The page needs to be parsed in order to know what the sections are, as sections are a concept of wikitext, hence involving the parser.

Related

Error parsing a specific JSON file in Snowflake with File Format

I have created a Stage and File Format in Snowflake which works with all my other JSON files except this, which throws an error:
Error parsing JSON: misplaced { File 'rooms.json.gz', line 1,
character 2 Row 0, column $1
I am using the same query that I am using for other files.
SELECT $1
FROM #MySchema.MY_STAGE/rooms.json.gz
;
What is wrong with the structure of this specific JSON file?
{
"rooms": [
{
"area": 131.49,
"longDescription": "",
"dateCreated": 1589908063390,
"reservable": false,
"name": "E249",
"remoteInfo": "",
"description": "",
"id": 2,
"type": {
"hexColor": "c16058",
"contentFlag": 1,
"cost": 0.0,
"dateCreated": 1308610520717,
"color": {},
"name": "BREAK ROOM",
"occupiable": false,
"id": 120,
"parkingSpace": false,
"dateUpdated": 1591818585913,
"typeCode": ""
},
"floor": {
"area": 25312.9878,
"dateCreated": 1589907703870,
"drawingAvailable": true,
"interiorGross": 0.0,
"name": "2",
"leaseArea": 0.0,
"id": 12,
"building": {
"address": {
"country": {
"defaultSelected": true,
"subdivisionCategoryName": "state",
"alpha2Code": "US",
"isoCode": "US",
"name": "United States of America (the)",
"id": 223
},
"city": "Some City",
"street": "Some Drive",
"postalCode": "00000",
"state": {
"country": {
"defaultSelected": true,
"subdivisionCategoryName": "state",
"alpha2Code": "US",
"isoCode": "US",
"name": "United States of America (the)",
"id": 223
},
"defaultSelected": false,
"code": "XX",
"name": "Some State",
"id": 66,
"categoryName": "state"
}
},
"code": "B2",
"dateCreated": 1589907508020,
"metric": false,
"name": "Some name",
"location": {},
"revitLink": "",
"id": 45,
"dateUpdated": 1601315841453,
"costCenters": []
},
"dateUpdated": 1600441936663
},
"capacity": 0,
"dateUpdated": 1600441936960
}
]
}
Edit: Screenshot from Notepad++ with all characters enabled

How to extract a specific value from JSON file?

I'm trying to extract a specific value from JSON file.
the key value is: "info": "this is an example" (The key is unique)
I want to extract only the value: "this is an example"
My code:
cat 9.json | jq '.info'
result:
null
JSON file example:
{
"Event": {
"id": "13",
"orgc_id": "1",
"org_id": "1",
"date": "2019-01-09",
"threat_level_id": "3",
"info": "test9",
"published": false,
"uuid": "5c35d180",
"attribute_count": "2",
"analysis": "0",
"timestamp": "1547044733",
"distribution": "1",
"proposal_email_lock": false,
"locked": false,
"publish_timestamp": "1547034089",
"sharing_group_id": "0",
"disable_correlation": false,
"extends_uuid": "",
"event_creator_email": "o#cyhgfnt.com",
"Org": {
"id": "1",
"name": "Cygfdgfdnt",
"uuid": "5b9f938d-e3a0-4ecb-83b3-0bdeac1b41bc"
},
"Orgc": {
"id": "1",
"name": "Cyhgfgft",
"uuid": "5b9f938d-e3a0-4ecb-83b3-0bdeac1b41bc"
},
"Attribute": [{
"id": "292630",
"type": "domain",
"category": "Network activity",
"to_ids": true,
"uuid": "5c35dd94-cccc-4086-b386-682823717aa5",
"event_id": "1357",
"distribution": "5",
"timestamp": "1547034584",
"comment": "This is a comment",
"sharing_group_id": "0",
"deleted": false,
"disable_correlation": false,
"object_id": "0",
"object_relation": null,
"value": "dodskj.com",
"Galaxy": [],
"ShadowAttribute": [],
"Tag": [{
"id": "223",
"name": "kill-chain:Exploitation",
"colour": "#a80079",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}]
}, {
"id": "292631",
"type": "ip-dst",
"category": "Network activity",
"to_ids": true,
"uuid": "5c35dd94-fe90-4ef6-b3a9-682823717aa5",
"event_id": "1357",
"distribution": "5",
"timestamp": "1547044733",
"comment": "comment example",
"sharing_group_id": "0",
"deleted": false,
"disable_correlation": false,
"object_id": "0",
"object_relation": null,
"value": "8.8.6.6",
"Galaxy": [],
"ShadowAttribute": [],
"Tag": [{
"id": "247",
"name": "maec-malware-capabilities:maec-malware-capability=\"anti-removal\"",
"colour": "#3f0004",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}, {
"id": "465",
"name": "osint:lifetime=\"perpetual\"",
"colour": "#006ebe",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}]
}],
"ShadowAttribute": [],
"RelatedEvent": [],
"Galaxy": [{
"id": "3",
"uuid": "698774c7-8022-42c4-917f-8d6e4f06ada3",
"name": "Threat Actor",
"type": "threat-actor",
"description": "Threat actors are characteristics of malicious actors (or adversaries) representing a cyber attack threat including presumed intent and historically observed behaviour.",
"version": "3",
"icon": "user-secret",
"namespace": "misp",
"GalaxyCluster": [{
"id": "6397",
"collection_uuid": "7cdff317-a673-4474-84ec-4f1754947823",
"type": "threat-actor",
"value": "Sofacy",
"tag_name": "misp-galaxy:threat-actor=\"Sofacy\"",
"description": "The Sofacy Group (also known as APT28, Pawn Storm, Fancy Bear and Sednit) is a cyber espionage group believed to have ties to the Russian government. Likely operating since 2007, the group is known to target government, military, and security organizations. It has been characterized as an advanced persistent threat.",
"galaxy_id": "3",
"source": "MISP Project",
"authors": ["Alexandre Dulaunoy", "Florian Roth", "Thomas Schreck", "Timo Steffens", "Various"],
"version": "82",
"uuid": "5b4ee3ea-eee3-4c8e-8323-85ae32658754",
"tag_id": "608",
"meta": {
"cfr-suspected-state-sponsor": ["Russian Federation"],
"cfr-suspected-victims": ["Georgia", "France", "Jordan", "United States", "Hungary", "World Anti-Doping Agency", "Armenia", "Tajikistan", "Japan", "NATO", "Ukraine", "Belgium", "Pakistan", "Asia Pacific Economic Cooperation", "International Association of Athletics Federations", "Turkey", "Mongolia", "OSCE", "United Kingdom", "Germany", "Poland", "European Commission", "Afghanistan", "Kazakhstan", "China"],
"cfr-target-category": ["Government", "Military"],
"cfr-type-of-incident": ["Espionage"],
"country": ["RU"],
"refs": ["https:\/\/en.wikipedia.org\/wiki\/Sofacy_Group", "https:\/\/aptnotes.malwareconfig.com\/web\/viewer.html?file=..\/APTnotes\/2014\/apt28.pdf", "http:\/\/www.trendmicro.com\/cloud-content\/us\/pdfs\/security-intelligence\/white-papers\/wp-operation-pawn-storm.pdf", "https:\/\/www2.fireeye.com\/rs\/848-DID-242\/images\/wp-mandiant-matryoshka-mining.pdf", "https:\/\/www.crowdstrike.com\/blog\/bears-midst-intrusion-democratic-national-committee\/", "http:\/\/researchcenter.paloaltonetworks.com\/2016\/06\/unit42-new-sofacy-attacks-against-us-government-agency\/", "https:\/\/www.cfr.org\/interactive\/cyber-operations\/apt-28", "https:\/\/blogs.microsoft.com\/on-the-issues\/2018\/08\/20\/we-are-taking-new-steps-against-broadening-threats-to-democracy\/", "https:\/\/www.bleepingcomputer.com\/news\/security\/microsoft-disrupts-apt28-hacking-campaign-aimed-at-us-midterm-elections\/", "https:\/\/www.bleepingcomputer.com\/news\/security\/apt28-uses-lojax-first-uefi-rootkit-seen-in-the-wild\/"],
"synonyms": ["APT 28", "APT28", "Pawn Storm", "PawnStorm", "Fancy Bear", "Sednit", "TsarTeam", "Tsar Team", "TG-4127", "Group-4127", "STRONTIUM", "TAG_0700", "Swallowtail", "IRON TWILIGHT", "Group 74"]
}
}]
}],
"Object": [],
"Tag": [{
"id": "608",
"name": "misp-galaxy:threat-actor=\"Sofacy\"",
"colour": "#12e000",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}, {
"id": "118",
"name": "gdpr:special-categories=\"health\"",
"colour": "#3ce600",
"exportable": true,
"user_id": "0",
"hide_tag": false,
"numerical_value": null
}]
}
}
I suppose you are trying to get the .info field inside .Event which should have been written as below. Use -r for without quotes
jq '.Event.info'

I get an EXTJS warning that the json data is incorrect

My data looks like this.
{
"total": "5",
"data": [
{
"Id": "5141",
"Qty": "1",
"Year": "175",
"Country": "GREAT BRITAN",
"Denomination": "HALF PENNY",
"Grade": "Uncirculated",
"Mint": "",
"Value": "5.00",
"Obversa": "",
"Reverse": "",
"Comments": "",
"PurchaseDate": "2010-11-05 00:00:00",
"CoinTotal": "5.00"
},
{
"Id": "5141",
"Qty": "1",
"Year": "175",
"Country": "GREAT BRITAN",
"Denomination": "HALF PENNY",
"Grade": "Fair",
"Mint": "",
"Value": "5.00",
"Obversa": "",
"Reverse": "",
"Comments": "",
"PurchaseDate": "2010-11-05 00:00:00",
"CoinTotal": "5.00"
},
{
"Id": "5141",
"Qty": "1",
"Year": "175",
"Country": "GREAT BRITAN",
"Denomination": "HALF PENNY",
"Grade": "Very Fine",
"Mint": "",
"Value": "5.00",
"Obversa": "",
"Reverse": "",
"Comments": "",
"PurchaseDate": "2010-11-05 00:00:00",
"CoinTotal": "5.00"
},
{
"Id": "5141",
"Qty": "1",
"Year": "175",
"Country": "GREAT BRITAN",
"Denomination": "HALF PENNY",
"Grade": "PROOF",
"Mint": "",
"Value": "5.00",
"Obversa": "",
"Reverse": "",
"Comments": "",
"PurchaseDate": "2010-11-05 00:00:00",
"CoinTotal": "5.00"
},
{
"Id": "5141",
"Qty": "1",
"Year": "175",
"Country": "GREAT BRITAN",
"Denomination": "HALF PENNY",
"Grade": "Good",
"Mint": "",
"Value": "5.00",
"Obversa": "",
"Reverse": "",
"Comments": "",
"PurchaseDate": "2010-11-05 00:00:00",
"CoinTotal": "5.00"
}
]
}
However I still get this error
[WARN] Unable to parse the JSON returned by the server.
Can someone identify the error and correction?

papal plus malformed_request JSON with shipping address

To show the Paypal plus iFrame (REST API) i make a Request with JSON
{
"intent": "sale",
"experience_profile_id": "XP-XXXX-XXXX-XXX-XXX",
"redirect_urls": {
"return_url": "https://www.XXXXXXX.de/bestellen.php",
"cancel_url": "https://www.XXXXXXX.de/zahlungabbruch.php"
},
"payer": {
"payment_method": "paypal"
},
"transactions": [{
"amount": {
"total": "53.45",
"currency": "EUR",
"details": {
"subtotal": "49.5",
"shipping": "3.95"
}
},
"description": "Tollewolle",
"invoice_number": "",
"item_list": {
"items": [
{
"quantity": "4",
"name": "Fine Kid - 50",
"price": "8.25",
"currency": "EUR",
"sku": "8-50"
},
{
"quantity": "2",
"name": "Fine Kid - 208",
"price": "8.25",
"currency": "EUR",
"sku": "8-208"
}
]
},
"shipping_address": {
"line1": "Rechnungs Str. 41",
"city": "Flensburg",
"postal_code": "24939",
"country_code": "DE"
}
}]
}
Without the shipping_address it works fine.
With the address i get an Error 'MALFORMED_REQUEST'
I believe Shipping address is a child ob lineitems so you need to move it up a level. E.g.
"transactions": [{"amount": {"total": "53.45", "currency":
"EUR","details":{"subtotal": "49.5", "shipping":
"3.95"}},"description": "Tollewolle", "invoice_number": "",
"item_list": {"items": [{"quantity": "4", "name": "Fine Kid - 50",
"price": "8.25", "currency": "EUR", "sku": "8-50"},{"quantity": "2",
"name": "Fine Kid - 208","price": "8.25", "currency": "EUR", "sku":
"8-208"}],"shipping_address": {"line1": "Rechnungs Str. 41","city":
"Flensburg", "postal_code": "24939", "country_code": "DE"}}}]

How to form JSON path

I have employers array as below; how to get employers:id and featuredReview:id using JSON expression.
"employers": [
{
"id": 194,
"name": "Target",
"website": "www.target.com",
"isEEP": false,
"exactMatch": false,
"industry": "Department, Clothing, & Shoe Stores",
"numberOfRatings": 11531,
"squareLogo": "http://media.glassdoor.com/sqll/194/target-squarelogo.png",
"overallRating": 3.2,
"ratingDescription": "OK",
"cultureAndValuesRating": "3.3",
"seniorLeadershipRating": "2.8",
"compensationAndBenefitsRating": "3.0",
"careerOpportunitiesRating": "3.0",
"workLifeBalanceRating": "3.0",
"recommendToFriendRating": "0.6",
"featuredReview": {
"id": 6613365,
"currentJob": false,
"reviewDateTime": "2015-05-15 16:32:06.997",
"jobTitle": "Executive Team Leader",
"location": "Buena Park, CA",
"jobTitleFromDb": "Executive Team Leader",
"headline": "Unrealistic expectations for leadership",
"overall": 4,
"overallNumeric": 4
},
"ceo": {
"name": "Brian Cornell",
"title": "CEO",
"numberOfRatings": 1127,
"pctApprove": 66,
"pctDisapprove": 34
}
}]
employers[0].id
employers[0].featuredReview.id