Finding a particular string in HTML - html

I need to extract data from a website and display it to the user. I'm recieving HTML, and I need to find a particular number inside it.
For example the string would be : "Canada = 50, USA = 60, France = 70". I need to search for "Canada" and find only the number 50.
I've been searching online for how to actually search the returned string of HTML and can't seem to get anything to work.

I dont know how this could be done in App since you want the App to look for specific words in a text file.
However I know this can be done using data analysis tools like R which can filter large amount of texts to create word clouds.
http://georeferenced.wordpress.com/2013/01/15/rwordcloud/

Related

Unreadable Text in JSON

I've been working a bit with some files from Minecraft Dungeons, which were extracted using QuickBMS and made available here: https://minecraft.fandom.com/wiki/Minecraft_Wiki:Minecraft_Dungeons_game_files
In the "data" folder, there are a bunch of json-files, which I believe contain a list of textures associated with any given stage of the game. There is, however, a problem. When opened, it reads like any json-file, it has a bunch of names and values, but some of the values are not human-readable, they instead show up as a string of seemingly unrelated characters. Here an example:
"walkable-plane" : "eNpjYSEOMIMAOp+ZmQmND1fEjF2AiQldAJsWDEPRXUKkowkDAM/qA6o=",
Now, given that these are exclusively characters, and not error signs or something of the sorts, I'm assuming this is an encoding issue. Of course, I don't know for sure, Or I wouldn't be asking this in the first place, but the file as it appears in the text is UTF-8, and it obviously doesn't produce a usable result. So, if anyone knows what exactly this is, and how I could extract information from it, I'd be really thankful.

Wikipedia-API fallback if extract is empty

I am requesting data from Wikipedia API.
My request-URL looks like this:
https://en.wikipedia.org/w/api.php?format=json&action=query&prop=extracts%7Cpageimages%7Cinfo&inprop=url&piprop=thumbnail&pithumbsize=144&pilimit=50&exintro&explaintext&redirects=1n&generator=geosearch&ggscoord=47.073733%7C15.3918631&ggsradius=10000&ggslimit=50&origin=*
It returns pages located close to the given coordinates with their extract. But for some articles this extract is empty. Can you help me to extend my request to have either a fallback for an empty extract or to get the extract AND the first 10 sentences?
I don't want to request always the first 10 sentences, as the extract (if available) makes more sense.
Thank you

Editing JSON - Add Attribute

I have a slew of JSON files I'm getting dumps of, with data from the day/period it was pulled. Most of the JSON files I'm dealing with are a lot larger than this, but I figured a smaller one would be easier to work with.
{"playlists":[{"uri":"spotify:user:11130196075:playlist:1Ov4b3NkyzIMwfY9E8ixpE","listeners":366,"streams":386,"dateAdded":"2016-02-24","newListeners":327,"title":"#Covers","owner":"Saga Prommeedet"},{"uri":"spotify:user:mickeyrose30:playlist:2Ov4b3NkyzIMwfY9E8ixpE","listeners":229,"streams":263,"dateAdded":"removed","newListeners":154,"title":"bestcovers2016","owner":"Mickey Rose"}],"top":2,"total":53820}
What I'm essentially trying to do is add a date attribute to each line of data, so that when I combine multiple JSON files to put through an analytical tool, the right row of data is associated with the correct date. My first thought was to write it as such:
{"playlists":[{"uri":"spotify:user:11130196075:playlist:1Ov4b3NkyzIMwfY9E8ixpE","listeners":366,"streams":386,"dateAdded":"2016-02-24","newListeners":327,"title":"#Covers","owner":"Saga Prommeedet"},{"uri":"spotify:user:mickeyrose30:playlist:2Ov4b3NkyzIMwfY9E8ixpE","listeners":229,"streams":263,"dateAdded":"removed","newListeners":154,"title":"bestcovers2016","owner":"Mickey Rose"}],"top":2,"total":53820,"date":072617}
since the "top" and "total" attributes are showing up on each row of data (with the associated values also showing up on each row) when I put it through an analytical tool like Tableau.
Also, have been editing and saving files through Brackets, and testing things through this converter (https://konklone.io/json/)
In javascript language
var m = JSON.parse(json_string);
m["date"]="20170804";
JSON.stringify(m);
This will work for you, very simple,

Convert Table to JSON

I am still taking baby steps with Yahoo Pipes and struggling with what I believe should be a simple task.
I have a table on a page that is being updated in realtime (every 1-2 minutes).
I want to extract the rows, push into a Pipe, and then spit out a JSON in the following format:
"sites": [
{
"Site": "210001-Singleton",
"LastSampleTime": "29/04/2014 11:51:00",
"RiverLevel": "0.744",
"FlowRate": "501.6",
"FlowRate": "0.744",
"Rainfall": "",
"WaterTemp": "",
"Conductivity": ""
},
etc.
I think I am right in thinking that once I have pulled the relative table components with an XPath fetcher, I would make use of a Loop with an Item Builder contained within it to spit out the data into the above format. However, I am struggling with trying to pull in the simple table.
Here is a simplified version of my yahoo pipe.
I have tried multiple variations of the XPath string to try and get just the rows I need.
From inspecting the table with firebug, I know that the TRs I want, all seem to share the same height of tr style="height:18px"
However, not sure if this is the best way to extract them.
Can someone assist in providing some pointers on how to pull the table into my desired format? Not too sure where I am going wrong with Xpath
Import.io can do what you want. Even though the HTML on that site is a bit messy, you can still use a custom xpath override within the tool.
I built the first row of data for you, so all you need to do is go in and edit the existing extractor adding in more columns using the following extractor as a start point https://import.io/data/set/?mode=loadSource&source=f867a123-091e-4596-bbea-871df2d5ceb7
Just open it up, edit the extractor and add the cols you need. Here is the xPath code I used:
/html/body/table/tbody/tr[7]/td[5]
Row 7 in the table is the first row with data, and td[2] is the first cell in. Just increase the number in the tr[x] to hit the next row.
Once you have the data structured, hit integrate, and follow the instructions. Use the import.io support to help too, thats what they are there for.
If the table will be expanded with more rows, you may want to change the xPath to work off of the values of the child element of the tr
Disclaimer: I work at import.io, other tools exist.

regex to extract html value

im trying to write small scraper script from google search, im write the program, bat have small problem i need regex for extract data-href value from google search, please help me :
exemple html code of google search :
data-href="www.buxmob.net/index.php?id=577">
data-href="www.webopedia.com/TERM/K/keyword.html">
data-href="moz.com/beginners-guide-to-seo/keyword-research">
need only the url present in this value, only this :
hxxp://www.webopedia.com/TERM/K/keyword.html
hxxp://moz.com/beginners-guide-to-seo/keyword-research
hxxp://www.buxmob.net/index.php?id=577
thanks you
All the examples you gave can be matched with
(?:data-href=")(.*?)(?:">)
See demo at http://regex101.com/r/rB4nS1
That does NOT mean it's a good idea to try to parse (general) html with regex - but sometimes, when the response is well formed and well known, you get away with it.
Note that you mentioned you wanted hxxp:// in front of the string - that is not the job of the regular expression, but belongs with the language you use to implement the expression. The above is a "non greedy match starting after the string data-href=" and ending at the next ">