regex to extract html value - html

im trying to write small scraper script from google search, im write the program, bat have small problem i need regex for extract data-href value from google search, please help me :
exemple html code of google search :
data-href="www.buxmob.net/index.php?id=577">
data-href="www.webopedia.com/TERM/K/keyword.html">
data-href="moz.com/beginners-guide-to-seo/keyword-research">
need only the url present in this value, only this :
hxxp://www.webopedia.com/TERM/K/keyword.html
hxxp://moz.com/beginners-guide-to-seo/keyword-research
hxxp://www.buxmob.net/index.php?id=577
thanks you

All the examples you gave can be matched with
(?:data-href=")(.*?)(?:">)
See demo at http://regex101.com/r/rB4nS1
That does NOT mean it's a good idea to try to parse (general) html with regex - but sometimes, when the response is well formed and well known, you get away with it.
Note that you mentioned you wanted hxxp:// in front of the string - that is not the job of the regular expression, but belongs with the language you use to implement the expression. The above is a "non greedy match starting after the string data-href=" and ending at the next ">

Related

Extract Json Data with screaming frog

I'm using Screaming Frog as a way to extract data from a Json generated from an URL.
The Json generated is this form :
{"ville":[{"codePostal":"13009","ville":"VAUFREGE","popin":"ouverturePopin","zoneLivraison":"1300913982","url":""},{"codePostal":"13009","ville":"LES BAUMETTES","popin":"ouverturePopin","zoneLivraison":"1300913989","url":""},{"codePostal":"13009","ville":"MARSEILLE 9EME ARRON","popin":"ouverturePopin","zoneLivraison":"1300913209","url":""}]}
I'm using this regex in Custom > Extraction in Screaming Frog as a way to extract the values of "codePostal".
"codePostal":".*?"
Problem is it doesn't extract anything.
When I test my regex in regex101, it seems correct.
Do you have any clue about what is wrong ?
Thanks.
Regards.
Have you tried to save the output to understand what ScreamingFrog sees? It doesn't matter - not at the beginning - whether your RegEx works.
That said, don't forget that SF is a Java based tool hence it is the engine used by the reg ex, so make sure you test your regular expressions with the correct dialect.
You need to specify group extractors enclosed in parentheses. For instance in your example, you need to have ("codePostal":".*?") as extractor.
In addition if you simply want to extract the value, you could use the following instead.
"codePostal":"(.*?)"
It's not a problem with your Regular Expression. It seems to be that the problem is with the Content Type. ScreamingFrog isn't properly reading application/JSON content types for scraping. Hopefully they will fix this bug.

Regular expression to extract a value from a given string

I need a regular expression to extract a value from a given key/value pair. It's not for a specific language. A working example in https://regex101.com/ would be great.
Here's what I get:
{"task_id":"12323232-323-23-321"}
and here's what I expect:
12323232-323-23-321
I know, it looks easy, but drives me crazy.
The perfect solution would be:
"return the value for task_id"
.
Thanks in advance
Adam
Don't know why would you want to use regex in this case since you're dealing with json. It's unlikely that the language you're using would not have a support / library for json, which would allow you to extract the task_id.
Getting back to regex you could try capturing a group.
:"(.*?)"

Replace characters in JSON Keys

I have very odd question that involves JSON Returned to me by D&B Matching API.
The Problem I am facing is that there are this very odd formatted keys
ReliabilityText":{
"#DNBCodeValue":9092,
"$":"Actual"
}
I want to replace the '#' and '$' characters with a simple text.
I thought of solution using Regular expression, but I couldn't find any solution so far.
Thing that I forgot to mention is that I am using Salesforce APEX to run the code.
Thank you in advance!

Docusign Prefilling fields with Java and XML does not appear to work

I'm trying to send an envelope from a template using the REST API. I'm using Java with XML since the Java example is given with XML only. Here:
http://iodocs.docusign.com/APIWalkthrough/requestSignatureFromTemplate
My template is very simple. It has:
1. A data field called Material1
2. A data field called Quantity1
3. A Full Name field
4. Signature field
5. Date Signed field
Here's the screen shot:
The Java code I'm using is exactly as it appears in the API Walkthrough link that I provided above. The XML that I supply is:
<envelopeDefinition xmlns="http://www.docusign.com/restapi">
<accountId>ZZZZZZZZZZ</accountId>
<status>sent</status>
<emailSubject>DocuSign API Call - Signature request from template
</emailSubject>
<templateId>1886EC14-153E-4E05-AFF8-04F508098E60</templateId>
<templateRoles>
<templateRole>
<name>Michael</name>
<email>michael#company.com</email>
<roleName>Signer1</roleName>
<tabs>
<textTabs>
<text>
<tabLabel>\*Material1</tabLabel>
<value>MTX80HD</value>
</text>
<text>
<tabLabel>\*Quantity1</tabLabel>
<value>11</value>
</text>
</textTabs>
</tabs>
</templateRole>
</templateRoles>
</envelopeDefinition>
However the value of MTX80HD is not being prefilled in the Material1 Field, nor do I see 11 in the Quantity1.
I've read multiple posts here and followed every suggestion that I have found but still can't get the pre population to work. The Full Name and Date Sign are filled in however.
TIA
Edit 1:
OK. I converted the XML to JSON, as #ergin suggested below and the fields are still not prepopulated. So the issue must lie elsewhere and not with XML.
Here's the JSON I'm sending:
{"account":"MyAccountId","status":"sent","emailSubject":"DocuSign API Call - Test signature request from template","templateId":"1886EC14-153E-4E05-AFF8-04F508098E60","templateRoles":[{"name":"Michael","email":"michael#company.com","roleName":"Signer1","tabs":{"textTabs":[{"tabLabel":"Material1","value":"MTX80HD"}]}}]}
The URL I'm sending the above JSON to is:
https://demo.docusign.net/restapi/v2/accounts/MyAccountId/envelopes
The fields I'm trying to populate with the tabs were created on the template as Data Field widgets.
I must be missing something obvious, as I see people posting that they got it to work, eventually.
Hope to hear from anyone who has any ideas.
Thanks.
Do you have multiple tabs that are using the same tabLabel or just one? You only need the escape sequence when you have multiple, however even then you need an extra backslash (). You have this
<tabLabel>\*Material1</tabLabel>
But it should be this instead:
<tabLabel>\\*Material1</tabLabel>
However if you only have one tab with the name Material1 then don't use \\* at all.
Also, if you want to use JSON instead you can look at some of the other API walkthroughs in other languages that show JSON, or you can use the auto-generated REST API help page:
https://www.docusign.net/restapi/help

Split a string in semantic MediaWiki

I want to add a link using existing string in in my wiki page.
This string will be appended to a url to form a complete URL.
This string consists of many words, for example "Crisis Management in International Computing"
I want to split by empty space " " then construct this string: "Crisis+Management+in+International+Computing"
Here is the String variable I have in my wiki page:
{{SUBJECTPAGENAME}}
Note: I have to check first if the string consists of more than one word, as if the string is just one word like this "Crisis" I won't perform split function.
I searched the web and did not find clear semantic to us in order to perform this issue.
Anyone experienced such a matter?
If I understand correctly from the comments, you want to replace all occurrences of space in your string, and replace it with +. That can be done with string functions of the ParserFunctions extension.
If you are running a fairly recent version of MediaWiki (>1.18, check by going to Special:Version), the ParserFunctions extension is bundled with the software. You just need to enable it by adding the following to LocalSettings.php:
require_once "$IP/extensions/ParserFunctions/ParserFunctions.php";
$wgPFEnableStringFunctions = true;
Then you will be able to write e.g.
{{#replace: {{SUBJECTPAGENAME}} |<nowiki> </nowiki>|+}}
Note however that if all you really want is a url version of a page name, you can just use {{SUBJECTPAGENAMEE}} instead of {{SUBJECTPAGENAME}}.
I would recommend you to go for a custom parser function.
Or as a hack, try splitting the string using the arraymaptemplate parser functions coming as part of Semantic Forms.
URL : arraymaptemplate parser function.
You can use an intro template to create the link and use array template to split and add the words to the intro template.
I have not tried it with delimiter character as space, but from the documentation, seems, it should be working using the html encoding for space.