My goal is to find a predefined string in an HTML source of a specific site that I have extracted using c++, but I'm getting some errors. Here is my source code so far:
So after I connect to the internet and the site and all I have this...
addr = InternetOpenUrl...
dmbp = char dmbp[5000]
dba = DWORD dba = 0
while (InternetReadFile(addr, dmbp, 80000, &dba) && dba)
{
string str2 = dmbp;
size_t sf1 = str2.find(string1);
if (sf1!=string::npos)
{printf("found");
// manipulate it...
}else{printf("not found");}
}
My problem is that it never actually confirms that it found the value that I need, it always says that the value is not found, but I even statically insert the page and look at myself and i can see the value that i need, it just doesnt show up. Does anyone with experience in html extraction with c++ know what I'm missing or how I can get this to work?
There is nothing wrong with the string search code as far as I can see, the problem is that we don't know exactly what you are searching for.
As pure HTML can be full of special characters (such as " or ", the string you might be looking for should deal with those characters. Also, strings can contain newlines and html tags (such as <b></b> within a single word), and they should be specified in the search string as string::find looks for an exact match (including any newline).
Also, I suggest debugging your code and see if the website's text/code is actually loaded into str2.
Looking at the information given that's currently the only issue I can think of why your code doesn't work.
Related
With a lot of help from people in this site, I managed to get some Json data from an amazon page. The data, for example, looks like this.
https://jsoneditoronline.org/?id=9ea92643044f4ac88bcc3e76d98425fc
First I have a list of strings which is converted to a string.
script = response.xpath('//script/text()').extract()
#For example, I need the variationValues data
variationValues = re.findall(r'variationValues\" : ({.*?})', ' '.join(script))[0]
Then, in my code, I have this (not a great name, will be changed later)
variationValuesJson = json.loads(variationValues)
variationValuesJson is in fact a dictionary, so doing something like this
variationValues["size_name"][3]
Should return "5.5 M US"
My issue is that, when running the program, I get the string indices must be integers error. Anyone knows whats wrong?
Note: I have tried using 'size_name' instead of "size_name", same error
variationValues["size_name"][3] #this is the raw string which you have converted to variationValuesjson
I think this is not what you actually want.
Your code should be this.
variationValuesJson['size_name'][3] #use variationValuesjson ;)
I was under the impression that UTF-8 was the answer to everything :0
Problem: Using Play's idiomatic form handling to go from a web page (basic HTML Text Area Input field) to a MySQL database through the Anorm abstraction layer (so all properly escaped) and then reading the database to gather that data and create an email using the JavaMail API's to send HTML email with alternate characters (accented characters like é for example. (I'd post more but I suspect we might get strange artifacts here as well -- I'll try that in a comment below perhaps)
I can use a moderate set of characters and create a TEXT email (edited via Atom and placed into the stream directly at the code level) and it comes through as an email with all the characters I've chosen in tact.
I have not yet systematically worked through the characters I was just using a relatively random sampling as an initial test.
I place the same set of characters into a text field and try to save them to the database and I can only save about 1 in 5 or less of them.
The errors look like this:
SQLException: Incorrect string value: '\xC4\x93\x0D\x0A\x0D\x0A...' for column 'content' at row 1
I suspect I'm about to learn a ton of new information about either Play and/or UTF-8 or HTML or some part of the chain where this is going off the rails.
My question then is this: Is there an idiomatic Play example of how to handle UTF-8 end to end through Anorm and into Java Mail?
(I think I kinda expected it to be "built-in" but then I expected a LOT more to be baked into the core product as well...)
I want/need both a TEXT and and HTML path for the email portion. (I can write BOTH and they work fine -- the problem is moving alternate characters though the channels as indicated above).
I'm currently seeing if this might be an answer:
https://objectpartners.com/2013/04/24/html-encoding-utf-8-characters/
However presently hitting this roadblock...
How to turn off specific Implicit's in Scala that prevent code from compiling due to overloaded methods?
This appears to be a hopeful candidate -- I am researching it now end to end.
import org.apache.commons.lang3._
def htmlEncode(input: String) = htmlEncode_sb(input).toString
def htmlEncode_sb(input: String, stringBuilder: StringBuilder = new StringBuilder()) = {
stringBuilder.synchronized {
for ((c, i) <- input.zipWithIndex) {
if (CharUtils.isAscii(c)) {
// Encode common HTML equivalent characters
stringBuilder.append(StringEscapeUtils.escapeHtml4(c.toString()))
} else {
// Why isn't this done in escapeHtml4()?
stringBuilder.append(s"""&#${Character.codePointAt(input, i)};""")
}
}
stringBuilder
}
}
In order to get it to work inside Play you'll need this in your build.sbt file
"org.apache.commons" % "commons-lang3" % "3.4",
This blog post lead me to write that code: https://objectpartners.com/2013/04/24/html-encoding-utf-8-characters/
Update: Confirmed that it does work end to end.
Web Page Input as TextArea inside a Form saved to MySQL database escaped by Anorm, reread from database and displayed inside a TextArea on a web page with extended characters (visually) appearing precisely as input.
You'll need to call #Html(htmlContentString) inside the Twirl template to re-render this as the original HTML but the browser (Safari 8.0.7) displayed exactly what I gave it after a round trip to and from the database.
One caveat -- it creates machine readable HTML not human readable HTML. It would be nice if it didn't encode angle brackets and such so it looks more like HTML that we expect. I'm sure a pattern match block will be added next to exclude just that :)
[This may not be precisely a programming question, but it's a puzzle that may best be answered by programmers. I tried it first on the Pro Webmasters site, to overwhelming silence]
We have an email address verification process on our website. The site first generates an appropriate key as a string
mykey
It then encodes that key as a bunch of bytes
&$dac~ʌ����!
It then base64 encodes that bunch of bytes
JiRkYWN+yoyIhIQ==
Since this key is going to be given as a querystring value of a URL that is to be placed in an HTML email, we need to first URLEncode it then HTMLEncode the result, giving us (there's no effect of HTMLEncoding in the example case, but I can't be bothered to rework the example)
JiRkYWN%2ByoyIhIQ%3D%3D
This is then embedded in HTML that is sent as part of an email, something like:
click here.
Or paste <b>http://myapp/verify?key=JiRkYWN%2ByoyIhIQ%3D%3D</b> into your browser.
When the receiving user clicks on the link, the site receives the request, extracts the value of the querystring 'key' parameter, base64 decodes it, decrypts it, and does the appropriate thing in terms of the site logic.
However on occasion we have users who report that their clicking is ineffective. One such user forwarded us the email he had been sent, and on inspection the HTML had been transformed into (to put it in terms of the example above)
click here
Or paste <b>http://myapp/verify?key=JiRkYWN+yoyIhIQ%3D%3D</b> into your browser.
That is, the %2B string - but none of the other percentage encoded strings - had been converted into a plus. (It's definitely leaving us with the right values - I've looked at the appropriate SMTP logs).
key=JiRkYWN%2ByoyIhIQ%3D%3D
key=JiRkYWN+yoyIhIQ%3D%3D
So I think that there are a couple of possibilities:
There's something I'm doing that's stupid, that I can't see, or
Some mail clients convert %2b strings to plus signs, perhaps to try to cope with the problem of people mistakenly URLEncoding plus signs
In case of 1 - what is it? In case of 2 - is there a standard, known way of dealing with this kind of scenario?
Many thanks for any help
The problem lies at this step
on inspection the HTML had been transformed into (to put it in terms of the example above)
click here
Or paste <b>http://myapp/verify?key=JiRkYWN+yoyIhIQ%3D%3D</b> into
your browser.
That is, the %2B string - but none of the other percentage encoded
strings - had been converted into a plus
Your application at "the other end" must be missing a step of unescaping. Regardless of if there is a %2B or a + a function like perls uri_unescape returns consistent answers
DB<9> use URI::Escape;
DB<10> x uri_unescape("JiRkYWN+yoyIhIQ%3D%3D")
0 'JiRkYWN+yoyIhIQ=='
DB<11> x uri_unescape("JiRkYWN%2ByoyIhIQ%3D%3D")
0 'JiRkYWN+yoyIhIQ=='
Here is what should be happening. All I'm showing are the steps. I'm using perl in a debugger. Step 54 encodes the string to base64. Step 55 shows how the base64 encoded string could be made into a uri escaped parameter. Steps 56 and 57 are what the client end should be doing to decode.
One possible work around is to ensure that your base64 "key" does not contain any plus signs!
DB<53> $key="AB~"
DB<54> x encode_base64($key)
0 'QUJ+
'
DB<55> x uri_escape('QUJ+')
0 'QUJ%2B'
DB<56> x uri_unescape('QUJ%2B')
0 'QUJ+'
DB<57> $result=decode_base64('QUJ+')
DB<58> x $result
0 'AB~'
What may be happening here is that the URLDecode is turning the %2b into a +, which is being interpreted as a space character in the URL. I was able to overcome a similar problem by first urldecoding the string, then using a replace function to replace spaces in the decoded string with + characters, and then decrypting the "fixed" string.
I have three dropdown boxes on a Main_Form. I will add the chosen content into three fields on the form, Form_Applications.
These three lines are added :
Form_Applications.Classification = Form_Main_Form.Combo43.Value
Form_Applications.Countryname_Cluster = Form_Main_Form.Combo56.Value
Form_Applications.Application = Form_Main_Form.Combo64.Value
The first two work perfectly but the last one gives error code 438!
I can enter in the immediate window :
Form_Applications.Classification = "what ever"
Form_Applications.Countryname_Cluster = "what ever"
but not for the third line. Then, after enter, the Object doesn't support this property or method error appears.
I didn't expect this error as I do exactly the same as in the first two lines.
Can you please help or do you need more info ?
In VBA Application is a special word and should not be used to address fields.
FormName.Application will return an object that points to the application instance that is running that form as opposed to an object within that form.
From the Application object you can do all sorts of other things such as executing external programs and other application level stuff like saving files/
Rename your Application field to something else, perhaps ApplicationCombo and change your line of code to match the new name. After doing this the code should execute as you expect.
Form_Applications.Application is referring to the application itself. It is not a field, so therefore it is not assignable (at least with a string).
You really haven't provided enough code to draw any real conclusions though. But looking at what you have posted, you definitely need to rethink your approach.
It's to say definitely but you are not doing the same. It looks like you are reading a ComboBox value the same (I will assume Combo64 is the same as 43 and 56) but my guess is that what you are assigning that value to is the problem:
Form_Applications.Application =
Application is not assignable. Is there another field you meant to use there?
How can I have a clean html ouput for search result pages? Each time I try to include special characters like "&" as part of the search term, I usually get results with "&" highlighted yet includes the HTML entity. Thus, the results has &, " etc...Here's a screenshot sample - http://min.us/mt3rOV5zVtOh6
Meanwhile, when I do my searches with "&" included in the search term, the result yields to having a clean output.
The piece of code in search-result.tpl.php
http://pastebin.com/zCmMJLNh
I've already tried several decoding functions but no success. Been trying to fix this for days already. The site is using Drupal 6 and the search module has been overridden.
You say "...the search module has been overridden" this could be the cause of why the search snippet remains htmlentityencoded on output ( e.g check_plain'd escaped html )
A better fix would be to find the cause in the modification, e.g a preprocess function that modifies the search snippet ( if any )
Alternatively, you could probably run the $snippet through decode_entities
i.e print decode_entities($snippet)
Assuming, the html is already escaped, as if not, can be a security risk.
See also: http://php.net/manual/en/function.html-entity-decode.php
and: http://www.php.net/manual/en/function.htmlspecialchars-decode.php
Well, you could try drupal_html_to_text to convert the snippet into plain text.
The right way is probably to figure out why those results aren't getting converted. Based on your comments it looks like the problem is only when you search specifically for "&". More specifically, it's the regex in the search.module (/modules/search/search.module - line 1188 in 6):
preg_match_all('/ ("([^"]+)"|(?!OR)([^" ]+))/', ' '. $keys, $matches);
It only matches spaces before the keyword (not after). You could modify the $keys here like:
if ($keys == '&') $keys = '&'
Or something like that (of course that means hacking core - meh).
You could also possibly add a form_alter via a module and modify the search form (see this link on how to add the form_alter). Then you could add a custom submit handler which would alter the search term in the form before it is submitted.