Is there a name for font families (such as fangchan-secret) that are used to prevent web scraping? - html

In trying to scrape some data from the website of a housing agency in China (the name of the agency is Anjuke) to gather data for a small personal project I realized that all of the numbers on the website are visually displayed as numbers, but are digitally read as obscure Chinese characters.
Is there a name for this kind of a font or this kind of a technique more specific than "anti-scraping measures"?
Additional information about this specific case:To see this in action you can click on any of the listings from the Anjuke website, and then attempt to copy-and-paste the price (or any HTML element that has the "strongbox" class), and you will see that instead of pasting the number is pastes and obscure Chinese character (such as 驋, 齤, 麣, 龤, or 龒).
Looking at the CSS revealed that these numbers have a font of "fangchan-secret", and a bit of quick googling linked to a blog post in Chinese by zhyuzh3d. I read some Chinese, although not loads. This blog post appears to be a Chinese explanation of how how fangchan-secret is a method to prevent to prevent webscraping, and also an explanation of how to get around around this preventative measure.

Related

Maintaining font style/formatting into a form that doesn't support html/markdown

I have looked into the previous postings to do with this area but haven't found any relevant answers as perhaps I am asking the wrong question.
On the popular design site Dribbble, there seem to be interesting formatting changes in profile names that break from the conventions of the site's styling.
Alot of people have been adding special characters (ΔδΓ etc.) that can be achieved by pasting into their profile form and saving changes, yet some users have somehow managed to enter formatted versions of their name, despite the profile form not supporting HTML or Markdown. You can see an example in the images below.
An example of copying the font to Google with maintained formatting
When opening in inspector, it also shows the formatted type
How could this be done in a simple text input form that doesn't support HTML/Markdown?
These are almost certainly Unicode characters, just like these characters that you reference in your question: ΔδΓ.
For example, Unicode's mathematical alphanumeric symbols section includes symbols that look like the ones in your screenshot. Since these are separate Unicode characters there is no need for additional formatting.
Users will need to have a font that supports those characters installed locally to view them.

What are these strange characters in HTML source?

My friend runs a website and had an e-mail from Google Safesearch informing him he was hosting a phishing page. Turns out his cPanel was bruteforced (weak password) and they uploaded some of the pages onto his server. He told me about it and I wanted to take a look at how sophisticated are.
In many of the files, certain words/portions of text are strange. They display perfectly in a webbrowser, but are jumbled inside the HTML. I was wondering if anyone can tell me what this is?
Examples:
<title>WеlÑоmе tо еВаy: Sign in</title>
<span class="txtbox_title">Раsswоrd</span>
<a class="three" href="#">Fоrgоt yоur
It's also worth noting that there is normal text throughout the page that displays perfectly also.
I assume this is to stop the detection of certain words in the page, but I'm not sure. Any information would be great.
Edit: Originally was tagged as PHP. I realised that it probably shouldn't be so removed it. Be nice, kids.
Edit edit: For clarity, it's a phishing page targetting eBay users.
The examples I posted in the original post are (in order):
eBay: Sign In
Your Password
Forgot your [password]
As such I don't believe it to be any sort of malware, but a method of encrypting text to fight detection in browsers such as Chrome (which I assume detect 'hot' words in their algorithm).
They UTF-8 encoded Cyrillic letters and possibly other characters chosen for their visual similarity to common Latin letters. You are viewing the page in an editor that does not interpret data as UTF-8 but as in Latin 1 encoding.
For example, what you see as “о” is actually two bytes, 0xD0 0xBE. When interpreted as UTF-8 data (which is what browsers do here), they represent “о” U+043E CYRILLIC SMALL LETTER O. It is identical with the common Latin letter “o” in visual appearance (in any font that contains both letters), but coded as a separate character due to belonging to a different writing system. To any program, they are quite distinct characters, unless the program has been separately coded to handle “confusables”.
Such confusion is often intentionally created for various reasons. You are probably right in assuming that here the purpose was “to stop the detection of certain words in the page”. When e.g. “Forgot” is written using Cyrillic o’s (Fоrgоt), normal Find operations will find it when searching for “Forgot”.
My best guess is that there it is a custom type of keylogger. The WеlÑоmе tо еВаywould be parsed by the keylogger to output some data into a database that can be mined later for important information.
My second guess is that it is a means to scare or mess with the person whom owns the site.
My third guess is that the virus was coded by china or some other language and when the code was translated back into utf-8 it resulted in some of the unused characters to output the strange content.
EDIT
My fith guess is the the phishing website was programmatic getting the source code content of the ebay site and parsing it into it's own html file. And ebay has its own countermeasures against such a type of attack by scrambling the letter in the source code.
With this there must be some type of javascript that undoes the effects of the original source code.

E-Book creation using MSWord (in HTML), how to include references

I'm converting a previously published book to an e-book, which will be self-published on Amazon. Things are going well, I'm using several online guides to correct stylistic errors and the like. One issue has arisen: The book is on nutritional science, and the print version contains over 200 references (included in the text as superscript). What are my options as far as hyperlinking them to the ebook text? I would prefer the references to appear as hyperlinked superscript, which could pull up a popup box when selected (much like the dictionary function on most e-books). Is this possible? Or should I use the bookmark/hyperlink functions within word and put references in their own end section?
Popup functionality for footnotes is not a built-in part of the epub spec. Apple has implemented it for iBooks as long as you mark the code up properly with epub:type declarations. You'll want the superscripted number set up like 24 and the corresponding footnote to be <p id="note24" epub:type="footnote">Content goes here</p>
The important bits are the epub:type declarations. Note that use of epub:type implies that you're using epub 3.0 rather than epub 2.0.1. If you're just using MS Word,then you're out of luck on that score.
Further details can be found on Liz Castro's blog: http://www.pigsgourdsandwikis.com/2012/05/creating-pop-up-footnotes-in-epub-3-and.html

Recognizing superscript characters using OCR

I've started a simple project in which it must get an image containing text with superscripts and then by using OCR (currently I'm using tesseract) it has to recognize the superscript characters + the normal ones.
For example, we have a chemical equation such as Cl², but when I use the tesseract to recognize it, it gives me Cl2 (all in one line).
So, what is the solution for this problem? Is there any other OCR API that has the ability to read superscripts?
Very good question that touches more advanced features of any OCR system.
First of all, to make sure you are NOT overlooking the functionality even though it may be there on an OCR system. Make sure to look at your result test not in plain TXT format, but in some kind of rich text capable viewer. TXT viewers, such as Notepad on Windows, often do not support superscript/subscript characters, so even if OCR were to give you correct characters, your viewer could have converted it to display it. If you are accessing text result programatically, that is less of an issue because you are supposed to get a proper subscript character value when accessing it directly. Just note that viewers must support it for you to actually see it. If you eliminated this possible post-processing conversion and made sure that no subscript is returned from OCR, then it probably does not support it.
Just like in this text box, in your original question you tried to give us a superscript character example, but this text box did not accept it even though you could copy/paste it from elsewhere.
Many OCR will see subscript as any other normal character, if they can see it at all. OCR of your use needs to have technical capability to actually produce superscripts/subscripts, and many of them do, but they tend to be commercial OCR systems not surprisingly.
I made a small testcase before answering this letter. I generated an image with a few superscript/subscript examples for my testing (of course EMC2 was the first example that came to mind :) .
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page.tif
And processed this image through OCR-IT OCR Cloud 2.0 API using all default settings, but exporting to a rich text format, such as MS Word .DOC.
You can find my test image here:
www.ocr-it.com/documents/superscript_subscript_test_page_result.doc
Also note: When you are interested to extract superscript/subscript characters, pay separate attention to your image quality, more than you would with a typical text. Those characters are tiny and you need sufficient details and resolution to achieve descent OCR quality. Even scanned at 300 dpi images sometimes have issues with tiny characters due to too few pixels. If you are considering mobile and digital cameras, that becomes even more important.
DISCLOSURE: My specialty is implementing internal OCR solutions for companies of different sizes. My company is WiseTREND. Contact me directly if I can assist with anything further.

Best practices for internationalizing web applications?

Internationalizing web apps always seems to be a chore. No matter how much you plan ahead for pluggable languages, there's always issues with encoding, funky phrasing that doesn't fit your templates, and other problems.
I think it would be useful to get the SO community's input for a set of things that programmers should look out for when deciding to internationalize their web apps.
Internationalization is hard, here's a few things I've learned from working with 2 websites that were in over 20 different languages:
Use UTF-8 everywhere. No exceptions. HTML, server-side language (watch out for PHP especially), database, etc.
No text in images unless you want a ton of work. Use CSS to put text over images if necessary.
Separate configuration from localization. That way localizers can translate the text and you can deal with different configurations per locale (features, layout, etc). You don't want localizers to have the ability to mess with your app.
Make sure your layouts can deal with text that is 2-3 times longer than English. And also 50% less than English (Japanese and Chinese are often shorter).
Some languages need larger font sizes (Japanese, Chinese)
Colors are locale-specific also. Red and green don't mean the same thing everywhere!
Add a classname that is the locale name to the body tag of your documents. That way you can specify a specific locale's layout in your CSS file easily.
Watch out for variable substitution. Don't split your strings. Leave them whole like this: "You have X new messages" and replace the 'X' with the #.
Different languages have different pluralization. 0, 1, 2-4, 5-7, 7-infinity. Hard to deal with.
Context is difficult. Sometimes localizers need to know where/how a string is used to make sure it's translated correctly.
Resources:
http://interglacial.com/~sburke/tpj/as_html/tpj13.html
http://www.ryandoherty.net/2008/05/26/quick-tips-for-localizing-web-apps/
http://ed.agadak.net/2007/12/one-potato-two-potato-three-potato-four
In my company all our strings are stored in *.properties files. Our build tools build a "test languange" copy of the properties files, which replace a string like this:
Click here
with something like this:
[~~ Çļïčк н∑ѓё ~~ タウ ~~]
Now, when we set the language to "test" in our config files, these properties files are used. (And of course we don't ship the test language files).
This allows us to:
Make sure that Unicode characters are displayed correctly, including Japanese/Chinese/Korean.
Make sure that the layout scales appropriately for languages with longer words (German in particular has longer words on average than English).
Spot any hard-coded strings (as they will be in plain-English).
As for the actual translation, this is done by professional translators, not developers.
As an English person living abroad I have become frustrated by many web application's approach to internationalization and have blogged about my frustrations.
My tips would be:
think about how you show an international version of a page
using geolocation might work for many users, but as my examples show for many it will not
why not use the Accept-Language header for determining which language to serve
if a user accesses a page via a search engine then don't redirect them somewhere else e.g. to a homepage in a different language
it's extremely annoying to change language and have a different page reload - either serve the same page or warn the user that the current content is not available in a different language before redirecting them
English is a very common language, so perhaps default to that
But make sure the change language option is clear on the GUI (I like what Google Maps are doing, as shown in the post)
All I see on the Web is companies getting internalization wrong. Getting it right from a user's perspective is tricky indeed.
I have a couple apps that are "bilingual"
I used resource files in ASP.NET1.1
There is also something called the String Resource Tool
Basically you put all your strings in a .RES file for both languages and then determine what file to read from based on Culture or whether someone clicked a Link for the language
The biggest gotcha is making sure the Translations are done correctly