I'm using xpdf in an AIR app to convert PDFs to PNGs on the fly. Before conversion I want to get a page count and am using xdf's pdfinfo utility to print to stdout and then parsing that string to get the page count.
My first pass solution: split the string by line breaks, test the resulting array for the "Pages:" string, etc.
My solution works but it feels clunky and fragile. I thought about replacing all the double spaces, doing a split on ":" and building a hash table – but there are timestamps with colons in the string which would screw that up.
Is there a better or smarter way to do this?
protected function processPDFinfo(data:String):void
{
var pageCount:Number = 0;
var tmp:Array = data.split("\n");
for (var i:int = 0; i < tmp.length; i++){
var tmpStr:String = tmp[i];
if (tmpStr.indexOf("Pages:") != -1){
var tmpSub:Array = tmpStr.split(":");
if (tmpSub.length){
pageCount = Number(tmpSub[tmpSub.length - 1]);
}
break;
}
}
trace("pageCount", pageCount);
}
Title: Developing Native Extensions
Subject: Adobe Flash Platform
Author: Adobe Systems Incorporated
Creator: FrameMaker 8.0
Producer: Acrobat Distiller Server 8.1.0
CreationDate: Mon Dec 7 05:45:39 2015
ModDate: Mon Dec 7 05:45:39 2015
Tagged: yes
Form: none
Pages: 140
Encrypted: no
Page size: 612 x 783 pts (rotated 0 degrees)
File size: 2505564 bytes
Optimized: yes
PDF version: 1.4
Use regular expressions like this one for example:
/Pages:\s*(\d+)/g
The first (and only) capturing group is the string of digits you are looking for.
var pattern:RegExp = /Pages:\s*(\d+)/g;
var pageCount:int = parseInt(patern.exec(data)[1]);
I understand about 2% of that (/Pages: /g). It is looking for the string literal Pages: and and then something with spaces wildcard and escaping d+??
I know, regex can be hard. What really helps creating them is if your IDE supports them. There are also online tools like regexr (me first time using version 2 here and it's even better than version 1, very nice!) In general, you want to have a tool that gives you immediate visual feedback of what's being matched.
Below is a screenshot with your text and my pattern in regexr.
You can hover over things and get all kinds of information.
The sidebar to the left is a full fledged documentation on regex.
The optional explain tab goes through the given pattern step by step.
\s* is any amount of whitespace characters and \d+ is at least one numeric digit character.
and returning an array??
This is the As3 part of the story. Once you create a RegExp object with he pattern, you can use exec() to execute it on some String. (not sure why they picked the retarded abbreviation for the method name)
The return value is a little funky:
Returns
Object — If there is no match, null; otherwise, an object with the following properties:
An array, in which element 0 contains the complete matching substring, and other elements of the array (1 through n) contain substrings that match parenthetical groups in the regular expression
index — The character position of the matched substring within the string
input — The string (str)
You have to check the documentation of exec() to really understand this. It's kind of JS style, returning a bunch of variables held together in a generic object that also acts as an array.
This is where the [1] in my example code comes from.
Related
I'm currently writing my own DrawTextEx() function that supports emoticons. Using this function, a callback is called every time an emoticon is found in the text, giving the opportunity to caller to replace the text segment containing the emoticon by an image. For example, the Unicode chars 0x3DD8 0x00DE found in a text will be replaced by a smiling face image while the text is drawn. Actually this function works fine.
Now I want to implement an image library on the caller side. I receive a text segment like 0x3DD8 0x00DE in my callback function, and my idea is to use this code as key in a map containing all the Unicode combinations, every one linked with a structure containing the image to draw. I found a good package on the http://emojione.com/developers/ website. All the packages available on this site contain several file names, that is an hexadecimal code. So I can iterate through the files contained in the package, and create my map in an automatic way.
However I found that these codes are part of another standard, and are in fact a set of items named "HTML entity", apparently used in the web development, as it can be seen on the http://graphemica.com/%F0%9F%98%80 website. So, to be able to use these files, I need a solution to convert the HTML entity values contained in their names into an UTF16 code. For example, in the case of the above mentioned smiling face, I need to convert the 0x1f600 HTML entity code to the 0x3DD8 0x00DE UTF16 code.
A brute force approach may consist to write a map that converts these codes, by adding each of them in my code, one by one. But as the Unicode standard contains, in the most optimist scenario, more than 1800 combinations for the emoticons, I want to know it there is an existing solution, such as a known API or function, that I may use to do the job. Or is there a known trick to do that? (like e.g. "character + ('a' - 'A')" to convert an uppercase char to lower)
Regards
For example, the Unicode chars 0x3DD8 0x00DE found in a text will be replaced by a smiling face image
The character U+1F600 Grinning Face 😀 is represented by the UTF-16 code unit sequence 0xD83D, 0xDE00.
(Graphemica swapping the order of the bytes for each code unit is super misleading; ignore that.)
I found that these codes are part of another standard, and are in fact a set of items named "HTML entity", apparently used in the web development
HTML has nothing to do with it. They're plain Unicode characters—just ones outside the Basic Multilingual Plane, above U+FFFF, which is why it takes more than one UTF-16 code unit to represent them.
HTML numeric character references like 😀 (often incorrectly referred to as entities) are a way of referring to characters by code point number, but the escape string is only effective in an HTML (or XML) document, and we're not in one of those.
So:
I need to convert the 0x1f600 HTML entity code to the 0x3DD8 0x00DE UTF16 code.
sounds more like:
I need to convert representations of U+1F600 Grinning Face: from the code point number 0x1F600 to the UTF-16 code unit sequence 0xD83D, 0xDE00
Which in C# would be:
string face = Char.ConvertFromUtf32(0x1F619); // "😀" aka "\uD83D\uDE00"
or in the other direction:
int codepoint = Char.ConvertToUtf32("\uD83D\uDE00", 0); // 0x1F619
(the name ‘UTF-32’ is poorly-chosen here; we are talking about an integer code point number, not a sequence of four-bytes-per-character.)
Or is there a known trick to do that? (like e.g. "character + ('a' - 'A')" to convert an uppercase char to lower)
In C++ things are more annoying; there's not (that I can think of) anything that directly converts between code points and UTF-16 code units. You could use various encoding functions/libraries to convert between UTF-32-encoded byte sequences and UTF-16 code units, but that can end up more faff than just writing the conversion logic yourself. eg in most basic form for a single character:
std::wstring fromCodePoint(int codePoint) {
if (codePoint < 0x10000) {
return std::wstring(1, (wchar_t)codePoint);
}
wchar_t codeUnits[2] = {
0xD800 + ((codePoint - 0x10000) >> 10),
0xDC00 + ((codePoint - 0x10000) & 0x3FF)
};
return std::wstring(codeUnits, 2);
}
This is assuming the wchar_t type is based on UTF-16 code units, same as C#'s string type is. On Windows this is probably true. Elsewhere it is probably not, but on platforms where wchar_t is based on code points, you can just pull each code point out of the string as a character with no further processing.
(Optimisation and error handling left as an exercise for the reader.)
I'm using the RAD Studio compiler, and fortunately it provides an implementation for the ConvertFromUtf32 and ConvertToUtf32 functions mentioned by bobince. I tested them and they do exactly what I needed.
For those that doesn't use the Embarcadero products, the fromCodePoint() implementation provided by bobince works also well. For information, here is also the ConvertFromUtf32() function as implemented in RAD Studio, and translated into C++
std::wstring ConvertFromUtf32(unsigned c)
{
const unsigned unicodeLastChar = 1114111;
const wchar_t minHighSurrogate = 0xD800;
const wchar_t minLowSurrogate = 0xDC00;
const wchar_t maxLowSurrogate = 0xDFFF;
// is UTF32 value out of bounds?
if (c > unicodeLastChar || (c >= minHighSurrogate && c <= maxLowSurrogate))
throw "Argument out of range - invalid UTF32 value";
std::wstring result;
// is UTF32 value a 16 bit value that can fit inside a wchar_t?
if (c < 0x10000)
result = wchar_t(c);
else
{
// do divide in 2 chars
c -= 0x10000;
// convert code point value to UTF16 string
result = wchar_t((c / 0x400) + minHighSurrogate);
result += wchar_t((c % 0x400) + minLowSurrogate);
}
return result;
}
Thanks to bobince for his response, which pointed me in the right direction and helped me to solve this problem.
Regards
I have created a Google Form that has a field where a numeric value is entered by the user (with numeric validation on the form) and is then submitted. When the number (e.g., 34.00) gets submitted, it appears as 34 in the Google spreadsheet, which is annoying but understandable. I already have a script that runs when the form is submitted to generate a nicely-formatted version of the information that was submitted on the form, but I'm having trouble formatting that value as a monetary value (i.e., 34 --> $34.00) using the Utilities.formatString function. Can anyone help? Thanks in advance.
The values property of a form submission event as documented in Event Objects is an array of Strings, like this:
['2015/05/04 15:00', 'amin#example.com', 'Bob', '27', 'Bill', '28', 'Susan', '25']
As a result, a script that wishes to use any of these values as anything but a String will need to do explicit type conversion, or coerce the string to number using the unary operator (+).
var numericSalary = +e.values[9];
Alternatively, you could take advantage of the built-in type determination of Sheets, by reading the submitted value from the range property also included in the event. Just as it does when you type in values at the keyboard, Sheets does its best to interpret the form values - in this case, the value in column J will have been interpreted as a Number, so you could get it like this:
var numericSalary = e.range.getValues()[9];
That will be slower than using the values array, and it will still provide an unformatted value.
Formatting
Utilities.formatString uses "sprintf-like" formatting values. If you search the interwebs, you'll find lots of references for sprint variables, some of which are helpful. Here's a format that will turn a floating-point number into a dollar-formatted string:
'$%.2f'
$ - nothing magic, just a dollar sign
% - magic begins here, the start of a format
.2 - defines a number with two decimal places, but unspecified digits before the radix
f - expect a floating point number
So this is your simplest line of code that will do the conversion you're looking for:
var currentSalary = Utilities.formatString( '$%.2f', +e.values[9] );
The correct format to get your string into a proper numeric format is as follows:
var myString = myString.toLocaleString('en-US', { style: 'currency', currency: 'USD' });
If you only care about the $ sign and don't need commas the code below (also shown in one of the answers above will suffice.
var myString = Utilities.formatString( '$%.2f', myString );
In my experience toLocaleString performs sometimes performs strangely in Apps Script as opposed to JavaScript
I know this was years ago but might help someone else.
You should be able to add the commas in between with this
.toLocaleString(); after you format the string decimal then add the '$' sign in the beginning by just concating them together.
Ex:
myString = Utilities.formatString( '%.2f', myNumber );
myString = '$' + myString.toLocaleString();
I faced difficulties to replace a string.
var expression:String = '2X3';
var inString:String = expression;
inString=inString.replace("÷","/");
inString=inString.replace("X","*");
trace('Result.....',inString);
Output:-
Result.....2*3
Its alright.
But the problem was when i tried to give input as
var expression:String = '2X3X3X4X5X6';
output:-
Result.....2*3X3X4X5X6
But i need it in form of
Result.....2*3*3*4*5*6
and same for division.
Thanks & Regards
I use this for replacing all
var result:String=inString.split("X").join("*");
I know you've already selected a answer, but it lacked an explanation and a proper solution. The reason you see this happening is that String.replace(), when the pattern is a String, only replaces the first result. The solution is to use RegEx:
var expression:String = '2x3X3x4X5X6';
var result:String = expression.replace(/x/ig, "*");
trace(result); // output 2*3*3*4*5*6
The pattern uses two flags, global and case-insensitivity. That will grab all instances of the letter X, regardless of case, and search the entire string. The benefit with RegEx is that is extremely low level. There is little-to-no overhead when using a regular expression meaning they are incredibly fast. String.split and String.join use loops to function, I believe, which are considerably slower. Additionally, you have to store an additional array in memory.
Granted, these are negligible in most cases (difference in the 10's of microseconds, maybe), but not all. I had a project the required files to be scrambled. Unfortunately, the files were too large (200MB minimum) and the doing the replace().join() method was 4-5 slower than the RegEx method. With RegEx, I managed to reduce the lag while scrambling from a few seconds to 2-3 frames.
did you try inString=inString.replaceAll("X","*");? notice the "All" suffix!
Search:
Scripting+Language Web+Pages Applications
Results:
...scripting language originally...producing dynamic web pages. It has...graphical applications....purpose scripting language that is...d creating web pages as output...
Suppose I want a value that represents the amount of characters to allow as padding on either side of the matched terms, and another value that represents how many matches will be shown in the result (ie, I want to see only the first 5 matches, nothing more).
How exactly would you go about doing this?
This is pretty language-agnostic, but I will be implementing the solution in a PHP environment, so please restrict answers to options that do not require a specific language or framework.
Here's my thought process: create an array from the search words. Determine which search word has the lowest index regarding where it's found in the article-body. Gather that portion of the body into another variable, and then remove that section from the article-body. Return to step 1. You might even add a counter to each word, skipping it when the counter reaches 3 or so.
Important:
The solution must match all search terms in a non-linear fashion. Meaning, term one should be found after term two if it exists after term two. Likewise, it should be found after term 3 as well. Term 3 should be found before term 1 and 2, if it happens to exist before them.
The solution should allow me to declare "Only allow up to three matches for each term, then terminate the summary."
Extra Credit:
Get the padding-variable to optionally pad words, rather than chars.
My thought process:
Create a results array that supports non-unique name/value pairs (PHP supports this in its standard array object)
Loop through each search term and find its character starting position in the search text
Add an item to the results array that stores this character position it has just found with the actual search term as the key
When you've found all the search terms, sort the array ascending by value (the character position of the search term)
Now, the search results will be in order that they were found in the search text
Loop through the results array and use the specified word padding to get words on each side of the search term while also keeping track of the word count in a separate name/value pair
Pseudocode, or my best attempt at it:
function string GetSearchExcerpt(searchText, searchTerms, wordPadding = 0, searchLimit = 3)
{
results = new array()
startIndex = 0
foreach (searchTerm in searchTerms)
{
charIndex = searchText.FindByIndex(searchTerms, startIndex) // finds 1st position of searchTerm starting at startIndex
results.Add(searchTerm, charIndex)
startIndex = charIndex + 1
}
results = results.SortByValue()
lastSearchTerm = ""
searchTermCount = new array()
outputText = ""
foreach (searchTerm => charIndex in results)
{
searchTermCount[searchTerm]++
if (searchTermCount[searchTerm] <= searchLimit)
{
// WordPadding is a simple function that moves left or right a given number of words starting at a specified character index and returns those words
outputText += "..." + WordPadding(-wordPadding, charIndex) + "<strong>" + searchTerm + "</strong>" + WordPadding(wordPadding, charIndex)
}
}
return outputText
}
Personally I would convert the search terms into Regular Expressions and then use a Regex Find-Replace to wrap the matches in strong tags for the formatting.
Most likely the RegEx route would be you best bet. So in your example, you would end up getting three separate RegEx values.
Since you want a non-language dependent solution I will not put the actual expressions here as the exact syntax varies by language.
Seeking a method to:
Take whitespace separated tokens in a String; return a suggested Word
ie:
Google Search can take "fonetic wrd nterpreterr",
and atop of the result page it shows "Did you mean: phonetic word interpreter"
A solution in any of the C* languages or Java would be preferred.
Are there any existing Open Libraries which perform such functionality?
Or is there a way to Utilise a Google API to request a suggested word?
In his article How to Write a Spelling Corrector, Peter Norvig discusses how a Google-like spellchecker could be implemented. The article contains a 20-line implementation in Python, as well as links to several reimplementations in C, C++, C# and Java. Here is an excerpt:
The full details of an
industrial-strength spell corrector
like Google's would be more confusing
than enlightening, but I figured that
on the plane flight home, in less than
a page of code, I could write a toy
spelling corrector that achieves 80 or
90% accuracy at a processing speed of
at least 10 words per second.
Using Norvig's code and this text as training set, i get the following results:
>>> import spellch
>>> [spellch.correct(w) for w in 'fonetic wrd nterpreterr'.split()]
['phonetic', 'word', 'interpreters']
You can use the yahoo web service here:
http://developer.yahoo.com/search/web/V1/spellingSuggestion.html
However it's only a web service... (i.e. there are no APIs for other language etc..) but it outputs JSON or XML, so... pretty easy to adapt to any language...
You can also use the Google API's to spell check. There is an ASP implementation here (I'm not to credit for this, though).
First off:
Java
C++
C#
Use the one of your choice. I suspect it runs the query against a spell-checking engine with a word limit of exactly one, it then does nothing if the entire query is valid, otherwise it replaces each word with that word's best match. In other words, the following algorithm (an empty return string means that the query had no problems):
startup()
{
set the spelling engines word suggestion limit to 1
}
option 1()
{
int currentPosition = engine.NextWord(start the search at word 0, querystring);
if(currentPosition == -1)
return empty string; // Query is a-ok.
while(currentPosition != -1)
{
queryString = engine.ReplaceWord(engine.CurrentWord, queryString, the suggestion with index 0);
currentPosition = engine.NextWord(currentPosition, querystring);
}
return queryString;
}
Since no one has yet mentioned it, I'll give one more phrase to search for: "edit distance" (for example, link text).
That can be used to find closest matches, assuming it's typos where letters are transposed, missing or added.
But usually this is also coupled with some sort of relevancy information; either by simple popularity (to assume most commonly used close-enough match is most likely correct word), or by contextual likelihood (words that follow preceding correct word, or come before one). This gets into information retrieval; one way to start is to look at bigram and trigrams (sequences of words seen together). Google has very extensive freely available data sets for these.
For simple initial solution though a dictionary couple with Levenshtein-based matchers works surprisingly well.
You could plug Lucene, which has a dictionary facility implementing the Levenshtein distance method.
Here's an example from the Wiki, where 2 is the distance.
String[] l=spellChecker.suggestSimilar("sevanty", 2);
//l[0] = "seventy"
http://wiki.apache.org/lucene-java/SpellChecker
An older link http://today.java.net/pub/a/today/2005/08/09/didyoumean.html
The Google SOAP Search APIs do that.
If you have a dictionary stored as a trie, there is a fairly straightforward way to find best-matching entries, where characters can be inserted, deleted, or replaced.
void match(trie t, char* w, string s, int budget){
if (budget < 0) return;
if (*w=='\0') print s;
foreach (char c, subtrie t1 in t){
/* try matching or replacing c */
match(t1, w+1, s+c, (*w==c ? budget : budget-1));
/* try deleting c */
match(t1, w, s, budget-1);
}
/* try inserting *w */
match(t, w+1, s + *w, budget-1);
}
The idea is that first you call it with a budget of zero, and see if it prints anything out. Then try a budget of 1, and so on, until it prints out some matches. The bigger the budget the longer it takes. You might want to only go up to a budget of 2.
Added: It's not too hard to extend this to handle common prefixes and suffixes. For example, English prefixes like "un", "anti" and "dis" can be in the dictionary, and can then link back to the top of the dictionary. For suffixes like "ism", "'s", and "ed" there can be a separate trie containing just the suffixes, and most words can link to that suffix trie. Then it can handle strange words like "antinationalizationalization".