Define a STL structure for implementing a glossory? - stl

In an interview I got a question like:
"define a structure using STL like list, vector, map etc. to implement a Glossary that you find at the end of any book."
Any ideas how could I define a structure for Glossary.
Thanks...

STL's Map container would be fine for this. There's no need to create your own.
Are you sure the question wasn't Which container would be best?

A glossary would normally contain a word and a definition of that word. For this, an std::map<std::string, std::string> should do the job quite nicely.
Based on your comment, you may be looking for something closer to an index, with a word and a number of pages in the book where that the word is used. In this case, you could use either an std::multimap<std::string, int>, or something like an std::map<std::string, std::vector<int> >. Given the way an index would usually be printed (the word shown only once, followed by all the page numbers where that word is used) it's probably easier to use the latter for this case.

Related

l20n with HTML markup?

How would I use l20n if I wanted to create something like this:
About <strong>Firefox</strong>
I want to translate the phrase as a whole but I also want the markup. I don't want to have to do this:
<aboutBrowser "About {{ browserBrandShortName }}">
<aboutBrowserStrong "About <strong>{{ browserBrandShortName }}</strong>">
...as the translation itself is now duplicated.
I understand that this might not be in the scope of l20n, but it is probably a common enough case in the real world. Is there some kind of established way to go about this?
Sometimes duplicating the translation is the best thing you can do. Redundancy is good in localization: it allows to make fewer assumptions about translations into other languages. One of the core principles of L20n is that only the localizer will know what they really need.
Your solution is actually okay
The solution that you proposed is actually quite good. It's entirely possible that the emphasis that you're trying to express with <strong> will have some unknown implications in some language that we might not be aware of. For instance, some languages might use declensions or postpositions to mean "about something", in which case you—as a developer—shouldn't make too many assumptions about the exact position of the <strong> element. It might be that the entire translation will be a single word surrounded by <strong>.
Here's your code again, formatted using L20n's multiline string literals:
<aboutBrowser "About {{ browserBrandShortName }}">
<aboutBrowserEmphasized """
About <strong>{{ browserBrandShortName }}</strong>
""">
Note that for this to work as expected, you'll need to add a data-l10n-overlay attribute to the DOM node with data-l10n-id=aboutBrowserEmphasized. Otherwise, < and > will be escaped.
Making few assumptions matters
Let me digress quickly and bring up Bug 859035 — Do not use the same "unknown" entity for Size & Author when installing a WebApp from Firefox OS. The English-speaking developer assumed the they can use the "unknown" adjective to qualify both the size and the author in the installation dialog. However, in certain languages, like French or Polish, the adjective must be accorded with the object in terms of gender and plurals. So even though in English we can only have one string:
<unknown "Unknown">
…other languages might require two separate strings for each of the contexts they're used in. In French, you'd say "auteur inconnu" (unknown author, masculine) but "taille inconnue" (unknown size, feminine):
<unknownSize "inconnue">
<unknownAuthor "inconnu">
In English, this means some redundancy:
<unknownSize "Unknown">
<unknownAuthor "Unknown">
…but that's OK, because in the end the quality of localization is improved. It is generally a good practice to use unique strings everywhere and reuse sparingly. Ideally, you'd allow different translations for all strings. Even something as simple and common as "Yes" and "No" can be tricky if you consider languages like Welsh:
Welsh doesn't have a single word to use every time for yes and no questions. The word used depends on the form of the question. You must generally answer using the relevant form of the verb used in the question, or in questions where the verb is not the first element you use either 'ie' / 'nage'.

What's a good Lucene analyzer for text and source code?

What would be a good Lucene analyzer to use for documents that are a mix of text and diverse source code?
For example, I want "C" and "C++" to be considered different words, and I want Charset.forName("utf-8") to be split between the class name and method name, and for the parameter to be considered either one or two words.
A good example dataset for what I'd like to look at is StackOverflow itself. I believe that StackOverflow uses Lucene.NET for search; does it use a stock analyzer, or has it been heavily customized?
You're probably best to use the WhitespaceTokenizer and customize it to strip off punctuation. For example we strip of all puncutation except '+', '-' so that words such as C++, etc... are left but opening and closing quotes and brackets, etc are left. In reality though for something like this you might have to add the document twice using different tokenizers to catch the different parts of the document. i.e. once with the StandardTokenizer and once with a WhitespaceTokenizer, in this case the StandardTokenizer will split all your code, e.g. between class and method names as the Whitespace one will pick-up the words such as C++. Obviously it kind of depends on the language though as e.g. Scala allows some punctuation characters in method names.

Regular Expressions vs XPath when parsing HTML text

I want to parse a HTML text and find special parts. For example a text in 3rd div of 1st row and 2nd column of a table. I have 2 options to parse: Regular Expressions and XPath. What is advantages and disadvantages of each one?
thanks
It somewhat depends on whether you have a complete HTML file of unknown but well-formed content versus having merely a snippet or an expanse of HTML of completely known content which may or may not be well-formed.
There is a difference between editing and parsing, you see.
It is one thing to be editing your own HTML file that you wrote yourself or are otherwise staring right in the face, and you issue the editor command
:100,200s!<br */>!!g
To remove the breaks from lines 200–300.
It is quite another to suck down whatever HTML happens to be at the other end of a URL and then try to make some sense out it, sight unseen.
The first calls for a regex solution — the very one shown above, in fact. To go off writing some massively overengineered behemoth to do a fall parse to set up the entire parse tree just to do the simple edit shown above is quite simply wrong. It’s also its own punishment.
On the other hand, using patterns to parse out (as opposed to lex out) an entire HTML document that can contain all kinds of whacky things you aren’t planning for just cries out for leveraging someone else’s hard work intead of recreating the wheel for yourself, and badly at that.
However, there’s something else nobody likes to mention, and that’s that most people just aren’t competent at regexes. They don’t really understand them. They don’t know how to test them or to craft them. They don’t know how to make them readable and maintainable.
The truth of the matter is that the overwhelming majority of regex users cannot even manage as simple and basic a thing as matching an arbitrary HTML tag using a regex, even when things gotchas like alternate encodings and CDATA sections and redefined entitities and <script> contents and archaic never-seen forms are all safely dispensed with.
It’s not because it’s hard to do; it isn’t, actually. It’s just that the people trying to do it understand neither regexes nor HTML particularly well, and they don’t know they don’t know, and so they get themselves in way over their heads more quickly than they realize. And then they have a complete disaster on their hands.
Plus it’s been done before, and correctly. Might as well learn from someone else’s mistakes for a change, eh? It would probably help to have a few canned regexes at your disposal to go at frequently manipulated things. This is especially useful for editing.
But for a full parse, you really shouldn’t try to embed a full HTML grammar inside your pattern. Honest, you really shouldn’t. Speaking as someone has actually can and has done this, I unlike 99.9999% of the responders here the credibility of actual experience in this area when I advise against it. Sure, I can do it, but I almost never want to, and I certainly don’t want you to try it at home unsupervised. I can’t be held responsible for any damage that might ensue. :)
Sure, this may sound like “Do as I say, not as I do,” but if your level of regex mastery were at a level that allowed you to contemplate such a thing, you would not be asking this question. As I mentioned, almost no one who uses regexes can actually match an arbitrary HTML tag, simple as that is. Given that you need that sort of building block before writing your recursive descent grammar, and given that next to nobody can even manage that simple building block, well...
Given that sad state of affairs, it’s probably best to use regexes for simple edit jobs only, and leave their use for more complete solutions to real regex wizards, for they are subtle and quick to anger. Meaning of course the regexes, not (just) the wizards.
But sure, keep some canned regexes handy for doing simple editing rather than full parsing. That way you won’t be forced to redevise them each time from first principles. I do keep a few of these around, but then I also keep simple frameworks that allow me to edit a particular structural element of the HTML, like the plain text or the tag contents or the link references, etc, and those all use a full parser, letting me then surgically target just the parts I want in complete confidence I haven’t forgotten something.
More as a testament to what is possible than what is advisable, you can see some answers with more, um, “heroic” pattern matching, including recursion,
here,
here,
here,
here,
here, and
here.
Understand that some of those were actually written for the express purpose of showing people why they should not use regexes, because some of them are really quite sophisticated, much moreso than you can expect in nonwizards. That difficulty may chase you away, which is ok, because it was sort of meant to.
But don’t let that stop you from using vi on your HTML files, nor should it scare you away from using its search or substitute commands. Don’t let the perfect be the enemy of the good. Sometimes good enough is exactly what you need, because the perfect would take more investment than it could ever be worth.
Understanding which out of several possible approaches will give you the most bang for your buck is something that takes time to learn, and no one can tell you the answer that works for you. They don’t know your dataset, your requirements, your skillset, your priorities. Therefore any categorical answer is automatically wrong. You have to evaluate these things for yourself.
I think XPath is the primary option for traversing XML-like documents. With RegExp, it will be up to you to handle the different forms of writing a tag (with multiple spaces, double quotes, single quotes, no quotes, in one line, in multi-lines, with inner data, without inner data, etc). With XPath, this is all transparent to you, and it has many features (like accessing a node by index, selecting by attribute values, selecting simblings, and MANY others).
See how powerfull it can be at http://www.w3schools.com/xpath/.
EDIT: See also How do HTML parses work if they're not using regexp?
XPath is less likely to break if the web developer does any minor changes. That would be my choice.
Here is the canonical Stackoverflow explanation for why you should not parse HTML with regex:
RegEx match open tags except XHTML self-contained tags
In general, you cannot parse HTML with regex because regex is not made to parse HTML. Just use XPath.

Is it actually possible to parse freeform HTML with a regular expression?

now before you prepare to right a speech about the perils of HTML parsing with regex, I already know it. This is more just a curiosity question, than actually wanting to know the question for practical usage.
Basically, given a file of HTML in some random, but perfectly valid format, can you parse out the content of <p> tags using a half-sane number of regular expressions? (and also pretending that <p> tags can not be nested or some other minor limitation)
It's certainly possible to extract all the text between {insert character sequence 1 here} and {insert character sequence 2 here} with regular expressions, so long as those sequences aren't overlapping. For example:
/(?<{insert character sequence 1 here}).*?(?={insert character sequence 2 here})/
Of course, it's terribly brittle and will break horribly if what you're running it on is even slightly malformed, or contains either character sequence outside the context where it's meaningful, or any number of other ways. If you oversimplify the problem, then yes you can get away with an oversimplified solution.
Yes, under restrictions like valid HTML and non-nesting, you can use regular expressions for certain uses.
It depends on what you limitations you'd consider minor. XHTML, for one obvious example, is somewhat more amenable to simple parsing. A great deal depends on whether you're thinking in terms of parsing existing HTML, or generating new HTML that could be parsed relatively easily. For the former case, I'd say the restrictions were major -- i.e., you'd need to know a great deal about the specific HTML in question to parse it. For the latter case, I'd say the restrictions were fairly trivial -- i.e., would only involve how you write the HTML, but would not affect what you could express in HTML.

which is better to add two names (-) or(_)

hi when i write css or html i found that i want add two name like this
web-development
web_development
which one is better according SEO or write style name, file name or image name.
The first one is better. Also see this post by Google employee Matt Cutts: http://www.mattcutts.com/blog/dashes-vs-underscores/
use the dash. Google engines don't really parse underscores. This is maybe for programmers sanity, so that when they search for query_function, they get results they are looking for?
If you have a url like "http://example.com/web-site", google will return results for 'web', 'site' and '"web site"'. This is not the case for underscores: web_site will only return results for web_site.
ps.
I also think that dashes are better than underscores for usability purposes: a dash is a single button on the keyboard, while an underscore requires two buttons to be pressed. This has nothing to do with the technical side of SEO, but everything to do with usability, which is more important than SEO imo.
for css i don't think there is some issues with naming methodology, but for naming HTML pages - is preferred as search engines take - as space, even though good page name is not enough for good s.e.o. you need to have proper meta tag and keywords.
And make sure all your images have proper title tag, this is real essential.
Isn't it common practice to use the - to connect two words, and the _ to replace a space in situations where you can't use a space/+ sign, like CSS classNames?
first one is better in terms of SEO. Because the priority of hiphen is greater than under score
Please list two (2) words in the English language that use underscores ("_") within them.
Now list fifty (50) words that use dashes/hyphens ("-").
My opinion is that the hyphens would be a better solution for SEO.
IMO When it comes down to SEO is that everything makes a difference !
You are dealing with two different problems: URLs and CSS.
For URLs, hyphens would be the better choice because of SEO.
However, depending on your editing program, underscores might work better for mutli-word class names. In TextMate for instance, I can hit Esc to finish (auto-complete) a class I previously entered. It stops completing when it encounters a hyphen, but will fill in the whole class name when you use an underscore. If this is not the case for your editor, then it is really up to your preference.