POS tagging using brown tag set in NLTK - nltk

Is it possible to assign tags using brown tag set in NLTK? I m not using the brown corpus which is already tagged.

Yes, but not out of the box: You can train your own tagger on the Brown corpus. Performance will depend on the kind of text you need to tag, and on how much work you put into trying out different kinds of taggers. Chapter 5 of the NLTK book will walk you step by step through the process of making a pretty decent tagger (look at the section on N-Gram Tagging in particular), and it even uses the Brown corpus as an example-- you won't need to change a thing.

Related

Random phrase over html tags in Firefox Developer tools?

Hovering over any pre tag gives me 'The quick brown fox jumps over the lazy dog'. This is no where in the document or in my css file. I opened the same page on brave and nothing was there. I looked at another html file with firefox and it came up in the body tag. I even did a rip grep on all my repos to see if anywhere that phrase could come up and it did not. Where does it get this from when hovering?
From wikipedia:
"The quick brown fox jumps over the lazy dog" is an English-language pangram—a sentence that contains all of the letters of the English alphabet. Owing to its brevity and coherence, it has become widely known. The phrase is commonly used for touch-typing practice, testing typewriters and computer keyboards, displaying examples of fonts, and other applications involving text where the use of all letters in the alphabet is desired.
The Firefox dev tools are just showing you what the hovered font rule looks like.
It triggers when you hover over a font file in the CSS, not when you hover over a pre element.

Lilypond: Accidentals #3 and b2 for Turkish music

I'm transcribing some music for bağlama, a stringed instrument with frets that can produce notes that are not part of traditional Western music.
I'd like to transcribe some notes using accidentals ♭2 and ♯3. Is there a way to do so in Lilypond?
LilyPond is indeed capable of notating non-Western music and is already set for notating Turkish classical music. Please refer to the two pages below:
http://lilypond.org/doc/v2.18/Documentation/notation/common-notation-for-non_002dwestern-music
and
http://lilypond.org/doc/v2.18/Documentation/notation/turkish-classical-music

How to effectively export Math equations from Microsoft word to HTML

I blog about computer vision and involves good amount of math. I find myself comfortable using Microsoft Word to prepare the write-up before posting.
I haven't figured an effectively way to move the Math equations in
the Word document to the blog. I want it to be rendered as text. What options do I have ?
I did come across 'MathJax' here in the Math section. Is it possible to use
it as a plugin in Blogger?
Use ASCII codes, search for the equivalent of a math equation. Ex. 10 > 11. It will render as 10 < 11.

Google webfont subset sample strings

I have written a WordPress plugin that allows any Google Webfont to be used in the CMS. It includes a font previewer that shows "The quick brown fox..." in each selected font. So far it has been used to request the latin (i.e. default) font only Google.
I have now extended the plugin to allow subsets to be requested for the selected fonts. There are a dozen Google Webfont subsets including, for example, latin-ext, greek and cyrillic.
Now the question: in the font preview page, I would like to show what these subsets look like. Are there any well-known or common unicode strings that will do this? I guess I am looking for the equivalent to "The quick brown fox" but for each of the Google subsets.
How the Google subsets map onto unicode named subsets, is not clear, so I may need to find something specific to Google.
Edit:
This is where the sample text is going to go. Maybe this is less exactly a programming problem and more about a source of data.
var settings = jQuery.extend({
...
preview_text: {
'latin': 'The quick brown fox jumps over the lazy dog',
'latin-ext': '?',
'greek': '?',
'greek-ext': '?',
'cyrillic': '?',
'cyrillic-ext': '?'
}
}, options);
If there is a way to programmatically get a selection of characters with glyphs unique to each subset, then I would be happy with that.
Just noticed that Google already provide some sample strings to use in its own font preview pages. The sample text is the same for the extended version as for the non-extended version of each subset that has an extended subset. I'll find a few additional characters for the extended subsets to complete my font preview.

OCR and Distinguishing Between 2 or 3 Fonts

Let's say that I have a black and white image of a document with only 2 or 3 fonts being used. One of the 3 is used for the title and another is a small font (or at least, very plain). For example, one of the little bits of text might be:
Fancy/Bolded/Italicized/Script font: The Best Soup In The World
Plain/small: Made with tap water, salt, and sugar.
Fancy/Bolded/Italicized/Script font: The Best Soup and 1/2 Sandwich In The World
Plain/small: Made with flour, tap water, salt, and sugar.
I don't need a big fancy OCR system that can tell me that "Best Soup" uses a particular fancy font with italics/etc. I just need a system that can tell me "Best Soup" is formatted rather differently from "tap water", that "Best Soup" and "Sandwich" are probably using the same formatting, and "Sandwich" is bigger/fancier than "tap water."
I'll be using Tesseract to do the actual OCR and bounding box detection (http://www.mail-archive.com/tesseract-ocr#googlegroups.com/msg02157.html), if that's relevant.
Is there anything out there that I can use to do this simple formatting classification?
Edit:
Is there anything out there that will do this without costing me an arm and a leg?
I’m not sure whether tesseract can solve the task you describe, but I believe good ocr engine should detect font styles. For example, ABBYY OCR SDK can not only identify bold/italic font style, but it can also define proper font face to use in the output.
Based on what you describe I guess you are trying to determine document style hierarchy like header levels etc. ABBYY FineReader Engine provides this functionality and you don’t have engage into the font size&style based text purpose routine. Besides, it provides the best ocr quality and it’s free to try. Consider trying it out if you plan commercial software. I work # ABBYY and can provide you more info our OCR SDK if necessary.
Best regards.