Where can I obtain an English dictionary with structured data? [closed] - open-source

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 8 years ago.
Improve this question
I would like to download an English dictionary -- not just a word list -- in a structured format such as TXT, XML, or SQL.
Specifically, I need phonetic pronunciation and parts of speech (definition is not required).
Surprisingly, I can't find this online anywhere. Wiktionary is available for download, but it is only the MediaWiki articles themselves. Crawling all articles and extracting the phonetics and parts of speech would be a huge exercise.
Is this available anywhere? I don't mind paying.
Edit: a few people have asked what I would like to do. My immediate need is just curiosity, for example "what the most common two-syllable verbs?". Eventually my hope would be a tool that helps you find available domain names, and does so by pairing the correct parts of speech, with bonus points for phonetic matches.
Note: cross-posted on English Language and Usage.

Go to http://www.speech.cs.cmu.edu/cgi-bin/cmudict and you will find the download page for the pronunciation dictionary at https://cmusphinx.svn.sourceforge.net/svnroot/cmusphinx/trunk/cmudict/
The latest version is currently cmudict.0.7a.
This is what I am currently using to implement the syllable counter for http://www.haikuvillage.com. It's in Ruby and I'd be happy to open source it for you if that helps.

Parts of Speech Dictionary in the public domain with highly structured format: http://icon.shef.ac.uk/Moby/mpos.html
Each line is an entry, separated by ×, with the word value on the left and the part-of-speech value (verb, etc.) on the right. Simple text file.

Wordnet is one of the best dictionaries i know. Perhaps you will find something there:
http://wordnet.princeton.edu/wordnet/related-projects/

Portman, while I used the SpellChecker tool from DevExpress I knew that there existed the OpenOffice dictionaries I'm pretty sure they have a well defined data structure. I recommend you to use that in combination with any free/paid text to speech tool.
Hope that helps,

This is not a direct answer to your question, but the Double Metaphone algorithm is very good at finding word or phrase matches for search engine application servers (such as Solr and others).
I cannot tell what your intended use of this is, so I can't tell if my suggestion is useful or not. If it is close to your intended use, the Wikipedia page about Double Metaphone has a listing of about a dozen implementations of it which may be worth exploring.
http://en.wikipedia.org/wiki/Double_Metaphone

Related

Best open-source spell-checker for OCR? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 4 years ago.
Improve this question
I have a large number of English OCRed documents from the 19th century and want to clean up some of the OCR errors by using a contextual spell-checker such as the one proposed by Peter Norvig at http://norvig.com/spell-correct.html. My main goal is to be able to use a probabilistic model (together with the ocred text data and an appropriate and large dictionary) to be able to correct words that are misspelled.
I am happy using the code that Norvig gives in his website and improving it, but before I do so, I would like to ask if there is an open-source solution for this. Norivg himself suggests looking at aspell, but I don't think that aspell is a contextual spell-checker, and I'm worried it might not work so well on OCR error correction.
So, you're looking for a spell checker that will substitute the most probabilistic choice whenever there is a phrase or word it doesn't understand? That seems like it would be a bad idea on 19c texts unless you have a large corpus of such texts that have already been spell checked by hand. Words that were commonplace then but rare now will be replaced without your knowledge. I daresay, you may find a contextual spell-checker trained on modern locution to be tetotaciously exflunctified by your 19c phraseology. ☺
If you have such a corpus, or you're up for creating one, there is a powerful Python based tool for OCR and analysis called OCRopus. It uses natural language processing, neural networks and many other buzzwords — I think I saw "deep learning" on the to-do list. It does not appear easy to use, though I admit I've never tried it myself. It seems to require skill at the command line and programming in Python. If you're still not daunted, it may be exactly what you're looking for.
On the other hand, if you are looking for something simpler, consider using a program with a standard spell checker. For example, gImageReader which can read in your PDF files, OCR them, and let you correct & add the words it doesn't know. I suggest at least trying a simple spell checker before searching for something more complicated.

What's the best way to open source data (rather than code)? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
As part of a recent programming project I compiled a database, the contents of which may conceivably be of use to someone else one day. I'm looking for the best way to 'open source' the data.
I could (and probably will) upload the SQL onto GitHub, but was wondering if anyone had found a more 'data-centric' way of sharing - maybe a website that makes it easy for users to browse/query/visualise/improve data sets, rather than just giving them a big lump of SQL.
To clarify, I'm looking for a place where I can share the data, rather than a format in which to share it - ideally a data-set equivalent of GitHub/Sourceforge.
The data is relatively small (a few thousand lines of SQL) so the volume should not be an obstacle.
I'm a big fan of Amazon's S3 for stuff like this. And if your data set is interesting enough, maybe you could publish it with InfoChimps.
I have worked with a lot of data from different companies. Most often this data has been in text delimited data format. The most popular of course being comma separated or tab. Using comma's is often a good choice because MySQL can also export and import CSV. Here is an example:
id, first_name, last_name, address
1, John, Smith, 11222 Stree Name
Google Fusion Tables ticks some of these boxes, although the emphasis seems to be on visualisation (I haven't used it, so this may be unfair). I am also reluctant to commit too heavily to any second-tier Google products these days, since they have a habit of disappearing.
You could export it to XML, that being probably the most compatible data format, although it is rather verbose. Another solution is OData, but this implies hosting the data and the platform that serves the data which may not be desirable.
Sparkfun is another possibility, it seems to be mainly targeted at real-time data sources but they offer free storage and the platform is open-source so you can host your own server.

Writing documentation - open source solutions for displaying docs online? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
I've been working on a framework in AS3 that I want to release, but first I obviously need to prepare some documentation for it.
I've noticed that quite a few sites have the exact same layout, functionality etc as Adobe Livedocs, which has let me to believe that there's something open source out there for creating online documentation.
Here's some examples:
http://livedocs.adobe.com/flash/9.0/ActionScriptLangRefV3/
http://papervision3d.googlecode.com/svn/trunk/as3/trunk/docs/index.html
http://www.fisixengine.com/api/
Would anyone be able to point me in the right direction for tools that I can use to prepare online documentation?
Ideally the system would be specifically suited for documentation in ActionScript 3. I don't have a requirement in terms of the documentation being automatically generated either - if there's something out there that looks/works nice I'm happy to manually create the documentation (provided it comes with tools for easily adding classes, arguments, etc).
Adobe has a free tool called ASDoc. It generates documentation which follows the official Adobe patter. Frankly, it isn't worth it though. The ASDoc tool is buggy and unreliable. If it has difficulty finding an import, if an import isn't used, a comment is not correctly formatted, or you have your source code spread out in any sort of unexpected way, it simply breaks.
My company has lost over 50 developer hours (a few people tried to get a couple of different projects to work and failed) in an attempt to get around these limitations and our solution? We used NaturalDocs (A JavaDoc compiler). Is it perfect? No. Is it comparable to ASDoc in output? Sort of, it isn't as neat, and it would be nice if it treated things a little differently, but it works to display the documentation.

Dictionary: Open Source Project [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 7 years ago.
Improve this question
I'm looking for a open source project for dictionary for a language (probably you never heard of it) which has not been "digitized". The dictionary will be from one language to several others, and several others to THE language. Since the language has not been "digitized", I need following features along with searching a word:
1 - Add your own translation to existing words/phrases
2 - Add a new word/phrase and add translation
3 - Request a word/phrase to be translated
4 - Rate (like/dislike or rate within the range) the translation (depending on the rating "correctness" get points")
5 - Possibly relate words (especially nouns) with pictures
6 - Easier to implement mobile version of it
I guess it's more "collaboration site", than dictionary. So the project I'm looking for may not be called as "Dictionary".
I know it's possible to design and write from the scratch, but would be good to begin with something in hand, especially if you are just spending your time/effort for non-profit stuff.
I'm looking around for the project, but didn't find something useful. At the same time designing the architecture in my mind.
If you could share some open source projects, it would be really great.
Thanks.
I am unsure what exactly you need, but would Wiktionary be of any help? There are a lot of localized variations to support different languages and there will probably be a way to ask them to support your language of interest, if it is not already there.

Google Code Search-like source code indexer and visualizer [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 6 years ago.
Improve this question
I'm looking for a way to search through our subversion repository or just packaged source code.
Are there any downloadable servers/tools like Google Code Search to index source code (preferable with support of version control systems like svn) and allow us to search in it?
Is there any tool that will index documents too?
FishEye or OpenGrok possibly.
There are many tools that will index documents.
I believe the source code for Google Search is available here. It's implemented in Go
https://code.google.com/p/codesearch/
Google made their internal Kythe source code analyser toolset available on GitHub, see http://www.kythe.io/.
It does a lot more than a simple text-level indexer. At the core it builds an AST graph from the source code and provide tools that operate on it and query it.
I use glimpse for code search. I use the free command line tool, and not the paid web interface. It's very quick, and can be combined with other tools to quickly find what your looking for. I find it's easy to setup multiple repositories for different branches of the code. Additionally, I've created a few scripts to help query, format, and colorize the results.
A language-sensitive source code search engine can be found
at SD Source Code Search Engine. It can handle many languages at the same time.
Searches can be performed for patterns in a specific langauge,
or patterns across languages (such as "find identifiers involving TAX").
By being sensitive to langauge tokens, the number of false positives is reduced,
saving time for the user. It understands C, C++, C#, COBOL, Java, ECMAScript, Java, XML, Verilog, VHDL, and a number of other languages.
[I'm a principal at the company]
Hound - code search tool with Web UI