I have a large text file, which is an Italian-English dictionary. A typical line is:
Mazzapícchio, a long pole that fishers vse to bob vp and down for Eeles, and also to make fish to stirre. Also a kind of meate or custard in some parts of Italie made with milke and egges.
(Yes, it's a 17th-century dictionary.)
I'm looking for the best/easiest way to turn this into a searchable database.
The search would need to ignore the diacritics; with everything up to the first comma as the 'entry'. There are some cross-references, e.g.: Mefíte, as Mephíte.
My first thought is simply to turn it into HTML, with anchor tags for the word/phrase up to the first comma. That should be easy enough with a bit of Grep. I could also add links to the crossrefs in the same way (using BBEdit to confirm each change). It would then be easy to query just using a browser's search field.
However, ideally, I'd like something that returned only (all) the matching results. XML/HTML Tagging is the easy bit: the problem is the front-end to access/query it.
I'm on MacOS. (I'm also investigating Apple's Dictionary format...)
Any ideas on how to proceed would be welcome. Thanks.
This is a huge question. So many choices at so many areas.
A small start:
A searchable db. Look at https://solr.apache.org/
Php to handle interaction front-end with solr and to serve your html search form and results.
Related
I'm trying to store some text containing html tags into properties, which doesn't work. I created a form for a property with the data type 'text' and a template. Saving the form writes the text into the template, but it can't get displayed, as it contains illegal characters, as I guess.
What I'm trying to do:
I need a form to enter data, containing html tags and special
characters
I'd like to be able to use a query to find all those pages
and show that text using a template I provide to the ask query.
I also tried to use the free text option, but then I can't retrieve it using the ask query.
What would be the best, or at least a working solution to this?
Thanks a lot
storing text with html tags is a bit tricky in SemanticMediaWiki
The reason is the invention of the StripMarkers UNIQ/QINU by the MediaWiki developers.
When parsing the content of page with html tags in it the parsing is sort of "postponed". This technical detail unfortunately makes it hard for extension developers like the SMW developers to solve the issue of handling such content. Also it makes it hard for lay people to follow the discussion on how to solve the problem
Here are two examples of SMW Issues that are marked as "closed". This state of affairs means that by following the configuration hints in the issue your problem should be solved. If not please ask a question on the SMW issue list or even initiate the reopening of the issues.
https://github.com/SemanticMediaWiki/SemanticMediaWiki/pull/794
https://github.com/SemanticMediaWiki/SemanticMediaWiki/issues/3707
On my wiki we ran into this and resolved it by replacing special characters (we had issues with [ ] =, but the same problem happens with to < > tags too) with alternate unicode characters using the regex extension and a template before setting the property with {{#set:}}. If you want to display the formatted text on the wiki directly then call that parameter separately without replacing the unicode characters.
When you want to display the property, you can then run the reverse replacement with regex before displaying your now intact code (using the template result format to allow you to perform the operation on the output of the query).
To switch to special characters you can create this template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/=/|꞊}}|/\[/|[}}|/\]/|]}}|/>/|≽}}|/</|≼}}
And to switch back you can use this as a template
{{#regex:{{#regex:{{#regex:{{#regex:{{#regex:{{{1|}}}|/꞊/|=}}|/[/|[}}|/]/|]}}|/≽/|>}}|/≼/|<}}
I'm working on an app where the users will be able to make little module extremely similar to PowerPoint. So I need to create an editor that will allow the user to stylize the text: point list, bold font, color, add and image.
The only way I found that would make the user be able to do that is that he could add tags himself.
I'm not sure if it's a good idea to have tags in my database.
It is fine. It is just data. In a database HTML-tags are meaningless. The problem is when outputting the information. Some view engines will escape the output as the default. In this case you don't want escaping. However, it will make it vulnerable to cross-site scripting (XSS).
I have a requirement to produce letters to send to customers which will contain a report within the letter text. The idea is that the user can create letter paragraphs which can be saved in a database for later use, can be sequenced and can appear either before or after a report. The report will be in table form.
I've looked using PDF::Table and PDF::API2, (both of which are good at what they do), however, both place 'items' on the page in fixed positions and not create a free flowing document.
Unless I've missed something, there is no way to add a table immediately after a paragraph of text or vice versa as page positions are required.
I have thought about using HTML::Template to create the basic letter, then HTML::HTMLDoc to convert to PDF, but would need the ability to insert a page break on change of customer.
What is my best option to achieve the above result please?
Many Thanks
There are only two ways that I've had any success with.
The first is the Apache XML-FOP project. This is a huge, sprawling Java library and specification for turning XML documents into nicely formatted PDFs. I was never good enough with XML stylesheets and transformations to get to grips with this.
The second is to generate openoffice/libreoffice documents and then use a copy of libreoffice in headless mode to convert them to PDFs. This is what I generally end up doing. You may want a minimal X11 installation for fonts etc with Xvfb as a fake display.
For editing the documents I've had success with the OpenOffice-OODoc distribution. HTH.
I have looked over the answers to what is the best to convert Word to HTML for free. What if I am willing to pay? The big issue is that these documents have several tables that need to be kept exact. The background colors and cell alignment have to match the original.
You're willing to pay? Try http://word-to-html.com/ or the even more expensive http://www.solutionsoft.com/convert-word-to-html.htm
The main thing the other answers miss is that Word does a horrid job of producing HTML. And otherwise reasonable tools like OpenOffice do an even worse job. The results are so incredibly bad there usual approach is two steps:
Step 1: Export HTML from word
Step 2: Post process the result to make it usable
An example (free) cleaner is http://word2cleanhtml.com/.
If you have the choice use Microsofts "Web Page, Filtered" rather than the full HTML (you'll be much happier). Also consider a dark horse candidate: email the document to yourself via gmail, then "view as HTML".
Word has an export (or save as) to HTML. Will that work?
It's Save As -- Other Formats -- Web Page, Filtered
what version of word are you using?
Word has an option "Save as HTML".Isn't this enough?
You would just do file>Save As> change file type to HTML.
I've been grappling with the fraught area of escaping user (text) input for web pages. The ultimate goal is to have user input displayed and stored exactly as typed in, without breaking anything.
To that end I have been using the following test string :
'"_$%^&*()+=-£{}[]/n/<>\#~;|,.?#:!&``"'
It seems to work well (even Stack Overflow or Twitter is not immune, hence the back ticks). My question is, will this string capture most escaping problems, for example going from a web page via Ajax and to a database and back again?
In fact how do I display this string in Stack Overflow without the back ticks?
Is there a better one, e.g. say one that will highlight encoding problems too?
When I'm testing, I'm using something like this
a’b<’>",!"/%$?$&?%(()%/"!"/&?%$/"&$/"?%&?-f¯Ñ112üêù
This is generally sufficient to highlight encoding issues, at least from what I can see.
Including a mathematical symbol such as unicode x2202 might be useful too.
That seems like it should be all of them. The smartest thing to do would be to (depending on the language you're using) use a library that has been well tested, that can sanitize user input. Just ask around what other websites use.
See here: http://gendoh.com/2511063
The post itself is written in Korean, but you could see what makes difference between several given patterns. (V1 to V3 are for generic web apps while V4 and V5 is for javascripts.)