I'm developing a companies catalog, where description is kept in HTML format.
Should I store both HTML and text version of description?
Will it impact on full text search, that will be implemented later?
Of course, I can just strip HTML tags in rendering.
What is a best practice for this?
You could of course consider doing it the other way around. Store fields in the database, then use a HTML template to insert the fields in the required place. Then your data is not duplicated and you can potentially have multiple html templates for the same underlying data.
Alternatively, you could store your fields in a single db field in some structured format (e.g. XML), and then transform that into html. (e.g. XSL). Note: some dbs can understand XML natively, if your db doesnt support this then you can store individual fields, generate XML from them, then apply XSL to get your html.
Related
If I have understood this correctly, one advantage of a document-oriented database like MongoDB is that unstructured data can be stored well. For example, If I have different HTML files, is the complete storage in a string in each document an advantage ? Or is it so thought that the different single contents of the HTML files are stored separately as many key-value pairs and only then make a HTML file out of it ?
If its a matter of storing the HTML content only, you can directly save it as a string. Text searching can also be done easily by going this route.
On the other side, if you need to process different elements, dynamically assign some values, replace some placeholders, etc. then you can think of saving associated tags/elements separately.
I have implemented RDFa on a shopping website.
Now, how to create triple store using those structured data?
There are thousands of products in the website. So, manually visiting each and every page and extracting RDF is not a good solution. Is there any automatic tools for this?
The answer depends on how you "implemented RDFa". It is unlikely that the majority of your content is expressed as static information, so it is also unlikely that the majority of your content requires scraping.
There are tools, such as D2R Server, that give you facilities for exposing your underlying datastore as a read-only SPARQL endpoint. The only trick will be if you do have static content and wish to expose that as automatically generated RDF as well. That would require some finessing.
The data which is in RDFa format on your website probably comes from a database, where it is in relational form, since you probably didn't add the RDF triples to the HTML manually. So the easiest way to get the data into the triple store would not be from the HTML, but by some kind of transformation of the original data in the database. In the end, RDF triples can be seen as a ternary relation that can well be stored in any relational database.
GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a way of using XSLT to extract the RDF triples from the HTML, in case you do not have access to a relational database that stores the data. Hope this helps.
I want to store file data of a directory in a file. i.e., file name, file size etc so that I can reduce search time. My problem now is to find an efficient way to do it. I have considered json and xml but can't decide between these two. Also if there is a better way let me know.
I'd say that it's up to what kind of data you prefer to work with and to what structure of data you have (very simple as a list of word, less simple as a list of word and the number of time each word was searched,...)
For a list of word you can use a simple text file with one word per line or coma separated (csv), for a less simple structure, json or xml will work fine.
I like to work with json as it's more light than xml and less verbose. If you didn't plan to share this data and/or it isn't complex, you don't need the validation (xsd,...) offered by xml.
And even if you plan to share this data, you can work with json.
You'll need some server side code to write the data to a file, like php, java, python, ruby,...
I would recommend Json file if you use alomst like a properties file.
If you plan to store the data in the file into database then you can go for XML where u have to the option to use JAXB/JPA in java environment
I have access to an web service that returns an XML or JSON. Since web service calls are expensive, I want to store the XML/JSON in my database so that I can access the data faster. The question I have is if I should just store the entire XML/JSON in a field or should I design a database model that represents the XML/JSON in a normalized way?
If I just needed to have the XML/JSON data available to me, then saving it as a string in a field would be OK.
However, I know that I'll be needed to extract only certain XML/JSON documents -- so I kindda need to be able to query this. For simple queries, maybe I can use something like LIKE %<title>hello world</title>% if I was search for "hello world" between the title tags within an XML document. But I think some of my queries might go beyond string matching (e.g. greater than a certain number or date, etc.). So I feel like I need to model this data properly in my database and populate it with the values from XML/JSON. This will become a painful exercise, though.
Any tips on what I should do? Maybe there is an option I didn't consider.
It really sounds like you have the requirement to translate the XML/JSON document into a distinct set of fields.
There are DBs out there that can return native JSON, but you normally would use the application layer.
I normally store pure XML/JSON when there's really no obvious need to access individual fields. For example, I'll store control data in XML format in a BLOB field, since I don't generally need to search for one particular control string.
One thing that you might want to think about is using a noSQL solution like MongoDB or Couch to store the JSON. You can put the JSON string and pull it out directly as well as access the individual fields.
A document database like MongoDB and Couch gives you the flexibility to natively store and retrieve the JSON document as well as access and search on the individual fields inside the JSON document, without needing to in advance know how the document is going to be used. It is really just eliminating a couple of steps in converting relational data into a structured document and converting a structured document into relational data.
When I do store XML directly in a BLOB, any of the searchable data would be in a field outside the BLOB. Disk space is relatively cheap, and this is a minor de-normalization. Sure, you'll have to make sure to to keep the field updated whenever the JSON/XML document is updated, but that's easy to do at the application layer.
From Sphinx reference manual: «The data to be indexed can generally come from very different sources: SQL databases, plain text files, HTML files, mailboxes, and so on»
But I can't find how add text files and html files to index. Quick Sphinx usage tour show setup for MySQL database only.
How I can do this?
Your should look at the xmlpipe2 data source.
From the manual:
xmlpipe2 lets you pass arbitrary full-text and attribute data to Sphinx in yet another custom XML format. It also allows to specify the schema (ie. the set of fields and attributes) either in the XML stream itself, or in the source settings.
I would suggest that you insert the texts in a database. That way you can retrieve them and probably highlight your search results much easier and faster.