Here's the general question that I'm asking:
How do you optimise your website so that searches using common misspellings of your name find their way to you?
And my specific situation:
At my company, we sell online education courses. These are given a code of two letters followed by two numbers, eg: BE01, BE02, IH01.
These courses have been around for some time now (9 years, which is like 63 internet years or something), and since our target market is fairly niche, most of our marketing comes from word-of-mouth from the small community.
I was looking at our statistics to see the search keywords used to get to our site, and the highest ranked one which wasn't just our company name was "BE10", which is one of our least popular courses. This made me think that people are typing in how they hear other people refer to the courses verbally, that is: "bee-ee-oh-one" - BEO1 (not BE01).
Looking at some other questions, and they say that the keywords meta tag is virtually useless, and that that information should go into the content of the page. I obviously don't want to perpetuate the misconception that our courses are called BEO1 by putting that into the content, so what should I do?
I'd recommend making separate informational pages on the misspellings (http://example.com/beo1.html or what-have-you) that include a brief explanation about the confusion and refer to the correct course page. Get these indexed by including them in your sitemap (presumably you have one already), and if you like, improve their likelihood of indexing and their ranking by linking them in an inconspicuous "common misspellings" section in the real course pages.
Related
I am doing a project for my degree and I have an actual client from another college. They want me to do all this stuff with topic modeling to an sql file of paper abstracts he's given me. I have zero experience with topic modeling but I've been using Gensim and Nlkt in a Jupyter notebook for this.
What he want's right now is for me to generate 10 or more topics, record the top 10 most overall common words from the LDA's results, and then if they are very frequent in each topic, remove them from the resulting word cloud and if they are more variant, remove the words from just the topics where they are infrequent and keep them in the more relevant topics.
He also wants me to compare the frequency of each topic from the sql files of other years. And, he wants these topics to have a name generated smartly from the computer.
I have topic models per year and overall, but of course they do not appear exactly the same way in each year. My biggest concern is the first thing he wants with the removal process. Is any of this possible? I need help figuring out where to look as google is giving me not what I want as I am probably searching it wrong.
Thank you!
Show some of the code you use so we can give you more useful tips. Also use nlp tag, the tags you used are kind of specific and not followed by many people so your question might be hard to find for the relevant users.
By the whole word-removal thing do you mean stop words too? Or did you already remove those? Stop words are very common words ("the", "it", "me" etc.) which often appear high in most frequent word lists but do not really have any meaning for finding topics.
First you remove the stop words to make the most common words list more useful.
Then, as he requested, you look which (more common) words are common in ALL the topics (I can imagine in case of abstracts this is stuff like hypothesis, research, paper, results etc., so stuff that is abstract-specific but not useful for determining topics within different abstracts and remove those. I can imagine for this kind of analysis as well as the initial LDA it makes sense to use all the data from all years to have a large amount of data for the model to recognize patterns. But you should try around the variations and see if the per year or overall versions get you nicer results.
After you have your global word lists per topic you go back to the original data (split up by year) to count the frequencies of how often the combined words from a topic occur per year. If you view this over the years you probably can see trends like some topics that are popular in the last few years/now but if you go back far enough they werent relevant.
The last thing you mentioned (automatically assigning labels to topics) is actually something quite tricky, depending on how you go about it.
The "easy" way would be e.g. just use the most frequent word in each topic as label but the results will probably be underwhelming.
A more advanced approach is Topic Labeling. Or you can try an approach like modified text summarization using more powerful models.
The scene:
I have indexed many websites using Nutch and Solr. I've implemented result grouping by site. My results output includes the page title, highlight snippets and URL. My issue is with the page navigation/copyright/company info bits that appear on many company sites.
A query for "solder", for example, may return 200+ results for a particular site -- but only a handful of the results are actually appropriate; perhaps the company's site structure includes "solder" on every page as part of their core business description, site navigation, etc. There are relevant results to see, but they're flooded by the irrelevant, repetitive matches from the other pages on the site.
The problem:
I've seen other postings asking how to prevent Nutch and Solr from indexing site headers, footers, navigation and others but with such a diverse group of sites, this approach just isn't feasible. What I'm observing, however, is that although the content for each result is significantly different, the highlighted snippets returned are 90-100% identical for the results I don't want. Observe:
Products | Alloy Information || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry
http://www.--------.com/Products/AlloyInformation.aspx
Products | Chemicals & Cleaners || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Industrial Division Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales
http://www.--------.com/Products/ChemicalsCleaners.aspx
Products | Rosin Based || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Products Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical
http://www.--------.com/Products/RosinBased.aspx
Support | Engineering Guide || --------
-Free Solutions Halogen-Free Products Sales Contacts Technical Articles Industry Links Terms & Conditions Support Products Services News Support Site Map Lead-Free Solutions Halogen-Free Products Sales Contacts Technical
http://www.--------.com/Support/EngineeringGuide.aspx
The Big Idea:
This leads me to the question of if I can filter or group results based on the highlighted snippets that are returned. I can't just group on the content because 1) the field is huge; and 2) the content is very different from page to page. If I could group, exclude or deduplicate results whose snippets were >85% identical, that would probably solve the problem. Perhaps some sort of post-processing step or some kind of tokenizer factory? Or a sort of idf for the search results rather than the entire document set?
This seems like it would be a fairly common problem, and perhaps I've just missed how to do it. Essentially this is Google's "To blah blah your search, we have hidden xxx similar results. Click here to show them" feature.
Thoughts?
I don't thinkthere is any way of doing exactly what you are asking, except post-processing that would be up to you, and not very efficient for larger result sets.
Maybe you should ask a different question if the documents being returned are actually quite different, even though the snippets are identical. If the documents are different, presumably there is value in showing them all, rather than de-duplicating.
You could try enhancing the search result display to show more information about the documents so that the user can discriminate amongst them - maybe not relying on highlighting, but showing some other parts of the document as well?
I really do think though that at the heart of the problem is the need to make matches found in site boilerplate less relevant than matches found elsewhere. Usually relevance ranking does a good job of this because the common terms are much less important for relevance ranking, but if you are mixing documents from a wide range of different sites you might find the effect less pronounced - since oft-repeated terms on one site could be very unique on another site. If your results are truly segmented by site, you might consider creating separate indexes (cores) for each site - this would have the effect of performing the relevance calculations in a site-specific way, and might help with this problem.
In the base shipping Nutch (not Solr) there is a clustering mechanism, I don't really know how it works but it does something which I had to remove. Have you looked at that ?
Another idea popping to mind would be : to index separetely real content, from navigational snippets. And at search time you apply a heigher query weight to 'real content' field.
Which would pull forward pages with 'solder' as content as opposed to pages with only 'solder' as navigation and yet you keep all pages just in case.
Hope I understood your problem correctly.
I'm making a simple searchable list which will end up containing about 100,000 links on various medical topics- mostly medical conditions/diseases.
Now on the surface of things this sounds easy... in fact I've set my tables up in the following way:
Links: id, url, name, topic
Topics (eg cardiology, paediatrics etc): id, name
Conditions (eg asthma, influenza etc): id, name, aliases
And possibly another table:
Link & condition (since 1 link can pertain to multiple conditions): link id, condition id
So basically since doctors (including myself) are super fussy, I want to make it so that if you're searching for a condition- whether it be an abbreviation, british or american english, or an alternative ancient name- you get relevant results (eg "angiooedema", "angioedema", "Quincke's edema" etc would give you the same results; similarly with "gastroesophageal reflux" "gastro-oesophageal reflux disease", GERD, GORD, GOR). Additionally, at the top of the results it would be good to group together links for a diagnosis that matches the search string, then have matches to link name, then finally matches to the topic.
My main problem is that there are thousands if not tens of thousands of conditions, each with up to 20 synonyms/spellings etc. One option is to get data from MeSH which happens to be a sort of medical thesaurus (but in american english only so there would have to be a way of converting from british english). The trouble being that the XML they provide is INSANE and about 250mb. To help they have got a guide to what the data elements are.
Honestly, I am at a loss as to how to tackle this most effectively as I've just started programming and working with databases and most of the possibilities of what to do seem difficult/suboptimal.
Was wondering if anyone could give me a hand? Happy to clarify anything that is unclear.
Your problem is well suited to a document-oriented store such as Lucene. For example you can design a schema such as
Link
Topic
Conditions
Then you can write a Lucene query such as Topic:edema and you should get all results.
You can do wildcard search for more.
To match british spellings (or even misspellings) you can use the ~ query which finds terms within a certain string distance. For example edema~0.5 matches oedema, oedoema and so on...
Apache Lucene is a Java library with portts available for most major languages. Apache Solr is a full-fledged search server built using Lucene lib and easily integrable into your platform-of-choice because it has a RESTful API.
Summary: my recommendation is to use Apache Solr as an adjunct to your MySql db.
It's hard. Your best bet is to use MeSH and then perhaps soundex to match on British English terms.
It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center.
Closed 12 years ago.
I'm creating a music player, where the user can search for artists, albums, or songs.
I have created a script that reads all the tags from the mp3s in the music library, and updates a database of songs, in a single table, containing artist names, albums, track titles, etc.
Currently, this works well, because it can scan for any changes in the music library, and add/delete rows for corresponding songs in the database.
This scan routine is therefore a fairly short an easy to understand piece of code, because it maintains only a single table.
I understand the database would be more powerful if artists, albums, and tracks have their own table, and are all linked to each other. I haven't done anything about the search part yet -- how screwed am I, if I keep everything in one table?
Thanks.
Your database is not normalized. You say it's all in one table, but you haven't given any information about the schema.
The kinds of problems which non-normalized databases have include problems with consistency related to storing redundant information - if you have something like:
Album, Track, Artist
then to change the Album name, you have to change it on every track associated with the Album.
Of course, there are all kinds of "database" systems out there which are not normalized, but these usually have mechanisms to handle these kinds of things which are appropriate to their paradigms.
In regards to the Pink/P!nk situation, if that's a big deal to you, then yes, normalization would be useful.
Your songs table would reference an artist_id.
You'd also have a table of artist aliases, which would map the various names that a particular artist has gone by to that artist_id.
But this can get pretty complex, and technically, it may not even be correct in your situation, as if an artist chooses to release projects under different names, they may not want them all lumped together.
In general, normalized databases are a safe place to start, but there are plenty of good reasons to denormalize, and it is more important to understand those reasons then blindly always do things one way.
pretty screwed, indeed. it's hardly normalized. go for separate tables.
if you've never heard of normalization or understood why it was important, perhaps you should read this. it's a succinct, simple explanation without a lot of jargon.
or you could go straight to the source since you're already using mysql:
http://dev.mysql.com/tech-resources/articles/intro-to-normalization.html
think about the cardinalities and relationships in your model:
an album will have zero or more tracks; a track will belong to only one album (album-to-track is one to many)
an artist can create zero or more albums; an album can be created by one or more artists (artist-to-album is many-to-many)
you'll want to think carefully about indexes, primary, and foreign keys. add indexes to non-key columns or groups that you'll want to search on.
this design would have four tables: album, track, artist, and artist_to_album many to many join table.
So the subject you're asking about is called "Normalization" and while this is useful in many circumstances, it can't always be applied.
Consider the artist Pink. Some of her albums have her name as Pink and others P!nk which we recognize as the same visually, because we know it's her. But a database would by force see these two separately (which also makes searching for her songs harder, but that's another story). Also consider Prince, "The artist formally known as Prince", etc.
So it might be possible to have an artist ID that matches to both Pink and P!nk but that also matches to her albums Funhouse etc. (I'm really gonna stop with the examples now, as any more examples will need to be tabular).
So, I think the question becomes, how complex do you want your searching to be? As is, you're able to maintain a 1:1 correlation between tag and database info. It just depends how fancy you want things to be. Also, for the lookup I mentioned above, consider that most times that information is coming from the user, you really can't supply a lookup from P!nk to Pink any more than you would from Elephant to Pachyderm because you don't know what people are going to want to enter.
I think in this case, the naive approach is just as well.
I collect news for certain topics and then run bayesian classfier on them to mark them as interesting or non-interesting.
I see that there are news which are different articles are essentially the same news. e.g.
- Ben Kingsley visits Taj Mahal with wife
- Kingsley romances wife in Taj's lawns
How do I teach the system to mark all these as duplicates?
Thanks
Sanjay
Interesting idea. I would guess this has been studied before, a look in some comp-sci journal should turn up a few good pointers. That said here are a few idea I have:
Method
You could find the most-unique key-phrases and see how well they match with the key phrases with the other articles. I would imagine the data published by google on the frequency of phrases on the web would give you baseline.
You somehow need to pickup on the fact that "in the" is a very common phrase but "Kingsley visits" is important. Once you have filtered down all the text to just the key phrases you could see how many of them match.
key phrases:
set of all verbs, nouns, names, and novel (new/mis-spelt) words
you could grab phrases that are say, between one and five words long
remove all that are very common (could have classifier on common phrases)
see how many of them match between articles.
have a controllable slider to set the matching threshold
It's not going to be easy if you write this yourself but I would say it's a very interesting problem area.
Example
If we just using the titles and follow the method through by hand.
Ben Kingsley visits Taj Mahal with wife will create the following keywords:
Ben Kingsley
Kingsley
Kingsley visits
wife
Mahal
... etc ...
but these should be removed as they are too common (hence don't help to uniquely identify the content)
Ben
with wife
once the same is done with the other title Kingsley romances wife in Taj's lawns then you can compare and find that quite a few key phrases match each other. Hence they are on the same subject.
Even though this is already a large undertaking there are many thing you could do to further the matching.
Extensions
These are all ways to trim the keyword set down once it is created.
WordNet would be a great start to looking into getting a match between say "longer" and "extend". This would be useful as articles wont use the same lexicon for their writing.
You could run a Bayesian Classfier on what counts as a key-phrase. It can be trained by having the set of all matching/non-matching articles and their key-phrases. You would have to be careful about how you deal with unseen phrases as these are likely to be the most important thing you come across. It might even be better to run it on what isn't a key-phrase.
It might even be an idea to calcluate the Levenshtein distance between some of the key-phrases if nothing else found a match. I'm guessing it is likely that there will always be some matches found.
I have a feeling that this is one of those things where a very good answer will get you a PhD. Than again, I suppose it has already been done before (google must have some automatic way to scrape all those news sites and fit them into categories and similar articles)
good luck with it.
This is a classification problem, but harder given the number of distinct classes you will have. One option might be to reduce the size of each document using Feature Selection (more info). Feature selection involves selecting the top n terms (excluding stop words, and possibly applying stemming to each word as well). Do this by calculating, for each document, the mutual information (more info) of each term, ordering the terms by that number and selecting the top n terms for each document. This reduced feature set of top n terms for each document can now form the basis for performing your duplicate selection (for example, if there are more than x% common terms between any documents, again x calculated through backtesting),
Most of this is covered in this free book on information retrieval.