How to create indeed.com like search? - mysql

If you have used indeed.com before, you may know that for the keywords you look for, it returns a traditional search results as long as multiple search refinement options on the left side of screen.
For example, searching for keyword "designer", the refinement options are:
Salary Estimate
$40,000+ (45982)
$60,000+ (29795)
$80,000+ (15966)
$100,000+ (6896)
$120,000+ (2828)
Title
Floral Design Specialist (945)
Hair Stylist (817)
GRAPHIC DESIGNER (630)
Hourly Associates/Co-managers (589)
Web designer (584)
more »
Company
Kelly Services (1862)
Unlisted Company (1133)
CyberCoders Engineering (1058)
Michaels Arts & Crafts (947)
ULTA (818)
Elance (767)
Location
New York, NY (2960)
San Francisco, CA (1633)
Chicago, IL (1184)
Houston, TX (1057)
Seattle, WA (1025)
more »
Job Type
Full-time (45687)
Part-time (2196)
Contract (8204)
Internship (720)
Temporary (1093)
How does it gather statistics information so quickly (e.g. the number of job offers in each salary range). Looks like the refinement options are created in realtime since minor keywords load fast too.
Is there a specific SQL technique to create such feature? Or is there a manual on the web explaining the tech behind this?

The technology used in Indeed.com and other search engines is known as inverted indexing which is at the core of how search engines work (e.g Google). The filtering you refer to ("refinement options") are known as facets.
You can use Apache Solr, a full-fledged search server built using Lucene and easily integrable into your application using its RESTful API. Comes out-of-the-box with several features such as faceting, caching, scaling, spell-checking, etc. Is also used by several sites such as Netflix, C-Net, AOL etc. - hence stable, scalable and battle-tested.
If you want to dig deep into facet based filtering works, lookup Bitsets/Bitarrays and is described in this article.

Why do you think that they load "too fast"? They certainly have nice, scaled architecture, they use caching for sure, they might be using some denormalized datastore to accelerate some computations and queries.
Take a look at google and number of web pages worldwide - you also think that google works too fast?

In addition to what Mios said and as Daimon mentioned it does use a denormalized doc store. Here is a link to Indeed's tech talk about its docstore
http://engineering.indeed.com/blog/2013/03/indeedeng-from-1-to-1-billion-video/
Also another related article on their Engineering blog:
http://engineering.indeed.com/blog/2013/10/serving-over-1-billion-documents-per-day-with-docstore-v2/

Related

How to Improve TEXT_DETECTION of Google Vision API for non-language-specific text such as registration plates

I rely on Google Cloud Vision API textAnnotations to recognise Italian registration plates.
Problem is ALL G's are reported as C's making it completely useless given user has to check one by one which is more error-prone that just hand-typing each plate.
How to get C and G properly recognised? Italian government is quite good in finding the right one when you go over speed limits so I guess the font is not the problem...
E.g. here I get EY454WC, DN862CC, DM843CW, no exception.
I ended up using OpenALPR API which offers a free plan up.to 1.000 calls per month.
It really does the job!

Alternatives to Deprecated Google News API

What are some alternatives to the comprehensive and now deprecated Google News API?
I'm trying to load some JSON data from two government webservices to be parsed into a query for a news API which will send me back hefty relevant news.
It will be over economic and regional data. So it could be lowest paying job in dallas county.
Are there any news api with as much functionality?
This is a relatively old question, but there have been lots of recent developments in the News API space and I thought I could shed some light.
I’m going to immediately disregard the more expensive choices like Bing ($7/1,000 requests), since most people won’t find it easy to afford them, and the results are actually often inferior to more specialized solutions.
That means you’re basically left with two options: News API and newsapi.org. I really do enjoy newsapi.org, they do a fantastic job and provide high quality search results. They also provide developers with 500 free requests per day, which is pretty great.
ContextualWeb News API does have a couple of advantages: they offer 10,000 requests per month free, and after that our API is significantly cheaper and more flexible (the baseline is $0.5/1000 requests, compared with a slightly higher $1.8/1000 requests with newsapi.org’s basic plan). ContextualWeb also offers result keywords, allowing you to do all kinds of nifty machine learning stuff with the search results.
Having said that, newsapi.org’s documentation is currently more straightforward, and their results are mostly always spot on.
Check out the Social Animal NEWS API.
The news database has 450 million-plus articles and 1 million articles are added on a daily basis.
Notable features:
It is fast and provides breaking news across the web in 39 languages.
Social engagement data is available for all news URLs which enables surfacing top quality articles
Offers Sentiment Analysis to filter articles based on their sentiments.
For more details, please visit the documentation page.
Disclaimer: I work for Social Animal.
There's a free alternative that provides good data but the quantity of query is limited.
If you want to get a higher number of query per day you must pay a subscription.
The site is called gnews.io, it's really simple to authenticate and to get a key.
You could use a general news api site like newsapi.org or if you want more specific niche news (e.g. stocks) you can search for niche api's like iex, alphavantage or stocknewsapi.com
Aylien provides a News API that gives you access to NLP-enriched news articles from 80,000+ news sources: https://aylien.com/product/news-api/demo
The Newsdata.io API is a simple news API with this you can search over 50,000 news data sources worldwide. You can use this API to get live-breaking news from any country in the world in your preferred language as they provide news data in 22 languages in 7 categories from 88 countries.
Newsdata.io allows you to search for published articles using keywords or phrases, languages, publications, time. You can sort the results by time, the popularity of the publication source, or location.
Newsdata.io also provides news data analysis and it includes data analysis like topic labeling, intent detection, sentiment analysis, emotion analysis, entity extraction, semantic similarities.
The Newsdata.io API is free to use for non-commercial purposes with 500 API requests per day with 10 articles per request, 99.99% SLA uptime.
Data format: The API sends quick GET HTTP requests and returns JSON results.
Ease of use: The API is simple to use. Furthermore, you can use its documentation to get started implementing the API in a matter of minutes.
Maybe you could try Google BigQuery

How do I explain APIs to a non-technical audience?

A little background: I have the opportunity to present the idea of a public API to the management of a large car sharing company in my country. Currently, the only options to book a car are a very slow web interface and a hard to reach call center. So I'm excited of the possiblity of writing my own search interface, integrating this functionality into other products and applications etc.
The problem: Due to the special nature of this company, I'll first have to get my proposal trough a comission, which is entirely made up of non-technical and rather conservative people. How do I explain the concept of an API to such an audience?
Don't explain technical details like an API. State the business problem and your solution to the business problem - and how it would impact their bottom line.
For years, sales people have based pitches on two things: Features and Benefit. Each feature should have an associated benefit (to somebody, and preferably everybody). In this case, you're apparently planning to break what's basically a monolithic application into (at least) two pieces: a front end and a back end. The obvious benefits are that 1) each works independently, so development of each is easier. 2) different people can develop the different pieces, 3) it's easier to increase capacity by simply buying more hardware.
Though you haven't said it explicitly, I'd guess one intent is to publicly document the API. This allows outside developers to take over (at least some) development of the front-end code (often for free, no less) while you retain control over the parts that are crucial to your business process. You can more easily [allow others to] add new front-end code to address new market segments while retaining security/certainty that the underlying business process won't be disturbed in the process.
HardCode's answer is correct in that you should really should concentrate on the business issues and benefits.
However, if you really feel you need to explain something you could use the medical receptionist analogue.
A medical practice has it's own patient database and appointment scheduling system used by it's admin and medical staff. This might be pretty complex internally.
However when you want to book an appointment as a patient you talk to the receptionist with a simple set of commands - 'I want an appointment', 'I want to see doctor X', 'I feel sick' and they interface to their systems based on your medical history, the symptoms presented and resource availability to give you an appointment - '4:30pm tomorrow' - in simple language.
So, roughly speaking using the receptionist is analogous to an exterior program using an API. It allows you to interact with a complex system to get the information you need without having to deal with the internal complexities.
They'll be able to understand the benefit of having a mobile phone app that can interact with the booking system, and an API is a necessary component of that. The second benefit of the API being public is that you won't necessarily have to write that app, someone else will be able to (whether or not they actually do is another question, of course).
You should explain which use cases will be improved by your project proposal. An what benefits they can expect, like customer satisfaction.

Does anyone bother with Dublin Core anymore?

As the question states, is there any point adding Dublin Core meta-tags to your HTML head? Or has sitemap.org removed the use for most of this (though it only replaces some of the tags)
I ask this as most sites I visit don't seem to use DC metatags in their source.
I'm interested in whether I need them for a site that will be used mostly for developers, however the discussion can be broader than this category.
To quote Google (from 2002):
"Currently we don't trust metadata because we are afraid of it being manipulated"
I would rather say that the time of rich metadata hasn’t come yet. In fact technologies like RFD are just on the way up. Tim Berners-Lee – you know, the guy who invented the web – quite recently spoke at TED about The next Web of open, linked data. So Dublin Core and other metadata formats are anything but out.
Dublin Core is still very important in certain industry sectors. Here in the UK, government organisations use DC to provide standardised access to tags.
META tags are not the only place you can put DC metadata. You can integrate it more with HTML using RDFa.
Now, as for proliferation — well, the only incentive it currently gives to webmasters is satisfaction for job well done, but does not yet affect SEO. As soon as this changes, you'll see outburst of sites tagged with RDF and microformats. And it will come. Yahoo already started working on that: http://ysearchblog.com/2008/03/13/the-yahoo-search-open-ecosystem/
I was looking on the web for information about the Dublin Core and if search engines used them and I came across the academic paper "Search Engines and Resource Discovery on the Web:
Is Dublin Core an Impact Factor?" by Mehdi Safari:
http://www.webology.ir/2005/v2n2/a13.html
To quote his conclusions section: "it was found that using Dublin Core elements did not improve the retrieval rank of the web pages" and that "Dublin Core metadata, as a well-known metadata schema, is not widely accepted and used by search engine designers and the spiders do not consider its elements while ranking the web pages".
This was back in 2005, but I am assuming this is still true.
Semantic web efforts are still sputtering along. I've run across a couple of research efforts to use RDF triples including the Dublin Core... but nothing close to commercialization.
However, as a general organizing principle for the world wild web? Don't bother. My guess is that folksonomies will deal with some metadata management, but that site tagging will need to be handled through ontological deduction of some sort. I get the same feeling around DC and RDF that I get around general-purpose globally open UDDI registries: nice idea, but that's not the way the world works.
It would be kinda interesting to know whether DC tags increase your Google Page Rank (and how reliably): that could be a strong incitament for many!

How to get started with speech-to-text?

I'm really interested in speech-to-text algorithms, but I'm not sure where to start studying up on them. A bunch of searching around led me to this, but it's from 1996 and I'm fairly certain that there have been improvements since then.
Does anyone who has any experience with this sort of stuff have any recommendations for reading / source code to examine? Or just general advice on what I should be trying to learn about if I want to get into the world of writing speech recognition programs (sometimes it's hard to know what to search for if you don't have much knowledge about the domain).
Edit: I'd like to do something cross-platform, but for the moment I'd be targeting linux.
Edit 2: Thanks csmba for the well-thought out reply. At this point in time, I'm mainly interested in being able to create applications that allow automation, or execution of different commands through voice. So, a limited amount of recognizable commands being able to be strung together. An example would be a music player that took commands like "Play the album Hello Everything by Squarepusher", or an application launcher that allowed the user to create voice-shortcuts to launch specific apps.
I realize that it's a pretty giant problem, and that I have nowhere near the level of knowledge required right now to tackle implementing an entire recognition engine, although the techniques involved with doing so fascinate me, and it is something I'd like to work myself up to doing. In all likelihood, I'll probably end up picking up a book or two on the subject and studying up / playing with "simple" implementations in my free time.
This is a HUGE questions, I wouldn't know how to begin... So let me just try giving you the right "terms" so you can refine your quest:
First, understand that Speech Recognition is a diverse and complicated subject, and it has many different applications. People tend to map this domain to the first thing that comes to their head (usually, that would be computers understanding what you are saying like in IVR systems). So first lets distinguise the concept into the main categories:
Human-to-Machine: Applications that deal with understanding what a human is saying, but the human knows he is talking to a machine and the grammar is very limited. Examples are
Computer automation
Specialized: Pilots automating some controls for example (noise a huge problem)
IVR (Interactive Voice Response) systems like Google-411 or when you call the bank and the computer on the other side says "say 'service' to get customer service"
human-to-human (Spontaneous speech): This is a bigger, more complex problem. Here we can also break it down into different applciations:
Call Center: conversation between Agent-Customer, phone quality, compressed
Intelligence: radio/phone/live conversations between 2 or more individuals
Now, Speech-To-Text is not what you should be saying that you care about. What you care about is solving a problem. Different technologies are used to solve different problems. See an overview here of some of them. to summarize, other approaches are Phonetic transcription, LVCSR and direct based.
Also, are you interested in being the PHd behind the technology? you would need a Masters equivalent involving Signal processing and probably a PHd to be cutting edge. In which case, you will work for a company that develops the actual speech engine. Companies like Nuance and IBM are the big ones, but also Phillips and other startups exist.
On the other hand, if you want to be the one implementing applications, you will not be working on the engine, but working on building application that USE the engine. A good analogy I think is form the gaming industry:
Are you developing the graphic engine (like the Cry engine), or working on one of several hundred games, all use the same graphic engine?
Don't get me wrong, there is plenty to work on the quality of the search also outside the IBM/Nuance of the world. The engine is usually very open, and there are a lot of algorithmic tweaking to be done that can dramatically affect performance. Each business application has different constraints and cost/benefit function, so you can make experiments for many years building better voice recognition based applications.
one more thing: in general, you would also want to have good statistics background the lower in the stack you want to be.
At this point in time, I'm mainly interested in being able to create applications that allow automation
Good, we are converging here... Then you have no interest in "Speech-to-Text". That buzzwords takes you to the world of full transcription, a place you do not need to go to. You should be focusing on some of the more Human-to-Machine technologies like Voice XML and the ones used in IVR systems (Nuance is the biggest player there)
I would definitely recommend picking up a book or two if you are new to the field. I've got no experience in the field, so I can't make a recommendation. If you are still in college (or still have close ties), you should find out if any of your professors can make a recommendation.
The survey you linked is probably an excellent resource, too. I'm sure there have been advancements since 1996, but the basics are unlikely to have fundamentally changed. If the survey is well-written, then it would be well worth your time to read it.
For OS X check out this: OS X Speech Technologies
For Windows check out this: Microsoft Speech API
I have worked with IBMs ViaVoice product. It has a good ASR (automated speech recognition) engine, and a nice text-to-speech engine.
The websites not very good, but this is a link for the Embedded version http://www-01.ibm.com/software/voice/support/
It is platform agnostic though, and everything works through a MVC architecture using vxml a variant of xml for voice purposes.
What platform are you targeting ?. There is Microsoft Speech APIs that you can use if its for windows.
There is also the Speech Recognition Service for Android.