Best method to scrape large number of Wikipedia tables to MySQL database - mysql

What would be the best programmatic way to grab all the HTML tables of Wikipedia main article pages where the pages' titles match certain keywords? Then I would like to take the column names and table data and put them into a database.
Would also grab the URL and page name for attribution.
I don't need specifics just some recommended methods or links to some tutorials perhaps.

The easy approach to this is not to scrape the wikipedia website at all. All of the data, metadata, and associated media that form Wikipedia are available in structured formats; so preclude any need to scrape their web pages.
To get the data from Wikipedia into your database (which you may then search, slice and dice 'til your heart's content):
Download the data files.
Run the SQLize tool of your choice
Run mysqlimport
Drink a coffee.
The URL of the original article should be able to be re-constructed from the page title pretty easily.

Related

Create searchable archives from a static HTML/CSS onepage site with daily archives

We are looking to hire a developer to build a custom solution for us, but before that we basically need to know what questions we should be asking, since none of us have any experience with programming. We have a website that is a daily listing of coffee news that is then archived, with each HTML file representing an entire day of news. What we're looking for is some sort of search functionality that would allow specific results to be displayed, rather than the entire page the results came from.
Here is the website in question: http://dailydose.coffeetalk.com/
Typically you will have each news clip stored in a database. Then you want the developer to write a functionary to query that database for find news clippings of interest to the users.
The HTML only serves as a template to display this data you retrieved from the database.

Building a website - Displaying product information

I am building an e-commerce website for a friend, I have some knowledge of HTML and CSS but I wouldn't class myself as advanced on the subject. I said I would do it as a favor/experience.
I just have a question about displaying multiple products information. My page currently has 12 items on it, do I need to create separate pages for each of those products with some information on it like so:
www.shoes.com/trainers/shoe1.html or www.shoes.com/trainers/shoe2.html
etc or is there a more efficient way of doing it.
I only ask because after looking around, the end urls do not contain pages like the above but look more like the following:
www.shoes.com/index.php?id_product=1025&controller=product
If anyone could help me out or point me somewhere I'd appreciate it.
Store your products in a database, if at all possible. This way you can use queries to sort and filter your products easily.
You are further looking for a dynamic website (http://php.net/manual/en/tutorial.php), using, for example, a .php script getting the desired data in MySQLi, for example (http://www.w3schools.com/php/php_ref_mysqli.asp)
With this, you can create any lists and links using, for example, the product ID to refer to a product, like in your example ("?id_product=1025")
Your PHP script would look for id_product ($_GET["id_product"]) and use this to query your database and get the desired data.
What you want to do is a dynamic website. You build a database with all the products, and create a "template" html page for how you want the product page to look like. For that, you will need to know a server side scripting language, like PHP or ASP.
If you are only familiar with HTML and CSS, your only option is building a "static" website, by creating an html page for each product. if there's alot of products it will be tedious and ineffective.
I would suggest a ready CMS, like wordpress for example. It has many "store" plugins you can download. one of the is Woocommerce. it's free to download but has paid plugins. I use it and i am happy.
You have to paginate your data, to do this you have to create a database first then using any server side scripting language
for example this article guide you how to paginate your data with PHP
http://code.tutsplus.com/tutorials/how-to-paginate-data-with-php--net-2928
I will just explain two types URL
www.shoes.com/trainers/shoe1.html
www.shoes.com/trainers/shoe2.html
The above method is a good way when you look at SEO point of view. Search Engine efficiently works with static URLs.
www.shoes.com/index.php?id_product=1025&controller=product
Second means, you are building a website with PHP and you are passing product id in URL as ?id_product=1025. As you mentioned if you are creating E-Commerce website static page design will be a bad practice. Since you want to design each product page.
My Suggestion is you can try Magento which is having most of the features of E-Commerce Web.

Using a database to store and get html pages for website

I've got a lack of understanding at the moment. I'm developing a website with many articles and instead of creating a .html page for every article, I thought about storing the text into a database and get it from there (somehow) again.
I'm totally unsure if it is the common way to store text in a database. How do all of the "big" websites handle the mass of articles they publish? They won't create single pages neither but instead using a database, I guess.
But how can I achieve this? How can I store whole html files with divs and jquery and stuff into a database and get them when clicking on a link? Might XML be a keyword?
First of all, you need to clearly understand how things should work.
Clearly the approach of creating a page per article cannot work for multiple reasons:
if you have a huge number of articles you'll need to have a huge number of pages
if you need to change something small in design, you'll need to make that change for every single stored article
What you need to do is to create a more generic page, which has all the common stuff for all articles in it (a place for title, a place for content). The articles themselves can be stored in a database. When opening a page for a specific article, your application should place the title and content in the right place in that page.
This approach is universal _ it will work for any number of articles.
The keywords you are looking for are : Dynamic, Content Management.
In order to achieve this, you should learn a scripting language, PHP for example.
You will find a lot of tutorials to get started and how to make your website a bit more dynamic.
But you were right about the database part, most blogging systems and other content providers use databases to store all of this in data tables. PHP (and some other languages) would allow you to interface the database and the content you provide to your users.
You should look into using a web development framework like ruby on rails. Rails has templating that essentially let's you define variables inside of your html (e.g. "text of article").
As for storing the text of the article, the way I do things like that is to store them in a file on my server and then fetch that file using AJAX and then insert into an html file.
Most sites accomplish this by having templates, in which the common-to-every-page html is stored in a file. Page-specific data (article text, etc.) is stored in the database and "inserted" into the relevant parts of the template before returning to the client.
download word press and check how it work! it will help you
http://wordpress.org/download/

What does it mean to "Index a page"?

For example in the sentence: "This tells Google how to index the page" what does Index the page mean in the grand scheme of things. Why would a page have an 'index.' What is it useful for?
Google servers are constantly visiting pages on the Internet (crawling) and reading their contents. Based on the contents Google builds an internal index, which is basically a data structure mapping from keywords to pages containing them (very simplified). Also when the crawler discovers hyperlinks, it will follow them and repeat the process on linked pages. This process happens all the time on thousands of servers.
In general, the term indexing means analyzing large amounts of data and building some sort of index to access the data in more efficient way based on some search criteria. Compare it with database indexes.
i guess you are asking the question of whats the need for indexing with google? Here it is why?
After creating a website that is very beautiful and have all good features. But as i guess You would have know that web is all about connecting the Webpages! And you have created a site, in which you can only look at it. If the world want to know about your site, the next step will be hosting! After that obviously you have to do index your webpage to any search engine, say for example google. Now your site will be indexed according to the google bot, i cant explain how bot works! And if the person searching your site name in any engine then that engine with the help of indexing can retrive your page as the result :) This is how you connect to the WEB!
This simply means Google is reading your page, figuring out what content is on it (via the page structure, links, etc.) assigning a page rank to it, among other things, and adding it to their database.
There is no specific terminology here.
See Web Crawler: http://en.wikipedia.org/wiki/Web_crawler
In short Index page this is page that originate from table of content that help to search materials in older to access data or information within the given basket of data that can be book or web-page easily.

Opening MySql database to search engines

Most of my content on my web application gets stored in MySql database. I want to open this content for search engine to index it.
What is the best solution to do this.
Best could be either performance oriented or ease of implementation.
Thanks in advance!
You can also create a sitemaps xml file that could sit at example.com/sitemaps.xml and contain a dump of all blog posts, products, user profiles etc etc in a format google can understand (more so than a normal webpage).
You can also ping a url to tell google to come check your sitemap whenever you add or edit content.
Assuming you are talking about web based search engines (such as Google), then they index webpages.
Make webpages for all entries in the database and link to them.
Like David said, a webpage should be available for each resource. Not only to force indexing, but also as a "landing page" to which the search result will then direct you. This can then of course be a redirect to another page.
The pages can be dynamic of course, but make sure that they are reference somewhere on your site so the spiders can reach them.