Trying to find connectivity in the Wikipedia Graph: Should I use the API or the relevant SQL Dump?

Trying to find connectivity in the Wikipedia Graph: Should I use the API or the relevant SQL Dump? - mysql

I apologize for the slightly obnoxious nature of my question.
A few years ago, I came across a game that could be played on Wikipedia. The goal is to start from a random page and try to get to the 'Adolf Hitler' page by utilizing internal wikilinks within 5 clicks(6 degrees of separation). I found this to be an interesting take on the Small-world experiment and tried it a few times, and was able to reach the target article within 5-6 clicks almost every time (however, there was no way for me to know if that was the shortest path or not).
As a project, I want to find out the degree of separation between a few(maybe hundreds, or thousands, if feasible) of random Wikipedia pages and Adolf Hitler's page in order to create a Histogram of sorts. My intention is to do an exhaustive search in a DFS manner from the root page (restricting the 'depth' of the search to 10, in order to ensure that the search terminates in case it has selected a bad path, or is running in cycles). So, the program would visit every article reachable within 10 clicks of the root article, and find the shortest way in which the target article is reachable.
I realise that the algorithm described above would certainly take too long to run. I have ideas for optimizations which I will play around with.
perhaps I will use a single source shortest path BFS based approach which seems to be more feasible considering that the degree of the graph would be quite high(mentioned later)
I will not mention all ideas for the algorithms here as they are not relevant to the question as in any possible implementation, I will have to query (directly by downloading relevant tables on my machine, or through the API) the:
pagelinks table which contains information about all internal links in all pages in the form of the 'id' of page containing the wikilink and the 'title' of the page that is being linked
page table which contains relevant information for mapping the page 'title' to 'id'. This mapping is needed because of the way data is stored in the pagelinks table.
Before I knew Wikipedia's schema, naturally, I started exploring the Wikipedia API and quickly found out that the following API query returns the list of all internal links on a given page, 500 at a time:
https://en.wikipedia.org/w/api.php?action=query&format=jsonfm&prop=links&titles=Mahatma%20Gandhi&pllimit=max
Runnig this on MediaWiki's API sandbox a few times gives a request time of about 30ms for 500 link results returned. This is not ideal as even 100 queries of this nature would end up taking 3 seconds, which means I would certainly have to reduce the scope of my search somehow.
On the other hand, I could download the relevant SQL Dumps for the two tables here and use them with MySQL. English Wikipedia has aroud 5.5 million articles (which can be considered as vertices of graph). The compressed sizes of the tables are: around 5GB for the pagelinks table and 1GB for the page table. A single tuple of the page table is a lot bigger than that of the pagelinks table, however (max size of around 630 and 270 bytes, by my estimate). Sorry I could not provide the number of tuples in each table, as I haven't yet downloaded and imported the database.
Downloading the database seems appealing because: since I have the entire list of pages in the page table, I could resort to a single pair shortest path BFS approach from Adolf Hitler's page by tracking all the internal backlinks. This would end up finding the degree of seperation of every page in the database. I would also imagine that eliminating the bottleneck (internet connection) would speed up the process.
Also, I would not be overusing the API.
However, I'm not sure that my Desktop would be able to perform even at par with the API considering the size of the database.
What would be a better approach to pursue?
More generally, what kind of workload requires the use of an offline copy of the database rather than just the API?
While we are on the topic, in your experience, is such a problem even feasible without the use of supercomputing, or perhaps a server to run the SQL queries?

Related

GoogleBetterAds - violatingSites.list - google-apis-explorer

I can get a list of summaries of violating sites, using the following link:
https://developers.google.com/ad-experience-report/[...]/violatingSites/list
My questions:
Is this list exhaustive?
If not, is it possible to get an exhaustive list (or not) and how?
Is it possible to know how these websites are pulled (the share of websites analysed, etc)?

- Is this list exhaustive?
What's size of your actual API return?
If you have an API return statement increasingly longer and longer with new data at each new request, you can think have the exhaustive list (with a possible update
latency).
If the API return statement have always same size with different data, in example old data will not appears and it replaced by new data, it's not exhaustive.
- If not, is it possible to get an exhaustive list (or not) and how?
I have no idea at the moment, the total number of websites can be in billion ...
- Is it possible to know how these websites are pulled (the share of websites analysed, etc)?
I have no idea for the moment too, I think it is either a confidential process or that it is described in the general conditions and subtily in the documentation...

Counting views of any element on website

I am using such MySQL request for measuring views count
UPDATE content SET views=views+1 WHERE id='$id'
For example if I want to check how many times some single page has been viewed I've just putting it on top of page code. Unfortunately I always receiving about 5-10x bigger amount than results in Google Analytics.
If I am correct one refresh should increase value in my data base about +1. Doesn't "Views" in Google Analytics works in the same way?
If e.g. Google Analytics provides me that single page has been viewed 100x times and my data base says it was e.g. 450x times. How such simple request could generate additional 350 views? And I don't mean visits or unique visits. Just regular views.
Is it possible that Google Analytics interprates such data in a little bit different way and my data base result is correct?

There are quite a few reasons why this could be occurring. The most usual culprit is bots and spiders. As soon as you use a third-party API like Google Analytics, or Facebook's API, you'll get their bots making hits to your page.
You need to examine each request in more detail. The user agent is a good place to start, although I do recommend researching this area further - discriminating between human and bot traffic is quite a deep subject.

In Google Analytics the data is provided by the user, for example:
A user view a page on your domain, now he is on charge to comunicate to Google The PageView, if something fails in the road, the data will no be included in the reports.
In the other case , the SQL sistem that you have is a Log Based Analytic, the data is collected by your system reducing the data collection failures.
If we see this in that way, that means taht some data can be missed with the slow conections and users that dont execute javascriopt (Adbloquers or bots), or the HTML page is not properly printed***.
Now 5x times more it's a huge discrepancy, in my experiences must be near 8-25% of discrepancy. (tested over transaction level, maybe in Pageview can be more)
What i recomend you is:
Save device, browser information, the ip, and some other metadata information that can be useful and dont forget the timesatmp, so in that way yo can isolate the problem, maybe are robots or users with adblock, in the worst case you code is not properly implemented ( located in the Footer as example)
*** i added this because one time i had a huge discrepancy, but it was a server error, the HTML code was not properly printed showing to the user a empty HTTP. The MYSQL was no so fast to save the information and process the HTML code. I notice it when the effort test (via Screaming frog) showed a lot of 500x errors. ( Wordpress Blog with no cache)

Couchbase - Large documents or lots of smaller ones?

I am building an application which allows users to post content. This content can then be commented on.
Assume the following:
A document for the content is between 200KB and 3MB inside depending
on the text content.
Each comment is between 10KB and 100KB in size.
There could be 1 comment, or 1000. There are no limits.
My question is, when storing the content, should the individual comments be stored inside the same document or should they be broken up?

I'd certainly keep the post content and the comments separate, assuming there will be parts of the application where the posts would be previewed/used without comments.
For the comments themselves (assuming they are separate) I'd say many smaller ones would typically be better, if only due to the consideration that there exists a post with 100s or 1000s of comments, you're not going to want to use them all immediately, so it makes sense to only get the documents as they are required (to be displayed), rather than loading MBs of comments to only use <100KB of what has been gotten at the time.
You might not require comments to be stored individually, but providing your keys are logical and follow a predictable scheme, I see no reason not to store them individually. If you would want some grouping, however, I'd probably avoid grouping more comments per document than would be typical for your usage.
If you are using views there perhaps may be some benefit to grouping more comments into single documents, but even then that would depend on your use case.
There would be no software bottlenecks from Couchbase for storing lots of documents individually, and the only constraints would be hardware constraints that would be very similar; whether you had lots of small documents or fewer large documents.

I would go with a document for each answer as Couchbase is brilliant in organizing what to keep in memory and what not. Usually an application isn't displaying all comments when there is more than 25 (I think that's a number most applications use, display them in chunks of 25 with newer at the top eg). So if the latest 25 comments are kept in memory while older ones are subsequently automatically written to disk you keep an ideal level between how you use your memory (always critical with Couchbase) and access time for you application. Perfect balance I'd say, I had a similar decision to make in my application and it worked perfectly.

What is the most efficient way to display lots of data on a website?

I have an optimization question.
Lets say that I'm making a website, and it has a JSON file with 5,000 pairs (about 582 kb) and through the combination of 3 sliders and some select tags it is possible to display every value. So the time to appear between every pair is in microseconds.
My question is: If the website is also made to run on mobile browsers, where is it more efficient to have the 5000 pairs of data - in a JSON file or in the data base? and why?

I am building a photo site with similar requirements and I can say after months of investigations and experimenting that there are no easy answer to that question. But I will try to give you some hints:
Try to divide the data in chunks, for example - if your sliders are selecting values between 1 through 100, instead of delivering exactly what the client selected, round up a bit maybe +-10 or maybe more, that way you can continue filtering on the client side without a server roundtrip. Save all data in client memory before querying.
Don't render more than what is visible on the screen, JSON storage and filtering is fast but DOM is very slow, minimize the visible elements.
Use 304 caching - meaning - whenever the client is requesting the same data twice; send a proper 304 response with etag. For example - a good rule of thumb here is to use something you know very easily, like the max ID in the database or so to see if any new data has been updated since the last call. If not, just send 304 and the client will use whatever he had the last time.
Use absolute positioning. Don't even try to use the CSS float or something like that, it will not work. Just calculate each position of each element. This will also help you to achieve tip nr 2 (by filtering out all elements that are outside of the visible screen). You can still use CSS transitions which gives nice animations when they change sliders.
You could experiment with IndexedDB to help with the client side querying but unfortunately the support in different browsers are still not good enough plus you hit the roof on storage, better to use the ordinary cache and with proper headings.
Good Luck!

A database like MongoDB would be good for this. It still uses the JSON syntax for storage so you can use the values from the JSON file. The querying is very fast too and you wouldn't have to parse the JSON file and store it in an object before using it.

Given the size of the data (just 582Kb) I will opt for the Json file.
The drawback is you will have a penalty starting the app and loading the data in memory, but then all queries will run very fast in memory as a good advantage.
You need to think about how much acceses will your app do to the database (how many queries) against load the file just once. And think if your main objective are mobile browsers or pcs.
For this volume of data I wouldn't try a database (another process consuming resources), just try how much resources (time, memory) are needed to load the JSON file.
If the data is going to grow... then you will need to rethink this, or maybe split your json file following some criteria.

Web displays: Paging vs. long tables

It seems that the trend in web design is to provide paged output, where long tables are displayed a page at a time. My customers don't like that, and have requested that the web sites I design for them show all entries in long tables. The arguments for paging seem to be mostly based on the performance hit of displaying long tables, and this is less of a concern in a high-bandwidth corporate intranet. Arguments against paging include the ability to print the entire table, do string searches against the entire table, select arbitrary ranges from the entire table for copying, etc. I've pointed out that these features can easily be added to paged web designs (e.g. a print button that prints the entire table, or a button that creates a CSV file of the the table), but the paged output still seems inconvenient to them. Our typical table is about 100 to 600 items. Obviously tables that would be significantly larger would probably have to be paged.
Questions:
What is your experience with personal or customer preferences for paged vs. full output in long tables?
Web design tools seem to be pushing the paging paradigm. Are they out of touch, or are my customers unusual?
If you're thinking "It depends on the length of the table", what threshold would you use?

I love long one-page listings.
One of the few reasons I can see for paged
listing is the ones you point out about performance.
I think your customers are very usual and in-touch.
The threshold would be about page loading times. When the server can't produce the full lists fast enough or when the lists gets so long that the browser slows down. (The latter can happen for quite short lists if you have non-a-tag hover stuff in your CSS and the browser is IE.)
Give the users a powerful search function and they'll narrow down their page lists themselves.

Why not simply have it be a user configurable option. It sounds like you plan to essentially implement both anyway.
To be honest I think that no matter which you choose someone will complain. At least with it being user configurable you have the ability to put it back on the user.

Provide a default page length, and a configurable parameter (e.g. in the query string for programmatic use, and/or a form on the webpage for interactive use) to control how many listings are in a page.
User flexibility is good. Texas Instruments has a parametric search tool for electrical engineers to find ICs that meet certain technical characteristics, and they include a link both to "show all" in a webpage and "download all" as a .csv file. That's a good model, kudos to TI. Ditto to flickr; their API lets you control (to a large extent) how many results show up on a web service call.
I personally HATE websites that default to 10 listings per page with no way to increase it. It takes FOREVER to browse them, & I'm willing to wait longer if I can get all the stuff at once.
If it's an interactive webpage, I would consider going to an AJAX solution that downloads 100 at a time so there's an indication of progress (and the user can stop it if there are 20000 results).
I agree with PEZ, it's all about responsiveness.

Best solution: Don't provide lists with more than 100 items.
Usually your user doesn't want to read more than 100 or even 600 items. They just don't care. They are searching for one (or possibly a few). Make sure that there's a way for them to get to those items without visual-grep-ing through the list.
And if your client insists on displaying all items, then provide paging with a configurable page size and let him enter "100000 items per page" if he wants to.

One of the seminal books on web design (sorry, I forget which one) used to say not to count on your users scrolling down because most of them don't know how or can't be bothered. I think a more recent update says that while is is true for the general public, certain sectors of more technical users can be expected to scroll down and you can make pages that require scrolling IFF (if and only iff) you know your users can handle it.

I can understand your situation extremely well. I have been in similar situation. I moved a business workflow from being man managed to an automated one. Initially it was carried out using excel spreadsheets. The stakeholders for my software were in the age group of 55+ they dont like anything ajaxy or any of the UI patterns you are talking about. It such cases data retreival logic can be optimized. Any table that touches the 1K mark or has item like image blobs or things like that should be shown in parts from a performance point of view.
long outputs slow rendering and will be performance leech
Customers dont want to changes most times and customer is always right unless u can convince them.
I have put forth my threshhold but it also depends on the content of the rows.
Happy Coding!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008