How can I summarize the updates to a table on an page I browse?

How can I summarize the updates to a table on an page I browse? - html

I am a student at a University. With the placement process going on, we have an internal placement website that shows updates and status about various companies I have applied to. Since the number of companies is too large it becomes cumbersome to scroll through the complete list to find information. Sometimes, I just miss some things. Now, to tackle this problem, here is what I want to do:
The data is in an HTML table. Each row shows information about one company: Some dates, Status(Not/Shortlisted/Applied), Some yes/no options etc. each in a different column. Once I open the page I want to be able to extract information about which companies I got shortlisted in, and in which ones I didn't make it.
What is the right technology to do this ? I am thinking of writing a Greasemonkey user script (I have never actually written any, but how hard could it be ?). What other options do I have?
Edit: I don't quite understand why this question has voted to be closed?
I just displayed a use case for something general: On opening a web page, automatically extracting information from the page and display it to the user. What is the easiest and sufficiently powerful way to achieve this?

Since you can't get access to the website's database, Greasemonkey would be your best automation approach. However, this task is likely to be over before you can get a decent script up from scratch.
Your best practical approach is to save the pages and/or copy and summarize the data in MS Excel, or equivalent.
~~~~~~~~~
Here at SO, We will not develop any but the simplest Greasemonkey scripts for you from scratch (unless they are fun somehow ;) ). But, you can sometimes get such help in the "Script requests forum" at userscripts.org.
In order for someone to help you, they will need:
A clear idea of exactly what data gets manipulated, and how.
Access to the target site. Or access to saved snapshots of the target pages. GM scripts are extremely dependent on the details of the target page.

"other option":
ctrl + F
enter shortlisted
enter
ctrl + G <--repeat last search

Related

Search HTML Tables on Multiple Pages

Hello Stack Overflow Community!
I am making a directory of many thousand custom mods for a game using HTML tables. When I started this project, I thought one HTML page would be slow, but adequate for the ~4k files I was expecting. As I progressed, I realized there are tens of thousands of files I need to have in these tables, and let the user search though to find what they are missing to load up a new scenario. Each entry has about 20 text entries and a small image (~3KB). I only need to be able to search through one column.
I'm thinking of dividing the tables across several pages on my website to help loading speeds and improve overall organization. But then a user would have to navigate to each page, and perform a search there. This could take a while and be very cumbersome.
I'm not great at website programming. Can someone advise a way to allow the user to search through several web pages and tables from one location? Ideally this would jump to the location in the table on the new webpage, or maybe highlight the entry like the browser's search function does.
You can see my current setup here : https://www.loco-dat-directory.site/
Hopefully someone can point me in the right direction, as I'm quite confused now :-)

This would be my steps,
Copy all my info into an excel spredsheet, then convert that to json, then make that an array for javascript (myarray), then can make an input field, and on click an if statement if input == myarray[0].propertyName
if you want something more than an exact match, you'd need https://lodash.com/
in your project.

Hacky Solution
There is a browser tool, called TableCapture, to capture data from html tables and load into excel/spreadsheets - where you are basically deferring to spreadsheet software to manage the searching.
You would have to see if:
This type of tool would solve your problem - maybe you can pull each HTML page's contents manually, then merge these pages into a document with multiple "sheets", and then let people download the "spreadsheet" from your website.
If you do not take on the labor above and just tell other people to do it, then you'd have to see if you can teach the people how to perform the search and do this method on their own. eg. "download this plugin, use it on these pages, search"
Why your question is difficult to answer
The reason why it will be hard for people to answer you in stackoverflow.com (usually code solutions) is that you need a more complicated solution (in my opinion) than hard coded tables and html/css/javascript.
This type of situation is exactly why people use databases and APIs to accept requests ("term": "something") for information and deliver responses ( "results": [...] ).

Thank you everyone for your great advice. I wasn't aware most of these potential solutions existed, and it was good to see how other people were tackling problems of similar scope.
I've decided to go with DataTables for their built-in sorting and filtering : https://datatables.net/
I'm also going to use a javascript array with an input field on the main page to allow users to search for which pack their mod is in. This will lead them to separate pages on my site, each with a unique datatable for a mod pack. Separate pages will load up much quicker than one gigantic page trying to show everything.

If I have a collection of random websites, how do I get specific information from each?

Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.

I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.

The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.

I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.

There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.

Show a one time alert for certain pages?

What ways are used to show a user help the first time they use a page - to showcase certain features they might not realize are there.
For instance, say a search form is introduced that has a hidden "advanced search" option:
I would think most people would see the chevron and click it, but..you never know. I know that I could add a cookie to say "Hey - this user has seen it" or create a table in the database.
The problem I see with adding a cookie, is if the user deletes cookies and logs back in - they will have to always dismiss the alert/error/whatever. Unless after a period of time, I go in and manually delete it (which then new users wouldn't see the alert.)
Alternatively, adding a table to the database seems too much for such a simple task. It's what I'm leaning towards, but I hate it...there has to be a better way.
Are there any other ways to show a one time alert for certain pages?
Edit - I used a pretty trivial example on purpose.

I guess both your options are right. The cookie option is bit better cause it will be lighter on the server, again in case you have many users then the database options will be not great.
You may also lookup the new HTML5 feature of storing data on client side. Its a better local storage method.
It goes like localStorage.uid="1234" or something like clickcount.. Refer the html5 docs its a great feature as well.
Heres the link..
http://www.w3schools.com/html/html5_webstorage.asp
have fun..

What is the best way to display spreadsheet data in Ruby on Rails?

I am looking for a way to edit data and have values dynamically calculated (i.e. totals, averages, etc.) My application is a web based gradebook system for teachers and one of the big challenges is allowing them to enter/update grades. The most natural solution for this type of data is a table or spreadsheet grid and my first thought was to write something myself, but I quickly got over that idea. :)
The chief problem I'm having is being able to calculate things in real time. When a teacher changes a grade I need the table to update that students AVG % and possibly their letter grade. It doesn't have to feed these calculations back to the server (they are just for show) but the cell changes do have to be saved (via AJAX).
I know this should probably be a FAQ and I found these two answers (1, 2) but my requirements are a bit different (I think). First of all I'm looking for something that integrates with RoR fairly well; this means using Prototype. It should also be pretty lightweight and clean; I don't need fancy things like pictures, sub-groups, etc. Lastly, since my project is under the GPL, it must be open source.
Any hints? Right now I'm looking at TableKit & Rico LiveGrid but I'm not sure they can do the row & column calculations that I need.

I think ExtJS has something like this. It'd be worth while checking it out: http://extjs.com/

Saving it in the database might be easiest. Calculate and save the things you need, then update the view.
I'm not sure how your UI works, but you can attach an AJAX event to the UI where they enter the information, saving the data. The controller can respond to javascript, dropping into an RJS template that would update the values you need on the page.

After searching for something lightweight and easy to use I have given up and am writing my own little bits of JavaScript to do my bidding for me. Its not perfect but it seems to work pretty well, and it satisfies my needs (for now).

Web displays: Paging vs. long tables

It seems that the trend in web design is to provide paged output, where long tables are displayed a page at a time. My customers don't like that, and have requested that the web sites I design for them show all entries in long tables. The arguments for paging seem to be mostly based on the performance hit of displaying long tables, and this is less of a concern in a high-bandwidth corporate intranet. Arguments against paging include the ability to print the entire table, do string searches against the entire table, select arbitrary ranges from the entire table for copying, etc. I've pointed out that these features can easily be added to paged web designs (e.g. a print button that prints the entire table, or a button that creates a CSV file of the the table), but the paged output still seems inconvenient to them. Our typical table is about 100 to 600 items. Obviously tables that would be significantly larger would probably have to be paged.
Questions:
What is your experience with personal or customer preferences for paged vs. full output in long tables?
Web design tools seem to be pushing the paging paradigm. Are they out of touch, or are my customers unusual?
If you're thinking "It depends on the length of the table", what threshold would you use?

I love long one-page listings.
One of the few reasons I can see for paged
listing is the ones you point out about performance.
I think your customers are very usual and in-touch.
The threshold would be about page loading times. When the server can't produce the full lists fast enough or when the lists gets so long that the browser slows down. (The latter can happen for quite short lists if you have non-a-tag hover stuff in your CSS and the browser is IE.)
Give the users a powerful search function and they'll narrow down their page lists themselves.

Why not simply have it be a user configurable option. It sounds like you plan to essentially implement both anyway.
To be honest I think that no matter which you choose someone will complain. At least with it being user configurable you have the ability to put it back on the user.

Provide a default page length, and a configurable parameter (e.g. in the query string for programmatic use, and/or a form on the webpage for interactive use) to control how many listings are in a page.
User flexibility is good. Texas Instruments has a parametric search tool for electrical engineers to find ICs that meet certain technical characteristics, and they include a link both to "show all" in a webpage and "download all" as a .csv file. That's a good model, kudos to TI. Ditto to flickr; their API lets you control (to a large extent) how many results show up on a web service call.
I personally HATE websites that default to 10 listings per page with no way to increase it. It takes FOREVER to browse them, & I'm willing to wait longer if I can get all the stuff at once.
If it's an interactive webpage, I would consider going to an AJAX solution that downloads 100 at a time so there's an indication of progress (and the user can stop it if there are 20000 results).
I agree with PEZ, it's all about responsiveness.

Best solution: Don't provide lists with more than 100 items.
Usually your user doesn't want to read more than 100 or even 600 items. They just don't care. They are searching for one (or possibly a few). Make sure that there's a way for them to get to those items without visual-grep-ing through the list.
And if your client insists on displaying all items, then provide paging with a configurable page size and let him enter "100000 items per page" if he wants to.

One of the seminal books on web design (sorry, I forget which one) used to say not to count on your users scrolling down because most of them don't know how or can't be bothered. I think a more recent update says that while is is true for the general public, certain sectors of more technical users can be expected to scroll down and you can make pages that require scrolling IFF (if and only iff) you know your users can handle it.

I can understand your situation extremely well. I have been in similar situation. I moved a business workflow from being man managed to an automated one. Initially it was carried out using excel spreadsheets. The stakeholders for my software were in the age group of 55+ they dont like anything ajaxy or any of the UI patterns you are talking about. It such cases data retreival logic can be optimized. Any table that touches the 1K mark or has item like image blobs or things like that should be shown in parts from a performance point of view.
long outputs slow rendering and will be performance leech
Customers dont want to changes most times and customer is always right unless u can convince them.
I have put forth my threshhold but it also depends on the content of the rows.
Happy Coding!

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008