Search HTML Tables on Multiple Pages - html

Hello Stack Overflow Community!
I am making a directory of many thousand custom mods for a game using HTML tables. When I started this project, I thought one HTML page would be slow, but adequate for the ~4k files I was expecting. As I progressed, I realized there are tens of thousands of files I need to have in these tables, and let the user search though to find what they are missing to load up a new scenario. Each entry has about 20 text entries and a small image (~3KB). I only need to be able to search through one column.
I'm thinking of dividing the tables across several pages on my website to help loading speeds and improve overall organization. But then a user would have to navigate to each page, and perform a search there. This could take a while and be very cumbersome.
I'm not great at website programming. Can someone advise a way to allow the user to search through several web pages and tables from one location? Ideally this would jump to the location in the table on the new webpage, or maybe highlight the entry like the browser's search function does.
You can see my current setup here : https://www.loco-dat-directory.site/
Hopefully someone can point me in the right direction, as I'm quite confused now :-)

This would be my steps,
Copy all my info into an excel spredsheet, then convert that to json, then make that an array for javascript (myarray), then can make an input field, and on click an if statement if input == myarray[0].propertyName
if you want something more than an exact match, you'd need https://lodash.com/
in your project.

Hacky Solution
There is a browser tool, called TableCapture, to capture data from html tables and load into excel/spreadsheets - where you are basically deferring to spreadsheet software to manage the searching.
You would have to see if:
This type of tool would solve your problem - maybe you can pull each HTML page's contents manually, then merge these pages into a document with multiple "sheets", and then let people download the "spreadsheet" from your website.
If you do not take on the labor above and just tell other people to do it, then you'd have to see if you can teach the people how to perform the search and do this method on their own. eg. "download this plugin, use it on these pages, search"
Why your question is difficult to answer
The reason why it will be hard for people to answer you in stackoverflow.com (usually code solutions) is that you need a more complicated solution (in my opinion) than hard coded tables and html/css/javascript.
This type of situation is exactly why people use databases and APIs to accept requests ("term": "something") for information and deliver responses ( "results": [...] ).

Thank you everyone for your great advice. I wasn't aware most of these potential solutions existed, and it was good to see how other people were tackling problems of similar scope.
I've decided to go with DataTables for their built-in sorting and filtering : https://datatables.net/
I'm also going to use a javascript array with an input field on the main page to allow users to search for which pack their mod is in. This will lead them to separate pages on my site, each with a unique datatable for a mod pack. Separate pages will load up much quicker than one gigantic page trying to show everything.

Related

If I have a collection of random websites, how do I get specific information from each?

Say I have a collection of websites for accountants, like this:
http://www.johnvanderlyn.com
http://www.rubinassociatespa.com
http://www.taxestaxestaxes.com
http://janus-curran.com
http://ricksarassociates.com
http://www.condoaudits.com
http://www.krco-cpa.com
http://ci.boca-raton.fl.us
What I want to do is crawl each and get the names & emails of the partners. How should I approach this problem, at a high-level?
Assume I know how to actually crawl each site (and all subpages) & parse the HTML elements -- I am using Oga.
What I am struggling with is how to make sense of data that is presented in a wide variety of ways. For instance, the email address for the firm (and or partner) can be found in one of these ways:
On the About Us page, under the name of the partner.
On the About Us page, as a generic catch-all email.
On the Team page, under the name of the partner.
On the Contact Us page, as a generic catch-all email.
On a Partner's page, under the name of the partner.
Or it could be any other way.
One way I was thinking about approaching the email, is just to search for all mailto a tags and filter from there.
The obvious downside for this is that there is no guarantee that the email will be for the partner and not some other employee.
Another issue that is more obvious is detecting the partner(s) names just from the markup. I was initially thinking I could just pull all the header tags and text in them, but I have stumbled across a few sites that have the partner names in span tags.
I know SO is usually for specific programming questions, but I am not sure how to approach this and where to ask this. Is there another StackExchange site that this question is more appropriate for?
Any advice on specific direction you can give me would be great.
I looked at the http://ricksarassociates.com/ website and I cant find any partners at all so in my opinion you better stand to gain from this if not you better look for some other invention.
I have done similar datascraping from time to time, and in norway we have laws - or should I say "laws" - that you are not allowed to email people however you are allowed to email the company - so in a way the same problem from another angle.
I wish I knew maths and algorythms by heart because I am sure there is a fascinating sollution hidden in AI and machine learning, but in my mind the only sollution I can see is building a rule set that over time probably gets quite complex. Maby you could apply some bayesian filtering - it works very well for email.
But - to be a little more productive here. One thing i know is inmportant, you could start by creating the crawler environment and building the dataset. Have the database for URLS so you can add more at any time, and start the crawling on what you have already so that you do your testing querying your own data with a 100% copy. This will save you enormous time instead of live scraping while tweaking.
I did my own search engine some years ago, scraping all NO domains however I needed only the index file that time. Took over a week alone just to scrape it down and I think it was 8GB of data just for that single file, and I had to use several proxyservers aswell to make it work due to problems with to much DNS traffik. Lots of problems that needed being taken care of. I guess I am only saying - if you are crawling a large scale you might aswell start getting the data down if you want to work efficient with the parsing later.
Good luck, and do post if you get a sollution. I do not think it is posible without an algorythm or AI though - people design websites the way they like and they pull templates out of their arse so there are no rules to follow. You will end up with bad data.
Do you have funding for this? If so its simpler. Then you could just crawl each site, and make a profile for each site. You could employ someone cheap to manual go through the parsed data and remove all the errors. This is probably how most people does it, unless someone already have done it and the database is for sale / available from webservice so it can be scraped.
The links you provide are mainly US site, so I guess you are focusing on English names. In that case, instead of parsing from html tags, I would just search the whole webpage for name. (There are free database of first name and last name) This may also work if you are donig this for some other Europe company, but it would be a problem for company from some countries. Take Chinese as an example, while there is a fix set of last name, one may use basically any combination of Chinese character as first name, so this solution won't work for Chinese site.
It is easy to find email from a webpage as there is a fixed format of (username)#(domain name) with no space in between. Again I won't treat it as html tags but just as normal string so that the email can be found no matter it is in mailto tag or in plain text. Then, to determine what email is it:
Only one email in page?
Yes -> catch-all email.
No -> Is name found in that page as well?
No -> catch-all email (can have more than one catch-all email, maybe for different purpose like info + employment)
Yes -> Email should be attached to the name found right before it. It is normal that the name should appear before the email.
Then, it should be safe to assume the name appear first belongs to more important member, e.g. Chairman or partner.
I have done similar scraping for these types of pages, and it varies wildly from site to site. If you are trying to make one crawler to sort of auto find the information, it will be difficult. However, the high level looks something like this.
For each site you check, look for element patterns. Divs will often have labels, ID's, and classes which will easily let you grab information. Perhaps you find that many divs will have a particular class name. Check for this first.
It is often better to grab too much data from a particular page, and boil it down on your side afterwards. You could, perhaps, look for information which comes up on a screen by utilizing type (is link) or regex (is email) to look for formatted text. Names and occupation will be harder to find by this method, but might be related positionally on many pages to other well formatted items.
Names will often be affixed with honorifics (Mrs., Mr., Dr., JD, MD, etc.) You could come up with a bank of those, and check against them for any page you end up on.
Finally, if you really wanted to make this process general purpose, you could do some heuristics to improve your methods based off of expected information; names, for example, are most often within a particular list. If it was worth your time, you could check certain text for whether it matches a list of more common names.
What you mentioned in your initial question seems that you would have a lot of benefit with a general purpose Regular Expressions crawler, and you could make improvements on it as you know more about the sites which you interact with.
There are excellent posts on this topic with a lot of useful links throughout these webpages:
https://www.quora.com/What-is-a-good-web-scraper-for-pulling-emails-names-etc-even-if-the-contact-info-is-another-page-deep-a-browser-add-on-is-a-plus
http://www.hongkiat.com/blog/web-scraping-tools/
http://www.garethjames.net/a-guide-to-web-scraping-tools/
http://www.butleranalytics.com/15-web-scraping-tools/
Some of the examined applications are working in macOS.

Random Article button

I'd like to create a button on a menu bar that can generate a link to a random article from my blog posts (much like Wikipedia has). It's for a client, and they'd like to have this functionality on the site. I'm not familiar with PHP so I'd like to find a way around that, especially since I don't have access to the root user on my server host's mySQL installation (if this is relevant).
I had a theoretical solution: have a .txt or .xml file containing a list of all the URLs to each of the posts, with a "key" assigned to each of them. Then, when the user clicks the random article button, the current time (ex. 1:45) is hashed and mapped to a specific URL. I am fairly new to Drupal, however, I was wondering if there was some way to have the random article button use a .c file to execute these steps. The site is being hosted on a server that uses Apache 2, and I looked through some modules that were implemented in C code. I'm pretty new to all of this (although proficient in C), and spent many fruitless hours searching for solutions.
In a pure Drupal fashion (don't know if you are interested by this kind of solution), you could create a view (create a block) which retrieve blog posts, use a random sort criteria and limit results to 1 item. Then configure this view to display fields, and add only one field : post title, and check "link to content" in this field parameters window. You'll get one random blog post title which will be rendered as a link to this blog post.
Finally in Structure->Block assign your new block in a region to see it.
It's a pure Drupal / Views / no-code-just-clicks :) way, but it will be far more maintainable and easy to setup than introducing C for such a simple feature.
Views module
Let me know if you try this and have problems configuring your view or anything else.
Good luck

XML content management

Hi there everyone this is my first time posting here, I am a student working for a really small company and I am in charge of developing the company's website.
Since I am only here for two more months, the boss wants to be able to change some content of the website without having to do any code.
He mentioned the idea of having an XML file he can update online, that will update the content of parts of the website. He does not want anything to do with third party websites.
So I was just wondering if this is even possible, I have no experience with XML, and really have no idea of where to start. All suggestions are welcome. Thanks in advance
Try a server-side include (SSI) with the something like the following tag:
<!--#include virtual="sitecontents/content_name_1.txt" -->
You can update the individual files (they don't necessarily have to include HTML tags) instead of trying to process an XML document. Much less development time and, frankly, much easier to maintain.
I did something similar to this for my website but used a database-driven concept. I have a secure area where I can enter in data via forms into the various database tables and then the information is automatically updated on the site via PHP. Takes awhile to get it all in place and working but has been worth the effort.
Now that I'm think about it, you can do this with XML as well. I played around with it a bit before developing the database-driven site. If you're not familiar with XML though it might be a stretch to get it done in the short time you have. You will also need to know XSLT which can be confusing at times. It is needed to connect to the XML file (data source) and then to parse the data and transform the information into something that looks good.

How can I summarize the updates to a table on an page I browse?

I am a student at a University. With the placement process going on, we have an internal placement website that shows updates and status about various companies I have applied to. Since the number of companies is too large it becomes cumbersome to scroll through the complete list to find information. Sometimes, I just miss some things. Now, to tackle this problem, here is what I want to do:
The data is in an HTML table. Each row shows information about one company: Some dates, Status(Not/Shortlisted/Applied), Some yes/no options etc. each in a different column. Once I open the page I want to be able to extract information about which companies I got shortlisted in, and in which ones I didn't make it.
What is the right technology to do this ? I am thinking of writing a Greasemonkey user script (I have never actually written any, but how hard could it be ?). What other options do I have?
Edit: I don't quite understand why this question has voted to be closed?
I just displayed a use case for something general: On opening a web page, automatically extracting information from the page and display it to the user. What is the easiest and sufficiently powerful way to achieve this?
Since you can't get access to the website's database, Greasemonkey would be your best automation approach. However, this task is likely to be over before you can get a decent script up from scratch.
Your best practical approach is to save the pages and/or copy and summarize the data in MS Excel, or equivalent.
~~~~~~~~~
Here at SO, We will not develop any but the simplest Greasemonkey scripts for you from scratch (unless they are fun somehow ;) ). But, you can sometimes get such help in the "Script requests forum" at userscripts.org.
In order for someone to help you, they will need:
A clear idea of exactly what data gets manipulated, and how.
Access to the target site. Or access to saved snapshots of the target pages. GM scripts are extremely dependent on the details of the target page.
"other option":
ctrl + F
enter shortlisted
enter
ctrl + G <--repeat last search

End-user documentation in MS Access [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 5 years ago.
Improve this question
How do you implement user documentation in Access? I've never bothered with formal user documentation in the past; I tend to rely on good interface design to guide users (or so I tell myself). But I'd really like to know what people smarter than me are doing...
Here are things I think I would consider important (in order):
Simplicity: it needs to be simple enough that it can be updated easily as the code changes, otherwise the documentation will end up out of sync
Screen shots: a picture's worth a thousand words; screen shots must be easily integrated into the documentation
Integration: the user can get to the relevant part of the documentation with as little effort as possible; ie, pressing F1 on a form brings up help for that form vs. opening a help file and having to navigate a table of contents
Searchable: full-text search capabilities would be nice
Other considerations:
Online vs. local: local would be faster/more reliable, but online would be always available plus search engine indexable (allowing use of google site: searches and providing some SEO benefit as well)
User Editable: how much do you allow users to make changes to the documentation: full access (ie, wiki), no access, moderated forums, etc.
Version control: text-based formats are more conducive to versioning than say, an Access table with help text inside the mdb
Exportable to PDF: seems like a nice-to-have
In Access I've never created end user documentation. No wait, I did once about 12 years ago. And I paid someone to write the manual along with screen shots. I did also have the hlp files, etc, etc. But I don't recall the details now.
Now for the Auto FE Updater, where appropriate, I have a text control which is underlned and blue which the user can then click on. The code then opens up their web browser to the appropriate page on my website using the ShellExecute API Much simpler for me than trying to figure out some kind of help system that would work for both offline and online. I also update the ToolTip control to put in the exact URL so they can see where they are going to go if they click on the text control. That's a VB 6 program but close enough for your requirements.
You may find HTML Help suitable.
I don't produce documentation for my client projects unless the client pays me big $$$ for it, as it's extremely difficult. I often guide users in producing in-house materials that document procedures and standards, but in general, I design my apps for EASE OF USE.
That is in contrast to EASE OF LEARNING.
EASE OF USE and EASE OF LEARNING often conflict with each other, as a UI design that makes it really easy to perform a task the first time often gets in the way once the user is accustomed to how things work.
However, it's important to design the UI with two things in mind:
things that are done on a daily basis don't need to be easy to learn -- they need to be really fast and friendly for the person who already knows how to use the app. I have a 10+20 rule -- 10 minutes of training and 20 minutes of use and the user will never forget how to use it.
things that are done only very seldom should be designed with a UI that is transparent and easy and doesn't require the user to remember anything at all. These kinds of tasks are great candidates for wizard-style interfaces that step the user through the process and provide hints and tips as text along the way.
I also have a number of UI design conventions that I implement throughout an app. The example that springs to mind is that any subform that is a datasheet or continuous form has a doubleclick event that when activated opens a popup form with the full details for the selected record. Once users grasp this convention, they will assume that any subform is doubleclickable in order to navigate to the detail.
There are other such conventions, but that's the basic idea, i.e., to implement similar behaviors in similar contexts so that if a user learns to do something in one context, when she finds herself in a different place with a similar UI, the things learned in the original context are transferrable in terms of basic UI behavior.
You will need to do two things:
Create a help file with topic-ID's for all of the topics
Link this help file to your access database, and link the topics
We have had very good results with http://www.helpandmanual.com/. From one single source, you can create any sort of help file that you want: pdf, online, chm, hlp, xml, ... It has a screenshot tool integrated.
Every topic can have it's own ID and you can just link your access forms / controls to this ID.
I have done a very similar thing to Tony. Its kind of a user generated content type thing let me explain.
The database contains a table with a list of the form names and then the path of a help file (word doc) that corresponds with that form.
Certain users have access to a form that allows them to say what help file corresponds to each form
Each form then has a help button so when the user clicks on it they open up the correct help file.
This way it is totally flexible, if they just want one big help file then all the links point to that but if the users want to put the effort in then they can make a file for each one. As they help files are separate from the DB storage is not a problem and also help files can be changed without having to recompile the application.
You could merge this idea with Tony’s and have the help files online if you wanted. I just find this a nice design pattern
I recently stumbled upon TiddlyWiki and have been thinking about using that as a backend to the systems Kevin and Tony described.