Extract all URLs from HTML in C

Extract all URLs from HTML in C - html

How can I extract all URLs in a HTML using C standard library?
I am trying to deal with it using sscanf(), but the valgrind gives error (and I am even not sure if the code can meet my requirement after debugging successfully, so if there are other ways, please tell me). I stored the html content in a string pointer, there are multiple URLs (including absolute URL and relative URL, e.g.http://www.google.com, //www.google.com, /a.html, a.html and so on) in it. I want to extract them one by one and store them separately into another string pointer.
I am also thinking about using strstr(), but then I have no idea about how to get the second url.
My code (I skip the assert here) using sscanf:
int
main(int argc, char* argv[]) {
char *remain_html = (char *)malloc(sizeof(char) * 1001);
char *url = (char *)malloc(sizeof(char) * 101);
char *html = "navigation"
"search";
printf("html: %s\n\n", html);
sscanf(html, "<a href=\"%s", remain_html);
printf("after first href tag: %s\n\n", remain_html);
sscanf(remain_html, "%s\">", url);
printf("first web: %s\n\n", url);
sscanf(remain_html, "<a href=\"%s", remain_html);
printf("after second href tag: %s\n\n", remain_html);
free(remain_html);
free(url);
}
The valgrind gives: Conditional jump or move depends on uninitialised value(s).
If anybody could help, thank you so much!

valgrind warn you about non initialized data (used in test), considering your program only does sscanf and printf that means you very probably have a problem with your scanf
if I change a little your program to print the result of sscanf, so show much elements it get :
int
main(int argc, char* argv[]) {
char *remain_html = (char *)malloc(sizeof(char) * 1001);
char *url = (char *)malloc(sizeof(char) * 101);
char *html = "<A class=\"mw-jump-link\" HREF=\"#mw-head\">Jump to navigation</a>"
"<a class=\"mw-jump-link\" href=\"#p-search\">Jump to search</a>";
printf("html: %s\n\n", html);
printf("%d\n", sscanf(html, "<a href=\"%s", remain_html));
printf("after first href tag: %s\n\n", remain_html);
printf("%d\n", sscanf(remain_html, "%s\">", url));
printf("first web: %s\n\n", url);
printf("%d\n", sscanf(remain_html, "<a href=\"%s", remain_html));
printf("after second href tag: %s\n\n", remain_html);
free(remain_html);
free(url);
}
the execution is :
pi#raspberrypi:/tmp $ ./a.out
html: <A class="mw-jump-link" HREF="#mw-head">Jump to navigation</a><a class="mw-jump-link" href="#p-search">Jump to search</a>
0
after first href tag:
-1
first web:
-1
after second href tag:
pi#raspberrypi:/tmp $
so the first scanf got nothing (0 element), that means it does not set remain_html and that one is non initialized when it is used by the next sscanf with an undefined behavior
Because of the format
"<a href=\"%s"
the first sscanf waits for a string starting by
<a href="
but html starts by
<A class=
which is different, so it stop from the second character and does not set remain_html
To use sscanf is not the right way, search for the prefix <a href=" may be in uppercase for instance using strcasestr, then extract the URL up to the closing "
Example :
#include <stdio.h>
#include <string.h>
#include <ctype.h>
/* in case you do not have that function */
char * strcasestr(char * haystack, char *needle)
{
while (*haystack) {
char * ha = haystack;
char * ne = needle;
while (tolower(*ha) == tolower(*ne)) {
if (!*++ne)
return haystack;
ha += 1;
}
haystack += 1;
}
return NULL;
}
int main(int argc, char* argv[]) {
char *html = "navigation"
"search";
char * begin = html;
char * end;
printf("html: %s\n", html);
while ((begin = strcasestr(begin, "<a href=\"")) != NULL) {
begin += 9; /* bypass the header */
end = strchr(begin, '"');
if (end != NULL) {
printf("found '%.*s'\n", (int) (end - begin), begin);
begin = end + 1;
}
else {
puts("invalid url");
return -1;
}
}
}
Compilation and execution :
pi#raspberrypi:/tmp $ gcc -Wall a.c
pi#raspberrypi:/tmp $ ./a.out
html: navigationsearch
found 'http://www.google.com'
found '/a.html'
pi#raspberrypi:/tmp $
Note I know the second parameter of strcasestr is in lower case so it is useless to do do tolower(*ne) and *ne is enough, but I given a definition of the function out of the current context

Related

Trying to get an image to show up on an HTML file in a C web server

I'm trying to get more familiar with C by writing a web server, and I managed to get a method to create HTTP headers for html files yesterday but I have been unable to get images within that html file to load.
Right now I generate my header by opening the html file, and creating a file stream to write the start of the header, the size of it, and then I loop through the file to send each character to the stream. I them send that stream as a char pointer back to the main method which sends it as a response over the socket.
I'm imagining that there is some more work I need to do here, but I haven't been able to find a good solution or anything too helpful to point me in the right direction of how exactly to get it to display. I appreciate any responses/insight.
test.html
<!DOCTYPE html>
<html>
<head>
<title>Nick's test website</title>
</head>
<body>
<h1>Welcome to my website programmed from scratch in C</h1>
<p>I'm doing this project to practice C</p>
<table>
<tr>
<td>test1</td>
<td>test2</td>
<td>test2</td>
</tr>
</table>
<img src="pic.jpg"/>
</body>
</html>
headermaker.c
char * headermaker(char * file_name){
char ch;
FILE * fp;
fp = fopen(file_name, "r");
if(fp == NULL){
perror("Error while opening file.\n");
exit(-1);
}
//print/save size of file
fseek(fp, 0, SEEK_END);
int size = ftell(fp);
fseek(fp, 0, SEEK_SET);
printf("File size is %d\n", size);
//create filestream
FILE * stream;
char * buf;
size_t len;
stream = open_memstream(&buf, &len);
//start header
fprintf(stream, "HTTP/1.1 200 OK\\nDate: Sun, 28 Aug 2022 01:07:00 GMT\\nServer: Apache/2.2.14 (Win32)\\nLast-Modified: Sun, 28 Aug 2022 19:15:56 GMT\\nContent-Length: ");
fprintf(stream, "%d\n", size);
fprintf(stream, "Content-Type: text/html\n\n");
//loop through each character
while((ch = fgetc(fp)) != EOF)
fprintf(stream, "%c", ch);
fclose(stream);
fclose(fp);
return buf;
}

Using a modified version of provided code to setup a web server using netcat and needed to see explanation of how to send jpeg using netcat
./one | nc -l 45231 ; { ./two && cat pic.jpeg; } | nc -l 45231;
Chrome browser can open http://localhost:45231 and will show the web page with an image. Also, can observe the network request - response sequence using "View->Developer->Developer Tools".
The code was built like this:
gcc -DONE -o one main.c && gcc -DTWO -o two main.c
The modified code:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
char * headermaker(char * file_name, char * content_type ){
char ch;
FILE * fp;
fp = fopen(file_name, "r");
if(fp == NULL){
perror("Error while opening file.\n");
exit(-1);
}
//print/save size of file
fseek(fp, 0, SEEK_END);
int size = ftell(fp);
fseek(fp, 0, SEEK_SET);
//printf("File size is %d\n", size);
//create filestream
FILE * stream;
char * buf;
size_t len;
stream = open_memstream(&buf, &len);
//start header
fprintf(stream, "HTTP/1.1 200 OK\n");
fprintf(stream, "Server: netcat!\n");
fprintf(stream, "Content-Type: %s\n",content_type);
fprintf(stream, "Content-Length: %d\n", size);
fprintf(stream, "\n");
//loop through each character
#ifdef ONE
while((ch = fgetc(fp)) != EOF)
fprintf(stream, "%c", ch);
#endif //#ifdef ONE
fclose(stream);
fclose(fp);
return buf;
}
int main( )
{
#ifdef ONE
char *buf = headermaker( "test.html", "text/html" );
printf( "%s", buf );
free(buf);
#endif //#ifdef ONE
#ifdef TWO
char *buf = headermaker( "pic.jpeg", "image/jpeg" );
printf( "%s", buf );
free(buf);
#endif //#ifdef TWO
return 0;
}
Another helpful debug tool was curl:
curl -vvv localhost:45231 > page.html
curl -vvv localhost:45231 > image.jpeg

texinfo include HTML header from file

I am writing a Texinfo manual, and for its HTML I need to include the contents of another file into the <head> ... </head> section of the HTML output. To be more specific, I want to add mathjax capability to the HTML version of the output to show equations nicely. But I can't seem to find how I can add its <script>...</script> to the header!

Since I couldn't find an answer and doing the job my self didn't seem to hard, I wrote a tiny C program to do the job for me. It did the job perfectly in my case!
Ofcourse, if there is an option in Texinfo that does the job, that would be a proper answer, this is just a remedy to get things temporarily going for my self.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#define ADDTOHEADER " \n\
<script type=\"text/javascript\" \n\
src=\"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML\">\n\
</head>"
void
addtexttohtml(char *filename)
{
char toadd[]=ADDTOHEADER;
size_t len=0;
ssize_t read;
FILE *in, *out;
char tmpname[]="tmp457204598345.html", *line=NULL;
in=fopen(filename, "r");
out=fopen(tmpname, "w");
if (in == NULL) exit(EXIT_FAILURE);
if (out == NULL) exit(EXIT_FAILURE);
while ((read = getline(&line, &len, in)) != -1)
{
if(strcmp(line, "</head>\n")==0) break;
fprintf(out, "%s", line);
}
fprintf(out, "%s", toadd);
while ((read = getline(&line, &len, in)) != -1)
fprintf(out, "%s", line);
if(line)
free(line);
fclose(in);
fclose(out);
rename(tmpname, filename);
}
int
main(int argc, char *argv[])
{
int i;
for(i=1;i<argc;i++)
addtexttohtml(argv[i]);
return 0;
}
This program can easily be compiled with $ gcc addtoheader.c.
Then we can easily put the compiled program (by default it should be called a.out) with the HTML files and run:
$ a.out *.html
You can just change the macro for any text you want.

Downloaded page source is different than the rendered page source

I'm planning to get data from this website
http://www.gpw.pl/akcje_i_pda_notowania_ciagle
(it's a site of the main stock market in Poland)
I've got a program written in C++ that downloads source of the site to the file.
But the problem is that it doesn't contain thing I'm interested in
(stocks' value of course).
If you compare this source of the site to the option "View element" ( RMB -> View element)
you can see that "View element" does contain the stocks' values.
<td>75.6</td>
<tr class="even red">
etc etc...
The downloaded source of the site doesn't have this information.
So we've got 2 questions
1) Why does source of the site is different from the "View element" option?
2) How to transfer my program so that it can download the right code?
#include <string>
#include <iostream>
#include "curl/curl.h"
#include <cstdlib>
using namespace std;
// Write any errors in here
static char errorBuffer[CURL_ERROR_SIZE];
// Write all expected data in here
static string buffer;
// This is the writer call back function used by curl
static int writer(char *data, size_t size, size_t nmemb,
string *buffer)
{
// What we will return
int result = 0;
// Is there anything in the buffer?
if (buffer != NULL)
{
// Append the data to the buffer
buffer->append(data, size * nmemb);
// How much did we write?
result = size * nmemb;
}
return result;
}
// You know what this does..
void usage()
{
cout <<"curltest: \n" << endl;
cout << "Usage: curltest url\n" << endl;
}
/*
* The old favorite
*/
int main(int argc, char* argv[])
{
if (argc > 1)
{
string url(argv[1]);
cout<<"Retrieving "<< url << endl;
// Our curl objects
CURL *curl;
CURLcode result;
// Create our curl handle
curl = curl_easy_init();
if (curl)
{
// Now set up all of the curl options
curl_easy_setopt(curl, CURLOPT_ERRORBUFFER, errorBuffer);
curl_easy_setopt(curl, CURLOPT_URL, argv[1]);
curl_easy_setopt(curl, CURLOPT_HEADER, 0);
curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1);
curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, writer);
curl_easy_setopt(curl, CURLOPT_WRITEDATA, &buffer);
// Attempt to retrieve the remote page
result = curl_easy_perform(curl);
// Always cleanup
curl_easy_cleanup(curl);
// Did we succeed?
if (result == CURLE_OK)
{
cout << buffer << "\n";
exit(0);
}
else
{
cout << "Error: [" << result << "] - " << errorBuffer;
exit(-1);
}
}
}
return 0;
}

Because the values are filled in using JavaScript.
"View source" shows you the raw source for the page, while "View Element" shows you the state the document tree is in at the moment.
There's no simple way to fix it, because you need to either execute the JavaScript or port it to C++ (and it would probably make you unpopular at the exchange).

When I save the page as an html file (file/save as), I get a file containing all data displayed in browser and which was not found in page source (I use Chrome).
So I suggest that you add one step in your code:
Download page from a javascript enabled browser that support command line or some sort of API (If curl can't do it, maybe wget or lynx/links/links2/elinks on linux can help you?).
Parse data.

How to use tcl apis in a c code

I want to use some of the functionalities(APIs) of my tcl code in another "c" code file. But i am not getting how to do that especiallly how to link them. For that i have taken a very simple tcl code which contains one API which adds two numbers and prints the sum. Can anybody tell me how can i call this tcl code to get the sum. How can i write a c wrapper that will call this tcl code. Below is my sample tcl program that i am using :
#!/usr/bin/env tclsh8.5
proc add_two_nos { } {
set a 10
set b 20
set c [expr { $a + $b } ]
puts " c is $c ......."
}

To evaluate a script from C code, use Tcl_Eval() or one of its close relatives. In order to use that API, you need to link in the Tcl library, initialize the Tcl library and create an interpreter to hold the execution context. Plus you really ought to do some work to retrieve the result and print it out (printing script errors out is particularly important, as that helps a lot with debugging!)
Thus, you get something like this:
#include <tcl.h>
#include <stdio.h>
#include <stdlib.h>
int main(int argc, char **argv) {
Tcl_Interp *interp;
int code;
char *result;
Tcl_FindExecutable(argv[0]);
interp = Tcl_CreateInterp();
code = Tcl_Eval(interp, "source myscript.tcl; add_two_nos");
/* Retrieve the result... */
result = Tcl_GetString(Tcl_GetObjResult(interp));
/* Check for error! If an error, message is result. */
if (code == TCL_ERROR) {
fprintf(stderr, "ERROR in script: %s\n", result);
exit(1);
}
/* Print (normal) result if non-empty; we'll skip handling encodings for now */
if (strlen(result)) {
printf("%s\n", result);
}
/* Clean up */
Tcl_DeleteInterp(interp);
exit(0);
}

I think i have sloved it out. You were correct. The problem was with the include method that i was using. I have the files tcl.h, tclDecls.h and tclPlatDecls.h included in the c code but these files were not existing in the path /usr/include so i was copying these files to that directory, may be it was not a proper way to do. Finally i have not copied those files to /usr/include and gave the include path while compiling. I have created executable and it is givingthe proper result on terminal. Thanks for your help.
Here is the exact c code i am using :
#include <tcl.h>
#include <tclDecls.h>
#include <tclPlatDecls.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
int main (int argc, char **argv) {
Tcl_Interp *interp;
int code;
char *result;
printf("inside main function \n");
// Tcl_InitStubs(interp, "8.5", 0);
Tcl_FindExecutable(argv[0]);
interp = Tcl_CreateInterp();
code = Tcl_Eval(interp, "source simple_addition.tcl; add_two_nos");
/* Retrieve the result... */
result = Tcl_GetString(Tcl_GetObjResult(interp));
/* Check for error! If an error, message is result. */
if (code == TCL_ERROR) {
fprintf(stderr, "ERROR in script: %s\n", result);
exit(1);
}
/* Print (normal) result if non-empty; we'll skip handling encodings for now */
if (strlen(result)) {
printf("%s\n", result);
}
/* Clean up */
Tcl_DeleteInterp(interp);
exit(0);
}
And to compile this code and to generate executable file i am using below command :
gcc simple_addition_wrapper_new.c -I/usr/include/tcl8.5/ -ltcl8.5 -o simple_addition_op
I have executed the file simple_addition_op and got below result which was proper
inside main function
c is 30 .......
My special thanks to Donal Fellows and Johannes

std::find with type T** vs T*[N]

I prefer to work with std::string but I like to figure out what is going wrong here.
I am unable to understand out why std::find isn't working properly for type T** even though pointer arithmetic works on them correctly. Like -
std::cout << *(argv+1) << "\t" <<*(argv+2) << std::endl;
But it works fine, for the types T*[N].
#include <iostream>
#include <algorithm>
int main( int argc, const char ** argv )
{
std::cout << *(argv+1) << "\t" <<*(argv+2) << std::endl;
const char ** cmdPtr = std::find(argv+1, argv+argc, "Hello") ;
const char * testAr[] = { "Hello", "World" };
const char ** testPtr = std::find(testAr, testAr+2, "Hello");
if( cmdPtr == argv+argc )
std::cout << "String not found" << std::endl;
if( testPtr != testAr+2 )
std::cout << "String found: " << *testPtr << std::endl;
return 0;
}
Arguments passed: Hello World
Output:
Hello World
String not found
String found: Hello
Thanks.

Comparing types of char const* amounts to pointing to the addresses. The address of "Hello" is guaranteed to be different unless you compare it to another address of the string literal "Hello" (in which case the pointers may compare equal). Your compare() function compares the characters being pointed to.

In the first case, you're comparing the pointer values themselves and not what they're pointing to. And the constant "Hello" doesn't have the same address as the first element of argv.
Try using:
const char ** cmdPtr = std::find(argv+1, argv+argc, std::string("Hello")) ;
std::string knows to compare contents and not addresses.
For the array version, the compiler can fold all literals into a single one, so every time "Hello" is seen throughout the code it's really the same pointer. Thus, comparing for equality in
const char * testAr[] = { "Hello", "World" };
const char ** testPtr = std::find(testAr, testAr+2, "Hello");
yields the correct result

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008

Extract all URLs from HTML in C - html

Related

Trying to get an image to show up on an HTML file in a C web server

texinfo include HTML header from file

Downloaded page source is different than the rendered page source

How to use tcl apis in a c code

std::find with type T** vs T*[N]

Categories

Resources