Extract text from column in select of MySql query - mysql

I have a table named sentEmails where the body column contains the body text of an email.
In the body text, there is a substring like:
some link: <a href="https://somelink#somesite.com/somePage.php?someVar=someVal&sentby=agent">Random link text
Using MySql, I need to extract the url from this column like https://somelink#somesite.com/somePage.php?someVar=someVal&sentby=agent
I was thinking something like the below would work by finding the starting location and returning the next 150 chars, of course it actually just returns the first 150 chars.
SELECT LEFT(body, LOCATE('some link: <a href="', body)+150) AS link
FROM sentEmails
WHERE sent between date_sub(now(),INTERVAL 1 WEEK) and now()
AND body like '%some link:%'
AND toEmail = 'email#gmail.com'
Additional info:
the link will always be preceded by the text some link:
Random link text at the end will change
I can live with getting a bit more of the text than need if I have to, for example, getting https://somelink#somesite.com/somePage.php">Random link text would be acceptable
the text shown above is a substring of the full body column which contains much more text
This isnt something Im going to be doing often. Im researching an issue and I need the links from 40-50 of these rows, Im just hoping to avoid having to pull the link manually from each row.
I can only use MySQL Query Browser to access this DB if I could connect with php, this would be trivial
The url in question, can have 6-25 parameters in it
The url in question will always end with this parameter &sentby=agent

If you had two unique delimiters around the URL, then could just use SUBSTRING() to isolate it. One approach would be to replace the two sides of the URL in the anchor tag with a delimeter:
SELECT SUBSTRING_INDEX(SUBSTRING_INDEX(
REPLACE(REPLACE(body, '<a href="', '~'), '&sentby=agent">', '&sentby=agent~'), '~', -2),
'~', 1)
FROM sentEmails
WHERE sent BETWEEN DATE_SUB(NOW(), INTERVAL 1 WEEK) AND NOW() AND
body LIKE '%some link:%' AND
toEmail = 'email#gmail.com'
I replaced <a href=" and "> with ~. If ~ does not occur anywhere in the body column, and if you only have one HTML tag in the body, then this should work.
If the body column is just a big chunk of HTML, then you should consider using xpath and handling this in your app layer.

if you're just trying to extract the link out, can you do instr() and mid function. something like this
select mid(body,substr(body,'="'),substr(body,'">')-substr(body,'="')) from email...
substr(body,'="') = starting position of the link =" and substr(body,'">') is the end position of the link.
MID function takes (str,pos, len) and len = end position - starting position

Thanks to Tim's help, I was able to get this working with the below query:
SELECT SUBSTRING_INDEX( SUBSTRING_INDEX(body, 'some link: <a href="', -1) , 'sentby=agent">', 1) AS link
FROM sentEmails
where sent between date_sub(now(),INTERVAL 1 WEEK) and now()
AND body like '%some link:%'
AND toEmail = 'email#gmail.com'

Doing this kind of search is not convenient. As the table with emails grows in size, the query will be less and less performant.
If this is a new application you're building, you're better with keeping a separate table with the list of URLs used on each sent email. You'd write the URLs to the DB as you send the emails.
The reasoning of this is that the App will do more searches in the DB than sending emails. Therefore, by doing a little extra work when sending emails, you help a lot in the most-expensive usage of the feature, which is the search.
If you still decide to keep the current approach, you'll want to have an index containing the columns (toEmail, sent) in this order.
Other than that, your approach makes sense and will work. Did you actually try it? Does it work for you?

Related

How can I create a date field by HTML5 Pattern (DD.MM.YYYY)

I have that code:
(?:19|20)[0-9]{2}-(?:(?:0[1-9]|1[0-2])-(?:0[1-9]|1[0-9]|2[0-9])|(?:(?!02)(?:0[1-9]|1[0-2])-(?:30))|(?:(?:0[13578]|1[02])-31))
Checks that
1) the year is numeric and starts with 19 or 20,
2) the month is numeric and between 01-12, and
3) the day is numeric between 01-29, or
b) 30 if the month value is anything other than 02, or
c) 31 if the month value is one of 01,03,05,07,08,10, or 12
It's from page http://html5pattern.com/Dates
I tried to move some part of code, but then this code doesnt work... Even I tried to find some instructions how can I do it. But I can't handle with it...
How can I get a result like with above code but in format:
DD.MM.YYYY
Also is there any possibility to add the dots in field that user can only input the numbers without dots?
(I mean that the dots will be there every time)
Thank you for help.
Sorry for my English.
I think this is something that you are looking for:
(?:(?:0[1-9]|1[0-9]|2[0-9])\.(?:0[1-9]|1[0-2])|(?:30)\.(?:(?!02)(?:0[1-9]|1[0-2]))|(?:31)\.(?:0[13578]|1[02]))\.(?:19|20)[0-9]{2}
Also tried some possible test cases and worked fine:
You can use an input mask plugin for some of what you are asking for instead I suppose.
One popular one that comes to my mind is Robin Herbots: https://github.com/RobinHerbots/Inputmask
You can find demos off his page here: https://robinherbots.github.io/Inputmask/index.html
Once you implement the plugin into your page, then it's just a matter of establishing the right input tags and jquery for them. For example:
Your phone number script would then be something along the lines of:
<script type="text/javascript">
$("#Phone").inputmask({mask: "999.999.9999"});
</script>
You should look up the documentation for it.

xpath scraping data from the second page

I am trying to scrape data from this webpage: http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33, and I specifically need data for fund number 26.
Have no problem getting data from the first page with this address (funds number 1-25), but for the hell of me can't scrape anything from the second page. Can someone help?
Thanks!
Here is the code I use: in Google Sheets:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[26]/td[#class='Center'][1]")
You can do 2 things - one is to append the PgIndex=2 onto the end of your URL, and then you can also significantly simplify your xpath to this:
//*[#id='Prices']//tr[2]/td[2]
This specifically grabs the second row on the table (tr which means table-row), in order to bypass the header row, then grabs the second field which is the table-data cell.
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","//*[#id='Prices']//tr[2]/td[2]")
To get the second page, add &PgIndex=2 to your url. Then adjust the /table/thead/tr[26] to /table/thead/tr[2]. The result is:
=IMPORTXML("http://webfund6.financialexpress.net/clients/zurichcp/PortfolioPriceTable.aspx?SchemeID=33&PgIndex=2","/html/body/form[#id='MainForm']/table/tr/td/div[#id='main']/div[#id='tabResult']/div[#id='Prices']/table/thead/tr[2]/td[#class='Center'][1]")

Get tabledata from html, JSOUP

What is the best way to extract data from a table from an url?
In short I need to get the actual data from the these 2 tables at: http://www.oddsportal.com/sure-bets/
In this example the data would be "Paddy power" and "3.50"
See this image:
(Sorry for posting image like this, but I still need reputation, i will edit later)
http://img837.imageshack.us/img837/3219/odds2.png
I have tried with Jsoup, but i dont know if this is the best way?
And I can't seem to navigate correctly down the tables, I have tried things like this:
tables = doc.getElementsByAttributeValueStarting("class", "center");
link = doc.select("div#col-content > title").first();
String text1 = doc.select("div.odd").text();
The tables thing seem to get some data, but doesn't include the text in the table
Sorry, man. The second field you want to retrieve is filled by JavaScript. Jsoup does not execute JavaScript.
To select title of first row you can use:
Document doc = Jsoup.connect("http://www.oddsportal.com/sure-bets/").get();
Elements tables = doc.select("table.table-main").select("tr:eq(2)").select("td:eq(2)");
System.out.println(tables.select("a").attr("title"));
Chain selects used for visualization.

Working Around SQL Replace Wildcards

I know that I cannot use a wildcard in a MySQL replace query through phpMyAdmin. But, I need some kind of workaround. I'm very open to ideas. Here's the skinny:
I have about 2,000 pages in a MySQL database that need to have image URL's updated. Some are local, some are hotlinked. Each one is different, the URL lengths vary, the image on the page and the new image are unique per page id number, and each one occurs at a different spot in the page.
I basically need to do the following:
UPDATE pages SET body = replace(body, 'src=\"%\"', 'src=\"http://newdomain/newimage.jpg\"') WHERE id="{page_number}"
But I know that the 'src=\"%\"' component doesn't jive.
So I fall at the feet of your collective knowledge to come up with some way to take the src="%" and replace it with a set URL for a set page id number. Thanks in advance.
If there's only one image per page, a quick solution would be like this:
UPDATE pages
SET
body = CONCAT(
SUBSTRING_INDEX(body, 'src="', 1),
'src=\"http://newdomain/newimage.jpg\"',
SUBSTRING(
SUBSTRING_INDEX(body, 'src="', -1)
FROM LOCATE('"', SUBSTRING_INDEX(body, 'src="', -1))+1)
)
WHERE
id="{page_number}" AND
body NOT LIKE '%<img%<img%';
First SUBSTRING_INDEX extract the body part at the left of src=", the last two nested SUBSTRING_INDEX extracts the body part at the right of the first " next to src=".
Last check is a very dirty check to make sure that only one image is present in the string. It could fail under some circumstances, but it might help.
My suggestion would be to build a table with your replace strings that would look like this:
page_id replace
1 src="..."
Then you can update across a JOIN like this
UPDATE pages AS p
INNER JOIN replace AS r
ON p.page_id = r.page_id
SET p.body = REPLACE(p.body, CONCAT('src="', SUBSTRING_INDEX(SUBSTRING_INDEX(p.body, 'src="', -1), '"', 1), '"', r.replace);
This would replace the last occurrence anything of format src="..." with a new value in same format, so this would work for all records with a single src value.

SQL query - Replace/Move some parts of content

I need to update about 2000 records in MySQL
I have a column 'my_content' from table 'my_table' with the folowing value
Title: some title here<br />Author: John Smith<br />Size: 2MB<br />
I have created 3 new columns (my_title, my_author and my_size) and now I need to separate the content of 'my_content' like this
'my_title'
some title here
'my_author'
John Smith
'my_size'
2MB
As you can imagine the title, author and size are always different for each row.
What I'm thinking is to query the following, but I'm not great at SQL queries and I'm not sure what the actually query would look like.
This is what I'm trying to do:
Within 'my_content' find everything that starts with "title:..." and ends with "...<br />au" and move it to 'my_title'
Within 'my_content' find everything that starts with "thor:..." and ends with "...<br />s" and move it to 'my_author'
Within 'my_content' find everything that starts with "ize:..." and ends with "...<br />" and move it to 'my_size'
I just don't know how to write a query to do this.
Once all the content is in the new columns, I can just find and delete the content that's not needed any more, for example 'thor:' , etc.
You can use INSTR to find the index of your delimiters and SUBSTRING to select out the part you want. So, for instance, the author would be
SUBSTR(my_content,
INSTR(my_content, "Author: ") + 8,
INSTR(my_content, "Size: ") - INSTR(my_content, "Author: ") - 8)
You'd need a bit more work to trim the <br/> and any surrounding whitespace.
Please try the below:
SELECT SUBSTRING(SUBSTRING_INDEX(mycontent,'<br />',1),LOCATE('Title: ',mycontent)+7) as mytitle,
SUBSTRING(SUBSTRING_INDEX(mycontent,'<br />',2),LOCATE('Author: ',mycontent)+8) as myauthor,
SUBSTRING(SUBSTRING_INDEX(mycontent,'<br />',3),LOCATE('Size: ',mycontent)+6) as mysize
FROM mytable;