I'm currently scraping info from a website which uses icon fonts to identify information. When I find the element that contains the icon I get the "" character as expected. I want to identify the utf8 code of the character and as such be able to identify which symbol was used.
I'm looking to do something along these lines:
For Each HTMLElement in HTMLDocument.getElementsbyClassName("icon-class")
utf8code = HTMLElement.innerText
If utf8code = U+00AE Then
'do things
End If
Next
Ok, Whilst I wasn't able to fully achieve the goal of identifying the utf8 code of any character I did manage to find a way to identify the characters for my use case.
As it turned out, in my case there are around 30 characters and they appear more or less sequentially in the UTF8 codepage. Then the subject was to understand how the UTF8 code is formed, and user #RemyLebeau helped point me in the right direction. This video was very helpful for that: https://youtu.be/MijmeoH9LT4
My own summation is as follows:
1st byte: remove the first n+1 bits where n = the total number of bytes found
2nd - nth byte: remove the first two bits
the result should be combined starting from the rightmost bit and moving left, any spaces left to make a multiple of 8 should be filled with 0s.
so as in my example with 4 bytes:
243, 178, 129, 139
11110011, 10110010, 10000001, 10001011
11110-011, 10-110010, 10-000001, 10-001011
000(011)(11, 0010)(0000, 01)(001011)
00001111, 00100000, 01001011
F, 20, 4B
now the code I used to help identify which character I was finding:
Dim utf8Encoding As New System.Text.UTF8Encoding(True)
Dim encodedString() As Byte
encodedString = utf8Encoding.GetBytes(HTML_Element.innerText)
Select Case encodedstring(3)
Case 147
Case 155
End Select
In my particular case I was able to use a hashtable to relate the value of the 4th byte to a separate value that I needed.
Is this a good solution? no, it only works in specific cases and being able to simply obtain the UTF8 code would create a solution that is more effective and elegant for all use cases. But as this is a project for personal use only, and through a combination of lack of personal understanding and lack of people willing to help me understand, this solution works for me and so I figured I would include it in case anybody finds themselves in a similar situation where the above shortcut might help.
Related
This seemed like it should be very simple to do yet I've not been able to find an answer after weeks of looking.
I'm trying to remove strings that are no longer needed. Regex_replace sounds perfect but is not available in MySQL.
In MySQL how would I accomplish changing this:
[quote=ABC;xxxxxx]
to this:
[quote=ABC]
The issues are:
- this can appear anywhere in a text blob
- the xxxxxx can only be numeric but may be 6, 7 or 8 characters long
- not adding/removing any rows, just rewriting the contents of one column on one row at a time.
Thanks.
I don't think you really need REGEX_Replace (though it would make things easier of course).
Assuming that the example you presented is a real reflection of what you have:
Your starting point is with the string [quote=<something>;, meaning that you can start searching for [quote=,
Once you found it, you need to search for ; and after that for ],
Once you found them both, you know what to extract when where to start for the next search (if the pattern you mentioned can appear more than once within a singe blob.
Did I get you correctly?
EDIT
This paradigm is aimed to convert all instances of [quote=ABC;xxxxxx] to [quote=ABC] under the following assumptions:
The pattern can appear any number of times within the input string,
The length of xxxxxx is not fixed,
The resulting string (after removing all the appearances of ;xxxxxx) should replace the value in the table,
Performance is not an issue since either this is going to be a one-time job (through the whole table) or it will run every time on a single string (e.g. before INSERTing a new record).
Some MySQL functions that will be used:
INSTR: Searches within a string for the first appearance of a sub-string and returns the position (offset) where the sub-string was found,
SUBSTR: Returns a substring from a string (several ways to use it),
CONCAT: Concatenates two or more strings.
The guidelines presented here apply for the manipulation of a single INPUT string. If this needs to be used over, say, a whole table, simply get the strings into a CURSOR and loop.
Here are the steps:
Declare five INT local variables to serve as indices and total input string length, say L_Start, L_UpTo, l_Total_Length, l_temp1 and l_temp 2, setting the initial value for l_Start = 1 and l_Total_Length = LENGTH(INPUT_String),
Declare a string variable into which you will copy the "cleaned" result and initiate it as '', say l_Output_str; also declare a temporary string to hold the value of 'ABD', say l_Quote,
Start a infinite loop (you will set the exit condition within it; see below),
Exit loop if l_Start >= l_Total_Length (here is one of the two exit points from the loop),
Find the first location of '[quote=' within the input string starting from L_Start,
If the returned value is 0 (i.e. substring not found), concatenate the current contents of l_Output_str with whatever remains if the input string from position L_start (e.g. SET l_Output_str = CONCAT(l_Output_str,SUBSTR(INPUT_String,L_Start) ;) and exit loop (second exit position),
Search the input string for the ; symbol starting from L_start + 7 (i.e. the length of [quote=) and save the value in l_temp_1,
Search the input string for the ] symbol starting from L_start + 7 + l_temp2 and save the value in l_temp_2,
Add the found result to output string as SET l_Output_str = CONCAT(l_Output_str,'[quote=',SUBSTR(INPUT_String,L_Start + 7, l_temp_2 - l_temp_1),']') ;,
Set L_Start = L_Start + 7 + l_temp_2 + 1 ;
End of loop.
Notes:
As I neither made the code nor tested it, it is possible that I'm not setting indices correctly; you will need to perform detailed tests to make get it working as needed;
The above IS the method I suggested;
If the input string is very long (many MBs), you might observe poor performance (i.e. it might take few seconds to complete) because of the concatenations. There are some steps that can be taken to improve performance, but let's have this working first and then, if needed, tackle the performance issues.
Hope that the above is clear and comprehensive.
How can I go about querying a database for items that are not only exactly similar to a sample, but also those that are almost similar? Almost as search engines work, but only for a small project, preferably in Java. For example:
String sample = "Sample";
I would like to retrieve all the following whenever I query sample:
String exactMatch = "Sample";
String nonExactMatch = "S amp le";
String nonExactMatch_2 = "ampls";
You need to define what similar means in terms that your database can understand.
Some possibilities include Levenshtein distance, for example.
In your example, sample matches...
..."Sample", if you search without case sensitivity.
..."S amp le", if you remove a set of ignored characters (here space only) from both the query string and the target string. You can store the new value in the database:
ActualValue SearchFor
John Q. Smith johnqsmith%
When someone searches for "John Q. Smith, Esq." you can boil it down to johnqsmithesq and run
WHERE 'johnqsmithesq' LIKE SearchFor
"ampls" is more tricky. Why is it that 'ampls' is matched by 'sample'? A common substring? A number of shared letters? Does their order count (i.e. are anagrams valid)? Many approaches are possible, but it is you who must decide. You might use Levenshtein distance, or maybe store a string such as "100020010003..." where every digit encodes the number of letters you have, up to 9 (so 3 C's and 2 B's but no A's would give "023...") and then run the Levenshtein distance between this syndrome and the one from each term in the DB:
ActualValue Search1 Rhymes abcdefghij_Contains anagramOf
John Q. Smith johnqsmith% ith 0000000211011... hhijmnoqst
...and so on.
One approach is to ask oneself, how must I transform both searched value and value searched for, so that they match?, and then proceed and implement that in code.
You can use match_against in myisam full text indexes columns.
I have a file with 13 columns and 41 lines consisting of the coefficients for the Joback Method for 41 different groups. Some of the values are non-existing, though, and the table lists them as "X". I saved the table as a .csv and in my code read the file to an array. An excerpt of two lines from the .csv (the second one contains non-exisiting coefficients) looks like this:
48.84,11.74,0.0169,0.0074,9.0,123.34,163.16,453.0,1124.0,-31.1,0.227,-0.00032,0.000000146
X,74.6,0.0255,-0.0099,X,23.61,X,797.0,X,X,X,X,X
What I've tried doing was to read and define an array to hold each IOSTAT value so I can know if an "X" was read (that is, IOSTAT would be positive):
DO I = 1, 41
(READ(25,*,IOSTAT=ReadStatus(I,J)) JobackCoeff, J = 1, 13)
END DO
The problem, I've found, is that if the first value of the line to be read is "X", producing a positive value of ReadStatus, then the rest of the values of those line are not read correctly.
My intent was to use the ReadStatus array to produce an error message if JobackCoeff(I,J) caused a read error, therefore pinpointing the "X"s.
Can I force the program to keep reading a line after there is a reading error? Or is there a better way of doing this?
As soon as an error occurs during the input execution then processing of the input list terminates. Further, all variables specified in the input list become undefined. The short answer to your first question is: no, there is no way to keep reading a line after a reading error.
We come, then, to the usual answer when more complicated input processing is required: read the line into a character variable and process that. I won't write complete code for you (mostly because it isn't clear exactly what is required), but when you have a character variable you may find the index intrinsic useful. With this you can locate Xs (with repeated calls on substrings to find all of them on a line).
Alternatively, if you provide an explicit format (rather than relying on list-directed (fmt=*) input) you may be able to do something with non-advancing input (advance='no' in the read statement). However, as soon as an error condition comes about then the position of the file becomes indeterminate: you'll also have to handle this. It's probably much simpler to process the line-as-a-character-variable.
An outline of the concept (without declarations, robustness) is given below.
read(iunit, '(A)') line
idx = 1
do i=1, 13
read(line(idx:), *, iostat=iostat) x(i)
if (iostat.gt.0) then
print '("Column ",I0," has an X")', i
x(i) = -HUGE(0.) ! Recall x(i) was left undefined
end if
idx = idx + INDEX(line(idx:), ',')
end do
An alternative, long used by many many Fortran programmers, and programmers in other languages, would be to use an editor of some sort (I like sed) and modify the file by changing all the Xs to NANs. Your compiler has to provide support for IEEE NaNs for this to work (most of the current crop in widespread use do) and they will correctly interpret NAN in the input file to a real number with value NaN.
This approach has the benefit, compared with the already accepted (and perfectly good) answer, of not requiring clever programming in Fortran to parse input lines containing mixed entries. Use an editor for string processing, use Fortran for reading numbers.
I want to replace the characters '<' and '>' by < and > with COBOL. I was wondering about INSPECT statement, but it looks like this statement just can be used to translate one char by another. My intention is to replace all html characters by their html entities.
Can anyone figure out some way to do it? Maybe looping over the string and testing each char is the only way?
GnuCOBOL or IBM COBOL examples are welcome.
My best code is something like it: (http://ideone.com/MKiAc6)
IDENTIFICATION DIVISION.
PROGRAM-ID. HTMLSECURE.
ENVIRONMENT DIVISION.
DATA DIVISION.
WORKING-STORAGE SECTION.
77 INPTXT PIC X(50).
77 OUTTXT PIC X(500).
77 I PIC 9(4) COMP VALUE 1.
77 P PIC 9(4) COMP VALUE 1.
PROCEDURE DIVISION.
MOVE 1 TO P
MOVE '<SCRIPT> TEST TEST </SCRIPT>' TO INPTXT
PERFORM VARYING I FROM 1 BY 1
UNTIL I EQUAL LENGTH OF INPTXT
EVALUATE INPTXT(I:1)
WHEN '<'
MOVE "<" TO OUTTXT(P:4)
ADD 4 TO P
WHEN '>'
MOVE ">" TO OUTTXT(P:4)
ADD 4 TO P
WHEN OTHER
MOVE INPTXT(I:1) TO OUTTXT(P:1)
ADD 1 TO P
END-EVALUATE
END-PERFORM
DISPLAY OUTTXT
STOP RUN
.
GnuCOBOL (yes, another name branding change) has an intrinsic function extension, FUNCTION SUBSTITUTE.
move function substitute(inptxt, ">", ">", "<", "<") to where-ever-including-inptxt
Takes a subject string, and pairs of patterns and replacements. (This is not regex patterns, straight up text matching). See http://opencobol.add1tocobol.com/gnucobol/#function-substitute for some more details. The patterns and replacements can all be different lengths.
As intrinsic functions return anonymous COBOL fields, the result of the function can be used to overwrite the subject field, without worry of sliding overlap or other "change while reading" problems.
COBOL is a language of fixed-length fields. So no, INSPECT is not going to be able to do what you want.
If you need this for an IBM Mainframe, your SORT product (assuming sufficiently up-to-date) can do this using FINDREP.
If you look at the XML processing possibilities in Enterprise COBOL, you will see that they do exactly what you want (I'd guess). GnuCOBOL can also readily interface with lots of other things. If you are writing GnuCOBOL for running on a non-Mainframe, I'd suggest you ask on the GnuCOBOL part of SourceForge.
Otherwise, yes, it would come down to looping through the data. Once you clarify what you want a bit more, you may get examples of that if you still need them.
I'm setting up a database to do some linguistic analysis, and Japanese Kana are giving me just a bit of trouble.
Unlike other questions on this so far, I don't know that it's an encoding issue, per se. I've set the coallation to utf8_unicode_ci, and on the surface it's saving and recalling most things all right.
The problem, however, is when I get into related kana, such as キ (ki) and ギ (gi). For sorting purposes, Japanese doesn't distinguish between the two unless they are in direct conflict. So for example:
ぎ (gi) comes before きかい (kikai)
きる (kiru) comes before ぎわく (giwaku)
き (ki) comes before ぎ (gi)
It's this behavior that I think is at the root of my problem. When loading my data set from an external file, I had it do a SELECT call to verify that specific readings in Japanese had not already been logged. If it was already there, it would fetch the ID so it could be paired to a headword; otherwise a new entry was added and paired thereafter.
What I noticed after I put everything in is that wherever two such similar readings occurred, the first one encountered would be logged and would then show up as a false positive for the other if it showed up. For example:
キョウ (kyou) appeared first, so characters with ギョウ (gyou) got paired with kyou instead
ズ (zu) appeared before ス (su), so likewise even more characters got incorrectly matched.
I can go through and manually sort it out if need be, but what I would really like to do is set the database up to take a stricter view regarding differentiating between characters (e.g. if the characters have two different UTF-8 code points, treat them as different characters). Is there any way to get this behavior?
You can use utf8_bin to get a collation that compares characters by their Unicode code points.
The utf8_general_ci collation also distinguishes キョウ and ギョウ.
when saving to database
save it as binary
and when calling back change it to Japanese
same problem accorded with me with Arabic language