Csv hebrew text not in good order - mysql

I am trying to import csv file to use the data in my php project to insert them in mysql database. The problem is that my csv file contains one column woth hebrew character. This csv converted from xls file.
The problem is that when i open the file with excel i have correct display, like that
But when i am trying to use the csv file. I have a problem of order
פרקט תלת שכבתי אלון 189x15/4 גרי ישן מעושן גימור שמן UV
Somebody know how to resolve this problem thanks!
The problem is not in my php script. My php script is all right. But the problem is that the xcel cell format not correspond when i use it in csv.
The problem is when an English word or number is mixed in with the text:
Example:
English:
“Can we improve the health of patients by giving them Aspirin?”
Hebrew:
“[Hebrew translated text] Aspirin?”
This is displayed as:
Aspirin [Hebrew translated text]?
Hopefully I explained the issue enough. It is a little confusing so if I need clarify more, please let me know.
Any help or experience is appreciated?

As an RTL language speaker, I think I can be of help.
It all depends on the text direction the UI is using. Most of the application uses LTR (Left-to-Right) for text direction by default. If you are using MySQL Workbench to see the values stored in the column, MySQL Workbench uses LTR direction as well. That's why you will see the wrong order problem when you have bi-directional (text mixed with numbers) text.
Keep in mind, that CSV is merely a UTF-8 plain text, which means the text is style-less and direction-less. You need only to set your HTML direction to RTL. See example below:
<h3>Wrong LTR Direction</h3>
<p dir="ltr">פרקט תלת שכבתי אלון 189x15/4 גרי ישן מעושן גימור שמן UV</p>
<h3>Correct RTL Direction</h3>
<p dir="rtl">פרקט תלת שכבתי אלון 189x15/4 גרי ישן מעושן גימור שמן UV</p>
Salam :)

Related

Word html format: insert a custom TOC via field code

I am generating Word docs from html. Basically, I build a file with html and save it as a .doc. Then I open it in Word and apply a template. All good so far.
I would like to automatically generate a custom TOC via the HTML ie when I am building the document. I need to insert a field code to do that, in the same way I do to add page numbering via the HML. eg:
<span style="mso-field-code: PAGE " class="page-field"></span>
If I save my html doc as docx and apply a template, I can make a TOC based in the styles in the way one would normally create a TOC in Word. I customised the TOC so the Title style is the top level followed by H1, H2 then H3. If I then toggle the field code on the TOC, the field code looks like this:
{ TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1" }
Now, I can add HTML like this to insert the TOC:
<div style="mso-field-code: TOC " class="toc-field">TOC goes HERE</div>
When I do that, if I right click the text "TOC goes HERE" I get the option to "Update field" and if I do that a TOC is generated using the default H1,H2,H3 tags.
But, what I can't work out is how to include the
\t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
part so my custom style sequence is applied. I have tried all sorts of combinations and it seems that adding anything after TOC causes Word to not make a field code.
Does anyone have any suggestions?
Update:
Based on the essential help from #slightlysnarky below, I thought I would summarise the outcome here because the information I needed was in a Microsoft chm file that was taken down many years ago. If you read the following extract from that help manual and compare it to the solution below you will see how this all works.
Word marks and stores information for simple fields by means of the Span element with the mso-field-code style. The mso-field-code value represents the string value of the field code. Formatting in the original field code might be lost when saving as HTML if only the string value of the code is necessary for its calculation.
Word has a different way of storing field information to HTML for more complex fields, such as ones that have formatted text or long values. Word marks these fields with so the data is not displayed in the browser. Word uses the Span element with the mso-element: field-begin, mso-element: field-separator, and mso-element: field-end attributes to contain the three respective parts of the field code: the field start, the separator between field code and field results, and the field end. Whenever possible, Word will save the field to HTML in the method that uses the least file space.
So, basically, add tags as shown below to your HTML at the point you wish the TOC to appear.
:-)
Word recognises a "complex field format" in HTML, along the same lines as it does in the Office Open XML format. So you can use
<span style='mso-element:field-begin'></span>TOC \t "Heading 1,2,Heading 2,3,Heading 3,4,Title,1"
<span style='mso-element:field-separator'></span>This text will show but the user will need to update the field
<span style='mso-element:field-end'></span>
This construct is outlined in a Microsoft document called "Microsoft Office HTML and XML Reference". It's a Windows .exe that unpacks to a .chm Help file. You can get it here
The info. on encoding fields is in Getting Started with Microsoft Office 2000 HTML and XML->Microsoft Word->Fields
There may be a later version but that's the only one I could find.

ruby tags for Sphinx/rst

I create HTML documents from a rst-formated text, with the help of Sphinx. I need to display some Japanese words with furiganas (=small characters above the words), something like that :
I'd like to produce HTML displaying furiganas thanks to the < ruby > tag.
I can't figure out how to get this result. I tried to:
insert raw HTML code with the .. raw:: html directive but it breaks my line into several paragraphs.
use the :superscript: directive but the text in furigana is written beside the text, not above.
use the :role: directive to create a link between the text and a CSS class of my own. But the :role: directive can only be applied to a segment of text, not to TWO segments as required by the furiganas (=text + text above it).
Any idea to help me ?
As long as I know, there's no simple way to get the expected result.
For a specific project, I choosed not to generate the furiganas with the help of Sphinx but to modify the .html files afterwards. See the add_ons/add_furiganas.py script and the result here. Yes, it's a quick-and-dirty trick :(

Regex extract html source with multiple elements

Before you tell me not to use Regex to parse html, I'm aware of this but my company uses Iconico Data Extractor to extract data from its website, and it allows you to create custom scripts, but it has to be regular expressions in javascript, I am therefore stuck with using RegEx to achieve my goal.
What I need is to take the following example html and extract each line
<b>Item 1</b> Text <br>
<b>Item 2</b> Text <br>
<b>Item 3</b> Text <br>
<p><font color="#000000" face="Arial, Helvetica, sans-serif"><b>Item 4:</b></font></p>
<p><font color="#000000" face="Arial, Helvetica, sans-serif">Detailed Description</font></p>
What I need is to break down each item into an expression to retrieve all of the line complete with tags, exactly how it appears in the html. I have tried /<b>*details(.|\s)*?\/a>/gi Which gets me the Item 4. But I cannot work out how to get items 1 - 3, as what I require is just the line from to /<b>*Item 1(.|\s)*?\br>/gi simply does not work and after hours of playing around with it i'm no further forward. I also need to get rid of the font tags too if thats possible. i think it's complicated by the fact that there is a closing </b> in the middle.
can anyone offer some advice on how to set up the expression. I already know that the general consenus is no to Regex, so no need to go down that route again :)
This is all quite new to me, so hope ive explained what im trying to do.
Thanks in advance
I've used regex to parse html before it worked just fine. I used something like the following. As you can see there are a lot of ".*?" which means non-greedy match any character. Very useful.
What language are you using? You may have to set options to allow parsing of newlines, otherwise it could be treating each line as a separate input.
in python add re.DOTALL option. In PHP there is a special slash tag to use.
<b>(.*?)<br>.*?<b>(.*?)<br><b>(.*?)<br><p.*?sans-serif"><b>(.*?)</p>.*?serif">(.*?)</p>
For the purposes of using this with the data extractor, I've done some research on getting data between two keywords and (Item 1:.*?<br>)/gi works brilliantly.
Unfortunately, I've now been told that the tags have to be stripped off from now on, so I need to scratch my head over that one. I'll post a new question if I need help with it.
Thanks so much for responding and trying to help

Good way to store formatted text in DB to output later

I write news for my website and format it like this:
[h1]News[h1]
[red]Happy New Year[/red]
[white]Happy New Year[/white]
The news are stored as is on the MySQL DB.
Then when it's called by my website, a function converts every code into HTML format.
[h1][/h1] = <h1></h1>
[red][/red] = <font color=red></font>
I'm not happy with this method for a long time, but now such codes are obsolet for HTML5.
Instead of using I should add it to CSS.
I'm very beginner with PHP, MySQL, CSS, HTML...really, but I'm trying and learning.
So, what I need is the best solution for this matter.
I was thinking to create a CSS rule like:
span.news-red { color=red }
span.news-white { color=white }
And then them into the code for red text, etc...
Is this an effective solution or just a palliative?
Thank you.
EDIT
I have this two functions to convert format of my text in order to be outputed for the visitor.
1st = Converts [white-text][/white-text] into
$string = preg_replace("/\[white-text\](\S+?)\[\/white-text\]/si","<font color=white>\\1</font>", $string);
2nd - Converts [url][/url] into
$string = preg_replace("/\[url\](\S+?)\[\/url\]/si","\\1", $string);
Problems:
WHITE-TEXT - It only changes the color of one word phrases.
URL - It works fine, but I would like to be able to write anything in the readable part of the URL.
In general, you want to have styles of text that are common. Give them descriptions as to why you are doing what you are doing. If I were you, I would name them something as to what they are in the db. Then let's say you decide that Red is just a horrible choice of colors. You could always change it to a different one very easily, just by editing the CSS.
Not knowing why you choose to make something red, I can't give you much of an answer, other than to try and use the css name that relates to why you chose red, rather than what you are doing in the first place.

What is practical purpose for bidirectional override "bdo"?

Before coming here, I tried myself by googling. After I read these two links
http://www.w3schools.com/tags/tag_bdo.asp
http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_bdo
I still don't understand clearly what is the practical purpose?
Thanks in advance for those who shed some light on this.
Pretty striaghtforward. If you're writing a web page using a default language, such as English, that is rendered left-to-right, and you want to include a island of text in another language, such as a quote in Hebrew, that is rendered right-to-left you can use this tag to override the base direction in which the text is written onto the page in case the bi-directional algorithm is getting it wrong. You need to make sure that the font you're using supports the appropriate character set too, of course.
http://www.w3.org/TR/html40/struct/dirlang.html
I tried the code bellow, and noticed that it is apparently obsolete for Hebrew, at least:
<!DOCTYPE html>
<html>
<body>
<p>If your browser supports bi-directional override (bdo), the next line will be written from right to left (rtl):</p>
<p>חדשות, ידיעות מהארץ והעולם - עיתון הארץ</p>
<bdo dir="rtl">חדשות, ידיעות מהארץ והעולם - עיתון הארץ</bdo>
</body>
</html>
Both seemed to output the same line, which confused me, but prompted a search that lead me to the following article:
The bidirectional ordering of text in AbiWord is done automatically,
closely following the Unicode Bidirectional Algorithm (UBA; see the
Unicode Consortium website). The Unicode character set assigns each
character certain directional properties which are then used by the
UBA to order text. Thus, Hebrew or Arabic characters will
automatically be treated as right-to-left, and English characters as
left-to-right. There are some characters that are directionally
ambiguous, and how they are treated by the UBA depends on what
characters are found in their vicinity (this includes all white space
and punctuation characters).
http://fantasai.tripod.com/qref/HTML4/structure/bdo.html
Hope it helps