Removing duplicate phrases from a bunch of text files?

Removing duplicate phrases from a bunch of text files? - duplicates

INPUT:
The text files contain text from a news website but without the html tags.
Some of the sentences don't have full stops. Some sentences are made up of phrases scrapped from navigation links joined in one line.
DESIRED OUTPUT:
Same text files but without the duplicate phrases.
Possible approach:
First reduce the text file sizes by removing the stop words, remove duplicate text files(if any), then apply the magic from here
Thanks in advance

Related

Highlighting words and string in a html span programatically

I am writing a web application. In that, we are displaying a large text in a span control. The text may have several sentences (string of words). I want to highlight a few words and a few sentences in a separate color.
The large text displayed in the span is the output of a service.
Service2 will return few words as an array, those words should be highlighted as bold.
Service3 will return few sentences if those sentences present in the large text those sentences should be highlighted with some color.
Hope I am clear on my question. Experts, please help.

Close tags dropping below highlighted line

I have minimal experience with HTML script so this may all go horribly wrong here.
Alright so I have a very simple yet very time consuming task of taking complete papers and converting them into HTML script. I'm using Sublime Text 3 with Emmet plugin.
Basically,
This is the first header
This is the first paragraph that needs to be tagged
This is the second header
This is the second paragraph that needs to be tagged
So super simple I need to put header tags on the headers and paragraph tags on the paragraphs.
What I have been doing is holding Ctrl and manually highlighting the desired text as it is all rather random. Problem is that takes forever to manually highlight the text like that.
I am aware of other ways to highlight such as Ctrl + L for the line. Problem is my close tags end up under the highlighted line.
Example:
<h2>This is the first header
</h2><p>This is the first paragraph that needs to be tagged
</p>
It's not a big deal but it makes the code harder to go through later and really chaotic.
The same problem persists if I click the corresponding number of the line.
Seeing as I have hundreds of pages to enter and even more headers, paragraphs, and pictures to properly tag; I'm looking for a solution to the tag dropping below the line or a faster method to entering text.
So, is there a fast method for entering text from a word document to Sublime text and quickly get the corresponding tags? e.g. <h2>,<h3>,<p>,<ul>,<li> and so on.
Any help will save my sanity, thanks.

When you select a line with CtrlL, it automatically selects the entire line, and moves the cursor down to the first position on the following line. There are two ways around this. The first is to place the cursor in the first position on the line you want to select, then just hit ShiftEnd and the line will be selected, with the cursor now sitting in the last position on that same line. Alternatively, use CtrlL, then hit Shift← (left arrow) to move the cursor from the first position on the next line to the last position on the selected line. Either way, you can now hit the key combo in Emmet for inserting a tag pair, and you're all set.

Adding Vertical Space in Sphinx Documents

I am using sphinx to build latex and HTML documents with a lot of figures and enumerated lists. When I use figures in the middle of text outside of enumerated lists, the spacing is fine in both latex and HTML with and without captions. There is about a line of space above and below, which is acceptable. However When I try to use a figure within enumerated lists, such as the example below, the spacing is bad in HTML.
#. Here is an item in the list, above the figure
.. figure:: _images/myimage.png
:align: center
:width: 80 %
#. Here is another item below the figure.
The result of the above code is the bottom of the figure is right up against the next item in the list. There is no spacing between them, and this looks bad. This can be fixed in HTML by using the | character at the end of the figure to add a little space, but in the LaTeX output, this causes a DUlineblock environment that adds way too much space in the pdf.
Is there a way to simply add a single blank line after the figure in both HTML and Latex?

You can enter empty lines with:
text
|
text

I found that the replacement:
.. |br| raw:: html
<br />
Works well for adding a black line after a figure in enumerated lists. Since its a raw substitution it only affects html and the figure spacing in latex is fine without modification.

How to have two different types of footnote references in the same HTML document?

I need to have two different types of footnote references on the same HTML text, one with numbers (1,2,3...), the other with letters (a,b,c...).
Ideally, I would have a paragraph in the left column of a two-column table -- this is the main paragraph, i.e. the master text.
The right column would contain numbered footnotes to certain words of the paragraph in the left column. Just below this, I would have a unified row (with one column) containing a,b,c footnotes to certain words in the main paragraph above.
Is this even possible in HTML? How to implement this?
In case HTML cannot support this, what alternatives do you suggest?
Thanks

There are two links that will be of help to you. Firstly read this previous question in SO which explains how to create hyperlinked footnotes:
How do I create a link to a footnote in HTML?
Further, if you want to make the reference character superscript or subscript here is an excellent example:
http://www.w3schools.com/tags/tryit.asp?filename=tryhtml_sup
So, combined you can do something like this:
<p>Some text that has a reference<a id="ref-1" href="#ref-1"><sup>1</sup></a></p>
which appears like so:
Some text that has a reference1
I hope this is the sort of thing you're looking for.

Best practices: displaying text that was input via multi-line text box

I have a multi-line text box. When users simply type away, the text box wraps the text, and it's saved as a single line. It's also possible that users may enter line breaks, for example when entering a "bulleted" lists like:
Here are some suggestions:
- fix this
- remove that
- and another thing
Now, the problem occurs when I try to display the value of this field. In order to preserve the formatting, I currently wrap the presentation in <pre> - this works to preserve user-supplied breaks, but when there's a lot of text saved as a single line, it displays the whole text block as single line, resulting in horizontal scrolling being needed to see everything.
Is there a graceful way to handle both of these cases?

The easiest way of dealing with this is turning all line breaks \n into <br> line breaks. In PHP for example, this is done using the nl2br() function.
If you want something a bit more fancy - like the list you quote getting converted into an actual HTML <ul> for example - you could consider a simple "language" like Markdown that SO uses. It comes with natural, simple rules like
# Heading 1
## Heading 2
### Heading 3
* Unordered List item
* Unordered List item
1. Numbered List item
2. Numbered List item
etc....

You can use the php function nl2br() It transforms line breaks into elements

Convert newline characters to <br /> tags explicitly, and let the browser word-wrap the text normally. That preserves the breaks the visitor entered, without harming other paragraphs.

You could replace line breaks with HTML line breaks.
Replace "\r\n" or "\n" (depending on the browser and platform, check first for longer one) with <br/>.

I would normally replace all CR/LF with LF, and then replace all LF with <br />. You can then render this text inside any HTML container you want and let it flow naturally.

We Keep Coding

html mysql json google-apps-script actionscript-3 ms-access google-chrome google-maps reporting-services sql-server-2008