如何删除HTML文件中不属于HTML标记的字符串

[英]How to remove strings which does not belong to HTML tag in an HTML file

本文翻译自 BlanketSniffer 查看原文 2017/07/21 35 regex/ beautifulsoup/ python-2.7

I have an HTML file which contains;

我有一个包含的HTML文件;

<html>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>a&amp; ca-79069608498"
<div class="cont" id="aka"></div>
<footer>
<div class="tent"><div class="cont"></div>
<h2><img alt="dscdsc" height="18" src="dsc.png" srcset="" width="116"/></h2>


</div>
</footer>

 ipt> (window.NORLQ=window.NORLQ||[]).push(function(){var 
ns,i,p,img;ns=document.getElementsByTagName('noscript');for(i=0;i<ns.len)>-1){img=document.createEleight'));img.setAttribute('alt',p.getAttribute('data-alt'));p.parentNode.replaceChild(img,p);}}});/*]]>*/</script><script>(window.RLQ=window.RLQ||[]).push(function(

Name of the file is a.html

该文件的名称是a.html

I want to remove everything after </html> in the HTML file using Python 2.7 but all the strings after HTML tag do not belong to a tag and some of them just noisy so I could not do it using Beautifulsoup and I don't know if it's smart to use regex for HTML file.

我想在使用Python 2.7的HTML文件中删除之后的所有内容,但HTML标记之后的所有字符串都不属于标记,其中一些只是嘈杂所以我不能使用Beautifulsoup来做它我不知道如果使用正则表达式来处理HTML文件是明智的。

How can I remove strings after </html> and write to HTML file?

如何在之后删除字符串并写入HTML文件?

3 个解决方案

#1

with regex

import re
...
newhtml = re.sub('</html>[\s\S.]+', '</html>', oldhtml)

#2

a = open(path, "r").read()
b = a.split('</html>', 1)[0]
open(path, 'w').write(b)

#3

Python has a module called HTMLParser for dealing this sort of problem.

Python有一个名为HTMLParser的模块来处理这类问题。

While the proposed regexpr seem to handle your problem well for now, it can be problematic to debug when something went wrong when it cant handle edge cases HTML.

虽然提议的regexpr似乎现在可以很好地处理您的问题,但是当它无法处理边缘情况HTML时出现问题时调试可能会有问题。

Therefore I am proposing a HTMLParser solution which gives you more control on its parsing behaviour.

因此,我提出了一个HTMLParser解决方案,它可以让您更好地控制其解析行为。

Example:

from HTMLParser import HTMLParser


class MyHTMLParser(HTMLParser):
    buffer = ""
    end_of_html = False

    def get_html(self):
        return self.buffer

    def handle_starttag(self, tag, attrs):
        if not self.end_of_html:
            value = "<" + tag
            for attr in attrs:
                value += attr[0] + "=" + attr[1]
            self.buffer += value + ">"

    def handle_data(self, data):
        if not self.end_of_html:
            self.buffer += data

    def handle_endtag(self, tag):
        if not self.end_of_html:
            self.buffer += "</" + tag + ">"
        if tag == "html":
            self.end_of_html = True


parser = MyHTMLParser();
parser.feed("""<html>
</div>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>a&amp; ca-79069608498"
<div class="cont" id="aka"></div>
<footer>
<div class="tent"><div class="cont"></div>
<h2><img alt="dscdsc" height="18" src="dsc.png" srcset="" width="116"/></h2>


</div>
</footer>

 ipt> (window.NORLQ=window.NORLQ||[]).push(function(){var
ns,i,p,img;ns=document.getElementsByTagName('noscript');for(i=0;i<ns.len)>-1){img=document.createEleight'));img.setAttribute('alt',p.getAttribute('data-alt'));p.parentNode.replaceChild(img,p);}}});/*]]>*/</script><script>(window.RLQ=window.RLQ||[]).push(function(
        """)

print parser.get_html()

Output:

<html>
</div>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>

智能推荐

如果文章对您有帮助，请打个赏吧

微信支付

支付宝支付

注意！

本站翻译的文章，版权归属于本站，未经许可禁止转摘，转摘请注明本文地址：http://www.silva-art.net/blog/2017/07/21/150dfeb035b5b081bb6c7660487c5cf7.html。

猜您在找

Javascript Regex检查字符串是否有不属于的单词。 - Javascript Regex Check if string has word that does not belong 从不属于特定代码页的C＃字符串中删除字符 - Remove characters from C# string not belonging to a specicif code page 如何要求“gem”文件中不属于“lib”目录的文件? - How to require file from `gem` which are not under `lib` directory? 从AngularJS中的字符串中删除HTML标记 - Remove HTML tag from string in AngularJS 使用jQquery从字符串中删除html标记。 - Remove html tag from a string using jQquery

赞助商链接