如何删除HTML文件中不属于HTML标记的字符串

[英]How to remove strings which does not belong to HTML tag in an HTML file


I have an HTML file which contains;

我有一个包含的HTML文件;

<html>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>a&amp; ca-79069608498"
<div class="cont" id="aka"></div>
<footer>
<div class="tent"><div class="cont"></div>
<h2><img alt="dscdsc" height="18" src="dsc.png" srcset="" width="116"/></h2>


</div>
</footer>

 ipt> (window.NORLQ=window.NORLQ||[]).push(function(){var 
ns,i,p,img;ns=document.getElementsByTagName('noscript');for(i=0;i<ns.len)>-1){img=document.createEleight'));img.setAttribute('alt',p.getAttribute('data-alt'));p.parentNode.replaceChild(img,p);}}});/*]]>*/</script><script>(window.RLQ=window.RLQ||[]).push(function(

Name of the file is a.html

该文件的名称是a.html

I want to remove everything after </html> in the HTML file using Python 2.7 but all the strings after HTML tag do not belong to a tag and some of them just noisy so I could not do it using Beautifulsoup and I don't know if it's smart to use regex for HTML file.

我想在使用Python 2.7的HTML文件中删除 之后的所有内容,但HTML标记之后的所有字符串都不属于标记,其中一些只是嘈杂所以我不能使用Beautifulsoup来做它我不知道如果使用正则表达式来处理HTML文件是明智的。

How can I remove strings after </html> and write to HTML file?

如何在 之后删除字符串并写入HTML文件?

3 个解决方案

#1


0  

with regex

import re
...
newhtml = re.sub('</html>[\s\S.]+', '</html>', oldhtml)

#2


0  

a = open(path, "r").read()
b = a.split('</html>', 1)[0]
open(path, 'w').write(b)

#3


0  

Python has a module called HTMLParser for dealing this sort of problem.

Python有一个名为HTMLParser的模块来处理这类问题。

While the proposed regexpr seem to handle your problem well for now, it can be problematic to debug when something went wrong when it cant handle edge cases HTML.

虽然提议的regexpr似乎现在可以很好地处理您的问题,但是当它无法处理边缘情况HTML时出现问题时调试可能会有问题。

Therefore I am proposing a HTMLParser solution which gives you more control on its parsing behaviour.

因此,我提出了一个HTMLParser解决方案,它可以让您更好地控制其解析行为。

Example:

from HTMLParser import HTMLParser


class MyHTMLParser(HTMLParser):
    buffer = ""
    end_of_html = False

    def get_html(self):
        return self.buffer

    def handle_starttag(self, tag, attrs):
        if not self.end_of_html:
            value = "<" + tag
            for attr in attrs:
                value += attr[0] + "=" + attr[1]
            self.buffer += value + ">"

    def handle_data(self, data):
        if not self.end_of_html:
            self.buffer += data

    def handle_endtag(self, tag):
        if not self.end_of_html:
            self.buffer += "</" + tag + ">"
        if tag == "html":
            self.end_of_html = True


parser = MyHTMLParser();
parser.feed("""<html>
</div>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>a&amp; ca-79069608498"
<div class="cont" id="aka"></div>
<footer>
<div class="tent"><div class="cont"></div>
<h2><img alt="dscdsc" height="18" src="dsc.png" srcset="" width="116"/></h2>


</div>
</footer>

 ipt> (window.NORLQ=window.NORLQ||[]).push(function(){var
ns,i,p,img;ns=document.getElementsByTagName('noscript');for(i=0;i<ns.len)>-1){img=document.createEleight'));img.setAttribute('alt',p.getAttribute('data-alt'));p.parentNode.replaceChild(img,p);}}});/*]]>*/</script><script>(window.RLQ=window.RLQ||[]).push(function(
        """)

print parser.get_html()

Output:

<html>
</div>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>
智能推荐

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2017/07/21/150dfeb035b5b081bb6c7660487c5cf7.html



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告