I have an HTML file which contains;
我有一个包含的HTML文件;
<html>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>a& ca-79069608498"
<div class="cont" id="aka"></div>
<footer>
<div class="tent"><div class="cont"></div>
<h2><img alt="dscdsc" height="18" src="dsc.png" srcset="" width="116"/></h2>
</div>
</footer>
ipt> (window.NORLQ=window.NORLQ||[]).push(function(){var
ns,i,p,img;ns=document.getElementsByTagName('noscript');for(i=0;i<ns.len)>-1){img=document.createEleight'));img.setAttribute('alt',p.getAttribute('data-alt'));p.parentNode.replaceChild(img,p);}}});/*]]>*/</script><script>(window.RLQ=window.RLQ||[]).push(function(
Name of the file is a.html
该文件的名称是a.html
I want to remove everything after </html>
in the HTML file using Python 2.7
but all the strings after HTML tag do not belong to a tag and some of them just noisy so I could not do it using Beautifulsoup and I don't know if it's smart to use regex for HTML file.
我想在使用Python 2.7的HTML文件中删除 之后的所有内容,但HTML标记之后的所有字符串都不属于标记,其中一些只是嘈杂所以我不能使用Beautifulsoup来做它我不知道如果使用正则表达式来处理HTML文件是明智的。
How can I remove strings after </html>
and write to HTML file?
如何在 之后删除字符串并写入HTML文件?
0
with regex
import re
...
newhtml = re.sub('</html>[\s\S.]+', '</html>', oldhtml)
0
a = open(path, "r").read()
b = a.split('</html>', 1)[0]
open(path, 'w').write(b)
0
Python has a module called HTMLParser for dealing this sort of problem.
Python有一个名为HTMLParser的模块来处理这类问题。
While the proposed regexpr
seem to handle your problem well for now, it can be problematic to debug when something went wrong when it cant handle edge cases HTML
.
虽然提议的regexpr似乎现在可以很好地处理您的问题,但是当它无法处理边缘情况HTML时出现问题时调试可能会有问题。
Therefore I am proposing a HTMLParser
solution which gives you more control on its parsing behaviour.
因此,我提出了一个HTMLParser解决方案,它可以让您更好地控制其解析行为。
Example:
from HTMLParser import HTMLParser
class MyHTMLParser(HTMLParser):
buffer = ""
end_of_html = False
def get_html(self):
return self.buffer
def handle_starttag(self, tag, attrs):
if not self.end_of_html:
value = "<" + tag
for attr in attrs:
value += attr[0] + "=" + attr[1]
self.buffer += value + ">"
def handle_data(self, data):
if not self.end_of_html:
self.buffer += data
def handle_endtag(self, tag):
if not self.end_of_html:
self.buffer += "</" + tag + ">"
if tag == "html":
self.end_of_html = True
parser = MyHTMLParser();
parser.feed("""<html>
</div>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>a& ca-79069608498"
<div class="cont" id="aka"></div>
<footer>
<div class="tent"><div class="cont"></div>
<h2><img alt="dscdsc" height="18" src="dsc.png" srcset="" width="116"/></h2>
</div>
</footer>
ipt> (window.NORLQ=window.NORLQ||[]).push(function(){var
ns,i,p,img;ns=document.getElementsByTagName('noscript');for(i=0;i<ns.len)>-1){img=document.createEleight'));img.setAttribute('alt',p.getAttribute('data-alt'));p.parentNode.replaceChild(img,p);}}});/*]]>*/</script><script>(window.RLQ=window.RLQ||[]).push(function(
""")
print parser.get_html()
Output:
<html>
</div>
<head></head>
<body><p>thanks god its Friday</p></body>
</html>
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2017/07/21/150dfeb035b5b081bb6c7660487c5cf7.html。