I'm trying to pull the parsable-cite info from this webpage using python. For example, for the page listed I would pull pl/111/148 and pl/111/152. My current regex is listed below, but it seems to return everything after parsable cite. It's probably something simple, but I'm relatively new to regexes. Thanks in advance.
我正试图使用python从这个网页中提取可解析的引用信息。例如,对于列出的页面,我会拉pl / 111/148和pl / 111/152。我现在的正则表达式列在下面,但似乎在可解析引用后返回所有内容。它可能很简单,但我对正则表达式相对较新。提前致谢。
re.findall(r'^parsable-cite=.*>$',page)
2
I highly recommend to use this regex which will capture what you want:
我强烈建议使用这个正则表达式来捕获你想要的东西:
re.findall(r'parsable-cite=\\\"(.*?)\\\"\>',page)
explanation:
说明:
parsable-cite= matches the characters parsable-cite= literally (case sensitive)
\\ matches the character \ literally
\" matches the character " literally
1st Capturing group (.*?)
.*? matches any character (except newline)
Quantifier: Between zero and unlimited times, as few times as possible,
expanding as needed
\\ matches the character \ literally
\" matches the character " literally
\> matches the character > literally
using ? is the key ;)
使用?是关键;)
hope this helps.
希望这可以帮助。
1
Make your regex lazy:
让你的正则表达式懒惰:
re.findall(r'^parsable-cite=.*?>$',page)
^
Or use a negated class (preferable):
或者使用否定的类(最好):
re.findall(r'^parsable-cite=[^>]*>$',page)
.*
is greedy by default and will try to match as much as possible before concluding a match.
。*默认情况下是贪婪的,并会在结束比赛前尽可能地匹配。
regex101演示
If you want to get the parts you need only, you can use capture groups:
如果您只想获得所需的零件,可以使用捕获组:
re.findall(r'^parsable-cite=([^>]*)>$',page)
regex101演示
Though, from the layout of your webpage, it doesn't seem that you need the anchors (^
and $
) (unless the newlines were somehow removed on the site...)
但是,从您的网页布局来看,您似乎不需要锚点(^和$)(除非在网站上以某种方式删除换行符...)
1
The .*
you have there is "greedy", meaning it will match as much as it can, including any number of >
characters and whatever comes after them.
。*你有“贪婪”,这意味着它将尽可能多地匹配,包括任意数量的>字符以及它们之后的任何内容。
If what you really want is "everything up to the next >
" then you should say [^>]*>
instead, meaning "any number of non->
characters, then a >
".
如果你真正想要的是“一切都是下一个>”那么你应该说[^>] *>,意思是“任意数量的非>字符,然后是>”。
1
maybe something like this:
也许是这样的:
(?<=parsable-cite=\\\")\w{2}\/\d{3}\/\d{3}
http://regex101.com/r/kE9uE3
1
Though this is a json string where html is embedded inside, but you can still use BeautifulSoup for this purpose:
虽然这是一个json字符串,里面嵌入了html,但你仍然可以使用BeautifulSoup来达到这个目的:
soup = BeautifulSoup(htmls);
tags = soup.findAll("external-xref", {"parsable-cite":re.compile("")})
for t in tags:
print t['parsable-cite']
1
This might work if its between \"
delimiters
如果它在“分隔符之间”,这可能会有效
# \bparsable-cite\s*=\s*\"((?s:(?!\").)*)\"
\b
parsable-cite
\s* = \s*
\"
( # (1 start)
(?s:
(?! \" )
.
)*
) # (1 end)
\"
Or, just
要不就
# (?s)\bparsable-cite\s*=\s*\"(.*?)\"
(?s)
\b
parsable-cite
\s* = \s*
\"
( .*? ) # (1)
\"
1
If you think it will be very similar each time:
如果您认为每次都非常相似:
re.findall(r"pl/\d+/\d+", page)
本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2014/03/27/8af227b117ad241fc8bce5c4e39bcb07.html。