在python中匹配字符串的开头和结尾与正则表达式

[英]Match beginning and end of string with regex in python

本文翻译自 johns4ta 查看原文 2014/03/27 1274 python/ regex

I'm trying to pull the parsable-cite info from this webpage using python. For example, for the page listed I would pull pl/111/148 and pl/111/152. My current regex is listed below, but it seems to return everything after parsable cite. It's probably something simple, but I'm relatively new to regexes. Thanks in advance.

我正试图使用python从这个网页中提取可解析的引用信息。例如，对于列出的页面，我会拉pl / 111/148和pl / 111/152。我现在的正则表达式列在下面，但似乎在可解析引用后返回所有内容。它可能很简单，但我对正则表达式相对较新。提前致谢。

re.findall(r'^parsable-cite=.*>$',page)

7 个解决方案

#1

I highly recommend to use this regex which will capture what you want:

我强烈建议使用这个正则表达式来捕获你想要的东西：

re.findall(r'parsable-cite=\\\"(.*?)\\\"\>',page)

explanation:

说明：

parsable-cite= matches the characters parsable-cite= literally (case sensitive)
  \\ matches the character \ literally
  \" matches the character " literally
  1st Capturing group (.*?)
  .*? matches any character (except newline)
      Quantifier: Between zero and unlimited times, as few times as possible,
           expanding as needed
  \\ matches the character \ literally
  \" matches the character " literally
  \> matches the character > literally

using ? is the key ;)

使用？是关键;）

hope this helps.

希望这可以帮助。

#2

Make your regex lazy:

让你的正则表达式懒惰：

re.findall(r'^parsable-cite=.*?>$',page)
                              ^

Or use a negated class (preferable):

或者使用否定的类（最好）：

re.findall(r'^parsable-cite=[^>]*>$',page)

.* is greedy by default and will try to match as much as possible before concluding a match.

。*默认情况下是贪婪的，并会在结束比赛前尽可能地匹配。

regex101 demo

regex101演示

If you want to get the parts you need only, you can use capture groups:

如果您只想获得所需的零件，可以使用捕获组：

re.findall(r'^parsable-cite=([^>]*)>$',page)

regex101 demo

regex101演示

Though, from the layout of your webpage, it doesn't seem that you need the anchors (^ and $) (unless the newlines were somehow removed on the site...)

但是，从您的网页布局来看，您似乎不需要锚点（^和$）（除非在网站上以某种方式删除换行符...）

#3

The .* you have there is "greedy", meaning it will match as much as it can, including any number of > characters and whatever comes after them.

。*你有“贪婪”，这意味着它将尽可能多地匹配，包括任意数量的>字符以及它们之后的任何内容。

If what you really want is "everything up to the next >" then you should say [^>]*> instead, meaning "any number of non-> characters, then a >".

如果你真正想要的是“一切都是下一个>”那么你应该说[^>] *>，意思是“任意数量的非>字符，然后是>”。

#4

maybe something like this:

也许是这样的：

(?<=parsable-cite=\\\")\w{2}\/\d{3}\/\d{3}

http://regex101.com/r/kE9uE3

#5

Though this is a json string where html is embedded inside, but you can still use BeautifulSoup for this purpose:

虽然这是一个json字符串，里面嵌入了html，但你仍然可以使用BeautifulSoup来达到这个目的：

soup = BeautifulSoup(htmls);
tags = soup.findAll("external-xref", {"parsable-cite":re.compile("")})
for t in tags:
    print t['parsable-cite']

#6

This might work if its between \" delimiters

如果它在“分隔符之间”，这可能会有效

 #  \bparsable-cite\s*=\s*\"((?s:(?!\").)*)\"

 \b 
 parsable-cite
 \s* = \s* 
 \"
 (                             # (1 start)
      (?s:
           (?! \" )
           . 
      )*
 )                             # (1 end)
 \"

Or, just

要不就

 #  (?s)\bparsable-cite\s*=\s*\"(.*?)\"

 (?s)
 \b 
 parsable-cite
 \s* = \s* 
 \"
 ( .*? )                 # (1)
 \"

#7

If you think it will be very similar each time:

如果您认为每次都非常相似：

re.findall(r"pl/\d+/\d+", page)

注意！

本站翻译的文章，版权归属于本站，未经许可禁止转摘，转摘请注明本文地址：http://www.silva-art.net/blog/2014/03/27/8af227b117ad241fc8bce5c4e39bcb07.html。

猜您在找

正则表达式是否可以匹配字符串开头或结尾处的字符（但不能同时匹配）？ - Can a regular expression match a character at the beginning OR end of the string (but not both)? 正则表达式结尾处没有数字和字符串的开头 - Regex for no digit at the end and the beginning of a string 如何将开头和结尾的字符串与正则表达式匹配 - How do I match a string at the beginning and the end with regular expressions Python中的正则表达式不匹配字符串的结尾 - Regular expression in Python won't match end of a string 如何使用正则表达式在javascript中的字符串的开头和结尾删除
？ - How to remove
at the beginning and at the end of a string in javascript using regex?