迭代字符串的行

[英]Iterate over the lines of a string


I have a multi-line string defined like this:

我有一个像这样定义的多行字符串:

foo = """
this is 
a multi-line string.
"""

This string we used as test-input for a parser I am writing. The parser-function receives a file-object as input and iterates over it. It does also call the next() method directly to skip lines, so I really need an iterator as input, not an iterable. I need an iterator that iterates over the individual lines of that string like a file-object would over the lines of a text-file. I could of course do it like this:

这个字符串作为我正在编写的解析器的测试输入。解析器函数接收文件对象作为输入,并对其进行迭代。它也直接调用next()方法来跳过行,所以我确实需要一个迭代器作为输入,而不是迭代器。我需要一个迭代器来遍历字符串的每一行,就像文件对象遍历文本文件的每一行一样。我当然可以这样做:

lineiterator = iter(foo.splitlines())

Is there a more direct way of doing this? In this scenario the string has to traversed once for the splitting, and then again by the parser. It doesn't matter in my test-case, since the string is very short there, I am just asking out of curiosity. Python has so many useful and efficient built-ins for such stuff, but I could find nothing that suits this need.

有更直接的方法吗?在这个场景中,字符串必须遍历一次,然后再由解析器遍历。在我的测试用例中,这并不重要,因为这里的字符串非常短,出于好奇,我只是问一下。Python对于这样的东西有很多有用的、高效的内置程序,但是我找不到任何适合这种需要的东西。

5 个解决方案

#1


111  

Here are three possibilities:

这里有三种可能性:

foo = """
this is 
a multi-line string.
"""

def f1(foo=foo): return iter(foo.splitlines())

def f2(foo=foo):
    retval = ''
    for char in foo:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

def f3(foo=foo):
    prevnl = -1
    while True:
      nextnl = foo.find('\n', prevnl + 1)
      if nextnl < 0: break
      yield foo[prevnl + 1:nextnl]
      prevnl = nextnl

if __name__ == '__main__':
  for f in f1, f2, f3:
    print list(f())

Running this as the main script confirms the three functions are equivalent. With timeit (and a * 100 for foo to get substantial strings for more precise measurement):

运行主脚本确认这三个函数是等价的。有了timeit(以及foo的* 100,以获得更精确的测量的大量字符串):

$ python -mtimeit -s'import asp' 'list(asp.f3())'
1000 loops, best of 3: 370 usec per loop
$ python -mtimeit -s'import asp' 'list(asp.f2())'
1000 loops, best of 3: 1.36 msec per loop
$ python -mtimeit -s'import asp' 'list(asp.f1())'
10000 loops, best of 3: 61.5 usec per loop

Note we need the list() call to ensure the iterators are traversed, not just built.

注意,我们需要list()调用来确保遍历迭代器,而不仅仅是构建的。

IOW, the naive implementation is so much faster it isn't even funny: 6 times faster than my attempt with find calls, which in turn is 4 times faster than a lower-level approach.

现在,幼稚的实现速度快得多,甚至都不好玩:比我尝试查找调用快6倍,而查找调用又比低级方法快4倍。

Lessons to retain: measurement is always a good thing (but must be accurate); string methods like splitlines are implemented in very fast ways; putting strings together by programming at a very low level (esp. by loops of += of very small pieces) can be quite slow.

记住:测量总是一件好事(但必须准确);像splitlines这样的字符串方法的实现速度非常快;通过在非常低的级别(尤其是很小的块的+=循环)编程将字符串放在一起会非常慢。

Edit: added @Jacob's proposal, slightly modified to give the same results as the others (trailing blanks on a line are kept), i.e.:

编辑:添加了@Jacob的建议,稍微修改一下,以得到与其他结果相同的结果(保留行上的拖尾空格),例如:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip('\n')
        else:
            raise StopIteration

Measuring gives:

测量了:

$ python -mtimeit -s'import asp' 'list(asp.f4())'
1000 loops, best of 3: 406 usec per loop

not quite as good as the .find based approach -- still, worth keeping in mind because it might be less prone to small off-by-one bugs (any loop where you see occurrences of +1 and -1, like my f3 above, should automatically trigger off-by-one suspicions -- and so should many loops which lack such tweaks and should have them -- though I believe my code is also right since I was able to check its output with other functions').

不一样好;基于方法——不过,值得记住的,因为它可能不太倾向于小bug(任何循环,你看到+ 1和- 1的出现,像我上面f3,应该自动触发这些怀疑- - - -所以许多循环,缺乏这样的调整,应该让他们——虽然我也相信我的代码是正确的,因为我能检查它的输出与其他功能”)。

But the split-based approach still rules.

但是基于分裂的方法仍然是规则。

An aside: possibly better style for f4 would be:

旁白:f4可能更好的风格是:

from cStringIO import StringIO

def f4(foo=foo):
    stri = StringIO(foo)
    while True:
        nl = stri.readline()
        if nl == '': break
        yield nl.strip('\n')

at least, it's a bit less verbose. The need to strip trailing \ns unfortunately prohibits the clearer and faster replacement of the while loop with return iter(stri) (the iter part whereof is redundant in modern versions of Python, I believe since 2.3 or 2.4, but it's also innocuous). Maybe worth trying, also:

至少,它有点小啰嗦。不幸的是,删除拖尾\ns的需要禁止用return iter(stri)来更清晰、更快地替换while循环(iter部分在现代Python版本中是多余的,我认为从2.3或2.4开始,但这也无伤大雅)。也许值得一试,还:

    return itertools.imap(lambda s: s.strip('\n'), stri)

or variations thereof -- but I'm stopping here since it's pretty much a theoretical exercise wrt the strip based, simplest and fastest, one.

或者它的变化,我就讲到这里了因为这是一个理论练习,基于条带,最简单,最快。

#2


38  

I'm not sure what you mean by "then again by the parser". After the splitting has been done, there's no further traversal of the string, only a traversal of the list of split strings. This will probably actually be the fastest way to accomplish this, so long as the size of your string isn't absolutely huge. The fact that python uses immutable strings means that you must always create a new string, so this has to be done at some point anyway.

我不确定您所说的“然后再通过解析器”是什么意思。分割完成后,就没有对字符串进行进一步的遍历,只有对分割字符串列表的遍历。这实际上可能是实现这一目标的最快方式,只要您的字符串的大小不是非常大。python使用不可变字符串这一事实意味着您必须始终创建一个新字符串,因此无论如何,这必须在某一时刻完成。

If your string is very large, the disadvantage is in memory usage: you'll have the original string and a list of split strings in memory at the same time, doubling the memory required. An iterator approach can save you this, building a string as needed, though it still pays the "splitting" penalty. However, if your string is that large, you generally want to avoid even the unsplit string being in memory. It would be better just to read the string from a file, which already allows you to iterate through it as lines.

如果您的字符串非常大,那么缺点在于内存使用:您将同时在内存中拥有原始字符串和分割字符串列表,从而使所需的内存增加一倍。迭代器方法可以为您节省这一点,在需要时构建一个字符串,尽管它仍然支付“分割”的惩罚。但是,如果您的字符串有那么大,您通常希望避免甚至是在内存中未分割的字符串。最好只是从文件中读取字符串,该文件已经允许您以行形式对其进行迭代。

However if you do have a huge string in memory already, one approach would be to use StringIO, which presents a file-like interface to a string, including allowing iterating by line (internally using .find to find the next newline). You then get:

但是,如果您已经在内存中有了一个巨大的字符串,那么一个方法就是使用StringIO,它向一个字符串提供了一个类似文件的接口,包括允许通过行进行迭代(内部使用.find查找下一个换行)。然后得到:

import StringIO
s = StringIO.StringIO(myString)
for line in s:
    do_something_with(line)

#3


3  

If I read Modules/cStringIO.c correctly, this should be quite efficient (although somewhat verbose):

如果我读模块/ cStringIO。c正确地说,这应该是非常有效的(尽管有点冗长):

from cStringIO import StringIO

def iterbuf(buf):
    stri = StringIO(buf)
    while True:
        nl = stri.readline()
        if nl != '':
            yield nl.strip()
        else:
            raise StopIteration

#4


2  

Regex-based searching is sometimes faster than generator approach:

基于regex的搜索有时比生成器方法更快:

RRR = re.compile(r'(.*)\n')
def f4(arg):
    return (i.group(1) for i in RRR.finditer(arg))

#5


1  

I suppose you could roll your own:

我想你可以自己卷一卷:

def parse(string):
    retval = ''
    for char in string:
        retval += char if not char == '\n' else ''
        if char == '\n':
            yield retval
            retval = ''
    if retval:
        yield retval

I'm not sure how efficient this implementation is, but that will only iterate over your string once.

我不确定这个实现的效率有多高,但这只会遍历您的字符串一次。

Mmm, generators.

嗯,发电机。

Edit:

编辑:

Of course you'll also want to add in whatever type of parsing actions you want to take, but that's pretty simple.

当然,您还需要添加任何类型的解析操作,但这非常简单。

智能推荐

注意!

本站翻译的文章,版权归属于本站,未经许可禁止转摘,转摘请注明本文地址:http://www.silva-art.net/blog/2010/06/16/fea13cffd2ea3bf2215491812cbb902d.html



 
© 2014-2019 ITdaan.com 粤ICP备14056181号  

赞助商广告