"RecursionError: maximum recursion depth exceeded in comparison" when calling str(code)

+2 Arjun Naidu · September 15, 2015
While trying to download all the source files using web crawler I'm hitting the following exception:


Traceback (most recent call last):
  File "C:/Users/arjunn/PycharmProjects/learn/test1.py", line 59, in
    source_spider(474)
  File "C:/Users/arjunn/PycharmProjects/learn/test1.py", line 54, in source_spider
    fw.write(get_code(href))
  File "C:/Users/arjunn/PycharmProjects/learn/test1.py", line 11, in get_code
    code = str(code)
  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\element.py", line 1035, in __unicode__
    return self.decode()
  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\element.py", line 1191, in decode_contents
    formatter))
.
.
.
.
.

  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\element.py", line 1122, in decode
    indent_contents, eventual_encoding, formatter)
  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\element.py", line 1188, in decode_contents
    text = c.output_ready(formatter)

  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\element.py", line 712, in output_ready
    output = self.format_string(self, formatter)
  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\site-packages\bs4\element.py", line 156, in format_string
    if not isinstance(formatter, collections.Callable):
  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\abc.py", line 182, in __instancecheck__
    if subclass in cls._abc_cache:
  File "C:\Users\arjunn\AppData\Local\Programs\Python\Python35-32\lib\_weakrefset.py", line 75, in __contains__
    return wr in self.data
RecursionError: maximum recursion depth exceeded in comparison



Code is as follows:

import requests
import re
from bs4 import BeautifulSoup


def get_code(url):
    source = requests.get(url)
    text = source.text
    soup = BeautifulSoup(text, "html.parser")
    code = soup.find('code')
    code = str(code)
    line = re.sub('', '\n', code)
    # line = replace_br_newln(code)

    soup2 = BeautifulSoup(line, "html.parser")
    script = soup2.get_text()
    script = script.replace('?', ' ')
    return script

'''
def replace_br_newln(in_str):
    new_string = ''
    while 1:
        index = in_str.find(r'')
        if index == -1:
            new_string += in_str
            break
        else:
            new_string += in_str[:index] + '\n'
            in_str = in_str[index+4:]
    return new_string
'''



def source_spider(max_pages):
    page = 13
    compare_str = r'[source code]'
    while page <= max_pages:
        print('#####  ', page, '  #####')
        url = "https://thenewboston.com/forum/category.php?id=15&page=" + str(page)
        source_code = requests.get(url)
        # just get the code, no headers or anything
        plain_text = source_code.text
        # BeautifulSoup objects can be sorted through easy
        soup = BeautifulSoup(plain_text, "html.parser")
        for link in soup.findAll('a', {'class': 'post-title'}):
            href = "https://thenewboston.com" + link.get('href')
            title = link.string  # just the text, not the HTML
            print(title)
            if compare_str in title:
                title = 'source\\' + title[14:] + '.txt'
                fw = open(title, 'w')
                fw.write(get_code(href))
                fw.close()
        page += 1


source_spider(474)





Why is recursion used for converting to string..?
What is the workaround to the issue?

Post a Reply

Replies

- page 1
Oldest  Newest  Rating
0 Halcyon Abraham Ramirez · September 15, 2015
tried out your code. no recursion error for me

btw

your get_code function returns individual letters

like these:



b
r
>
f
o
r
 
k
,
 
v
 
i
n
 
w
e
i
g
h
t
s
.
i
t
e
m
s
(
)
:
<
b
r

is that what you wanted? also put some time between each request or you'll overload bucky's server :/

I know because I've overloaded servers before too!
0 Arjun Naidu · September 18, 2015
Hi Halcyon

The get_code() is still giving me the recursion error for the url2.
Also, it is not returning individual characters. I've printed it for url1


import requests
import re
from bs4 import BeautifulSoup


def get_code(url):
    source = requests.get(url)
    text = source.text
    soup = BeautifulSoup(text, "html.parser")
    code = soup.find('code')
    code = str(code)
    line = re.sub('', '\n', code)
    soup2 = BeautifulSoup(line, "html.parser")
    script = soup2.get_text()
    script = script.replace('?', ' ')
    return script


print(get_code(r'https://thenewboston.com/forum/topic.php?id=2342'))  #url1
get_code(r'https://thenewboston.com/forum/topic.php?id=2653')   #url1
0 sfolje 0 · September 18, 2015
http://stackoverflow.com/questions/3323001/maximum-recursion-depth
says:
"You can change the recursion limit with
sys.setrecursionlimit,

but doing so is dangerous -- the standard limit is a little conservative ".
You can see current recursion limit (1000) with
sys.getrecursionlimit()

Second url needs sys.setrecursionlimit(1194)
0 Halcyon Abraham Ramirez · September 19, 2015
yes it does give you individual letters still

it's because of this line


line = re.sub('', '\n', code)

try this out:

a = "hello"

print(re.sub("","\n",a))

I fixed you def_get code:



def get_code(url):
    soup = BeautifulSoup(requests.get(url).content)
    return re.sub(r"\?"," ",soup.find('code').text)

what is url2? can you show me the exact url?
try runnig this:

def get_code(url):
    soup = BeautifulSoup(requests.get(url).content)
    return re.sub(r"\?"," ",soup.find('code').text)

print(get_code(r'https://thenewboston.com/forum/topic.php?id=2342'))
0 sfolje 0 · September 19, 2015
I think url2 means 'https://thenewboston.com/forum/topic.php?id=2653', no error with your solution though.
0 Arjun Naidu · September 19, 2015
Hi Halcyon.
I am sorry. The '<br>' inside re.sub is being removed. I am trying to replace html line break tag with newline 
The actual code is     line = re.sub('<br>', '\n', code)  

I am not sure if this is being printed properly even now. (First 5 posts are moderated and I cannot see what I posted until a moderator approves it.)
0 Halcyon Abraham Ramirez · September 20, 2015
have you run this?


def get_code(url):
soup = BeautifulSoup(requests.get(url).content)
return re.sub(r"\?"," ",soup.find('code').text)

print(get_code(r'https://thenewboston.com/forum/topic.php?id=2342'))


was it your desired output?
0 Arjun Naidu · September 21, 2015
I've tried your code, but the entire result is in a single line.
I've increased the recursion depth limit as @sfolje 0  suggested and my code is working fine  without any error
0 Halcyon Abraham Ramirez · September 21, 2015
this is interesting.

this is the first time I've encountered this error. apparently this is a bug in beautifulsoup.

no wonder why we're getting the error.

well I guess the work around is what @sfolje0 said
0 sfolje 0 · September 24, 2015
Please like my post if you think it deserves to be liked ;).
  • 1
  • 2

Python

107,143 followers
About

This section is all about snakes! Just kidding.

Links
Moderators
Bucky Roberts Administrator