파이썬 웹 이미지 크롤러 (GUI)

Programming 2016. 2. 3. 17:16

한참 전에 공부한답시고 엄청 지저분하지만 돌아가는, 그런 소스로 짠 적이 있었다.

이번엔 좀 더 확실히 더 많은 기능을 넣고, 에러도 잘 나지 않는 그런 프로그램을 구현해보려고 만들었다.

예전엔 3일이나 걸려서 만들었던 프로그램이지만 다시 만드니 2시간이면 기능, 대부분의 GUI까지 구현이 끝나는 것을 볼 수 있었다.

크롤러 특성상 사이트 의존성이 매우 강하기 때문에 이 크롤러는 zerochan,net 의 이미지만을 긁어올 수 있다.

다음이 소스이다.

<crawler.py>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
import os
import re
import urllib
from bs4 import *
import threading
 
class Crawler:
    def __init__(self, keywords, page):
        self.keywords = keywords
        self.url = "http://www.zerochan.net/"+keywords
        self.page = page
        self.img_url = []
        self.num = 0
 
    def url_crawl(self):
        for i in range(1, self.page+1):
            tmp_url = self.url+"?p="+str(i)
            source = urllib.urlopen(tmp_url).read()
            if "No such tag. Back to" in source:
                return -1;
            source = BeautifulSoup(source, "html5lib")
            imgTagList = source('img')
            for j in range(0, len(imgTagList)):
                try:
                    self.img_url.append(imgTagList[j]['src'])
                except:
                    return -2;
 
    def url_setting(self):
        for i in range(len(self.img_url)):
            tmp = self.img_url[i].split('.')
            self.img_url[i] = "http://static.zerochan.net/.full."+tmp[-2]+"."+tmp[-1]
 
    def crawl(self, url):
        img = urllib.urlopen(url).read()
        f = open(self.keywords+"/"+('.'.join(url.split('.')[-2:])), 'wb')
        f.write(img)
        f.close()
 
    def findImg(self):
        ret = self.url_crawl()
        if ret == -1:
            return -1
        if ret == -2:
            return -2
        self.url_setting()
        self.num = len(self.img_url)
        return len(self.img_url)
 
    def start(self, num):
        try:
            self.keywords = self.keywords.replace(':', "")
            self.keywords = self.keywords.replace('\\', "")
            self.keywords = self.keywords.replace('/', "")
            self.keywords = self.keywords.replace('?', "")
            self.keywords = self.keywords.replace('!', "")
            self.keywords = self.keywords.replace('"', "")
            self.keywords = self.keywords.replace('<', "")
            self.keywords = self.keywords.replace('>', "")
            self.keywords = self.keywords.replace('|', "")
            os.makedirs(self.keywords)
        except:
            pass
        self.thread1 = threading.Thread(target=self.crawls, args=(1, num))
        self.thread1.start()
        self.thread2 = threading.Thread(target=self.crawls, args=(2, num))
        self.thread2.start()
        self.thread3 = threading.Thread(target=self.crawls, args=(3, num))
        self.thread3.start()
        self.thread4 = threading.Thread(target=self.crawls, args=(4, num))
        self.thread4.start()
 
    def crawls(self, num, maxNum):
        for i in range(0, (maxNum+4-num)/4):
            self.crawl(self.img_url[(i*4)+num-1])
 
Colored by Color Scripter
cs

크롤러 클래스를 만들어서 넣어 두었다.

총 4개의 Thread를 만들어서 크롤링을 하며, 우선적으로 페이지들을 돌며 이미지들의 주소를 받아오고,

이 주소에 각각 Thread를 할당하여 다운받아오는 방식이다.

다음은 GUI 인터페이스이다.

<gui.py>

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
from Tkinter import *
import tkMessageBox
from crawler import *
class Crawler:
    def __init__(self, keywords, page):
        self.keywords = keywords
        self.url = "http://www.zerochan.net/"+keywords
        self.page = page
        self.img_url = []
        self.num = 0
 
    def url_crawl(self):
        for i in range(1, self.page+1):
            tmp_url = self.url+"?p="+str(i)
            source = urllib.urlopen(tmp_url).read()
            if "No such tag. Back to" in source:
                return -1;
            source = BeautifulSoup(source, "html5lib")
            imgTagList = source('img')
            for j in range(0, len(imgTagList)):
                try:
                    self.img_url.append(imgTagList[j]['src'])
                except:
                    return -2;
 
    def url_setting(self):
        for i in range(len(self.img_url)):
            tmp = self.img_url[i].split('.')
            self.img_url[i] = "http://static.zerochan.net/.full."+tmp[-2]+"."+tmp[-1]
 
    def crawl(self, url):
        img = urllib.urlopen(url).read()
        f = open(self.keywords+"/"+('.'.join(url.split('.')[-2:])), 'wb')
        f.write(img)
        f.close()
 
    def findImg(self):
        ret = self.url_crawl()
        if ret == -1:
            return -1
        if ret == -2:
            return -2
        self.url_setting()
        self.num = len(self.img_url)
        return len(self.img_url)
 
    def start(self, num):
        try:
            self.keywords = self.keywords.replace(':', "")
            self.keywords = self.keywords.replace('\\', "")
            self.keywords = self.keywords.replace('/', "")
            self.keywords = self.keywords.replace('?', "")
            self.keywords = self.keywords.replace('!', "")
            self.keywords = self.keywords.replace('"', "")
            self.keywords = self.keywords.replace('<', "")
            self.keywords = self.keywords.replace('>', "")
            self.keywords = self.keywords.replace('|', "")
            os.makedirs(self.keywords)
        except:
            pass
        self.thread1 = threading.Thread(target=self.crawls, args=(1, num))
        self.thread1.start()
        self.thread2 = threading.Thread(target=self.crawls, args=(2, num))
        self.thread2.start()
        self.thread3 = threading.Thread(target=self.crawls, args=(3, num))
        self.thread3.start()
        self.thread4 = threading.Thread(target=self.crawls, args=(4, num))
        self.thread4.start()
 
    def crawls(self, num, maxNum):
        for i in range(0, (maxNum+4-num)/4):
            self.crawl(self.img_url[(i*4)+num-1])
 
 
class Interface:
    def __init__(self, Master):
        self.master = Master
        self.master.geometry('600x130')
 
        # MainFrame
 
        self.mainFrame = Frame(self.master)
        self.mainFrame.pack(fill=X)
 
        # UrlFrame
 
        self.urlFrame = Frame(self.mainFrame)
        self.urlFrame.pack(side=TOP, fill=X)
 
        self.urlLabel = Label(self.urlFrame)
        self.urlLabel.configure(text='Keywords :')
        self.urlLabel.pack(side=LEFT, padx=5, pady=10)
 
        self.urlEntry = Entry(self.urlFrame)
        self.urlEntry.configure(width=18)
        self.urlEntry.pack(side=LEFT, padx=5, pady=10)
 
        self.pageLabel = Label(self.urlFrame)
        self.pageLabel.configure(text='Page Number : ')
        self.pageLabel.pack(side=LEFT, padx=5, pady=10)
 
        self.pageEntry = Entry(self.urlFrame)
        self.pageEntry.configure(width=10)
        self.pageEntry.pack(side=LEFT, padx=5, pady=10)
 
        self.countLabel = Label(self.urlFrame)
        self.countLabel.configure(text='Image Number :')
        self.countLabel.pack(side=LEFT, padx=5, pady=10)
 
        self.countEntry = Entry(self.urlFrame)
        self.countEntry.configure(width=10)
        self.countEntry.pack(side=LEFT, padx=5, pady=10)
 
        # ButtonFrame
 
        self.buttonFrame = Frame(self.mainFrame)
        self.buttonFrame.pack(side=TOP, fill=X)
 
        self.findButton = Button(self.buttonFrame, command=self.findThreadingStart)
        self.findButton.configure(text='Find', width=25)
        self.findButton.pack(side=LEFT, padx=7, pady=5)
 
        self.startButton = Button(self.buttonFrame, command=self.startThreadingStart)
        self.startButton.configure(text='Start', width=25)
        self.startButton.pack(side=LEFT, padx=7, pady=5)
 
        self.stopButton = Button(self.buttonFrame, command=self.stopCrawling)
        self.stopButton.configure(text='Stop', width=25)
        self.stopButton.pack(side=LEFT, padx=7, pady=5)
 
        # Notification Frame
 
        self.notificationFrame = Frame(self.mainFrame)
        self.notificationFrame.pack(side=TOP, fill=X)
 
        self.numberLabel = Label(self.notificationFrame)
        self.numberLabel.configure(text='Images found : ')
        self.numberLabel.pack(side=LEFT, padx=10, pady=5)
 
        self.numberviewLabel = Label(self.notificationFrame)
        self.numberviewLabel.configure(text='N/A')
        self.numberviewLabel.pack(side=LEFT, padx=10, pady=5)
 
        self.notificationButton = Button(self.notificationFrame, command=self.help)
        self.notificationButton.configure(text = 'Help')
        self.notificationButton.pack(side=RIGHT, padx=10, pady=5)
 
        # Warning Frame
 
        self.warningFrame = Frame(self.mainFrame)
        self.warningFrame.pack(side=TOP, fill=X)
 
        self.warningLabel = Label(self.warningFrame)
        self.warningLabel.pack(side=LEFT, padx=10)
 
    def findThreadingStart(self):
        self.findThread = threading.Thread(target=self.findImage)
        self.findThread.start()
 
    def findImage(self):
        self.warningLabel.configure(text='[*] Finding Images...')
 
        url = self.urlEntry.get()
        if url=="":
            self.warningLabel.configure(text='[*] Please input keywords!')
            return
        page = self.pageEntry.get()
        try:
            page = int(page)
        except:
            self.warningLabel.configure(text='[*] Please input only INTEGER in page form!')
            return
 
        self.crawler = Crawler(url, page)
        self.number = self.crawler.findImg()
 
        if self.number == -1:
            self.warningLabel.configure(text='[*] No such Tag, please use other Tag')
            return
        elif self.number == -2:
            self.warningLabel.configure(text='[*] Exception occured. Please feedback to developer.')
            return
 
        self.numberviewLabel.configure(text=str(self.number))
        self.warningLabel.configure(text='[*] Finding Images finished')
 
    def startThreadingStart(self):
        self.startThread = threading.Thread(target=self.startCrawling)
        self.startThread.start()
 
    def startCrawling(self):
        self.warningLabel.configure(text='[*] Crawling Started')
        try:
            num = self.countEntry.get()
            if num=="MAX":
                num = self.number
            else:
                try:
                    num = int(num)
                except:
                    self.warningLabel.configure(text='[*] Please input only INTEGER in number form!')
                    return
 
                if num > self.number:
                    num = self.number
            self.crawler.start(num)
        except:
            self.warningLabel.configure(text="[*] Please do 'Find' before 'Start'!")
            return
    def stopCrawling(self):
        self.warningLabel.configure(text="[*] It can't be used!!")
 
    def help(self):
        helpString = """        [Usage]
1. Input the keywords in the box. (ex: Kousaka Kirino)
  It can be character's name, emotions(ex: crying, smile),
  objects(ex: wings, sword).
  Please use official name.
2. Input the page's number you want to crawl.
  About 5~15 images in one page.
  !! Too many pages (like above 100) can take many time.
  !! So please be careful.
3. Click the Find button, and wait for finishing.
4. After finished, input the number you want to crawl.
  If there's anything or number is greater than (3),
  It's automatically set as the maximum(3's number).
  If you want to crawl all of them, input 'MAX'.
5. Click the Start Button, and wait for finishing.
=============================================
This Program was made by 5kyc1ad(skyclad1975).
It's made up of Python 2.7.10, with Tkinter, BeautifulSoup.
Please Feedback : skyclad0x7b7@gmail.com
Blog : http://5kyc1ad.tistory.com"""
        tkMessageBox.showinfo("Zerochan_Crawler::Help",helpString)
 
        
root = Tk()
myApp = Interface(root)
root.mainloop()
Colored by Color Scripter
cs

사실상 이거 짜는게 제일 오래 걸렸다.

크게 어려운 건 없었지만 노가다라고 해야 하나, 역시 코딩은 열심히 두들기는게 답이었다.

다음은 시연 영상이다.

(혹시라도 문제가 된다면 삭제하도록 하겠습니다)

저작자표시

'Programming' 카테고리의 다른 글

[VHDL] Full Adder (0)	2016.02.11
[VHDL] D Flip-Flop (0)	2016.02.11
[C++] 가상함수(Virtual Function) (0)	2016.01.20
Assembly Programming - atoi (0)	2015.10.11
Assembly Programming - isAlpha, isNumber (0)	2015.10.11

__미니__

E-mail : skyclad0x7b7@gmail.com 나와 계약해서 슈퍼 하-카가 되어 주지 않을래?

파이썬 웹 이미지 크롤러 (GUI)

'Programming' 카테고리의 다른 글

카테고리

태그목록

미니

LATEST FROM OUR BLOG

LATEST COMMENTS

BLOG VISITORS

티스토리툴바