수강 시작 !!Step1 . Indeed 페이지의 게시글 제목과 기업이름 가져오기

2021/02/09

nomadcoders.co/

노마드 코더 Nomad Coders

코딩은 진짜를 만들어보는거야!. 실제 구현되어 있는 서비스를 한땀 한땀 따라 만들면서 코딩을 배우세요!

nomadcoders.co

평소 즐겨보는 유튜버인 노마드 코더의 무료 강의 중 하나인 "Python으로 웹 스크래퍼 만들기"를 수강해보려고 한다.

환경설정

Rep.it 이라는 코드 실행 가능한 웹 페이지에서 개발을 한다.

1. Requests로 url의 html 파일들 가져오기

requests.readthedocs.io/en/master/

Requests: HTTP for Humans™ — Requests 2.25.1 documentation

Requests: HTTP for Humans™ Release v2.25.1. (Installation) Requests is an elegant and simple HTTP library for Python, built for human beings. Behold, the power of Requests: >>> r = requests.get('https://api.github.com/user', auth=('user', 'pass')) >>> r.

requests.readthedocs.io

8행: result = requests.get(URL) => 해당 URL의 모든 html 파일을 text로 가져옴

2. BeautifulSoup4로 html파일 안에 있는 태그와 값들 가져오기

www.crummy.com/software/BeautifulSoup/bs4/doc/

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call str() on a BeautifulSoup object (unicode() in Python 2), or on a Tag within it: str(soup) # ' I linked to example.com ' str(soup.a) # ' I linked to example.com ' The str(

www.crummy.com

3. 코드 확인

indeed.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47

import requests
from bs4 import BeautifulSoup
 
LIMIT = 50
URL = f"https://kr.indeed.com/%EC%B7%A8%EC%97%85?as_and=java+spring&as_phr=&as_any=&as_not=&as_ttl=&as_cmp=&jt=all&st=&salary=&radius=25&l=%EA%B2%BD%EA%B8%B0%EB%8F%84+%EC%84%B1%EB%82%A8&fromage=any&limit={LIMIT}&sort=&psf=advsrch&from=advancedsearch"
 
def extract_indeed_pages():
  result = requests.get(URL)
 
  #모든 html 결과 가져오기
  #print(indeed_result.text)
 
  #Extract data from html 
  soup = BeautifulSoup(result.text, "html.parser") 
 
  pagination = soup.find("div",{"class":"pagination"})
 
  # print(pagination)
 
  links = pagination.find_all("a")
 
  # print(pages)
  # link.find("span")
  pages = []
  for link in links[0:-1]:
   pages.append(int(link.string))
  # print(spans[:-1])
  max_page = pages[-1]
  return max_page
 
 
def extract_indeed_jobs_and_companies(last_pages):
  url_text = requests.get(f"{URL}&start={0*LIMIT}")
  soup = BeautifulSoup(url_text.text, "html.parser")
  results = soup.find_all("div", {"class":"jobsearch-SerpJobCard"})
  for result in results:
    company = result.find("span", {"class":"company"})
    title = result.find("h2", {"class":"title"}).find("a")["title"]
    company_anchor = company.find("a")
    if company_anchor is not None:
      company = str(company_anchor.string)
    else:
      company = str(company.string)
    
    company = company.strip()
    print(company)
    print(title, "\n")
Colored by Color Scripter

cs

# find_all: 전체 html 파일에서 해당되는 모든 리스트를 찾아주고(=select)

# find: 첫번째 찾은 결과를 보여준다. (select_one)

extract_indeed_pages 함수 = 아래에 있는 다음 페이지 중 마지막 페이지를 추출하는 함수

14행: soup = BeautifulSoup(result.text, "html.parser")

requests로 추출한 result를 soup에 담는다.

16행: pagination = soup.find("div",{"class":"pagination"})

#pagination = soup.select_one("div.pagination"})

<soup에서 "div" 태그이고 클래스 명이 "pagination"인 html을 하나 찾아서 반환해줘 !>

20행: links = pagination.find_all("a")

extract_indeed_jobs_and_companies 함수

35행: results = soup.find_all("div", {"class":"jobsearch-SerpJobCard"})

37행: company = result.find("span", {"class":"company"})

#company = soup.select_one("div.sjcl span.company")

#company = soup.select_one("span.company")

<soup에서 "span 태그이고 클래스명이 "company"인 html을 하나 찾아서 반환해줘 !>

현재 진도

~ #2.7 Extracting Companies (09:13)

2021/02/09

환경설정

1. Requests로 url의 html 파일들 가져오기

2. BeautifulSoup4로 html파일 안에 있는 태그와 값들 가져오기

3. 코드 확인

티스토리툴바