[Web Crawling] BeautifulSoup

Major Study/25-1 Web Crawling

[Web Crawling] BeautifulSoup

선경이 2025. 4. 6. 16:42

HTML

HTML 개념

- Hypertext Markup Language 약자

- 월드 와이드 웹에서 하이퍼텍스트 문서를 만들기 위한 기본언어이다.

HTML 특징

- HTML 문서의 골격은 태그의 쌍으로 구성된다.

-> <태그이름>문서의내용</태그이름>

- 태그 중에서는 <BR>, <IMG>처럼 종료 태그가 없는 경우도 있다.

HTML 기본구조

- HTML 문서는 태그로 구성되어있다.

<html>

<head>

<title>Hello</title>

</head>

<body>

<p>Hello HTML!</p>

</body>

</html>

- <HTML> ~ </HTML> : HTML로 작성된 것을 표시한다. HTML의 시작을 의미한다.

- <HEAD> ~ </HEAD> : 문서정보 기술, 문서 전체에 영향을 미치는 내용으로 화면에는 출력되지 않는다.

- <BODY> ~ </BODY> : 실질적인 페이지 내용 삽입,이 태그 안의 내용은 모두 화면에 출력된다.

HTML 태그 속성

- 태그속성은 태그에 특별한 성질을 부여하는 기능을 한다.

- 태그 속성이 없는 title 태그 : <title>test site</title>

- class와 id 속성을 가지는 title 태그 : <title class="t" id="ti">test site</title>

Parsing

- 하나의 문장을 그것을 이루고 있는 구성성분으로 분해하는 것이다.

- 구성성분 사이의 위계 관계를 분석해서 문장의 구조를 결정한다.

HTML parsing

- DOM : 웹페이지의 HTML문서를 트리 구조로 표현한 모델

- HTML Parsing : HTML 문서를 읽어 들여서 DOM으로 변환해주는 것

BeautifulSoup

- HTML이나 XML를 파싱하기 위한 라이브러리

HTML parser 종류

- lxml

- html5lib

- html.parser

HTML parsing

BeautisulSoup 이용

from bs4 import BeautifulSoup
html = """<html><head><title>test site</title></head><body><p>test1</p><p>test2</p></body></html>"""
soup = BeautifulSoup(html, 'lxml')

print(soup)
<html><head><title>test site</title></head><body><p>test1</p><p>test2</p></body></html> 

print(type(soup))
>>> <class 'bs4.BeautifulSoup'>

HTML을 보기 좋게 출력하기

BeautifulSoup.prettify() 이용

print(soup.prettify())
>>> <html>
 <head>
  <title>
   test site
  </title>
 </head>
 <body>
  <p>
   test1
  </p>
  <p>
   test2
  </p>
 </body>
</html>

태그 접근법

BeautifulSoup 객체에서 .<태그이름>

- 처음 만나는 태그를 가져온다.

html = """<html><head><title>test site</title></head><body><p>test1</p><p>test2</p></body></html>"""
soup = BeautifulSoup(html, 'lxml')

tag_title = soup.title
print(tag_title)
>>> <title>test site</title>

print(type(soup), type(tag_title))
>>> <class 'bs4.BeautifulSoup'> <class 'bs4.element.Tag'>

태그 객체 속성

- Tag.text : 태그에 포함된 텍스트

- Tag.string : 태그의 후손들 중에서 단 하나의 텍스만 있을 때 사용

- Tag.name : 태그의 이름

html = """<html><head><title>test site</title></head><body><p>test1</p><p>test2</p></body></html>"""
soup = BeautifulSoup(html, 'lxml')
tag_title = soup.title

print(tag_title.text)
>>> test site

print(tag_title.string)
>>> test site

print(tag_title.name)
>>> title

태그 속성

- .attr를 이용하여 모든 태그속성들을 접근

print(tag_title.attr)
>>> {'class': ['t'], 'id': 'ti'}

- 타이틀 태그의 속성 접근

- 대괄호 []안에 태그속성의 이름을 넣는다.

- class 태그 속성은 list 형태로 리턴되고 여러 개의 속성값을 가질 수 있다.

print(tag_title['calss'])
>>> ['t']

print(tag_title['id'])
>>> ti

- 없는 속성에 접근하면 에러가 발생한다.

- get() 함수를 사용하면 없는 태그속성을 접근해도 에러가 발생하지 않는다.

print(tag_title.get('class'))
print(tag_title.get('id'))
print(tag_title.get('size'))
print(tag_title.get('size','just ok'))

>>> ['t']
>>> ti
>>> None
>>> 'jusk ok'

Tag.text VS Tag.string

- Tag.text : 모든 자식 태그들의 텍스트를 연결하여 반환한다.

- Tag.string : 후손들의 태그 중에 단 하나의 텍스트가 있는 경우만 스트링을 반환한다.

여러 개의 텍스트를 가지고 있으면 None을 반환한다.

html = """<html><head><title class="t" id="ti">test site</title></head><body><p><span>test1</span><span>test2</span></p></body></html>"""
soup = BeautifulSoup(html, 'lxml')

tag_p = soup.p
data_text = tag_p.text
data_string = tag_p.string

print('text:',data_text, type(data_text))
>>> text: test1test2 <class 'str'>

print('string:',data_string,type(data_string))
>>> string: None <class 'NoneType'>

print(tag_p.span.string)
>>> test1

HTML 태그 관계

- 태그 관계

- 부모 자식 관계

- <head>는 <title>의 부모 태그이다.

- <title>은 <head>의 자식 태그이다.

- 형제 관계

- <head>와 <body>는 형제 관계이다.

자식 태그 가져오기

contents

- 리스트 형식으로 모든 자식태그를 가져온다.

soup = BeautifulSoup(html, 'lxml')
print(soup.html.contents)
>>> ['\n', <head><title>test site</title></head>, '\n', <body><p><a>test1</a><b>test2</b><c>test3</c></p></body>, '\n']

.children

- iterator 형식으로 자식태그를 가져온다.

print(list(soup.html.children))
>>> ['\n', <head><title>test site</title></head>, '\n', <body><p><a>test1</a><b>test2</b><c>test3</c></p></body>, '\n']

for tag in soup.html.contents : 
	if tag.name : 
    	print(tag)
     
>>> <head><title>test site</title></head>
<body><p><a>test1</a><b>test2</b><c>test3</c></p></body>

부모 태그 가져오기

.parent 이용

print(soup.title.parent)
print(soup.p.parent)

>>> <head><title>test site</title></head>
>>> <body><p><a>test1</a><b>test2</b><c>test3</c></p></body>

형제 태그 가져오기

.find_next_sibling() : 다음 형제 가져오기

.find_previous_sibling() : 이전 형제 가져오기

print(soup.head.find_nect_sibling())
print(soup.body.find_previous_sibling())

>>> <body><p><a>test1</a><b>test2</b><c>test3</c></p></body>
>>> <head><title>test site</title></head>