2021 2학기 데이터크롤링 기말프로젝트

Last update: Aug 16, 2022

Related tags

Text Data & NLP data_crawling

Overview

공지

주제

웹 크롤링을 이용한 취업 공고 스케줄러

스케줄

주제 정하기
코딩하기
핵심 코드 설명 + 피피티 구조 구상 // 12/4 토
피피티 + 스크립트(대본) 제작 + 녹화 // ~ 12/10 ~ 12/11 금~토
영상 편집 // ~12/11 토

웹크롤러

사람인_평균연봉 1000개

주제 선정 배경

마지막 학기를 보내며 취업 전선에 뛰어들려 하니 여러 가지 생각해야 할 게 많았다. 학교라는 좁은 사회를 벗어나 더 큰 물에 뛰어들려 보니 겁부터 났다. 수영 전 준비운동을 하듯 내가 취업하기 위해 먼저 채용 정보를 수집해야 겠다고 생각했다.
IT 내에서도 트렌드와 어떤 분야에서 사람을 많이 구하는지 알고 싶었다. 그를 위해 스택 오버플로우에서 User-Agent 를 확인 후 채용 공고 크롤링을 수행했다.
우리나라 내에서 각자의 분야에 종사하는 사람들이 평균 연봉으로 얼마를 받는지 알고 싶어서 여러 취업 사이트 중 하나인 '사람인'에서 User-Agent 를 확인 후 평균 연봉 정보를 크롤링했다. 최근 1000개만 수행해보았다. (10000개 해도 될 듯하다.)

데이터 수집 방법

사람인, 스택오버플로우에서의 채용 공고를 긁어오기로 했다.
따로 만든 크롤러 파일(연봉정보, 채용공고)에서 CSV 로 데이터를 추출한다.

크롤링 작업 중 핵심 코드 설명

연봉 정보 파일은 주석 달기 완료

분석 방법

주제어(키워드) 빈도 분석
주제어(키워드) 중요도 분석
텍스트 마이닝
참고한 링크

결론

어떠한 분야에서의 국내 평균 연봉은 이렇다!
요새는 세계적으로 IT 내 이쪽 분야가 트렌드다! 사람을 많이 뽑는다!

참고자료

사람인 사이트
스택 오버플로우 사이트

과제 수행에서 어려웠던 점

User-Agent 에서 크롤링을 허락해주는 사이트 중 URL 에 페이지의 숫자가 나타나는 사이트를 찾기 어려웠다.
직무 별

PPT 구성

[1] - 주제
[2] - 주제 선정 배경
[3] - 데이터 수집 방법
[4] - 크롤링 작업 중 핵심 소스 코드 설명
[5] - 분석방법/모델
[6] - 결론
[7] - 참고자료
[8] - 과제 수행에서 어려웠던 점

PPT 상세 구성

스택 오버 플로우
- 직종별 구인수 (Front/Back) (NCS IT 직무 8개)
- 나라별 구인 직종
사람인
- 1000개의 임의의 기업에 따른 최고 연봉 (5) 과 최저 연봉 (5)
  - 최고 같은 경우 은행이나 다른 업종
  - 최저 같은 경우 서비스 업종
- 기업형태에 따른 연봉 구간 (중소/중견/대)
- 산업(업종)에 따른 연봉 구간
- 코스닥/코스피에 따른 연봉 구간 차이?
현재 취업하려고 하는 사람들에게 어떤 직무가 자신에게 나을지 판단 -> 결론
- 직무별 수요에 따라서 결과 표시 (스택)
- 연봉을 중요시 여긴다면 결과 표시 (사람인)

분석 결과

스택 오버 플로우
- 직종별 구인수 (Front/Back) (NCS IT 직무 8개)
  - 분석 결과 여따 써줘요
  - 대략 밑에 작성하라는 의미
  - Front / Back
  - 직무 8개 별로
- 나라별 구인 직종
- 사람인
  - 1000개의 임의의 기업에 따른 최고 연봉 (5) 과 최저 연봉 (5)
    - 최고 같은 경우 은행이나 다른 업종
    - 최저 같은 경우 서비스 업종
  - 기업형태에 따른 연봉 구간 (중소/중견/대)
  - 산업(업종)에 따른 연봉 구간
  - 코스닥/코스피에 따른 연봉 구간 차이?

Owner

Choi Eun Jeong

Frontend Developer with React & React Native

Choi Eun Jeong

GitHub Repository

A modular framework for vision & language multimodal research from Facebook AI Research (FAIR)

MMF is a modular framework for vision and language multimodal research from Facebook AI Research. MMF contains reference implementations of state-of-t

5.1k Dec 26, 2022

Train GPT-3 model on V100(16GB Mem) Using improved Transformer.

GPT-X using transformer pytorch

24 Sep 11, 2022

Python package to easily retrain OpenAI's GPT-2 text-generating model on new texts

gpt-2-simple A simple Python package that wraps existing model fine-tuning and generation scripts for OpenAI's GPT-2 text generation model (specifical

3.1k Jan 07, 2023

Implementation of ProteinBERT in Pytorch

ProteinBERT - Pytorch (wip) Implementation of ProteinBERT in Pytorch. Original Repository Install $ pip install protein-bert-pytorch Usage import torc

92 Dec 25, 2022

Korean Sentence Embedding Repository

Korean-Sentence-Embedding 🍭 Korean sentence embedding repository. You can download the pre-trained models and inference right away, also it provides

80 Jan 02, 2023

In this project, we compared Spanish BERT and Multilingual BERT in the Sentiment Analysis task.

Applying BERT Fine Tuning to Sentiment Classification on Amazon Reviews Abstract Sentiment analysis has made great progress in recent years, due to th

5 Jan 03, 2022

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 B) on a single 16 GB VRAM V100 Google Cloud instance with Huggingface Transformers using DeepSpeed

Guide: Finetune GPT2-XL (1.5 Billion Parameters) and GPT-NEO (2.7 Billion Parameters) on a single 16 GB VRAM V100 Google Cloud instance with Huggingfa

289 Jan 06, 2023

WikiPron - a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary

WikiPron WikiPron is a command-line tool and Python API for mining multilingual pronunciation data from Wiktionary, as well as a database of pronuncia

213 Jan 01, 2023

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation

Unsupervised Document Expansion for Information Retrieval with Stochastic Text Generation Official Code Repository for the paper "Unsupervised Documen

2 Oct 26, 2021

DeepAmandine is an artificial intelligence that allows you to talk to it for hours, you won't know the difference.

DeepAmandine This is an artificial intelligence based on GPT-3 that you can chat with, it is very nice and makes a lot of jokes. We wish you a good ex

3 Apr 19, 2022

This repository describes our reproducible framework for assessing self-supervised representation learning from speech

LeBenchmark: a reproducible framework for assessing SSL from speech Self-Supervised Learning (SSL) using huge unlabeled data has been successfully exp

49 Aug 24, 2022

A BERT-based reverse-dictionary of Korean proverbs

Wisdomify A BERT-based reverse-dictionary of Korean proverbs. 김유빈 : 모델링 / 데이터 수집 / 프로젝트 설계 / back-end 김종윤 : 데이터 수집 / 프로젝트 설계 / front-end Quick Start C

94 Dec 08, 2022

ChatterBot is a machine learning, conversational dialog engine for creating chat bots

ChatterBot ChatterBot is a machine-learning based conversational dialog engine build in Python which makes it possible to generate responses based on

12.8k Jan 03, 2023

Blazing fast language detection using fastText model

Luga A blazing fast language detection using fastText's language models Luga is a Swahili word for language. fastText provides a blazing fast language

18 Dec 20, 2022

RIDE automatically creates the package and boilerplate OOP Python node scripts as per your needs

RIDE: ROS IDE RIDE automatically creates the package and boilerplate OOP Python code for nodes as per your needs (RIDE is not an IDE, but even ROS isn

20 Jul 14, 2022

AllenNLP integration for Shiba: Japanese CANINE model

Allennlp Integration for Shiba allennlp-shiab-model is a Python library that provides AllenNLP integration for shiba-model. SHIBA is an approximate re

12 Feb 16, 2022

ZUNIT - Toward Zero-Shot Unsupervised Image-to-Image Translation

ZUNIT Dependencies you can install all the dependencies by pip install -r requirements.txt Datasets Download CUB dataset. Unzip the birds.zip at ./da

9 Jun 24, 2022

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

A crowdsourced dataset of dialogues grounded in social contexts involving utilization of commonsense.

62 Dec 20, 2022

Quick insights from Zoom meeting transcripts using Graph + NLP

Transcript Analysis - Graph + NLP This program extracts insights from Zoom Meeting Transcripts (.vtt) using TigerGraph and NLTK. In order to run this

7 Sep 17, 2022

Rank-One Model Editing for Locating and Editing Factual Knowledge in GPT

Rank-One Model Editing (ROME) This repository provides an implementation of Rank-One Model Editing (ROME) on auto-regressive transformers (GPU-only).

130 Dec 21, 2022