|本期目录/Table of Contents|

[1]杜鹏辉,仇继扬,彭书涛,等.基于Scrapy的网络爬虫的设计与实现[J].电子设计工程,2019,27(22):120-123.
 DU Penghui,QIU Jiyang,PENG Shutao,et al.Design and implementation of Web crawler based on Scrapy[J].SAMSON,2019,27(22):120-123.
点击复制

基于Scrapy的网络爬虫的设计与实现(PDF)
分享到:

《电子设计工程》[ISSN:1674-6236/CN:61-1477/TN]

卷:
27
期数:
2019年22期
页码:
120-123
栏目:
网络与通信
出版日期:
2019-11-20

文章信息/Info

Title:
Design and implementation of Web crawler based on Scrapy
文章编号:
1674-6236(2019)22-0120-04
作者:
杜鹏辉1仇继扬1彭书涛1柴沣伟23刘意先2
(1. 国网陕西省电力公司 陕西 西安 710048; 2. 西安邮电大学 陕西 西安 710121;3.中兴通讯股份有限公司 陕西 西安 710065)
Author(s):
DU Peng?hui1QIU Ji?yang1PENG Shu?tao1CHAI Feng?wei23LIU Yi?xian2
(1. State Grid Shaanxi Electric Power Company, Xi’an 710048, China; 2. Xi’an University of Posts and Telecommunications, Xi’an 710121,China; 3. ZTE Corporation, Xi’an 710065, China)
关键词:
大数据 网络爬虫 Python Scrapy框架
Keywords:
big data Web crawler Python Scrapy framework
分类号:
TN919
DOI:
-
文献标志码:
A
摘要:
随着信息技术的发展,网络数据成为了一种重要资产,如何快速有效的提取和分析数据是目前的一个研究热点。针对网络中的海量数据采用Scrapy框架设计网络爬虫对数据进行提取,首先分介绍了如何在Python下安装调用Scrapy框架并建立相应爬虫项目,然后对目标网站的页面源码的结构进行分析,从标签中定位需要获取的数据,并依此设计出了相应的表达式将相应的数据提取到统一的数据结构中,最后将数据保存到文件,实现存储的持久化。该设计方法能为各类基于Web的网络数据分析项目提供相应的数据采集和分析支持。
Abstract:
With the development of information technology, network data has become an important asset. How to extract and analyze data quickly and effectively is a research hotspot. Scrapy framework is used to design web crawlers to extract data for massive data in the network. Firstly, how to install the Scrapy framework and build the corresponding crawler project in Python is introduced. Then the structure of the source code of the target website page is analyzed. The data that needs to be acquired is located from the tag, and extracted by using corresponding expression and filled in the unified data structure.At last the data is saved to a file to achieve persistence storage ability. This design method can provide data acquisition and analysis support for various Web based network data analysis projects.

参考文献/References:

[1] Zaheer Allam, Zaynah A. Dhunny. On big data, artificial intelligence and smart cities[J]. Cities, 2019(89):80-91[2] Dariush Khezrimotlagh, Joe Zhu,Cook W D, et al. Data envelopment analysis and big data[J]. European Journal of Operational Research, 2019,274(3):1047-1054.[3] Torrecilla J L, Romo J. Data learning from big data[J]. Statistics & Probability Letters, 2018(136):15-19.[4] Taleb I, Serhani M A. Big Data Pre-Processing: Closing the Data Quality Enforcement Loop[C]. Honolulu: 2017 IEEE International Congress on Big Data (BigData Congress),2017:498-501.[5] Tabesh P,Mousavidin E,Hasani S. Implementing big data strategies: A managerial perspective[J]. Business Horizons, 2019,62(3):347-358.[6] Akter S,Wamba S F, Angappa Gunasekaran, et al. How to improve firm performance using big data analytics capability and business strategy alignment[J]. International Journal of Production Economics, 2016(182):113-131.[7] Jing Zeng, Glaister K W. Value creation from big data: Looking inside the black box[J]. Strategic Organization, 2017,16(2):105-140.[8] 张露.网络爬虫技术在大数据审计中的应用[J].合作经济与科技,2019(7):190-192.[9] Sung-min Kim, Young-guk Ha. Automated discovery of small business domain knowledge using web crawling and data mining[C]// Hong Kong: 2016 International Conference on Big Data and Smart Computing (BigComp), 2016:481-484.[10]姜同庆,于海兰,王亚楠.Web网络大数据分类系统开发应用[J].信息技术与信息化,2018(9):105-107.[11]舒万畅.爬虫技术在大数据领域中的应用分析[J].科学技术创新,2018(36):91-92.[12]Pythran S G. Crossing the Python Frontier[J]. Computing in Science & Engineering, 2018,20(2):83-89.[13]焦萍萍. 基于Python技术面向校园网原型搜索引擎设计[J]. 电脑知识与技术, 2017,13(9):20-21.[14]刘宇, 郑成焕. 基于Scrapy的深层网络爬虫研究[J]. 软件, 2017,38(7):111-114.[15]孙歆,戴桦,孔晓昀,等. 基于Scrapy的工业漏洞爬虫设计[J]. 信息安全与技术, 2017,8(1):66-71.[16]Shi Z, Shi M,Lin W. The Implementation of Crawling News Page Based on Incremental Web Crawler[C]// Las Vegas: 2016 4th Intl Conf on Applied Computing and Information Technology/3rd Intl Conf on Computational Science/Intelligence and Applied Informatics/1st Intl Conf on Big Data, Cloud Computing, Data Science & Engineering (ACIT-CSII-BCD),2016:348-351.

相似文献/References:

[1]吕冬雪.基于大数据环境的NoSQL技术分析[J].电子设计工程,2016,(14):33.
 LV Dong-xue.Analysis of NoSQL technology based on big data environment[J].SAMSON,2016,(22):33.
[2]许元斌.基于电力大数据的多源异构参数融合方法的研究与应用[J].电子设计工程,2016,(14):14.
 XU Yuan-bin.Research and application of fusion method of multi-source heterogeneous data based on large power parameters[J].SAMSON,2016,(22):14.
[3]杨 斐,艾晓燕,张永恒,等.大数据精准挖据处理架构及预测模型研究[J].电子设计工程,2016,(12):29.
 YANG Fei,AI Xiao-yan,ZHANG Yong-heng,et al.New mining architecture and prediction model for big data[J].SAMSON,2016,(22):29.
[4]周小娟.一种轻量级大数据分析系统的实现[J].电子设计工程,2016,(08):40.
 ZHOU Xiao-juan.Implementation of lightweight big data analysis system[J].SAMSON,2016,(22):40.
[5]魏利峰,纪建伟,王晓斌.云环境中web信息抓取技术的研究及应用[J].电子设计工程,2016,(04):29.
 WEI Li-feng,JI Jian-wei,WANG Xiao-bin.Research and application of web information extraction technology in cloud environment[J].SAMSON,2016,(22):29.
[6]李千慧,魏海平,窦雪英.基于Hadoop的排序性能优化研究[J].电子设计工程,2016,(02):45.
 LI Qian-hui,WEI Hai-ping,DOU Xue-ying.Optimization of sorting performance based on Hadoop[J].SAMSON,2016,(22):45.
[7]沈 琦,陈 博.基于大数据处理的ETL框架的研究与设计[J].电子设计工程,2016,(02):25.
 SHEN Qi,CHEN Bo.Research and design of ETL framework based on data processing[J].SAMSON,2016,(22):25.
[8]何 浩,李 滔. 基于Python的Android应用GUI的开发[J].电子设计工程,2013,(09):63.
 HE Hao,LI Tao. Development of Android application’s GUI based on Python[J].SAMSON,2013,(22):63.
[9]陈笑飞,李 滔. 基于Python的虚拟仪器技术研究及实现[J].电子设计工程,2012,(16):48.
 CHEN Xiao-fei,LI Tao. Development and research of virtual instrument based on Python[J].SAMSON,2012,(22):48.
[10]黄皓凌,张 凡.6搜-高效的专用IPv6搜索引擎[J].电子设计工程,2011,(23):34.
 HUANG Hao-ling,ZHANG Fan.6 sou-A highly effective specialized IPv6 search engine[J].SAMSON,2011,(22):34.

备注/Memo

备注/Memo:
收稿日期:2019-05-05 稿件编号:201905010基金项目:国家自然科学基金资助项目(61671377);国网陕西电力科学研究项目(2018256)作者简介:杜鹏辉(1979—),男,陕西西安人,硕士,工程师。研究方向:智能信息处理及现代企业治理。
更新日期/Last Update: 2019-11-25