Web Mining of Firm Websites: A Framework for Web Scraping and a Pilot Study for Germany

37 Pages Posted: 31 Aug 2018 Last revised: 7 Feb 2020

See all articles by Jan Kinne

Jan Kinne

Centre for European Economic Research (ZEW)

Janna Axenbeck

ZEW – Leibniz Centre for European Economic Research

Date Written: 2018

Abstract

Nowadays, almost all (relevant) firms have their own websites which they use to publish information about their products and services. Using the example of innovation in firms, we outline a framework for extracting information from firm websites using web scraping and data mining. For this purpose, we present an easy and free-to-use web scraping tool for large-scale data retrieval from firm websites. We apply this tool in a large-scale pilot study to provide information on the data source (i.e. the population of firm websites in Germany), which has as yet not been studied rigorously in terms of its qualitative and quantitative properties. We find, inter alia, that the use of websites and websites’ characteristics (number of subpages and hyperlinks, text volume, language used) differs according to firm size, age, location, and sector. Web-based studies also have to contend with distinct outliers and the fact that low broadband availability appears to prevent firms from operating a website. Finally, we propose two approaches based on neural network language models and social network analysis to derive firm-level information from the extracted web data.

Keywords: Web Mining, Web Scraping, R&D, R&I, STI, Innovation, Indicators, Text Mining

JEL Classification: O30, C81, C88

Suggested Citation

Kinne, Jan and Axenbeck, Janna, Web Mining of Firm Websites: A Framework for Web Scraping and a Pilot Study for Germany (2018). ZEW - Centre for European Economic Research Discussion Paper No. 18-033, Available at SSRN: https://ssrn.com/abstract=3240470 or http://dx.doi.org/10.2139/ssrn.3240470

Jan Kinne (Contact Author)

Centre for European Economic Research (ZEW) ( email )

P.O. Box 10 34 43
L 7,1
D-68034 Mannheim, 68034
Germany

Janna Axenbeck

ZEW – Leibniz Centre for European Economic Research ( email )

P.O. Box 10 34 43
L 7,1
D-68034 Mannheim, 68034
Germany

Do you have negative results from your research you’d like to share?

Paper statistics

Downloads
534
Abstract Views
1,750
Rank
96,751
PlumX Metrics