全部 标题 作者
关键词 摘要

OALib Journal期刊
ISSN: 2333-9721
费用:99美元

查看量下载量

相关文章

更多...

Data Science: State of the Art and Trends

DOI: 10.4236/dsi.2020.11002, PP. 22-49

Keywords: Data Science, Big Data, Data Products, Data Wrangling, Data-Driven

Full-Text   Cite this paper   Add to My Lib

Abstract:

The entering into big data era gives rise to a novel discipline called Data Science. To start with, a very brief history, interdisciplinarity, theoretical framework, and taxonomy of Data Science are discussed. Then, the differences between domain-general Data Science and domain-specific Data Science are proposed based upon conducting literature reviews on hot topics in big data-related studies. In addition, ten common debates in Data Science are described, including debates on thinking pattern, properties of big data, enablers of intelligence, bottlenecks in data products development, data preparation, quality of services, big data analysis, evaluation of big data algorithms, the fourth paradigm and big data skills shortage. Moreover, the emerging trends in Data Science are presented: shifts in data analysis methodologies, adoption of model integration and meta-analysis, introducing data first, schema later or never paradigm, rethinking data consistency in big data systems, recognizing data replication and data locality, growth in integrated data applications, changes in the complexity of data computing, the advent of data products, the rise of pro-ams and citizen data science, as well as the increasing demand for data scientists. In conclusion, some suggestions for further studies are also proposed: to avoid misconstruing Data Science, to take advantages of active property of big data, to balance the three dimensions of Data Science, to introduce Design of Experiments, to embrace causality analysis, and to develop data products.

References

[1]  Mayer-Schönberger, V. and Kenneth, C. (2013) Big Data: A Revolution That Will Transform
How We Live, Work, and Think. Houghton Mifflin Harcourt, Boston.
[2]  Boyd, D. and Crawford, K. (2012) Critical Questions for Big Data: Provocations for a
Cultural, Technological, and Scholarly Phenomenon. Information, Communication & Society
, 15, 662-679.
https://doi.org/10.1080/1369118X.2012.678878
[3]  Kitchin, R. (2014) Big Data, New Epistemologies and Paradigm Shifts. Big Data & Society, 1, 1-12.
https://doi.org/10.1177/2053951714528481
[4]  Jagadish, H.V. (2015) Big Data and Science: Myths and Reality. Big Data Research, 2, 49-52.
https://doi.org/10.1016/j.bdr.2015.01.005
[5]  Song, I. and Zhu, Y.J. (2016) Big Data and Data Science: What Should We Teach. Expert
Systems, 33, 364-373.
https://doi.org/10.1111/exsy.12130
[6]  Naur, P. (1974) Concise Survey of Computer Methods. Petrocelli Books, New York.
[7]  Cleveland, W.S. (2001) Data Science: An Action Plan for Expanding the Technical Areas of the
Field of Statistics. International Statistical Review, 69, 21-26.
https://doi.org/10.1111/j.1751-5823.2001.tb00477.x
[8]  Mattmann, C.A. (2013) Computing: A Vision for Data Science. Nature, 493, 473-475.
https://doi.org/10.1038/493473a
[9]  Dhar, V. (2013) Data Science and Prediction. Communications of the ACM, 56, 64-73.
https://doi.org/10.1145/2500499
[10]  Davenport, T.H. and Patil, D.J. (2012) Data Scientist. Harvard Business Review, 90, 70-76.
[11]  Kitchin, R. (2013) Big Data and Human Geography: Opportunities, Challenges and Risks.
Dialogues in Human Geography, 3, 262-267.
https://doi.org/10.1177/2043820613513388
[12]  Smith, M. (2015) The White House Names Dr. DJ Patil as the First US Chief Data Scientist.
https://obamawhitehouse.archives.gov/blog/2015/02/18/white-house-names-dr-dj-patil-first-
us-chief-data-scientist
[13]  Rivera, J. and Van der Meulen, R. (2014) Gartner’s 2014 Hype Cycle for Emerging Technologies
Maps the Journey to Digital Business. Connecticut, EEUU: Gartner Group.
[14]  Gartner, J. (2016) Hype Cycle for Data Science, 2016.
https://www.gartner.com/doc/3388917/hype-cycle-data-science
[15]  O’Neil, C. and Schutt, R. (2013) Doing Data Science: Straight Talk from the Frontline.
O’Reilly Media Inc., Newton, 7.
[16]  Overton, J. (2016) Going Pro in Data Science. O’Reilly Media Inc., Newton, 12.
[17]  Chao, L. (2017) Data Science Theory and Practice. Tsinghua University Press, Beijing, 15.
[18]  Myers, R. (2019) Data Management and Statistical Analysis Techniques. Scientific e-Resources, 2.
[19]  Patil, D.J. (2012) Data Jujitsu. O’Reilly Media Inc., Newton.
[20]  Davenport, T.H. and Kudyba, S. (2016) Designing and Developing Analytics-Based Data Products.
MIT Sloan Management Review, 58, 83.
[21]  Gray, J., Chambers, L. and Bounegru, L. (2012) The Data Journalism Handbook: How Journalists
Can Use Data to Improve the News. O’Reilly Media Inc., Newton.
[22]  Kalidindi, S.R. and De Graef, M. (2015) Materials Data Science: Current Status and Future Outlook.
Annual Review of Materials Research, 45, 171-193.
https://doi.org/10.1146/annurev-matsci-070214-020844
[23]  Fang, B. and Zhang, P. (2016) Big Data in Finance. In: Big Data Concepts, Theories, and
Applications, Springer, Cham, 391-412.
https://doi.org/10.1007/978-3-319-27763-9_11
[24]  Davis, K. (2012) Ethics of Big Data: Balancing Risk and Innovation. O’Reilly Media Inc., Newton.
[25]  West, D.M. (2012) Big Data for Education: Data Mining, Data Analytics, and Web Dashboards.
Governance Studies at Brookings, 4, 1-10.
[26]  Labrinidis, A. and Jagadish, H.V. (2012) Challenges and Opportunities with Big Data. Proceedings
of the VLDB Endowment, 5, 2032-2033.
https://doi.org/10.14778/2367502.2367572
[27]  Kaisler, S., et al. (2013) Big Data: Issues and Challenges Moving Forward. 2013 46th
Hawaii International Conference on System Sciences IEEE, Wailea, 7-10 January 2013, 995-1004.
https://doi.org/10.1109/HICSS.2013.645
[28]  Chen, H., Chiang, R.H.L. and Storey, V.C. (2012) Business Intelligence and Analytics: From
Big Data to Big Impact. MIS Quarterly, 36, 1165-1188.
https://doi.org/10.2307/41703503
[29]  Provost, F. and Fawcett, T. (2013) Data Science and Its Relationship to Big Data and Data-Driven
Decision Making. Big Data, 1, 51-59.
https://doi.org/10.1089/big.2013.1508
[30]  Blei, D.M. and Smyth, P. (2017) Science and Data Science. Proceedings of the National Academy
of Sciences, 114, 8689-8692.
https://doi.org/10.1073/pnas.1702076114
[31]  Shanahan, J.G. and Dai, L. (2015) Large Scale Distributed Data Science Using Apache Spark.
Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and
Data Mining, Sydney, 10-13 August 2015, 2323-2324.
https://doi.org/10.1145/2783258.2789993
[32]  Holmes, A. (2012) Hadoop in Practice. Manning Publications Co., New York.
[33]  Sharma, S., et al. (2016) Leading NoSQL Models for Handling Big Data: A Brief Review.
International Journal of Business Information Systems, 22, 1-25.
https://doi.org/10.1504/IJBIS.2016.075714
[34]  Sadalage, P.J. and Fowler, M. (2013) NoSQL Distilled: A Brief Guide to the Emerging World
of Polyglot Persistence. Pearson Education, London.
[35]  Marx, V. (2013) Biology: The Big Challenges of Big Data. Nature, 498, 255-260.
https://doi.org/10.1038/498255a
[36]  Raghupathi, W. and Raghupathi, V. (2014) Big Data Analytics in Healthcare: Promise and Potential.
Health Information Science and Systems, 2, 3.
https://doi.org/10.1186/2047-2501-2-3
[37]  Kim, G.-H., Trimi, S. and Chung, J.-H. (2014) Big-Data Applications in the Government Sector.
Communications of the ACM, 57, 78-85.
https://doi.org/10.1145/2500873
[38]  Daniel, B. (2015) Big Data and Analytics in Higher Education: Opportunities and Challenges.
British Journal of Educational Technology, 46, 904-920.
https://doi.org/10.1111/bjet.12230
[39]  George, G., Haas, M.R. and Pentland, A. (2014) Big Data and Management. Academy of
Management Journal, 57, 321-326.
https://doi.org/10.5465/amj.2014.4002
[40]  Swan, M. (2013) The Quantified Self: Fundamental Disruption in Big Data Science and Biological
Discovery. Big Data, 1, 85-99.
https://doi.org/10.1089/big.2012.0002
[41]  Lewis, S.C. (2015) Journalism in an Era of Big Data: Cases, Concepts, and Critiques. Taylor &
Francis, Abingdon-on-Thames, 321-330.
https://doi.org/10.1080/21670811.2014.976399
[42]  Rahm, E. (2016) Big Data Analytics. It—Information Technology, 58, 155-156.
https://doi.org/10.1515/itit-2016-0024
[43]  Baumer, B. (2015) A Data Science Course for Undergraduates: Thinking with Data. The
American Statistician, 69, 334-342.
https://doi.org/10.1080/00031305.2015.1081105
[44]  Hardin, J., et al. (2015) Data Science in Statistics Curricula: Preparing Students to
“Think with Data”. The American Statistician, 69, 343-353.
https://doi.org/10.1080/00031305.2015.1077729
[45]  Cassel, L.N., et al. (2017) Advancing Data Science for Students of All Majors. Proceedings of the
2017 ACM SIGCSE Technical Symposium on Computer Science Education, Seattle,
8-11 March 2017, 722.
https://doi.org/10.1145/3017680.3022362
[46]  Berman, F.D. and Bourne, P.E. (2015) Let’s Make Gender Diversity in Data Science a Priority
Right from the Start. PLoS Biology, 13, e1002206.
https://doi.org/10.1371/journal.pbio.1002206
[47]  Chao, L.M. (2016) Data Science. Tsinghua University Press, Beijing.
[48]  Cooper, P. (2014) Data, Information, Knowledge and Wisdom. Anaesthesia & Intensive Care
Medicine, 15, 44-45.
https://doi.org/10.1016/j.mpaic.2013.11.009
[49]  Erl, T., Khattak, W. and Buhler, P. (2016) Big Data Fundamentals: Concepts, Drivers &
Techniques. Prentice Hall Press, Upper Saddle River.
[50]  Rowley, J. (2007) The Wisdom Hierarchy: Representations of the DIKW Hierarchy. Journal
of Information Science, 33, 163-180.
https://doi.org/10.1177/0165551506070706
[51]  Riofrio, G., et al. (2015) Business Intelligence Applied to Learning Analytics in Student-
Centered Learning Processes. 2015 Latin American Computing Conference (CLEI)
IEEE, Arequipa, 19-23 October 2015, 1-10.
[52]  Wang, G., et al. (2016) Big Data Analytics in Logistics and Supply Chain Management:
Certain Investigations for Research and Applications. International Journal of
Production Economics, 176, 98-110.
https://doi.org/10.1016/j.ijpe.2016.03.014
[53]  Cárdenas, A.A., Manadhata, P.K. and Rajan, S.P. (2013) Big Data Analytics for Security.
IEEE Security & Privacy, 11, 74-76.
https://doi.org/10.1109/MSP.2013.138
[54]  Leek, J.T. and Peng, R. (2015) What Is the Question? Mistaking the Type of Question Being
https://doi.org/10.1126/science.aaa6146
[55]  Van Dijck, J. (2014) Datafication, Dataism and Dataveillance: Big Data between Scientific
Paradigm and Ideology. Surveillance & Society, 12, 197-208.
https://doi.org/10.24908/ss.v12i2.4776
[56]  Ruckenstein, M. and Pantzar, M. (2017) Beyond the Quantified Self: Thematic Exploration of a
Dataistic Paradigm. New Media & Society, 19, 401-418.
https://doi.org/10.1177/1461444815609081
[57]  Cheong, L.K. and Chang, V. (2007) The Need for Data Governance: A Case Study. ACIS
2007 Proceedings, Toowoomba, 5-7 December 2007, 100.
[58]  Khatri, V. and Brown, C.V. (2010) Designing Data Governance. Communications of the ACM,
https://doi.org/10.1145/1629175.1629210
[59]  Thomas, G. (2006) The DGI Data Governance Framework. The Data Governance Institute,
[60]  Lee, S.U., Zhu, L.M. and Jeffery, R. (2017) Design Choices for Data Governance in
Platform Ecosystems: A Contingency Model.
[61]  CMMI Institute. Data Management Maturity (DMM)SMModel.
http://cmmiinstitute.com/data-management-maturity
[62]  Liu, J.Z., et al. (2016) Rethinking Big Data: A Review on the Data Quality and Usage Issues. ISPRS
Journal of Photogrammetry and Remote Sensing, 115, 134-142.
https://doi.org/10.1016/j.isprsjprs.2015.11.006
[63]  Lee, J.Z., Wang, Z.H. and Gao, H. (2016) State-of-the-Art of Research on Big Data Usability.
Journal of Software, 27, 1605-1625.
[64]  Rahm, E. and Do, H.H. (2000) Data Cleaning: Problems and Current Approaches.
IEEE Data Engineering Bulletin, 23, 3-13.
[65]  Wickham, H. (2014) Tidy Data. Journal of Statistical Software, 59, 1-23.
https://doi.org/10.18637/jss.v059.i10
[66]  Lafuente, G. (2015) The Big Data Security Challenge. Network Security, 2015, 12-14.
https://doi.org/10.1016/S1353-4858(15)70009-7
[67]  Perera, C., et al. (2015) Big Data Privacy in the Internet of Things Era. IT Professional, 17, 32-39.
https://doi.org/10.1109/MITP.2015.34
[68]  Patil, D. and Noren, A. (2011) Building Data Science Teams: The Skills, Tools and Perspectives
behind Great Data Science Groups. O’Reilly Media Inc., Newton.
[69]  Banerjee, S. (2015) Citizen Data Science for Social Good: Case Studies and Vignettes from
https://www.researchgate.net/publication/283119007_Citizen_Data_Science_for_Social_
Good_Case_Studies_and_Vignettes_from_Recent_Projects
[70]  Parasie, S. and Dagiral, E. (2013) Data-Driven Journalism and the Public Good: “Computer-
Assisted-Reporters” and “Programmer-Journalists” in Chicago. New Media & Society,
15, 853-871.
https://doi.org/10.1177/1461444812463345
[71]  Du, D.Y., Li, A.H. and Zhang, L.L. (2014) Survey on the Applications of Big Data in Chinese Rea
l Estate Enterprise. Procedia Computer Science, 30, 24-33.
https://doi.org/10.1016/j.procs.2014.05.377
[72]  Middleton, S.E., Shadbolt, N.R. and De Roure, D.C. (2004) Ontological User Profiling in Recommender
Systems. ACM Transactions on Information Systems, 22, 54-88.
https://doi.org/10.1145/963770.963773
[73]  Marshall, P., Rhodes, M. and Todd, B. (2014) Ultimate Guide to Google AdWords.
Entrepreneur Press, Irvine.
[74]  Gurrin, C., Smeaton, A.F. and Doherty, A.R. (2014) Life Logging: Personal Big Data.
Foundations and Trends in Information Retrieval, 8, 1-125.
https://doi.org/10.1561/1500000033
[75]  Bello-Orgaz, G., Jung, J.J. and Camacho, D. (2016) Social Big Data: Recent Achievements and
New Challenges. Information Fusion, 28, 45-59.
https://doi.org/10.1016/j.inffus.2015.08.005
[76]  Mohanty, S., Jagadeesh, M. and Srivatsa, H. (2013) Big Data Imperatives: Enterprise “Big Data”
Warehouse BI Implementations and Analytics. Apress, New York.
https://doi.org/10.1007/978-1-4302-4873-6
[77]  Bertot, J.C., et al. (2014) Big Data, Open Government and e-Government: Issues, Policies and
Recommendations. Information Polity, 19, 5-16.
https://doi.org/10.3233/IP-140328
[78]  Aggarwal, A.K. (2019) Opportunities and Challenges of Big Data in Public Sector. In: Web Services:
Concepts, Methodologies, Tools, and Applications, IGI Global, Hershey, 1749-1761.
https://doi.org/10.4018/978-1-5225-7501-6.ch090
[79]  Jurney, R. (2017) Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with
Spark. O’Reilly Media Inc., Newton.
[80]  Moreno, J., et al. (2020) Improving Incident Response in Big Data Ecosystems by Using Blockchain
Technologies. Applied Sciences, 10, 724.
https://doi.org/10.3390/app10020724
[81]  Matt Turck. Big Data Landscape 2016 v18 FINAL.
http://mattturck.com/big-data-landscape-2016-v18-final
[82]  Drexl, J. (2016) Designing Competitive Markets for Industrial Data-Between Propertization and
Access. Max Planck Institute for Innovation & Competition Research Paper 16-13.
https://doi.org/10.2139/ssrn.2862975
[83]  Jin, X.L., et al. (2015) Significance and Challenges of Big Data Research. Big Data Research,
2, 59-64.
https://doi.org/10.1016/j.bdr.2015.01.006
[84]  Al-Jarrah, O.Y., et al. (2015) Efficient Machine Learning for Big Data: A Review.
Big Data Research, 2, 87-93.
https://doi.org/10.1016/j.bdr.2015.04.001
[85]  Batra, S. (2014) Big Data Analytics and Its Reflections on DIKW Hierarchy.
Review of Management, 4, 5.
[86]  Donhost, M.J. and Anfara Jr., V.A. (2010) Data-Driven Decision Making. Middle School Journal,
42, 56-63.
https://doi.org/10.1080/00940771.2010.11461758
[87]  Chen, C.L. and Zhang, C.-Y. (2014) Data-Intensive Applications, Challenges, Techniques and
Technologies: A Survey on Big Data. Information Sciences, 275, 314-347.
https://doi.org/10.1016/j.ins.2014.01.015
[88]  Voulgaris, Z. and Magoulas, G.D. (2008) Extensions of the k nearest Neighbour Methods for
Classification Problems. Proceedings of the 26th IASTED International Conference on
Artificial Intelligence and Applications, AIA, Vol. 8, 23-28.
[89]  Rajaraman, A. (2008) More Data Usually Beats Better Algorithms. Datawocky Blog.
[90]  Kleppmann, M. (2017) Designing Data-Intensive Applications: The Big Ideas behind Reliable,
Scalable, and Maintainable Systems. O’Reilly Media Inc., Newton.
[91]  Brewer, E. (2013) Parallelism in the Cloud. Workshop on Hot Topics in Parallelism. Keynote Talk.
[92]  Fan, J.Q., Han, F. and Liu, H. (2014) Challenges of Big Data Analysis. National Science Review,
1, 293-314.
https://doi.org/10.1093/nsr/nwt032
[93]  Edgar, R.C. (2004) MUSCLE: A Multiple Sequence Alignment Method with Reduced Time and
Space Complexity. BMC Bioinformatics, 5, Article No. 113.
[94]  Ginsberg, J., et al. (2009) Detecting Influenza Epidemics Using Search Engine Query
Data. Nature, 457, 1012-1014.
https://doi.org/10.1038/nature07634
[95]  Lazer, D., et al. (2014) The Parable of Google Flu: Traps in Big Data Analysis. Science,
343, 1203-1205.
https://doi.org/10.1126/science.1248506
[96]  Tansley, S. and Tolle, K. (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery.
Vol. 1, Microsoft Research, Redmond.
[97]  Kalechofsky, H. (2016) A Simple Framework for Building Predictive Models. A Little Data
Science Business Guide. 1-18.
[98]  Shmueli, G. (2010) To Explain or to Predict? Statistical Science, 25, 289-310.
https://doi.org/10.1214/10-STS330
[99]  Dhar, V. and Chou, D. (2001) A Comparison of Nonlinear Models for Financial Prediction.
IEEE Transactions on Neural Networks, 12, 907-921.
https://doi.org/10.1109/72.935099
[100]  Føllesdal, D. (1979) Hermeneutics and the Hypothetico-Deductive Method. Dialectica, 33, 319-336.
https://doi.org/10.1111/j.1746-8361.1979.tb00759.x
[101]  Sober, E. (2002) Instrumentalism, Parsimony, and the Akaike Framework. Philosophy of Science, 69, S112-S123.
https://doi.org/10.1086/341839
[102]  Rasmussen, C.E. and Ghahramani, Z. (2001) Occam’s Razor. In: Advances in Neural Information
Processing Systems 13 (NIPS 2000), MIT Press, Cambridge, 276-282.
[103]  LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep Learning. Nature, 521, 436-444.
https://doi.org/10.1038/nature14539
[104]  Glass, G.V. (1976) Primary, Secondary, and Meta-Analysis of Research. Educational Researcher, 5, 3-8.
https://doi.org/10.3102/0013189X005010003
[105]  Liu, Z.H., Hammerschmidt, B. and McMahon, D. (2014) JSON Data Management: Supporting Schema-Less
Development in RDBMS. Proceedings of the 2014 ACM SIGMOD International Conference on Management
of Data, Snowbird, 22-27 June 2014, 1247-1258.
https://doi.org/10.1145/2588555.2595628
[106]  Brewer, E. (2012) CAP Twelve Years Later: How the “Rules” Have Changed. Computer, 45, 23-29.
https://doi.org/10.1109/MC.2012.37
[107]  Plunkett, T., et al. (2013) Oracle Big Data Handbook. McGraw Hill Professional, New York.
[108]  Chawla, S., Hartline, J. and Nekipelov, D. (2014) Mechanism Design for Data Science.
Proceedings of the Fifteenth ACM Conference on Economics and Computation, Palo Alto, 8-12 June
2014, 711-712.
https://doi.org/10.1145/2600057.2602881
[109]  Leadbeater, C. and Miller, P. (2004) The Pro-Am Revolution: How Enthusiasts Are Changing
Our Society and Economy. Demos, London.
[110]  Kamar, E. (2016) Directions in Hybrid Intelligence: Complementing AI Systems with Human Intelligence.
Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence
(IJCAI-16), New York, 9-15 July 2016, 4070-4073.
[111]  Power, D.J. (2016) Data Science: Supporting Decision-Making. Journal of Decision Systems, 25, 345-356.
https://doi.org/10.1080/12460125.2016.1171610
[112]  Conway, D. (2011) Data Science in the US Intelligence Community. IQT Quarterly, 2, 24-27.
[113]  Anderson, P., McGuffee, J. and Uminsky, D. (2014) Data Science as an Undergraduate Degree. Proceedings
of the 45th ACM Technical Symposium on Computer Science Education, Atlanta,
5-8 March 2014, 705-706.
https://doi.org/10.1145/2538862.2538868
[114]  Marshall, L. and Eloff, J.H.P. (2016) Towards an Interdisciplinary Master’s Degree Programme
in Big Data and Data Science: A South African Perspective. In: Annual Conference of the
Southern African Computer Lecturers’ Association, Springer, Cham, 131-139.
https://doi.org/10.1007/978-3-319-47680-3_13
[115]  West, J.D. and Portenoy, J. (2016) Chapter 10: The Data Gold Rush in Higher Education. In: Big
Data Is Not a Monolith, The MIT Press, Cambridge, 129.

Full-Text

Contact Us

[email protected]

QQ:3279437679

WhatsApp +8615387084133