The entering into big data era gives rise to a novel discipline called Data Science. To start with, a very brief history, interdisciplinarity, theoretical framework, and taxonomy of Data Science are discussed. Then, the differences between domain-general Data Science and domain-specific Data Science are proposed based upon conducting literature reviews on hot topics in big data-related studies. In addition, ten common debates in Data Science are described, including debates on thinking pattern, properties of big data, enablers of intelligence, bottlenecks in data products development, data preparation, quality of services, big data analysis, evaluation of big data algorithms, the fourth paradigm and big data skills shortage. Moreover, the emerging trends in Data Science are presented: shifts in data analysis methodologies, adoption of model integration and meta-analysis, introducing data first, schema later or never paradigm, rethinking data consistency in big data systems, recognizing data replication and data locality, growth in integrated data applications, changes in the complexity of data computing, the advent of data products, the rise of pro-ams and citizen data science, as well as the increasing demand for data scientists. In conclusion, some suggestions for further studies are also proposed: to avoid misconstruing Data Science, to take advantages of active property of big data, to balance the three dimensions of Data Science, to introduce Design of Experiments, to embrace causality analysis, and to develop data products.
References
[1]
Mayer-Schönberger, V. and Kenneth, C. (2013) Big Data: A Revolution That Will Transform How We Live, Work, and Think. Houghton Mifflin Harcourt, Boston.
[2]
Boyd, D. and Crawford, K. (2012) Critical Questions for Big Data: Provocations for a Cultural, Technological, and Scholarly Phenomenon. Information, Communication & Society , 15, 662-679. https://doi.org/10.1080/1369118X.2012.678878
[3]
Kitchin, R. (2014) Big Data, New Epistemologies and Paradigm Shifts. Big Data & Society, 1, 1-12. https://doi.org/10.1177/2053951714528481
[4]
Jagadish, H.V. (2015) Big Data and Science: Myths and Reality. Big Data Research, 2, 49-52. https://doi.org/10.1016/j.bdr.2015.01.005
[5]
Song, I. and Zhu, Y.J. (2016) Big Data and Data Science: What Should We Teach. Expert Systems, 33, 364-373. https://doi.org/10.1111/exsy.12130
[6]
Naur, P. (1974) Concise Survey of Computer Methods. Petrocelli Books, New York.
[7]
Cleveland, W.S. (2001) Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. International Statistical Review, 69, 21-26. https://doi.org/10.1111/j.1751-5823.2001.tb00477.x
[8]
Mattmann, C.A. (2013) Computing: A Vision for Data Science. Nature, 493, 473-475. https://doi.org/10.1038/493473a
[9]
Dhar, V. (2013) Data Science and Prediction. Communications of the ACM, 56, 64-73. https://doi.org/10.1145/2500499
[10]
Davenport, T.H. and Patil, D.J. (2012) Data Scientist. Harvard Business Review, 90, 70-76.
[11]
Kitchin, R. (2013) Big Data and Human Geography: Opportunities, Challenges and Risks. Dialogues in Human Geography, 3, 262-267. https://doi.org/10.1177/2043820613513388
[12]
Smith, M. (2015) The White House Names Dr. DJ Patil as the First US Chief Data Scientist. https://obamawhitehouse.archives.gov/blog/2015/02/18/white-house-names-dr-dj-patil-first- us-chief-data-scientist
[13]
Rivera, J. and Van der Meulen, R. (2014) Gartner’s 2014 Hype Cycle for Emerging Technologies Maps the Journey to Digital Business. Connecticut, EEUU: Gartner Group.
[14]
Gartner, J. (2016) Hype Cycle for Data Science, 2016. https://www.gartner.com/doc/3388917/hype-cycle-data-science
[15]
O’Neil, C. and Schutt, R. (2013) Doing Data Science: Straight Talk from the Frontline. O’Reilly Media Inc., Newton, 7.
[16]
Overton, J. (2016) Going Pro in Data Science. O’Reilly Media Inc., Newton, 12.
[17]
Chao, L. (2017) Data Science Theory and Practice. Tsinghua University Press, Beijing, 15.
[18]
Myers, R. (2019) Data Management and Statistical Analysis Techniques. Scientific e-Resources, 2.
[19]
Patil, D.J. (2012) Data Jujitsu. O’Reilly Media Inc., Newton.
[20]
Davenport, T.H. and Kudyba, S. (2016) Designing and Developing Analytics-Based Data Products. MIT Sloan Management Review, 58, 83.
[21]
Gray, J., Chambers, L. and Bounegru, L. (2012) The Data Journalism Handbook: How Journalists Can Use Data to Improve the News. O’Reilly Media Inc., Newton.
[22]
Kalidindi, S.R. and De Graef, M. (2015) Materials Data Science: Current Status and Future Outlook. Annual Review of Materials Research, 45, 171-193. https://doi.org/10.1146/annurev-matsci-070214-020844
[23]
Fang, B. and Zhang, P. (2016) Big Data in Finance. In: Big Data Concepts, Theories, and Applications, Springer, Cham, 391-412. https://doi.org/10.1007/978-3-319-27763-9_11
[24]
Davis, K. (2012) Ethics of Big Data: Balancing Risk and Innovation. O’Reilly Media Inc., Newton.
[25]
West, D.M. (2012) Big Data for Education: Data Mining, Data Analytics, and Web Dashboards. Governance Studies at Brookings, 4, 1-10.
[26]
Labrinidis, A. and Jagadish, H.V. (2012) Challenges and Opportunities with Big Data. Proceedings of the VLDB Endowment, 5, 2032-2033. https://doi.org/10.14778/2367502.2367572
[27]
Kaisler, S., et al. (2013) Big Data: Issues and Challenges Moving Forward. 2013 46th Hawaii International Conference on System Sciences IEEE, Wailea, 7-10 January 2013, 995-1004. https://doi.org/10.1109/HICSS.2013.645
[28]
Chen, H., Chiang, R.H.L. and Storey, V.C. (2012) Business Intelligence and Analytics: From Big Data to Big Impact. MIS Quarterly, 36, 1165-1188. https://doi.org/10.2307/41703503
[29]
Provost, F. and Fawcett, T. (2013) Data Science and Its Relationship to Big Data and Data-Driven Decision Making. Big Data, 1, 51-59. https://doi.org/10.1089/big.2013.1508
[30]
Blei, D.M. and Smyth, P. (2017) Science and Data Science. Proceedings of the National Academy of Sciences, 114, 8689-8692. https://doi.org/10.1073/pnas.1702076114
[31]
Shanahan, J.G. and Dai, L. (2015) Large Scale Distributed Data Science Using Apache Spark. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, 10-13 August 2015, 2323-2324. https://doi.org/10.1145/2783258.2789993
[32]
Holmes, A. (2012) Hadoop in Practice. Manning Publications Co., New York.
[33]
Sharma, S., et al. (2016) Leading NoSQL Models for Handling Big Data: A Brief Review. International Journal of Business Information Systems, 22, 1-25. https://doi.org/10.1504/IJBIS.2016.075714
[34]
Sadalage, P.J. and Fowler, M. (2013) NoSQL Distilled: A Brief Guide to the Emerging World of Polyglot Persistence. Pearson Education, London.
[35]
Marx, V. (2013) Biology: The Big Challenges of Big Data. Nature, 498, 255-260. https://doi.org/10.1038/498255a
[36]
Raghupathi, W. and Raghupathi, V. (2014) Big Data Analytics in Healthcare: Promise and Potential. Health Information Science and Systems, 2, 3. https://doi.org/10.1186/2047-2501-2-3
[37]
Kim, G.-H., Trimi, S. and Chung, J.-H. (2014) Big-Data Applications in the Government Sector. Communications of the ACM, 57, 78-85. https://doi.org/10.1145/2500873
[38]
Daniel, B. (2015) Big Data and Analytics in Higher Education: Opportunities and Challenges. British Journal of Educational Technology, 46, 904-920. https://doi.org/10.1111/bjet.12230
[39]
George, G., Haas, M.R. and Pentland, A. (2014) Big Data and Management. Academy of Management Journal, 57, 321-326. https://doi.org/10.5465/amj.2014.4002
[40]
Swan, M. (2013) The Quantified Self: Fundamental Disruption in Big Data Science and Biological Discovery. Big Data, 1, 85-99. https://doi.org/10.1089/big.2012.0002
[41]
Lewis, S.C. (2015) Journalism in an Era of Big Data: Cases, Concepts, and Critiques. Taylor & Francis, Abingdon-on-Thames, 321-330. https://doi.org/10.1080/21670811.2014.976399
[42]
Rahm, E. (2016) Big Data Analytics. It—Information Technology, 58, 155-156. https://doi.org/10.1515/itit-2016-0024
[43]
Baumer, B. (2015) A Data Science Course for Undergraduates: Thinking with Data. The American Statistician, 69, 334-342. https://doi.org/10.1080/00031305.2015.1081105
[44]
Hardin, J., et al. (2015) Data Science in Statistics Curricula: Preparing Students to “Think with Data”. The American Statistician, 69, 343-353. https://doi.org/10.1080/00031305.2015.1077729
[45]
Cassel, L.N., et al. (2017) Advancing Data Science for Students of All Majors. Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science Education, Seattle, 8-11 March 2017, 722. https://doi.org/10.1145/3017680.3022362
[46]
Berman, F.D. and Bourne, P.E. (2015) Let’s Make Gender Diversity in Data Science a Priority Right from the Start. PLoS Biology, 13, e1002206. https://doi.org/10.1371/journal.pbio.1002206
[47]
Chao, L.M. (2016) Data Science. Tsinghua University Press, Beijing.
[48]
Cooper, P. (2014) Data, Information, Knowledge and Wisdom. Anaesthesia & Intensive Care Medicine, 15, 44-45. https://doi.org/10.1016/j.mpaic.2013.11.009
[49]
Erl, T., Khattak, W. and Buhler, P. (2016) Big Data Fundamentals: Concepts, Drivers & Techniques. Prentice Hall Press, Upper Saddle River.
[50]
Rowley, J. (2007) The Wisdom Hierarchy: Representations of the DIKW Hierarchy. Journal of Information Science, 33, 163-180. https://doi.org/10.1177/0165551506070706
[51]
Riofrio, G., et al. (2015) Business Intelligence Applied to Learning Analytics in Student- Centered Learning Processes. 2015 Latin American Computing Conference (CLEI) IEEE, Arequipa, 19-23 October 2015, 1-10.
[52]
Wang, G., et al. (2016) Big Data Analytics in Logistics and Supply Chain Management: Certain Investigations for Research and Applications. International Journal of Production Economics, 176, 98-110. https://doi.org/10.1016/j.ijpe.2016.03.014
[53]
Cárdenas, A.A., Manadhata, P.K. and Rajan, S.P. (2013) Big Data Analytics for Security. IEEE Security & Privacy, 11, 74-76. https://doi.org/10.1109/MSP.2013.138
[54]
Leek, J.T. and Peng, R. (2015) What Is the Question? Mistaking the Type of Question Being https://doi.org/10.1126/science.aaa6146
[55]
Van Dijck, J. (2014) Datafication, Dataism and Dataveillance: Big Data between Scientific Paradigm and Ideology. Surveillance & Society, 12, 197-208. https://doi.org/10.24908/ss.v12i2.4776
[56]
Ruckenstein, M. and Pantzar, M. (2017) Beyond the Quantified Self: Thematic Exploration of a Dataistic Paradigm. New Media & Society, 19, 401-418. https://doi.org/10.1177/1461444815609081
[57]
Cheong, L.K. and Chang, V. (2007) The Need for Data Governance: A Case Study. ACIS 2007 Proceedings, Toowoomba, 5-7 December 2007, 100.
[58]
Khatri, V. and Brown, C.V. (2010) Designing Data Governance. Communications of the ACM, https://doi.org/10.1145/1629175.1629210
[59]
Thomas, G. (2006) The DGI Data Governance Framework. The Data Governance Institute,
[60]
Lee, S.U., Zhu, L.M. and Jeffery, R. (2017) Design Choices for Data Governance in Platform Ecosystems: A Contingency Model.
[61]
CMMI Institute. Data Management Maturity (DMM)SMModel. http://cmmiinstitute.com/data-management-maturity
[62]
Liu, J.Z., et al. (2016) Rethinking Big Data: A Review on the Data Quality and Usage Issues. ISPRS Journal of Photogrammetry and Remote Sensing, 115, 134-142. https://doi.org/10.1016/j.isprsjprs.2015.11.006
[63]
Lee, J.Z., Wang, Z.H. and Gao, H. (2016) State-of-the-Art of Research on Big Data Usability. Journal of Software, 27, 1605-1625.
[64]
Rahm, E. and Do, H.H. (2000) Data Cleaning: Problems and Current Approaches. IEEE Data Engineering Bulletin, 23, 3-13.
[65]
Wickham, H. (2014) Tidy Data. Journal of Statistical Software, 59, 1-23. https://doi.org/10.18637/jss.v059.i10
[66]
Lafuente, G. (2015) The Big Data Security Challenge. Network Security, 2015, 12-14. https://doi.org/10.1016/S1353-4858(15)70009-7
[67]
Perera, C., et al. (2015) Big Data Privacy in the Internet of Things Era. IT Professional, 17, 32-39. https://doi.org/10.1109/MITP.2015.34
[68]
Patil, D. and Noren, A. (2011) Building Data Science Teams: The Skills, Tools and Perspectives behind Great Data Science Groups. O’Reilly Media Inc., Newton.
[69]
Banerjee, S. (2015) Citizen Data Science for Social Good: Case Studies and Vignettes from https://www.researchgate.net/publication/283119007_Citizen_Data_Science_for_Social_ Good_Case_Studies_and_Vignettes_from_Recent_Projects
[70]
Parasie, S. and Dagiral, E. (2013) Data-Driven Journalism and the Public Good: “Computer- Assisted-Reporters” and “Programmer-Journalists” in Chicago. New Media & Society, 15, 853-871. https://doi.org/10.1177/1461444812463345
[71]
Du, D.Y., Li, A.H. and Zhang, L.L. (2014) Survey on the Applications of Big Data in Chinese Rea l Estate Enterprise. Procedia Computer Science, 30, 24-33. https://doi.org/10.1016/j.procs.2014.05.377
[72]
Middleton, S.E., Shadbolt, N.R. and De Roure, D.C. (2004) Ontological User Profiling in Recommender Systems. ACM Transactions on Information Systems, 22, 54-88. https://doi.org/10.1145/963770.963773
[73]
Marshall, P., Rhodes, M. and Todd, B. (2014) Ultimate Guide to Google AdWords. Entrepreneur Press, Irvine.
[74]
Gurrin, C., Smeaton, A.F. and Doherty, A.R. (2014) Life Logging: Personal Big Data. Foundations and Trends in Information Retrieval, 8, 1-125. https://doi.org/10.1561/1500000033
[75]
Bello-Orgaz, G., Jung, J.J. and Camacho, D. (2016) Social Big Data: Recent Achievements and New Challenges. Information Fusion, 28, 45-59. https://doi.org/10.1016/j.inffus.2015.08.005
[76]
Mohanty, S., Jagadeesh, M. and Srivatsa, H. (2013) Big Data Imperatives: Enterprise “Big Data” Warehouse BI Implementations and Analytics. Apress, New York. https://doi.org/10.1007/978-1-4302-4873-6
[77]
Bertot, J.C., et al. (2014) Big Data, Open Government and e-Government: Issues, Policies and Recommendations. Information Polity, 19, 5-16. https://doi.org/10.3233/IP-140328
[78]
Aggarwal, A.K. (2019) Opportunities and Challenges of Big Data in Public Sector. In: Web Services: Concepts, Methodologies, Tools, and Applications, IGI Global, Hershey, 1749-1761. https://doi.org/10.4018/978-1-5225-7501-6.ch090
[79]
Jurney, R. (2017) Agile Data Science 2.0: Building Full-Stack Data Analytics Applications with Spark. O’Reilly Media Inc., Newton.
[80]
Moreno, J., et al. (2020) Improving Incident Response in Big Data Ecosystems by Using Blockchain Technologies. Applied Sciences, 10, 724. https://doi.org/10.3390/app10020724
[81]
Matt Turck. Big Data Landscape 2016 v18 FINAL. http://mattturck.com/big-data-landscape-2016-v18-final
[82]
Drexl, J. (2016) Designing Competitive Markets for Industrial Data-Between Propertization and Access. Max Planck Institute for Innovation & Competition Research Paper 16-13. https://doi.org/10.2139/ssrn.2862975
[83]
Jin, X.L., et al. (2015) Significance and Challenges of Big Data Research. Big Data Research, 2, 59-64. https://doi.org/10.1016/j.bdr.2015.01.006
[84]
Al-Jarrah, O.Y., et al. (2015) Efficient Machine Learning for Big Data: A Review. Big Data Research, 2, 87-93. https://doi.org/10.1016/j.bdr.2015.04.001
[85]
Batra, S. (2014) Big Data Analytics and Its Reflections on DIKW Hierarchy. Review of Management, 4, 5.
[86]
Donhost, M.J. and Anfara Jr., V.A. (2010) Data-Driven Decision Making. Middle School Journal, 42, 56-63. https://doi.org/10.1080/00940771.2010.11461758
[87]
Chen, C.L. and Zhang, C.-Y. (2014) Data-Intensive Applications, Challenges, Techniques and Technologies: A Survey on Big Data. Information Sciences, 275, 314-347. https://doi.org/10.1016/j.ins.2014.01.015
[88]
Voulgaris, Z. and Magoulas, G.D. (2008) Extensions of the k nearest Neighbour Methods for Classification Problems. Proceedings of the 26th IASTED International Conference on Artificial Intelligence and Applications, AIA, Vol. 8, 23-28.
[89]
Rajaraman, A. (2008) More Data Usually Beats Better Algorithms. Datawocky Blog.
[90]
Kleppmann, M. (2017) Designing Data-Intensive Applications: The Big Ideas behind Reliable, Scalable, and Maintainable Systems. O’Reilly Media Inc., Newton.
[91]
Brewer, E. (2013) Parallelism in the Cloud. Workshop on Hot Topics in Parallelism. Keynote Talk.
[92]
Fan, J.Q., Han, F. and Liu, H. (2014) Challenges of Big Data Analysis. National Science Review, 1, 293-314. https://doi.org/10.1093/nsr/nwt032
[93]
Edgar, R.C. (2004) MUSCLE: A Multiple Sequence Alignment Method with Reduced Time and Space Complexity. BMC Bioinformatics, 5, Article No. 113.
[94]
Ginsberg, J., et al. (2009) Detecting Influenza Epidemics Using Search Engine Query Data. Nature, 457, 1012-1014. https://doi.org/10.1038/nature07634
[95]
Lazer, D., et al. (2014) The Parable of Google Flu: Traps in Big Data Analysis. Science, 343, 1203-1205. https://doi.org/10.1126/science.1248506
[96]
Tansley, S. and Tolle, K. (2009) The Fourth Paradigm: Data-Intensive Scientific Discovery. Vol. 1, Microsoft Research, Redmond.
[97]
Kalechofsky, H. (2016) A Simple Framework for Building Predictive Models. A Little Data Science Business Guide. 1-18.
[98]
Shmueli, G. (2010) To Explain or to Predict? Statistical Science, 25, 289-310. https://doi.org/10.1214/10-STS330
[99]
Dhar, V. and Chou, D. (2001) A Comparison of Nonlinear Models for Financial Prediction. IEEE Transactions on Neural Networks, 12, 907-921. https://doi.org/10.1109/72.935099
[100]
Føllesdal, D. (1979) Hermeneutics and the Hypothetico-Deductive Method. Dialectica, 33, 319-336. https://doi.org/10.1111/j.1746-8361.1979.tb00759.x
[101]
Sober, E. (2002) Instrumentalism, Parsimony, and the Akaike Framework. Philosophy of Science, 69, S112-S123. https://doi.org/10.1086/341839
[102]
Rasmussen, C.E. and Ghahramani, Z. (2001) Occam’s Razor. In: Advances in Neural Information Processing Systems 13 (NIPS 2000), MIT Press, Cambridge, 276-282.
[103]
LeCun, Y., Bengio, Y. and Hinton, G. (2015) Deep Learning. Nature, 521, 436-444. https://doi.org/10.1038/nature14539
[104]
Glass, G.V. (1976) Primary, Secondary, and Meta-Analysis of Research. Educational Researcher, 5, 3-8. https://doi.org/10.3102/0013189X005010003
[105]
Liu, Z.H., Hammerschmidt, B. and McMahon, D. (2014) JSON Data Management: Supporting Schema-Less Development in RDBMS. Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, Snowbird, 22-27 June 2014, 1247-1258. https://doi.org/10.1145/2588555.2595628
[106]
Brewer, E. (2012) CAP Twelve Years Later: How the “Rules” Have Changed. Computer, 45, 23-29. https://doi.org/10.1109/MC.2012.37
[107]
Plunkett, T., et al. (2013) Oracle Big Data Handbook. McGraw Hill Professional, New York.
[108]
Chawla, S., Hartline, J. and Nekipelov, D. (2014) Mechanism Design for Data Science. Proceedings of the Fifteenth ACM Conference on Economics and Computation, Palo Alto, 8-12 June 2014, 711-712. https://doi.org/10.1145/2600057.2602881
[109]
Leadbeater, C. and Miller, P. (2004) The Pro-Am Revolution: How Enthusiasts Are Changing Our Society and Economy. Demos, London.
[110]
Kamar, E. (2016) Directions in Hybrid Intelligence: Complementing AI Systems with Human Intelligence. Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16), New York, 9-15 July 2016, 4070-4073.
[111]
Power, D.J. (2016) Data Science: Supporting Decision-Making. Journal of Decision Systems, 25, 345-356. https://doi.org/10.1080/12460125.2016.1171610
[112]
Conway, D. (2011) Data Science in the US Intelligence Community. IQT Quarterly, 2, 24-27.
[113]
Anderson, P., McGuffee, J. and Uminsky, D. (2014) Data Science as an Undergraduate Degree. Proceedings of the 45th ACM Technical Symposium on Computer Science Education, Atlanta, 5-8 March 2014, 705-706. https://doi.org/10.1145/2538862.2538868
[114]
Marshall, L. and Eloff, J.H.P. (2016) Towards an Interdisciplinary Master’s Degree Programme in Big Data and Data Science: A South African Perspective. In: Annual Conference of the Southern African Computer Lecturers’ Association, Springer, Cham, 131-139. https://doi.org/10.1007/978-3-319-47680-3_13
[115]
West, J.D. and Portenoy, J. (2016) Chapter 10: The Data Gold Rush in Higher Education. In: Big Data Is Not a Monolith, The MIT Press, Cambridge, 129.