data

Prediction method of college students' achievements based on learning behaviour data mining

This paper proposes a method for predicting college students' performance based on learning behaviour data mining. The method addresses the issue of limited sample size affecting prediction accuracy. It utilises the K-means clustering algorithm to mine learning behaviour data and employs a density-based approach to determine optimal clustering centres, which are then output as the results of the clustering process. These clustering results are used as input for an attention encoder-decoder model to extract features from the learning behaviour sequence, incorporating an attention mechanism, sequence feature generator, and decoder. The characteristics derived from the learning behaviour sequence are then used to establish a prediction model for college students' performance, employing support vector regression. Experimental results demonstrate that this method accurately predicts students' performance with a relative error of less than 4% by leveraging the results obtained from learning behaviour data mining.




data

International Journal of Business Intelligence and Data Mining




data

A risk identification method for abnormal accounting data based on weighted random forest

In order to improve the identification accuracy, accuracy and time-consuming of traditional financial risk identification methods, this paper proposes a risk identification method for financial abnormal data based on weighted random forest. Firstly, SMOTE algorithm is used to collect abnormal financial data; secondly, the original accounting data is decomposed into features, and the features of abnormal data are extracted through random forests; then, the index weight is calculated according to the entropy weight method; finally, the negative gradient fitting is used to determine the loss function, and the weighted random forest method is used to solve the loss function value, and the recognition result is obtained. The results show that the identification accuracy of this method can reach 99.9%, the accuracy rate can reach 96.06%, and the time consumption is only 6.8 seconds, indicating that the risk identification effect of this method is good.




data

A data mining method based on label mapping for long-term and short-term browsing behaviour of network users

In order to improve the speedup and recognition accuracy of the recognition process, this paper designs a data mining method based on label mapping for long-term and short-term browsing behaviour of network users. First, after removing the noise information in the behaviour sequence, calculate the similarity of behaviour characteristics. Then, multi-source behaviour data is mapped to the same dimension, and a behaviour label mapping layer and a behaviour data mining layer are established. Finally, the similarity of the tag matrix is calculated based on the similarity calculation results, and the mining results are output using SVM binary classification process. Experimental results show that the acceleration ratio of this method exceeds 0.9; area under curve receiver operating characteristic curve (AUC-ROC) value increases rapidly in a short time, and the maximum value can reach 0.95, indicating that the mining precision of this method is high.




data

Research on fast mining of enterprise marketing investment databased on improved association rules

Because of the problems of low mining precision and slow mining speed in traditional enterprise marketing investment data mining methods, a fast mining method for enterprise marketing investment databased on improved association rules is proposed. First, the enterprise marketing investment data is collected through the crawler framework, and then the collected data is cleaned. Then, the cleaned data features are extracted, and the correlation degree between features is calculated. Finally, according to the calculation results, all data items are used as constraints to reduce the number of frequent itemsets. A pruning strategy is designed in advance. Combined with the constraints, the Apriori algorithm of association rules is improved, and the improved algorithm is used to calculate all frequent itemsets, Obtain fast mining results of enterprise marketing investment data. The experimental results show that the proposed method is fast and accurate in data mining of enterprise marketing investment.




data

General Data Protection Regulation: new ethical and constitutional aspects, along with new challenges to information law

The EU 'General Data Protection Regulation' (GDPR) marked the most important step towards reforming data privacy regulation in recent years, as it has brought about significant changes in data process in various sectors, ranging from healthcare to banking and beyond. Various concerns have been raised, and as a consequence of these, certain parts of the text of the GDPR itself have already started to become questionable due to rapid technological progress, including, for example, the use of information technology, automatisation processes and advanced algorithms in individual decision-making activities. The road to GDPR compliance by all European Union members may prove to be a long one and it is clear that only time will tell how GDPR matters will evolve and unfold. In this paper, we aim to offer a review of the practical, ethical and constitutional aspects of the new regulation and examine all the controversies that the new technology has given rise to in the course of the regulation's application.




data

Visualizing Research Data Records for their Better Management

As academia in general, and research funders in particular, place ever greater importance on data as an output of research, so the value of good research data management practices becomes ever more apparent. In response to this, the Innovative Design and Manufacturing Research Centre (IdMRC) at the University of Bath, UK, with funding from the JISC, ran a project to draw up a data management planning regime. In carrying out this task, the ERIM (Engineering Research Information Management) Project devised a visual method of mapping out the data records produced in the course of research, along with the associations between them. This method, called Research Activity Information Development (RAID) Modelling, is based on the Unified Modelling Language (UML) for portability. It is offered to the wider research community as an intuitive way for researchers both to keep track of their own data and to communicate this understanding to others who may wish to validate the findings or re-use the data.




data

FISHNet: encouraging data sharing and reuse in the freshwater science community

This paper describes the FISHNet project, which developed a repository environment for the curation and sharing of data relating to freshwater science, a discipline whose research community is distributed thinly across a variety of institutions, and usually works in relative isolation as individual researchers or within small groups. As in other “small sciences”, these datasets tend to be small and “hand-crafted”, created to address particular research questions rather than with a view to reuse, so they are rarely curated effectively, and the potential for sharing and reusing them is limited. The paper addresses a variety of issues and concerns raised by freshwater researchers as regards data sharing, describes our approach to developing a repository environment that addresses these concerns, and identifies the potential impact within the research community of the system.




data

Sheer Curation of Experiments: Data, Process, Provenance

This paper describes an environment for the “sheer curation” of the experimental data of a group of researchers in the fields of biophysics and structural biology. The approach involves embedding data capture and interpretation within researchers' working practices, so that it is automatic and invisible to the researcher. The environment does not capture just the individual datasets generated by an experiment, but the entire workflow that represent the “story” of the experiment, including intermediate files and provenance metadata, so as to support the verification and reproduction of published results. As the curation environment is decoupled from the researchers’ processing environment, the provenance is inferred from a variety of domain-specific contextual information, using software that implements the knowledge and expertise of the researchers. We also present an approach to publishing the data files and their provenance according to linked data principles by using OAI-ORE (Open Archives Initiative Object Reuse and Exchange) and OPMV.




data

Beyond The Low Hanging Fruit: Data Services and Archiving at the University of New Mexico

Open data is becoming increasingly important in research. While individual researchers are slowlybecoming aware of the value, funding agencies are taking the lead by requiring data be made available, and also by requiring data management plans to ensure the data is available in a useable form. Some journals also require that data be made available. However, in most cases, “available upon request” is considered sufficient. We describe a number of historical examples of data use and discovery, then describe two current test cases at the University of New Mexico. The lessons learned suggest that an instituional data services program needs to not only facilitate fulfilling the mandates of granting agencies but to realize the true value of open data. Librarians and institutional archives should actively collaborate with their researchers. We should also work to find ways to make open data enhance a researchers career. In the long run, better quality data and metadata will result if researchers are engaged and willing participants in the dissemination of their data.




data

Chempound - a Web 2.0-inspired repository for physical science data

Chempound is a new generation repository architecture based on RDF, semantic dictionaries and linked data. It has been developed to hold any type of chemical object expressible in CML and is exemplified by crystallographic experiments and computational chemistry calculations. In both examples, the repository can hold >50k entries which can be searched by SPARQL endpoints and pre-indexing of key fields. The Chempound architecture is general and adaptable to other fields of data-rich science.




data

What to Teach Business Students in MIS Courses about Data and Information




data

A Data Model Validation Approach for Relational Database Design Courses




data

Restructuring an Undergraduate Database Management Course for Business Students




data

Measurement Data Logging via Bluetooth




data

Design, Development and Deployment Considerations when Applying Native XML Database Technology to the Programme Management Function of an SME




data

Exploring the Key Informational, Ethical and Legal Concerns to the Development of Population Genomic Databases for Pharmacogenomic Research




data

Advanced Data Clustering Methods of Mining Web Documents




data

Oracle Database Workload Performance Measurement and Tuning Toolkit




data

Reflecting on an Adventure-Based Data Communications Assignment: The ‘Cryptic Quest’ 




data

Meta-Analysis of Clinical Cardiovascular Data towards Evidential Reasoning for Cardiovascular Life Cycle Management




data

An Exploratory Survey in Collaborative Software in a Graduate Course in Automatic Identification and Data Capture




data

Blogs – The New Source of Data Analysis




data

A Data Driven Conceptual Analysis of Globalization — Cultural Affects and Hofstedian Organizational Frames: The Slovak Republic Example




data

Finding Diamonds in Data: Reflections on Teaching Data Mining from the Coal Face




data

Animated Courseware Support for Teaching Database Design




data

Data Modeling for Better Performance in a Bulletin Board Application




data

Derivation of Database Keys’ Operations




data

A Research Study for the Development of a SOA Middleware Prototype that used Web Services to Bridge the LMS to LOR Data Movement Interoperability Gap for Education




data

A Comparison Study of Impact Factor in Web of Science and Scopus Databases for Engineering Education and Educational Technology Journals




data

Transitioning from Data Storage to Data Curation: The Challenges Facing an Archaeological Institution




data

Planning an Iron Ore Mine: From Exploration Data to Informed Mining Decisions




data

Analyzing Computer Programming Job Trend Using Web Data Mining




data

Effectiveness of Combining Algorithm and Program Animation: A Case Study with Data Structure Course




data

Characterizing Big Data Management

Big data management is a reality for an increasing number of organizations in many areas and represents a set of challenges involving big data modeling, storage and retrieval, analysis and visualization. However, technological resources, people and processes are crucial to facilitate the management of big data in any kind of organization, allowing information and knowledge from a large volume of data to support decision-making. Big data management can be supported by these three dimensions: technology, people and processes. Hence, this article discusses these dimensions: the technological dimension that is related to storage, analytics and visualization of big data; the human aspects of big data; and, in addition, the process management dimension that involves in a technological and business approach the aspects of big data management.




data

A Data Science Enhanced Framework for Applied and Computational Math

Aim/Purpose: The primary objective of this research is to build an enhanced framework for Applied and Computational Math. This framework allows a variety of applied math concepts to be organized into a meaningful whole. Background: The framework can help students grasp new mathematical applications by comparing them to a common reference model. Methodology: In this research, we measure the most frequent words used in a sample of Math and Computer Science books. We combine these words with those obtained in an earlier study, from which we constructed our original Computational Math scale. Contribution: The enhanced framework improves the Computational Math scale by integrating selected concepts from the field of Data Science. Findings: The resulting enhanced framework better explains how abstract mathematical models and algorithms are tied to real world applications and computer implementations. Future Research: We want to empirically test our enhanced Applied and Computational Math framework in a classroom setting. Our goal is to measure how effective the use of this framework is in improving students’ understanding of newly introduced Math concepts.




data

Changing Paradigms of Technical Skills for Data Engineers

Aim/Purpose: This paper investigates the changing paradigms for technical skills that are needed by Data Engineers in 2018. Background: A decade ago, data engineers needed technical skills for Relational Database Management Systems (RDBMS), such as Oracle and Microsoft SQL Server. With the advent of Hadoop and NoSQL Databases in recent years, Data Engineers require new skills to support the large distributed datastores (Big Data) that currently exist. Job demand for Data Scientists and Data Engineers has increased over the last five years. Methodology: This research methodology leveraged the Pig programming language that used MapReduce software located on the Amazon Web Services (AWS) Cloud. Data was collected from 100 Indeed.com job advertisements during July of 2017 and then was uploaded to the AWS Cloud. Using MapReduce, phrases/words were counted and then sorted. The sorted phrase / word counts were then leveraged to create the list of the 20 top skills needed by a Data Engineer based on the job advertisements. This list was compared to the 20 top skills for a Data Engineer presented by Stitch that surveyed 6,500 Data Engineers in 2016. Contribution: This paper presents a list of the 20 top technical skills required by a Data Engineer.




data

Machine Learning-based Flu Forecasting Study Using the Official Data from the Centers for Disease Control and Prevention and Twitter Data

Aim/Purpose: In the United States, the Centers for Disease Control and Prevention (CDC) tracks the disease activity using data collected from medical practice's on a weekly basis. Collection of data by CDC from medical practices on a weekly basis leads to a lag time of approximately 2 weeks before any viable action can be planned. The 2-week delay problem was addressed in the study by creating machine learning models to predict flu outbreak. Background: The 2-week delay problem was addressed in the study by correlation of the flu trends identified from Twitter data and official flu data from the Centers for Disease Control and Prevention (CDC) in combination with creating a machine learning model using both data sources to predict flu outbreak. Methodology: A quantitative correlational study was performed using a quasi-experimental design. Flu trends from the CDC portal and tweets with mention of flu and influenza from the state of Georgia were used over a period of 22 weeks from December 29, 2019 to May 30, 2020 for this study. Contribution: This research contributed to the body of knowledge by using a simple bag-of-word method for sentiment analysis followed by the combination of CDC and Twitter data to generate a flu prediction model with higher accuracy than using CDC data only. Findings: The study found that (a) there is no correlation between official flu data from CDC and tweets with mention of flu and (b) there is an improvement in the performance of a flu forecasting model based on a machine learning algorithm using both official flu data from CDC and tweets with mention of flu. Recommendations for Practitioners: In this study, it was found that there was no correlation between the official flu data from the CDC and the count of tweets with mention of flu, which is why tweets alone should be used with caution to predict a flu out-break. Based on the findings of this study, social media data can be used as an additional variable to improve the accuracy of flu prediction models. It is also found that fourth order polynomial and support vector regression models offered the best accuracy of flu prediction models. Recommendations for Researchers: Open-source data, such as Twitter feed, can be mined for useful intelligence benefiting society. Machine learning-based prediction models can be improved by adding open-source data to the primary data set. Impact on Society: Key implication of this study for practitioners in the field were to use social media postings to identify neighborhoods and geographic locations affected by seasonal outbreak, such as influenza, which would help reduce the spread of the disease and ultimately lead to containment. Based on the findings of this study, social media data will help health authorities in detecting seasonal outbreaks earlier than just using official CDC channels of disease and illness reporting from physicians and labs thus, empowering health officials to plan their responses swiftly and allocate their resources optimally for the most affected areas. Future Research: A future researcher could use more complex deep learning algorithms, such as Artificial Neural Networks and Recurrent Neural Networks, to evaluate the accuracy of flu outbreak prediction models as compared to the regression models used in this study. A future researcher could apply other sentiment analysis techniques, such as natural language processing and deep learning techniques, to identify context-sensitive emotion, concept extraction, and sarcasm detection for the identification of self-reporting flu tweets. A future researcher could expand the scope by continuously collecting tweets on a public cloud and applying big data applications, such as Hadoop and MapReduce, to perform predictions using several months of historical data or even years for a larger geographical area.




data

An Empirical Examination of the Effects of CTO Leadership on the Alignment of the Governance of Big Data and Information Security Risk Management Effectiveness

Aim/Purpose: Board of Directors seek to use their big data as a competitive advantage. Still, scholars note the complexities of corporate governance in practice related to information security risk management (ISRM) effectiveness. Background: While the interest in ISRM and its relationship to organizational success has grown, the scholarly literature is unclear about the effects of Chief Technology Officers (CTOs) leadership styles, the alignment of the governance of big data, and ISRM effectiveness in organizations in the West-ern United States. Methodology: The research method selected for this study was a quantitative, correlational research design. Data from 139 participant survey responses from Chief Technology Officers (CTOs) in the Western United States were analyzed using 3 regression models to test for mediation following Baron and Kenny’s methodology. Contribution: Previous scholarship has established the importance of leadership styles, big data governance, and ISRM effectiveness, but not in a combined understanding of the relationship between all three variables. The researchers’ primary objective was to contribute valuable knowledge to the practical field of computer science by empirically validating the relationships between the CTOs leadership styles, the alignment of the governance of big data, and ISRM effectiveness. Findings: The results of the first regression model between CTOs leadership styles and ISRM effectiveness were statistically significant. The second regression model results between CTOs leadership styles and the alignment of the governance of big data were not statistically significant. The results of the third regression model between CTOs leadership styles, the alignment of the governance of big data, and ISRM effectiveness were statistically significant. The alignment of the governance of big data was a significant predictor in the model. At the same time, the predictive strength of all 3 CTOs leadership styles was diminished between the first regression model and the third regression model. The regression models indicated that the alignment of the governance of big data was a partial mediator of the relationship between CTOs leadership styles and ISRM effectiveness. Recommendations for Practitioners: With big data growing at an exponential rate, this research may be useful in helping other practitioners think about how to test mediation with other interconnected variables related to the alignment of the governance of big data. Overall, the alignment of governance of big data being a partial mediator of the relationship between CTOs leadership styles and ISRM effectiveness suggests the significant role that the alignment of the governance of big data plays within an organization. Recommendations for Researchers: While this exact study has not been previously conducted with these three variables with CTOs in the Western United States, overall, these results are in agreement with the literature that information security governance does not significantly mediate the relationship between IT leadership styles and ISRM. However, some of the overall findings did vary from the literature, including the predictive relationship between transactional leadership and ISRM effectiveness. With the finding of partial mediation indicated in this study, this also suggests that the alignment of the governance of big data provides a partial intervention between CTOs leadership styles and ISRM effectiveness. Impact on Society: Big data breaches are increasing year after year, exposing sensitive information that can lead to harm to citizens. This study supports the broader scholarly consensus that to achieve ISRM effectiveness, better alignment of governance policies is essential. This research highlights the importance of higher-level governance as it relates to ISRM effectiveness, implying that ineffective governance could negatively impact both leadership and ISRM effectiveness, which could potentially cause reputational harm. Future Research: This study raised questions about CTO leadership styles, the specific governance structures involved related to the alignment of big data and ISRM effectiveness. While the research around these variables independently is mature, there is an overall lack of mediation studies as it relates to the impact of the alignment of the governance of big data. With the lack of alignment around a universal framework, evolving frameworks could be tested in future research to see if similar results are obtained.




data

From Tailored Databases to Wikis: Using Emerging Technologies to Work Together More Efficiently




data

Multi-Agent System for Knowledge-Based Access to Distributed Databases




data

Egocentric Database Operations for Social and Economic Network Analysis




data

Discovering Interesting Association Rules in the Web Log Usage Data




data

Relational Algebra Programming With Microsoft Access Databases




data

Data Visualization in Support of Executive Decision Making

Aim/Purpose: This journal paper seeks to understand historical aspects of data management, leading to the current data issues faced by organizational executives in relation to big data and how best to present the information to circumvent big data challenges for executive strategic decision making. Background: This journal paper seeks to understand what executives value in data visualization, based on the literature published from prior data studies. Methodology: The qualitative methodology was used to understand the sentiments of executives and data analysts using semi-structured interview techniques. Contribution: The preliminary findings can provide practical knowledge for data visualization designers, but can also provide academics with knowledge to reflect on and use, specifically in relation to information systems (IS) that integrate human experience with technology in more valuable and productive ways. Findings: Preliminary results from interviews with executives and data analysts point to the relevance of understanding and effectively presenting the data source and the data journey, using the right data visualization technology to fit the nature of the data, creating an intuitive platform which enables collaboration and newness, the data presenter’s ability to convey the data message and the alignment of the visualization to core the objectives as key criteria to be applied for successful data visualizations Recommendations for Practitioners: Practitioners, specifically data analysts, should consider the results highlighted in the findings and adopt such recommendations when presenting data visualizations. These include data and premise understanding, ensuring alignment to the executive’s objective, possessing the ability to convey messages succinctly and clearly to the audience, having knowledge of the domain to answer questions effectively, and using the right technology to convey the message. Recommendation for Researchers: The importance of human cognitive and sensory processes and its impact in IS development is paramount. More focus can be placed on the psychological factors of technology acceptance. The current TAM model, used to describe use, identifies perceived usefulness and perceived ease-of-use as the primary considerations in technology adoption. However, factors that have been identified that impact on use do not express the importance of cognitive processes in technology adoption. Future Research: Future research requires further focus on intangible and psychological factors that could affect technology adoption and use, as well as understanding data visualization effectiveness in corporate environments, not only predominantly within the Health sector. Lessons from Health sector studies in data visualization should be used as a platform.




data

A Multicluster Approach to Selecting Initial Sets for Clustering of Categorical Data

Aim/Purpose: This article proposes a methodology for selecting the initial sets for clustering categorical data. The main idea is to combine all the different values of every single criterion or attribute, to form the first proposal of the so-called multiclusters, obtaining in this way the maximum number of clusters for the whole dataset. The multiclusters thus obtained, are themselves clustered in a second step, according to the desired final number of clusters. Background: Popular cluster methods for categorical data, such as the well-known K-Modes, usually select the initial sets by means of some random process. This fact introduces some randomness in the final results of the algorithms. We explore a different application of the clustering methodology for categorical data that overcomes the instability problems and ultimately provides a greater clustering efficiency. Methodology: For assessing the performance of the proposed algorithm and its comparison with K-Modes, we apply both of them to categorical databases where the response variable is known but not used in the analysis. In our examples, that response variable can be identified to the real clusters or classes to which the observations belong. With every data set, we perform a two-step analysis. In the first step we perform the clustering analysis on data where the response variable (the real clusters) has been omitted, and in the second step we use that omitted information to check the efficiency of the clustering algorithm (by comparing the real clusters to those given by the algorithm). Contribution: Simplicity, efficiency and stability are the main advantages of the multicluster method. Findings: The experimental results attained with real databases show that the multicluster algorithm has greater precision and a better grouping effect than the classical K-modes algorithm. Recommendations for Practitioners: The method can be useful for those researchers working with small and medium size datasets, allowing them to detect the underlying structure of the data in an intuitive and reasonable way. Recommendation for Researchers: The proposed algorithm is slower than K-Modes, since it devotes a lot of time to the calculation of the initial combinations of attributes. The reduction of the computing time is therefore an important research topic. Future Research: We are concerned with the scalability of the algorithm to large and complex data sets, as well as the application to mixed data sets with both quantitative and qualitative attributes.




data

Challenges in Contact Tracing by Mining Mobile Phone Location Data for COVID-19: Implications for Public Governance in South Africa

Aim/Purpose: The paper’s objective is to examine the challenges of using the mobile phone to mine location data for effective contact tracing of symptomatic, pre-symptomatic, and asymptomatic individuals and the implications of this technology for public health governance. Background: The COVID-19 crisis has created an unprecedented need for contact tracing across South Africa, requiring thousands of people to be traced and their details captured in government health databases as part of public health efforts aimed at breaking the chains of transmission. Contact tracing for COVID-19 requires the identification of persons who may have been exposed to the virus and following them up daily for 14 days from the last point of exposure. Mining mobile phone location data can play a critical role in locating people from the time they were identified as contacts to the time they access medical assistance. In this case, it aids data flow to various databases designated for COVID-19 work. Methodology: The researchers conducted a review of the available literature on this subject drawing from academic articles published in peer-reviewed journals, research reports, and other relevant national and international government documents reporting on public health and COVID-19. Document analysis was used as the primary research method, drawing on the case studies. Contribution: Contact tracing remains a critical strategy in curbing the deadly COVID-19 pandemic in South Africa and elsewhere in the world. However, given increasing concern regarding its invasive nature and possible infringement of individual liberties, it is imperative to interrogate the challenges related to its implementation to ensure a balance with public governance. The research findings can thus be used to inform policies and practices associated with contact tracing in South Africa. Findings: The study found that contact tracing using mobile phone location data mining can be used to enforce quarantine measures such as lockdowns aimed at mitigating a public health emergency such as COVID-19. However, the use of technology can expose the public to criminal activities by exposing their locations. From a public governance point of view, any exposure of the public to social ills is highly undesirable. Recommendations for Practitioners: In using contact tracing apps to provide pertinent data location caution needs to be exercised to ensure that sensitive private information is not made public to the extent that it compromises citizens’ safety and security. The study recommends the development and implementation of data use protocols to support the use of this technology, in order to mitigate against infringement of individual privacy and other civil liberties. Recommendation for Researchers: Researchers should explore ways of improving digital applications in order to improve the acceptability of the use of contact tracing technology to manage pandemics such as COVID-19, paying attention to ethical considerations. Impact on Society: Since contact tracing has implications for privacy and confidentiality it must be conducted with caution. This research highlights the challenges that the authorities must address to ensure that the right to privacy and confidentiality is upheld. Future Research: Future research could focus on collecting primary data to provide insight on contact tracing through mining mobile phone location data. Research could also be conducted on how app-based technology can enhance the effectiveness of contact tracing in order to optimize testing and tracing coverage. This has the potential to minimize transmission whilst also minimizing tracing delays. Moreover, it is important to develop contact tracing apps that are universally inter-operable and privacy-preserving.




data

Automatic Generation of Temporal Data Provenance From Biodiversity Information Systems

Aim/Purpose: Although the significance of data provenance has been recognized in a variety of sectors, there is currently no standardized technique or approach for gathering data provenance. The present automated technique mostly employs workflow-based strategies. Unfortunately, the majority of current information systems do not embrace the strategy, particularly biodiversity information systems in which data is acquired by a variety of persons using a wide range of equipment, tools, and protocols. Background: This article presents an automated technique for producing temporal data provenance that is independent of biodiversity information systems. The approach is dependent on the changes in contextual information of data items. By mapping the modifications to a schema, a standardized representation of data provenance may be created. Consequently, temporal information may be automatically inferred. Methodology: The research methodology consists of three main activities: database event detection, event-schema mapping, and temporal information inference. First, a list of events will be detected from databases. After that, the detected events will be mapped to an ontology, so a common representation of data provenance will be obtained. Based on the derived data provenance, rule-based reasoning will be automatically used to infer temporal information. Consequently, a temporal provenance will be produced. Contribution: This paper provides a new method for generating data provenance automatically without interfering with the existing biodiversity information system. In addition to this, it does not mandate that any information system adheres to any particular form. Ontology and the rule-based system as the core components of the solution have been confirmed to be highly valuable in biodiversity science. Findings: Detaching the solution from any biodiversity information system provides scalability in the implementation. Based on the evaluation of a typical biodiversity information system for species traits of plants, a high number of temporal information can be generated to the highest degree possible. Using rules to encode different types of knowledge provides high flexibility to generate temporal information, enabling different temporal-based analyses and reasoning. Recommendations for Practitioners: The strategy is based on the contextual information of data items, yet most information systems simply save the most recent ones. As a result, in order for the solution to function properly, database snapshots must be stored on a frequent basis. Furthermore, a more practical technique for recording changes in contextual information would be preferable. Recommendation for Researchers: The capability to uniformly represent events using a schema has paved the way for automatic inference of temporal information. Therefore, a richer representation of temporal information should be investigated further. Also, this work demonstrates that rule-based inference provides flexibility to encode different types of knowledge from experts. Consequently, a variety of temporal-based data analyses and reasoning can be performed. Therefore, it will be better to investigate multiple domain-oriented knowledge using the solution. Impact on Society: Using a typical information system to store and manage biodiversity data has not prohibited us from generating data provenance. Since there is no restriction on the type of information system, our solution has a high potential to be widely adopted. Future Research: The data analysis of this work was limited to species traits data. However, there are other types of biodiversity data, including genetic composition, species population, and community composition. In the future, this work will be expanded to cover all those types of biodiversity data. The ultimate goal is to have a standard methodology or strategy for collecting provenance from any biodiversity data regardless of how the data was stored or managed.




data

Determinants of the Intention to Use Big Data Analytics in Banks and Insurance Companies: The Moderating Role of Managerial Support

Aim/Purpose: The aim of this research paper is to suggest a comprehensive model that incorporates the technology acceptance model with the task-technology fit model, information quality, security, trust, and managerial support to investigate the intended usage of big data analytics (BDA) in banks and insurance companies. Background: The emergence of the concept of “big data,” prompted by the widespread use of connected devices and social media, has been pointed out by many professionals and financial institutions in particular, which makes it necessary to assess the determinants that have an impact on behavioral intention to use big data analytics in banks and insurance companies. Methodology: The integrated model was empirically assessed using self-administered questionnaires from 181 prospective big data analytics users in Moroccan banks and insurance firms and examined using partial least square (PLS) structural equation modeling. The results cover sample characteristics, an analysis of the validity and reliability of measurement models’ variables, an evaluation of the proposed hypotheses, and a discussion of the findings. Contribution: The paper makes a noteworthy contribution to the BDA adoption literature within the finance sector. It stands out by ingeniously amalgamating the Technology Acceptance Model (TAM) with Task-Technology Fit (TTF) while underscoring the critical significance of information quality, trust, and managerial support, due to their profound relevance and importance in the finance domain. Thus showing BDA has potential applications beyond the finance sector. Findings: The findings showed that TTF and trust’s impact on the intention to use is considerable. Information quality positively impacted perceived usefulness and ease of use, which in turn affected the intention to use. Moreover, managerial support moderates the correlation between perceived usefulness and the intention to use, whereas security did not affect the intention to use and managerial support did not moderate the influence of perceived ease of use. Recommendations for Practitioners: The results suggest that financial institutions can improve their adoption decisions for big data analytics (BDA) by understanding how users perceive it. Users are predisposed to use BDA if they presume it fits well with their tasks and is easy to use. The research also emphasizes the importance of relevant information quality, managerial support, and collaboration across departments to fully leverage the potential of BDA. Recommendation for Researchers: Further study may be done on other business sectors to confirm its generalizability and the same research design can be employed to assess BDA adoption in organizations that are in the advanced stage of big data utilization. Impact on Society: The study’s findings can enable stakeholders of financial institutions that are at the primary stage of big data exploitation to understand how users perceive BDA technologies and the way their perception can influence their intention toward their use. Future Research: Future research is expected to conduct a comparison of the moderating effect of managerial support on users with technical expertise versus those without; in addition, international studies across developed countries are required to build a solid understanding of users’ perceptions towards BDA.




data

A New Model for Collecting, Storing, and Analyzing Big Data on Customer Feedback in the Tourism Industry

Aim/Purpose: In this study, the research proposes and experiments with a new model of collecting, storing, and analyzing big data on customer feedback in the tourism industry. The research focused on the Vietnam market. Background: Big Data describes large databases that have been “silently” built by businesses, which include product information, customer information, customer feedback, etc. This information is valuable, and the volume increases rapidly over time, but businesses often pay little attention or store it discretely, not centrally, thereby wasting an extremely large resource and partly causing limitations for business analysis as well as data. Methodology: The study conducted an experiment by collecting customer feedback data in the field of tourism, especially tourism in Vietnam, from 2007 to 2022. After that, the research proceeded to store and mine latent topics based on the data collected using the Topic Model. The study applied cloud computing technology to build a collection and storage model to solve difficulties, including scalability, system stability, and system cost optimization, as well as ease of access to technology. Contribution: The research has four main contributions: (1) Building a model for Big Data collection, storage, and analysis; (2) Experimenting with the solution by collecting customer feedback data from huge platforms such as Booking.com, Agoda.com, and Phuot.vn based on cloud computing, focusing mainly on tourism Vietnam; (3) A Data Lake that stores customer feedback and discussion in the field of tourism was built, supporting researchers in the field of natural language processing; (4) Experimental research on the latent topic mining model from the collected Big Data based on the topic model. Findings: Experimental results show that the Data Lake has helped users easily extract information, thereby supporting administrators in making quick and timely decisions. Next, PySpark big data processing technology and cloud computing help speed up processing, save costs, and make model building easier when moving to SaaS. Finally, the topic model helps identify customer discussion trends and identify latent topics that customers are interested in so business owners have a better picture of their potential customers and business. Recommendations for Practitioners: Empirical results show that facilities are the factor that customers in the Vietnamese market complain about the most in the tourism/hospitality sector. This information also recommends that practitioners reduce their expectations about facilities because the overall level of physical facilities in the Vietnamese market is still weak and cannot be compared with other countries in the world. However, this is also information to support administrators in planning to upgrade facilities in the long term. Recommendation for Researchers: The value of Data Lake has been proven by research. The study also formed a model for big data collection, storage, and analysis. Researchers can use the same model for other fields or use the model and algorithm proposed by this study to collect and store big data in other platforms and areas. Impact on Society: Collecting, storing, and analyzing big data in the tourism sector helps government strategists to identify tourism trends and communication crises. Based on that information, government managers will be able to make decisions and strategies to develop regional tourism, propose price levels, and support innovative programs. That is the great social value that this research brings. Future Research: With each different platform or website, the study had to build a query scenario and choose a different technology approach, which limits the ability of the solution’s scalability to multiple platforms. Research will continue to build and standardize query scenarios and processing technologies to make scalability to other platforms easier.