Accepted Research Papers

Tuesday, October 16th, 11:00AM - 12:30PM, R1: Best Paper Award Nominees

Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy. Thomas Durieux (INRIA), Youssef Hamadi (LIX) and Martin Monperrus (KTH).

Robust and Rapid Adaption for Concept Drift in Software System Anomaly DetectionMinghua Ma (Tsinghua University), Shenglin Zhang (Nankai University), Dan Pei (Tsinghua University), Xin Huang and Hongwei Dai (Sogou Inc)

Run-time Reliability Estimation of Microservice ArchitecturesRoberto Pietrantuono, Stefano Russo and Antonio Guerriero (University of Naples Federico II)​

Tuesday, October 16th, 2:00PM - 3:30PM, R2: Journal-first Paper Presentations

Session Chair: Karama Kanoun

An architecture, system engineering, and acquisition approach for space system software resiliency. Dewanne M. Phillips (The Aerospace Corporation), Thomas A. Mazzuchi, Shahram Sarkani (George Washington University)​

Software metrics thresholds calculation techniques to predict fault-proneness: An empirical comparison.  Alexandre Boucher, Mourad Badri (University of Quebec)​

SPIRITuS: a SimPle Information Retrieval regressIon Test Selection approach. Simone Romano, Giuseppe Scanniello (University of Basilicata), Giuliano Antoniol (Ecole Polytech. de Montreal), Alessandro Marchetto

Tuesday, October 16th, 4:00PM - 6:00PM (2 hours), R3: Reliability, Security and Safety Analysis

Session Chair: Tadashi Dohi

Online Model-based Testing Under Uncertainty. Matteo Camilli (University of Milan), Angelo Gargantini and Patrizia Scandurra (University of Bergamo)​

Reliability Evaluation of Functionally Equivalent Simulink PID Controller Implementations. Kai Ding, Andrey Morozov and Klaus Janschek (TU Dresden)​

A Natural Language Programming Approach for Requirements-based Security TestingPhu X. Mai, Fabrizio Pastore, Arda Goknil and Lionel Briand (University of Luxembourg)

Safe-AR: Reducing Risk While Augmenting Reality (WAP paper)Robyn Lutz (Iowa State University)

Wednesday, October 17th,  2:00PM - 3:30PM, R4: Test Case Generation

Session Chair: Roberto Pietrantuono

Worst-Case Execution Time Testing via Evolutionary Symbolic ExecutionAndrea Aquino (Università della Svizzera Italiana), Giovanni Denaro (University of Milano-Bicocca) and Pasquale Salza (Università della Svizzera Italiana)​

Search-Based Test Data Generation for JavaScript Functions that Interact with the DOMAlexander Elyasov, Wishnu Prasetya and Jurriaan Hage (Utrecht University)​

DeepMutation: Mutation Testing of Deep Learning Systems. Lei Ma (Harbin Institute of Technology), Fuyuan Zhang (Nanyang Technological University), Jiyuan Sun (Kyushu University), Minhui Xue (New York University), Bo Li (University of Illinois at Urbana-Champaign), Felix Juefei-Xu (Carnegie Mellon University), Chao Xie (Kyushu University), Li Li (Monash University), Yang Liu (Nanyang Technological University), Jianjun Zhao (Kyushu University) and Yadong Wang (Harbin Institute of Technology)​

Wednesday, October 17th, 4:00PM - 6:00PM (2 hours), R5: Regression Testing

Session Chair: Fabrizio Pastore

Evaluating Regression Test Selection Opportunities in a Very Large Open-Source EcosystemAlex Gyori, Owolabi Legunsen, Farah Hariri and Darko Marinov (University of Illinois at Urbana-Champaign)

Substate Profiling for Effective Test Suite ReductionRawad Abou Assi, Wes Masri and Chadi Trad (American University of Beirut)

A Study of Regression Test Selection in Continuous Integration EnvironmentsTing Wang and Tingting Yu (University of Kentucky)

ReTEST: A Cost Effective Test Case Selection Technique for Modern Software DevelopmentMaral Azizi and Hyunsook Do (University of North Texas)

Thursday, October 18th, 9:00AM - 10:30AM, R6: Fault Localization and Debugging

Session Chair: Zheng Zheng

FTMES: A Failed-Test-Oriented Mutant Execution Strategy for Mutation-based Fault LocalizationAndré Oliveira (Instituto Federal de Mato Grosso), Celso G. Camilo-Junior (Universidade Federal de Goiás), Eduardo Freitas (Instituto Federal de Goiás) and Auri Marcelo Rizzo Vincenzi (Universidade Federal de São Carlos)​

Causal Distance-Metric-Based Assistance for Debugging After Compiler Fuzzing. Josie Holmes and Alex Groce (Northern Arizona University)

Bugaroo: Exposing Memory Model Bugs in Many-core Systems. Mohammad Majharul Islam (Intel) and Abdullah Muzahid (The University of Texas at San Antonio)​

Thursday, October 18th, 11:00AM - 12:30PM, R7: Mobile Systems

Session Chair: Domenico Cotroneo

RedDroid: Android Application Redundancy Customization Based on Static Analysis. Yufei Jiang (Microsoft), Qinkun Bao, Shuai Wang, Xiao Liu and Dinghao Wu (The Pennsylvania State University)​

DroidEH: An Exception Handling Mechanism for Android Applications. Juliana Oliveira, Hivana Macedo, Nelio Cacho (Federal University of Rio Grande do Norte) and Alexander Romanovsky (Newcastle University)

MoonlightBox: Mining Android API Histories for Uncovering Release-time Inconsistencies. Li Li (Monash University), Tegawendé F. Bissyandé and Jacques Klein (University of Luxembourg)

Thursday, October 18th, 2:00PM - 3:30PM, R8: Systems Security and Privacy

Session Chair: Nuno Antunes

Detection of Mirai by Syntactic and Behavioral AnalysisNajah Ben Said (INRIA), Fabrizio Biondi (CentraleSupelec), Vesselin Bontchev (Bulgarian Academy of Sciences), Olivier Decourbe, Thomas Given-Wilson, Axel Legay and Jean Quilbeuf (INRIA)​

You Are Where You APP: An Assessment on Location Privacy of Social APPsFanghua Zhao, Linan Gao (Shandong University), Yang Zhang (CISPA), Zeyu Wang, Bo Wang and Shanqing Guo (Shandong University)

Iris Template Protection Based on Randomized Response Technique and Aggregated Block Information. Dongdong Zhao, Xiaoyi Hu, Jing Tian, Shengwu Xiong and Jianwen Xiang (Wuhan University of Technology)


Abstract of Accepted Research Papers


Paper ID 5. RedDroid: Android Application Redundancy Customization Based on Static Analysis

Yufei Jiang (Microsoft), Qinkun Bao, Shuai Wang, Xiao Liu, and Dinghao Wu (The Pennsylvania State University)

Abstract

Smartphone users are installing more and bigger apps. At the meanwhile each app carries considerable amount of unused stuff called software bloat in its apk file. As a result the resources of a smartphone such as hard disk and network bandwidth has become even more insufficient than ever before. Therefore it is critical to investigate existing apps on the market and apps in development to identify the sources of software bloat and develop techniques and tools to remove the bloat. In this paper we present a comprehensive study of software bloat in Android applications and categorize them into two types compile-time redundancy and install-time redundancy. In addition we further propose a static analysis based approach to identifying and removing software bloat from Android applications. We implemented our approach in a prototype called RedDroid and we evaluated RedDroid on thousands of Android applications collected from Google Play. Our experimental results not only validate the effectiveness of our approach but also report the bloatware issue in real-world Android applications for the first time.


Paper ID 8. Worst-Case Execution Time Testing via Evolutionary Symbolic Execution

Andrea Aquino (Università della Svizzera Italiana), Giovanni Denaro (University of Milano-Bicocca), and Pasquale Salza (Università della Svizzera Italiana)

Abstract

Worst-case execution time testing amounts to constructing a test case triggering the worst-case execution time of a program and has many important applications to identify debug and fix performance bottlenecks and security holes of programs. We propose a novel technique for worst-case execution time testing combining symbolic execution and evolutionary algorithms which we call Evolutionary Symbolic Execution that (i) considers the set of the feasible program paths as the search space (ii) embraces the execution cost of the program paths as the fitness function to pursue the worst path (iii) exploits symbolic execution with random path selection to collect an initial set of feasible program paths (iv) incrementally evolves by steering symbolic execution to traverse new program paths that comply with execution conditions combined and refined from the currently collected program paths and (v) periodically applies local optimizations to the worst currently identified program path to speed up the identification of the worst path. We report on a set of initial experiments indicating that our technique succeeds in generating good worst-case execution time test cases for programs that existing approaches cannot cope with.


Paper ID 13. DroidEH: An Exception Handling Mechanism for Android Applications

Juliana Oliveira, Hivana Macedo, Nelio Cacho (Federal University of Rio Grande do Norte), and Alexander Romanovsky (Newcastle University)

Abstract

App crashing is the most common cause of complaints about Android mobile phone apps according to recent studies. Since most Android applications are written in Java exception handling is the primary mechanism they employ to report and handle errors similar to standard Java applications. Unfortunately the exception handling mechanism for the Android platform has two liabilities: (1) the “Terminate ALL†approach and (2) a lack of a holistic view on exceptional behavior. As a consequence exceptions easily get “out of control†and as system development progresses exceptional control flows become less well-understood with potentially negative effects on program reliability. This paper presents an innovative exception handling mechanism for the Android platform named DroidEH that provides abstractions to support systematic engineering of holistic fault tolerance by applying cross-cutting reasoning about systems and their components."


Paper ID 14. Robust and Rapid Adaption for Concept Drift in Software System Anomaly Detection

Minghua Ma (Tsinghua University). Shenglin Zhang (Nankai University). Dan Pei (Tsinghua University). Xin Huang and Hongwei Dai (Sogou Inc)

Abstract

Anomaly detection is critical for web-based software systems. Anecdotal evidence suggests that in these systems the accuracy of a static anomaly detection method that was previously ensured is bound to degrade over time. It is due to the significant change of data distribution namely concept drift which is caused by software change or personal preferences evolving. Even though dozens of anomaly detectors have been proposed over the years in the context of software system they have not tackled the problem of concept drift. In this paper we present a framework StepWise which can detect concept drift without tuning detection threshold or per-KPI (Key Performance Indicator) model parameters in a large scale KPI streams take external factors into account to distinguish the concept drift which under operators' expectations and help any kind of anomaly detection algorithm to handle it rapidly. For the prototype deployed in Sogou our empirical evaluation shows StepWise improve the average F-score by 206% for many widely-used anomaly detectors over a baseline without any concept drift detection.


Paper ID 17. Online Model-based Testing Under Uncertainty

Matteo Camilli (University of Milan), Angelo Gargantini and Patrizia Scandurra (University of Bergamo)

Abstract

Modern software systems are required to operate in a highly uncertain and changing environment. They have to control the satisfaction of their requirements at run-time and possibly adapt and cope with situations that have not been completely addressed at design-time. Software engineering methods and techniques are more than ever forced to deal with change and uncertainty (lack of knowledge) explicitly. For tackling the challenge posed by uncertainty in delivering more reliable systems this paper proposes a novel online Model-based Testing technique that complements classic test case generation based on pseudo-random sampling strategies with an uncertainty-aware sampling strategy. To deal with system uncertainty during testing the proposed strategy builds on an Inverse Uncertainty Quantification approach that is related to the discrepancy between the measured data at run-time (while the system executes) and a Markov Decision Process model describing the behavior (including uncertainty) of the system under test. To this purpose a conformance game approach is adopted in which tests feed a Bayesian inference calibrator that continuously learns from test data to tune the system model and the system itself. A comparative evaluation between the proposed uncertainty-aware sampling policy and classical pseudo-random sampling policies is also presented using the Tele Assistance System case study showing the differences in achieved accuracy and efficiency.


Paper ID 20. Causal Distance-Metric-Based Assistance for Debugging After Compiler Fuzzing

Josie Holmes and Alex Groce (Northern Arizona University)

Abstract

Measuring the distance between two program executions is a fundamental problem in dynamic analysis of software and useful in many test generation and debugging algorithms. This paper proposes a metric for measuring distance between executions and specializes it to an important application: determining similarity of failing test cases for the purpose of automated fault identification and localization in debugging based on automatically generated compiler tests. The metric is based on a causal concept of distance where executions are similar to the degree that changes in the program itself introduced by mutation cause similar changes in the correctness of the executions. Specifically if two failing test cases (for the original compiler) become successful due to the same mutant they are more likely to be due to the same fault. We evaluate our metric using more than 50 faults and 2 800 test cases for two widely-used real-world compilers and demonstrate improvements over state-of-the-art methods for fault identification and localization.


Paper ID 22. Safe-AR: Reducing Risk While Augmenting Reality (WAP paper)

Robyn Lutz (Iowa State University)

Abstract

Augmented reality (AR) systems excel at offering real-time situation-aware information to support users' decision making. With AR rich visualizations of relevant data can be displayed to users without blocking their view of the real world. For example an AR automotive windshield might display a red outline around a pedestrian to alert a driver at night. Safety-critical AR applications are in use or in design for automotive driving aids surgery airplane maintenance turbine assembly and emergency response. Many of these applications can enhance operational safety. However developing risk analysis methods to handle potential failure modes in the melded virtual and physical realities remains an open problem. This paper proposes a risk analysis method with which to study computer-generated AR visualizations of system and environmental information. The analysis also incorporates three levels at which AR interfaces with the user: perception comprehension and decision-making. Preliminary results reported here show that this method yields improved coverage of user-involved AR failure modes over current approaches. Our approach enables a more complete risk analysis of the cyber-physical-human system that an AR application depends upon and affects. While the focus here is on safety the method appears to be generalizable to AR security risks.


Paper ID 27. Reliability Evaluation of Functionally Equivalent Simulink PID Controller Implementations

Kai Ding, Andrey Morozov, and Klaus Janschek (TU Dresden)

Abstract

Model-based design of embedded control systems becomes more and more popular. Control engineers prefer to use MATLAB Simulink and suitable automatic code generators for the development and deployment of the software. Simulink provides a vast variety of functionally equivalent design solutions. For instance a proportional-integral-derivative (PID) controller can be implemented in Simulink using i) separate blocks for the P I D terms ii) a dedicated Discrete PID Controller block iii) a Discrete Transfer Function block or iv) a Discrete State-Space block. However these functionally equivalent implementations of the PID controller show completely different reliability properties. This article introduces a new analytical method for the overall system reliability evaluation under data errors occurred in RAM and CPU. The method is based on a stochastic dual-graph error propagation model that captures control and data flow structures of the assembly code and allows the computation of system level reliability metrics in critical system outputs for specified faults probabilities. The analytical method enables an early system reliability evaluation. Also application of this analytical method to possible implementations of the particular control algorithm helps to select the most reliable controller implementation variant.


Paper ID 33. A Natural Language Programming Approach for Requirements-based Security Testing

Phu X. Mai, Fabrizio Pastore, Arda Goknil, and Lionel Briand (University of Luxembourg)

Abstract

To facilitate communication among stakeholders software security requirements are typically written in natural language and capture both positive requirements (i.e. what the system is supposed to do to ensure security) and negative requirements (i.e. undesirable behavior undermining security). In this paper we tackle the problem of automatically generating executable security test cases from security requirements in natural language (NL). More precisely since existing approaches for the generation of test cases from NL requirements verify only positive requirements we focus on the problem of generating test cases from negative requirements. We propose apply and assess Misuse Case Programming (MCP) an approach that automatically generates security test cases from misuse case specifications (i.e. use case specifications capturing the behavior of malicious users). MCP relies on natural language processing techniques to extract the concepts (e.g. inputs and activities) appearing in requirements specifications and generates executable test cases by matching the extracted concepts to the members of a provided test driver API. MCP has been evaluated in an industrial case study which provides initial evidence of the feasibility and benefits of the approach.

 


Paper ID 41. Substate Profiling for Effective Test Suite Reduction

Rawad Abou Assi, Wes Masri, and Chadi Trad (American University of Beirut)

Abstract

Test suite reduction (TSR) aims at removing redundant test cases from regression test suites. A typical TSR approach ensures that structural profile elements covered by the original test suite are also covered by the reduced test suite. It is plausible that structural profiles might be unable to segregate failing runs from passing runs which diminishes the effectiveness of TSR in regard to defect detection. This motivated us to explore state profiles which are based on the collective values of program variables. This paper presents Substate Profiling a new form of state profiling that enhances existing profile-based analysis techniques such as TSR and coverage-based fault localization. Compared to current approaches for capturing program states Substate Profiling is more practical and finer grained. We evaluated our approach using thirteen multi-fault subject programs comprising 56 defects. Our study involved greedy TSR using Substate profiles and four structural profiles namely basic-block branch def-use pair and the combination of the three. For the majority of the subjects Substate Profiling detected considerably more defects with a comparable level of reduction. Also Substate profiles were found to be complementary to structural profiles in many cases thus combining both types is beneficial.


Paper ID 44. Run-time Reliability Estimation of Microservice Architectures

Roberto Pietrantuono, Stefano Russo and Antonio Guerriero (University of Naples Federico II)

Abstract

Microservices are gaining popularity as an architectural paradigm for service-oriented applications especially suited for highly dynamic contexts requiring loosely-coupled independent services frequent software releases decentralized governance and data management. Because of the high flexibility and evolvability characterizing microservice architectures (MSAs) it is difficult to estimate their reliability at design time as it changes continuously due to the services’ upgrades and/or to the way applications are used by customers. This paper presents a testing method for on-demand reliability estimate of microservice applications in their operational phase. The method allows to faithfully assess upon request the reliability of a MSA-based application under a scarce testing budget at any time when it is in operation and exploiting field data about microservice usage and failing/successful demands. A new in-vivo testing algorithm is developed based on an adaptive web sampling strategy named Microservice Adaptive Reliability Testing (MART). The method is evaluated by simulation as well as by experimentation on an example application based on the Netflix Open Source Software MSA with encouraging results in terms of estimation accuracy and especially efficiency.


Paper ID 56. Iris Template Protection Based on Randomized Response Technique and Aggregated Block Information

Dongdong Zhao, Xiaoyi Hu, Jing Tian, Shengwu Xiong, and Jianwen Xiang (Wuhan University of Technology)

Abstract

Nowadays biometric recognition has been widely used in real-world applications but it has also brought potential privacy threats to users. Iris template protection enables an effective iris recognition while protecting personal privacy. In this paper we propose a method for iris template protection based on randomized response technique and aggregated block information. Specifically the iris data are first permuted according to an application-specific parameter; next the permuted data are flipped using the randomized response technique; finally the result is divided into blocks and the aggregated information (i.e. the sum of all bits) in each block is calculated and stored instead of original iris data for privacy protection. We demonstrate that the proposed method supports the shifting and masking strategies for enhancing recognition performance. Moreover the proposed method satisfies the three privacy requirements prescribed in ISO/IEC 24745: irreversibility revocability and unlinkability. Experimental results show that the proposed method could effectively maintain the recognition performance (w.r.t. the original iris recognition system without privacy protection) on the iris database CASIA-IrisV3-Interval.


Paper ID 64. Fully Automated HTML and Javascript Rewriting for Constructing a Self-healing Web Proxy

Thomas Durieux (INRIA), Youssef Hamadi (LIX), and Martin Monperrus (KTH)

Abstract

Over the last few years the complexity of web applications has increased to provide more dynamic web applications to users. The drawback of this complexity is the growing number of errors in the front-end applications. In this paper we present BikiniProxy a novel technique to provide self-healing for the web. BikiniProxy is designed as an HTTP proxy that uses five self-healing strategies to rewrite the buggy HTML and Javascript code. We evaluate BikiniProxy with a new benchmark of 555 reproducible Javascript errors of which 31% can be automatically self-healed.


Paper ID 69. Detection of Mirai by Syntactic and Behavioral Analysis

Najah Ben Said (INRIA), Fabrizio Biondi (CentraleSupelec), Vesselin Bontchev (Bulgarian Academy of Sciences), Olivier Decourbe, Thomas Given-Wilson, Axel Legay and Jean Quilbeuf (INRIA)

Abstract

The largest botnet DDoS attacks in history have been executed by devices controlled by the Mirai botnet trojan. To prevent Mirai from spreading this paper presents and evaluates techniques to classify binary samples as Mirai based on their syntactic and behavioral properties. Syntactic malware detection is shown to have a good detection rate and no false positives but to be very easy to circumvent. Behavioral malware detection is resistant to simple obfuscation and has better detection rate than syntactic detection while keeping false positives to zero. This paper demonstrates these results and concludes by showing how to combine syntactic and behavioral analysis techniques for the detection of Mirai.


Paper ID 72. ReTEST: A Cost Effective Test Case Selection Technique for Modern Software Development

Maral Azizi and Hyunsook Do (University of North Texas)

Abstract

Regression test selection offers cost savings by selecting a subset of existing test cases when testers validate the modified version of the application. Existing test selection approaches typically utilize static or dynamic analysis to decide which test cases should be selected and these analyses are often very time-consuming. In this paper we propose a novel languageindependent Regression TEst SelecTion (ReTEST) technique that facilitates a light-weight analysis by using information retrieval. ReTEST uses fault history test case diversity and program change history information to develop a test case selection technique. Our empirical evaluation with four open source programs shows that our approach can be effective and efficient in practice by selecting a far smaller subset of test cases compared to the existing techniques.


Paper ID 75. Search-Based Test Data Generation for JavaScript Functions that Interact with the DOM

Alexander Elyasov, Wishnu Prasetya, and Jurriaan Hage (Utrecht University)

Abstract

The popularity of JavaScript is continuously growing. Together with HTML and CSS it is the core technology for modern web development. Because of the dynamic nature and complex interplay with HTML JS applications are often error-prone and vulnerable. Despite the active research efforts to devise intricate static and dynamic analyses to facilitate JS testing the problem of test data generation for JS code interacting with the DOM has not yet been addressed. In this paper we present a novel Javascript Evolution-based testing framework with DOM as an Input called JEDI. In order to reach a target branch it applies genetic search for relevant input parameters of the JS function including the global DOM state. We conducted an empirical evaluation to study the effectiveness and efficiency of our testing framework. It shows that the ""genetic with restart"" algorithm proposed in this paper is able to achieve complete branch coverage for all experimental subjects taking on average 19 seconds per branch.


Paper ID 78. A Study of Regression Test Selection in Continuous Integration Environments

Ting Wang and Tingting Yu (University of Kentucky)

Abstract

Continuous integration (CI) systems perform automated build test execution and delivery of the software. CI can provide fast feedback on software changes minimizing the time and effort required in each iteration. In the meantime it is important to ensure that enough testing is performed prior to code submission to avoid breaking builds. Recent approaches have been proposed to improve the cost-effectiveness of regression testing through techniques such as regression test selection (RTS). These approaches target at CI environments because traditional RTS techniques often use code instrumentation or very fine-grained dependency analysis which may not be able to handle rapid changes. In this paper we study in-depth the usage of RTS in CI environments for different open-source projects. We analyze 924 open-source projects using CI in GitHub to understand 1) under what conditions RTS is needed and 2) how to balance the trade-offs between granularity levels to perform cost-effective RTS. The findings of this study can aid practitioners and researchers to develop more advanced RTS techniques for being adapted to CI environments.

 


Paper ID 79. You Are Where You APP: An Assessment on Location Privacy of Social APPs

Fanghua Zhao, Linan Gao (Shandong University), Yang Zhang(CISPA), Zeyu Wang, Bo Wang, and Shanqing Guo (Shandong University)

Abstract

The development of positioning technologies has digitalized people’s mobility traces for the first time in history. With GPS sensors equipped with mobile devices people can share their positions by allowing APPs to access their location.The large amount of mobility data can help to build appealing applications e.g. location recommendation. Meanwhile location privacy has become a major concern. In this paper we design a general system to assess whether an APP is vulnerable to location inference attacks. We utilize a series of automatic testing mechanisms including UI match and API analysis to extract the location information (distance) an APP provides. According to different characteristics of these Apps we classify them into two categories corresponding to two kinds of attacks namely attack with distance limitation (AWDL) and attack without distance limitation (AWODL). By evaluating 800 APPs we found that 24.7% of them are vulnerable to the AWDL attack while 11.0% to AWODL attack. Moreover some APPs even allow us to modify the parameters in Http requests which largely increases the scope of the attacks. In addition 5 APPs directly expose the exact geo-coordinates of the potential victims.

 


Paper ID 88. DeepMutation: Mutation Testing of Deep Learning Systems

Lei Ma (Harbin Institute of Technology), Fuyuan Zhang (Nanyang Technological University), Jiyuan Sun (Kyushu University), Minhui Xue (New York University), Bo Li (University of Illinois at Urbana-Champaign), Felix Juefei-Xu (Carnegie Mellon University), Chao Xie (Kyushu University), Li Li (Monash University), Yang Liu (Nanyang Technological University), Jianjun Zhao (Kyushu University)) and Yadong Wang (Harbin Institute of Technology)

Abstract

Deep learning (DL) defines a new data-driven programming paradigm where the internal system logic is largely shaped by the training data. The standard way of evaluating DL models is to examine their performance on a test dataset. The quality of the test dataset is of great importance to gain confidence of the trained models. Using inadequate test dataset DL models that have achieved high test accuracy may still suffer from vulnerability against (adversarial) attacks. In software testing mutation testing is a well-established technique to evaluate the quality of test suites. However due to the fundamental difference of traditional software and deep learning-based software traditional mutation testing techniques cannot be directly applied to DL systems. In this paper we propose the mutation testing framework specialized for DL systems. We first propose a source-level mutation testing technique to slightly modify source (i.e. training data and training programs) of DL software which shares the same spirit of traditional mutation testing. Then we design a set of model-level mutation testing operators that directly mutate on DL models without a training process. The effectiveness of the proposed mutation techniques is demonstrated on two public datasets MNIST and CIFAR-10 with three DL models.


Paper ID 100. FTMES: A Failed-Test-Oriented Mutant Execution Strategy for Mutation-based Fault Localization

André Oliveira (Instituto Federal de Mato Grosso), Celso G. Camilo-Junior (Universidade Federal de Goiás), Eduardo Freitas (Instituto Federal de Goiás), and Auri Marcelo Rizzo Vincenzi (Universidade Federal de São Carlos)

Abstract

Fault localization has been as one of the most expensive in the whole debugging activity. The spectrum-based is the most studied and evaluated fault localization (FL) technique. Other approach is the mutation-based fault localization technique (MBFL) that needs to execute the test suite on a large amount of mutants what increase its cost. Efforts from research community have been performed in order to achieve solutions which reduce such cost while keeping a minimum quality. Current mutation execution strategy was evaluated considering artificial faults. However recent researches show that some MBFL techniques exhibit low efficacy on localization when evaluated on real faults. In this paper we propose a Failed-Test-Oriented Mutant Execution Strategy (FTMES) aiming at efficacy on fault localization while reducing the execution cost of the mutants. FTMES uses only the set of failed test cases to execute mutants and avoids the execution of passed test cases. The experiments were conducted considering 224 real faults comparing the efficacy of localization of our approach (FTMES) against 8 baselines from the literature. We found that FTMES outperforms the baselines with cost reduction value in 82% on average with a high efficacy of ranking defective code.


Paper ID 101. MoonlightBox: Mining Android API Histories for Uncovering Release-time Inconsistencies

Li Li (Monash University), Tegawendé F. Bissyandé, and Jacques Klein (University of Luxembourg)

Abstract

In most of the approaches aiming at investigating Android apps the release time of apps is not appropriately taken into account. Through three empirical studies we demonstrate that the app release time is key for guaranteeing performance. Indeed not considering time may result in serious threats to the validity of proposed approaches. Unfortunately even approaches considering time could present some threats to validity when release times are erroneous. Symptoms of such erroneous release times appear in the form of inconsistencies with the APIs leveraged by the app. We present a tool called MoonlightBox for uncovering time inconsistencies by inferring the lower bound assembly time of a given app based on the used API lifetime information: any assembly time below this lower bound is considered as manipulated. We further perform several experiments and confirm that 1) over 7% of Android apps are subject to time inconsistency 2) malicious apps are more likely to be targeted by time inconsistency compared to benign apps 3) time inconsistencies are favoured by some specific app lineages. We eventually revisit the three motivating empirical studies leveraging MoonlightBox to compute a more realistic timeline of apps. The experimental results confirm that time indeed matters. The accuracy of release time is even crucial to achieve precise results.


Paper ID 107. Bugaroo: Exposing Memory Model Bugs in Many-core Systems

Mohammad Majharul Islam (Intel) and Abdullah Muzahid (The University of Texas at San Antonio)

Abstract

Modern many-core architectures such as GPUs aggressively reorder and buffer memory accesses. Updates to shared and global data are not guaranteed to be visible to concurrent threads immediately. Such updates can be made visible to other threads by using some fence instructions. Therefore missing the required fences can introduce subtle bugs called Memory Model Bugs. We propose Bugaroo to expose memory model bugs in any arbitrary GPU program. It works by statically instrumenting the code to buffer some shared and global data for as long as possible without violating the semantics of any fence or synchronization instruction. Any program failure that results from such buffering indicates the presence of subtle memory model bugs in the program. Bugaroo later provides detailed debugging information regarding the failure. Bugaroo is the first proposal to expose memory model bugs of GPU programs by simulating memory buffers. We present a detailed design and implementation of Bugaroo. We evaluated it using seven programs. Our approach uncovers new findings about missing and redundant fences in two of the programs. This makes Bugaroo an effective and useful tool for GPU programmers.


Paper ID 108. Evaluating Regression Test Selection Opportunities in a Very Large Open-Source Ecosystem

Alex Gyori, Owolabi Legunsen, Farah Hariri, and Darko Marinov (University of Illinois at Urbana-Champaign)

Abstract

Library developers are increasingly releasing their libraries publicly and other developers build client projects that depend on these libraries creating an ecosystem of interconnected software projects. This paper analyzes regression test selection (RTS) opportunities in one such ecosystem: the MAVEN Central repository. There some popular libraries have up to 924589 client projects; in turn client projects can depend on up to 11190 libraries. However we found in a sample of 408 popular projects that 202 (almost half!) cannot update to the latest library versions without breaking compilation or tests. This problem shows a challenge that many library developers currently face in checking that their code changes do not break client projects. Meanwhile some large proprietary software organizations have systems like Google’s TAP and Microsoft’s CloudBuild which check that internally developed library changes do not break clients. We evaluate the costs that open-source library developers would incur if they ran client projects’ tests not just the library’s tests after each code change. We compare five RTS techniques that work at different granularity levels of analyses and changes. The results show that techniques at lower granularity provide big benefits; they run on average between 7.8% and 10.5% of all client projects’ tests. Our study provides strong evidence that it is worthwhile to build RTS tools that work in very large-scale software ecosystems enable software testing research at scale and reduce costs incurred by library developers when checking that changes do not break clients.