Saturday 22 December 2012

Mobi-Sync: Efficient Time Synchronization for Mobile Underwater SensorNetworks

Abstract


Time synchronization is an important requirement for many services provided by distributed networks. A lot of time synchronization protocols have been proposed for terrestrial Wireless Sensor Networks (WSNs). However, none of them can be directly applied to Underwater Sensor Networks (UWSNs). A synchronization algorithm for UWSNs must consider additional factors such as long propagation delays from the use of acoustic communication and sensor node mobility. These unique challenges make the accuracy of synchronization procedures for UWSNs even more critical. Time synchronization solutions specifically designed for UWSNs are needed to satisfy these new requirements. This paper proposes Mobi-Sync, a novel time synchronization scheme for mobile underwater sensor networks. Mobi-Sync distinguishes itself from previous approaches for terrestrial WSN by considering spatial correlation among the mobility patterns of neighboring UWSNs nodes. This enables Mobi-Sync to accurately estimate the long dynamic propagation delays. Simulation results show that Mobi-Sync outperforms existing schemes in both accuracy and energy efficiency.

Microarchitecture of a Coarse-Grain Out-of-Order Superscalar Processor

Abstract


We explore the design, implementation, and evaluation of a coarse-grain superscalar processor in the context of the microarchitecture of the Control Processor (CP) of the Multilevel Computing Architecture (MLCA), a novel architecture targeted for multimedia multicore systems. The MLCA augments a traditional multicore architecture (called the lower level) with a CP (called the top-level), which automatically extracts parallelism among coarse-grain units of computation (tasks), synchronizes these tasks and schedules them for execution on processors. It does so in a fashion similar to how instruction-level parallelism is extracted by superscalar processors, i.e., using register renaming, Out-of-Order Execution (OoOE) and scheduling. The coarse-grain nature of tasks imposes challenging constraints on the direct use of these techniques, but also offers opportunities for simpler designs. We analyze the impact of these constraints and opportunities and present novel microarchitectural mechanisms for coarse-grain superscalar execution, including register renaming, task queue, dynamic out-of-order scheduling and task-issue. We design an MLCA system around our CP microarchitecture and implement it on an FPGA. We evaluate the system using multimedia applications and show good scalability for eight processors, limited by the memory bandwidth of the FPGA platform. Furthermore, we show that the CP introduces little overhead in terms of resource usage. Finally, we show scalability beyond eight processors using cycle-accurate RTL-level simulation with an idealized memory subsystem. We demonstrate that the CP poses no performance bottlenecks and is scalable up to 32 processors.

IP-Geolocation Mapping for Moderately Connected Internet Regions

Abstract


Most IP-geolocation mapping schemes [14], [16], [17], [18] take delay-measurement approach, based on the assumption of a strong correlation between networking delay and geographical distance between the targeted client and the landmarks. In this paper, however, we investigate a large region of moderately connected Internet and find the delay-distance correlation is weak. But we discover a more probable rule—with high probability the shortest delay comes from the closest distance. Based on this closest-shortest rule, we develop a simple and novel IP-geolocation mapping scheme for moderately connected Internet regions, called GeoGet. In GeoGet, we take a large number of webservers as passive landmarks and map a targeted client to the geolocation of the landmark that has the shortest delay. We further use JavaScript at targeted clients to generate HTTP/Get probing for delay measurement. To control the measurement cost, we adopt a multistep probing method to refine the geolocation of a targeted client, finally to city level. The evaluation results show that when probing about 100 landmarks, GeoGet correctly maps 35.4 percent clients to city level, which outperforms current schemes such as GeoLim [16] and GeoPing [14] by 270 and 239 percent, respectively, and the median error distance in GeoGet is around 120 km, outperforming GeoLim and GeoPing by 37 and 70 percent, respectively.

In-Network Estimation with Delay Constraints in Wireless Sensor Networks

Abstract


The use of wireless sensor networks (WSNs) for closing the loops between the cyberspace and the physical processes is more attractive and promising for future control systems. For some real-time control applications, controllers need to accurately estimate the process state within rigid delay constraints. In this paper, we propose a novel in-network estimation approach for state estimation with delay constraints in multihop WSNs. For accurately estimating a process state as well as satisfying rigid delay constraints, we address the problem through jointly designing in-network estimation operations and an aggregation scheduling algorithm. Our in-network estimation operation performed at relays not only optimally fuses the estimates obtained from the different sensors but also predicts the upper stream sensors' estimates which cannot be aggregated to the sink before deadlines. Our estimate aggregation scheduling algorithm, which is interference free, is able to aggregate as much estimate information as possible from the network to the sink within delay constraints. We proved the unbiasedness of in-network estimation, and theoretically analyzed the optimality of our approach. Our simulation results corroborate our theoretical results and show that our in-network estimation approach can obtain significant estimation accuracy gain under different network settings.

IDM: An Indirect Dissemination Mechanism for Spatial Voice Interactionin Networked Virtual Environments

Abstract


One type of Peer-to-Peer (P2P) live streaming has not yet been significantly investigated, namely topologies that provide many-to-many, interactive connectivity. Exemplar applications of such P2P systems include spatial audio services for networked virtual environments (NVEs) and distributed online games. Numerous challenging problems have to be overcome—among them providing low delay, resilience to churn, effective load balancing, and rapid convergence—in such dynamic environments. We propose a novel P2P overlay dissemination mechanism, termed IDM, that can satisfy such demanding real-time requirements. Our target application is to provide spatialized voice support in multiplayer NVEs, where each bandwidth constrained peer potentially communicates with all other peers within its area-of-interest (AoI). With IDM each peer maintains a set of partners, termed helpers, which may act as stream forwarders. We prove analytically that the system reachability is maximized when the loads of helpers are balanced proportionally to their network capacities. We then propose a game-theoretic algorithm that balances the loads of the peers in a fully distributed manner. Of practical importance in dynamic systems, we prove that our algorithm converges to an approximately balanced state from any prior state in rapid O(log log n) time, where n is the number of users. We further evaluate our technique with simulations and show that it can achieve near optimal system reachability and satisfy the tight latency constraints of interactive audio under conditions of churn, avatar mobility, and heterogeneous user access network bandwidth.

Gaussian versus Uniform Distribution for Intrusion Detection inWireless Sensor Networks

Abstract


In a Wireless Sensor Network (WSN), intrusion detection is of significant importance in many applications in detecting malicious or unexpected intruder(s). The intruder can be an enemy in a battlefield, or a malicious moving object in the area of interest. With uniform sensor deployment, the detection probability is the same for any point in a WSN. However, some applications may require different degrees of detection probability at different locations. For example, an intrusion detection application may need improved detection probability around important entities. Gaussian-distributed WSNs can provide differentiated detection capabilities at different locations but related work is limited. This paper analyzes the problem of intrusion detection in a Gaussian-distributed WSN by characterizing the detection probability with respect to the application requirements and the network parameters under both single-sensing detection and multiple-sensing detection scenarios. Effects of different network parameters on the detection probability are examined in detail. Furthermore, performance of Gaussian-distributed WSNs is compared with uniformly distributed WSNs. This work allows us to analytically formulate detection probability in a random WSN and provides guidelines in selecting an appropriate deployment strategy and determining critical network parameters.

Fast Channel Zapping with Destination-Oriented Multicast for IP VideoDelivery

Abstract


Channel zapping time is a critical quality of experience (QoE) metric for IP-based video delivery systems such as IPTV. An interesting zapping acceleration scheme based on time-shifted subchannels (TSS) was recently proposed, which can ensure a zapping delay bound as well as maintain the picture quality during zapping. However, the behaviors of the TSS-based scheme have not been fully studied yet. Furthermore, the existing TSS-based implementation adopts the traditional IP multicast, which is not scalable for a large-scale distributed system. Corresponding to such issues, this paper makes contributions in two aspects. First, we resort to theoretical analysis to understand the fundamental properties of the TSS-based service model. We show that there exists an optimal subchannel data rate which minimizes the redundant traffic transmitted over subchannels. Moreover, we reveal a start-up effect, where the existing operation pattern in the TSS-based model could violate the zapping delay bound. With a solution proposed to resolve the start-up effect, we rigorously prove that a zapping delay bound equal to the subchannel time shift is guaranteed by the updated TSS-based model. Second, we propose a destination-oriented-multicast (DOM) assisted zapping acceleration (DAZA) scheme for a scalable TSS-based implementation, where a subscriber can seamlessly migrate from a subchannel to the main channel after zapping without any control message exchange over the network. Moreover, the subchannel selection in DAZA is independent of the zapping request signaling delay, resulting in improved robustness and reduced messaging overhead in a distributed environment. We implement DAZA in ns-2 and multicast an MPEG-4 video stream over a practical network topology. Extensive simulation results are presented to demonstrate the validity of our analysis and DAZA scheme.

Exploiting Ubiquitous Data Collection for Mobile Users in WirelessSensor Networks

Abstract


We study the ubiquitous data collection for mobile users in wireless sensor networks. People with handheld devices can easily interact with the network and collect data. We propose a novel approach for mobile users to collect the network-wide data. The routing structure of data collection is additively updated with the movement of the mobile user. With this approach, we only perform a limited modification to update the routing structure while the routing performance is bounded and controlled compared to the optimal performance. The proposed protocol is easy to implement. Our analysis shows that the proposed approach is scalable in maintenance overheads, performs efficiently in the routing performance, and provides continuous data delivery during the user movement. We implement the proposed protocol in a prototype system and test its feasibility and applicability by a 49-node testbed. We further conduct extensive simulations to examine the efficiency and scalability of our protocol with varied network settings.

Dynamic Coverage of Mobile Sensor Networks

Abstract


We study the dynamic aspects of the coverage of a mobile sensor network resulting from continuous movement of sensors. As sensors move around, initially uncovered locations may be covered at a later time, and intruders that might never be detected in a stationary sensor network can now be detected by moving sensors. However, this improvement in coverage is achieved at the cost that a location is covered only part of the time, alternating between covered and not covered. We characterize area coverage at specific time instants and during time intervals, as well as the time durations that a location is covered and uncovered. We further consider the time it takes to detect a randomly located intruder and prove that the detection time is exponentially distributed with parameter 2lambda r bar{v}_s where lambda represents the sensor density, r represents the sensor's sensing range, and bar{v}_s denotes the average sensor speed. For mobile intruders, we take a game theoretic approach and derive optimal mobility strategies for both sensors and intruders. We prove that the optimal sensor strategy is to choose their directions uniformly at random between [0, 2pi ). The optimal intruder strategy is to remain stationary. This solution represents a mixed strategy which is a Nash equilibrium of the zero-sum game between mobile sensors and intruders.

Distributed k-Core Decomposition

Abstract


Several novel metrics have been proposed in recent literature in order to study the relative importance of nodes in complex networks. Among those, k-coreness has found a number of applications in areas as diverse as sociology, proteinomics, graph visualization, and distributed system analysis and design. This paper proposes new distributed algorithms for the computation of the k-coreness of a network, a process also known as k-core decomposition. This technique 1) allows the decomposition, over a set of connected machines, of very large graphs, when size does not allow storing and processing them on a single host, and 2) enables the runtime computation of k-cores in “live” distributed systems. Lower bounds on the algorithms complexity are given, and an exhaustive experimental analysis on real-world data sets is provided.

Distributed Data Replenishment

Abstract


We propose a distributed data replenishment mechanism for some distributed peer-to-peer-based storage systems that automates the process of maintaining a sufficient level of data redundancy to ensure the availability of data in presence of peer departures and failures. The dynamics of peers entering and leaving the network are modeled as a stochastic process. A novel analytical time-backward technique is proposed to bound the expected time for a piece of data to remain in P2P systems. Both theoretical and simulation results are in agreement, indicating that the data replenishment via random linear network coding (RLNC) outperforms other popular strategies. Specifically, we show that the expected time for a piece of data to remain in a P2P system, the longer the better, is exponential in the number of peers used to store the data for the RLNC-based strategy, while they are quadratic for other strategies.

Cross-Layer Design of Congestion Control and Power Control inFast-Fading Wireless Networks

Abstract


We study the cross-layer design of congestion control and power allocation with outage constraint in an interference-limited multihop wireless networks. Using a complete-convexification method, we first propose a message-passing distributed algorithm that can attain the global optimal source rate and link power allocation. Despite the attractiveness of its optimality, this algorithm requires larger message size than that of the conventional scheme, which increases network overheads. Using the bounds on outage probability, we map the outage constraint to an SIR constraint and continue developing a practical near-optimal distributed algorithm requiring only local SIR measurement at link receivers to limit the size of the message. Due to the complicated complete-convexification method, however the congestion control of both algorithms no longer preserves the existing TCP stack. To take into account the TCP stack preserving property, we propose the third algorithm using a successive convex approximation method to iteratively transform the original nonconvex problem into approximated convex problems, then the global optimal solution can converge distributively with message-passing. Thanks to the tightness of the bounds and successive approximations, numerical results show that the gap between three algorithms is almost indistinguishable. Despite the same type of the complete-convexification method, the numerical comparison shows that the second near-optimal scheme has a faster convergence rate than that of the first optimal one, which make the near-optimal scheme more favorable and applicable in practice. Meanwhile, the third optimal scheme also has a faster convergence rate than that of a previous work using logarithm successive approximation method.

Coloring-Based Inter-WBAN Scheduling for Mobile Wireless Body AreaNetworks

Abstract


In this study, random incomplete coloring (RIC) with low time-complexity and high spatial reuse is proposed to overcome in-between wireless-body-area-networks (WBAN) interference, which can cause serious throughput degradation and energy waste. Interference-avoidance scheduling of wireless networks can be modeled as a problem of graph coloring. For instance, high spatial-reuse scheduling for a dense sensor network is mapped to high spatial-reuse coloring; fast convergence scheduling for a mobile ad hoc network (MANET) is mapped to low time-complexity coloring. However, for a dense and mobile WBAN, inter-WBAN scheduling (IWS) should simultaneously satisfy both of the following requirements: 1) high spatial-reuse and 2) fast convergence, which are tradeoffs in conventional coloring. By relaxing the coloring rule, the proposed distributed coloring algorithm RIC avoids this tradeoff and satisfies both requirements. Simulation results verify that the proposed coloring algorithm effectively overcomes inter-WBAN interference and invariably supports higher system throughput in various mobile WBAN scenarios compared to conventional colorings.

Cluster-Based Certificate Revocation with Vindication Capability forMobile Ad Hoc Networks

Abstract


Mobile ad hoc networks (MANETs) have attracted much attention due to their mobility and ease of deployment. However, the wireless and dynamic natures render them more vulnerable to various types of security attacks than the wired networks. The major challenge is to guarantee secure network services. To meet this challenge, certificate revocation is an important integral component to secure network communications. In this paper, we focus on the issue of certificate revocation to isolate attackers from further participating in network activities. For quick and accurate certificate revocation, we propose the Cluster-based Certificate Revocation with Vindication Capability (CCRVC) scheme. In particular, to improve the reliability of the scheme, we recover the warned nodes to take part in the certificate revocation process; to enhance the accuracy, we propose the threshold-based mechanism to assess and vindicate warned nodes as legitimate nodes or not, before recovering them. The performances of our scheme are evaluated by both numerical and simulation analysis. Extensive results demonstrate that the proposed certificate revocation scheme is effective and efficient to guarantee secure communications in mobile ad hoc networks.

Analysis of Distance-Based Location Management in WirelessCommunication Networks

Abstract


The performance of dynamic distance-based location management schemes (DBLMS) in wireless communication networks is analyzed. A Markov chain is developed as a mobility model to describe the movement of a mobile terminal in 2D cellular structures. The paging area residence time is characterized for arbitrary cell residence time by using the Markov chain. The expected number of paging area boundary crossings and the cost of the distance-based location update method are analyzed by using the classical renewal theory for two different call handling models. For the call plus location update model, two cases are considered. In the first case, the intercall time has an arbitrary distribution and the cell residence time has an exponential distribution. In the second case, the intercall time has a hyper-Erlang distribution and the cell residence time has an arbitrary distribution. For the call without location update model, both intercall time and cell residence time can have arbitrary distributions. Our analysis makes it possible to find the optimal distance threshold that minimizes the total cost of location management in a DBLMS.

A Secure Payment Scheme with Low Communication and Processing Overheadfor Multihop Wireless Networks

Abstract


We propose RACE, a report-based payment scheme for multihop wireless networks to stimulate node cooperation, regulate packet transmission, and enforce fairness. The nodes submit lightweight payment reports (instead of receipts) to the accounting center (AC) and temporarily store undeniable security tokens called Evidences. The reports contain the alleged charges and rewards without security proofs, e.g., signatures. The AC can verify the payment by investigating the consistency of the reports, and clear the payment of the fair reports with almost no processing overhead or cryptographic operations. For cheating reports, the Evidences are requested to identify and evict the cheating nodes that submit incorrect reports. Instead of requesting the Evidences from all the nodes participating in the cheating reports, RACE can identify the cheating nodes with requesting few Evidences. Moreover, Evidence aggregation technique is used to reduce the Evidences' storage area. Our analytical and simulation results demonstrate that RACE requires much less communication and processing overhead than the existing receipt-based schemes with acceptable payment clearance delay and storage area. This is essential for the effective implementation of a payment scheme because it uses micropayment and the overhead cost should be much less than the payment value. Moreover, RACE can secure the payment and precisely identify the cheating nodes without false accusations.

A Scalable Server Architecture for Mobile Presence Services in SocialNetwork Applications

Abstract


Social network applications are becoming increasingly popular on mobile devices. A mobile presence service is an essential component of a social network application because it maintains each mobile user's presence information, such as the current status (online/offline), GPS location and network address, and also updates the user's online friends with the information continually. If presence updates occur frequently, the enormous number of messages distributed by presence servers may lead to a scalability problem in a large-scale mobile presence service. To address the problem, we propose an efficient and scalable server architecture, called PresenceCloud, which enables mobile presence services to support large-scale social network applications. When a mobile user joins a network, PresenceCloud searches for the presence of his/her friends and notifies them of his/her arrival. PresenceCloud organizes presence servers into a quorum-based server-to-server architecture for efficient presence searching. It also leverages a directed search algorithm and a one-hop caching strategy to achieve small constant search latency. We analyze the performance of PresenceCloud in terms of the search cost and search satisfaction level. The search cost is defined as the total number of messages generated by the presence server when a user arrives; and search satisfaction level is defined as the time it takes to search for the arriving user's friend list. The results of simulations demonstrate that PresenceCloud achieves performance gains in the search cost without compromising search satisfaction.

Coalition-Based Cooperative Packet Delivery under Uncertainty: ADynamic Bayesian Coalitional Game

Abstract


Cooperative packet delivery can improve the data delivery performance in wireless networks by exploiting the mobility of the nodes, especially in networks with intermittent connectivity, high delay and error rates such as wireless mobile delay-tolerant networks (DTNs). For such a network, we study the problem of rational coalition formation among mobile nodes to cooperatively deliver packets to other mobile nodes in a coalition. Such coalitions are formed by mobile nodes which can be either well behaved or misbehaving in the sense that the well-behaved nodes always help each other for packet delivery, while the misbehaving nodes act selfishly and may not help the other nodes. A Bayesian coalitional game model is developed to analyze the behavior of mobile nodes in coalition formation in presence of this uncertainty of node behavior (i.e., type). Given the beliefs about the other mobile nodes' types, each mobile node makes a decision to form a coalition, and thus the coalitions in the network vary dynamically. A solution concept called Nash-stability is considered to find a stable coalitional structure in this coalitional game with incomplete information. We present a distributed algorithm and a discrete-time Markov chain (DTMC) model to find the Nash-stable coalitional structures. We also consider another solution concept, namely, the Bayesian core, which guarantees that no mobile node has an incentive to leave the grand coalition. The Bayesian game model is extended to a dynamic game model for which we propose a method for each mobile node to update its beliefs about other mobile nodes' types when the coalitional game is played repeatedly. The performance evaluation results show that, for this dynamic Bayesian coalitional game, a Nash-stable coalitional structure is obtained in each subgame. Also, the actual payoff of each mobile node is close to that when all the information is completely known. In addition, the payoffs of the mobile nodes will be at least as h- gh as those when they act alone (i.e., the mobile nodes do not form coalitions).

Secure Communication Based on Ambient Audio

Abstract


We propose to establish a secure communication channel among devices based on similar audio patterns. Features from ambient audio are used to generate a shared cryptographic key between devices without exchanging information about the ambient audio itself or the features utilized for the key generation process. We explore a common audio-fingerprinting approach and account for the noise in the derived fingerprints by employing error correcting codes. This fuzzy-cryptography scheme enables the adaptation of a specific value for the tolerated noise among fingerprints based on environmental conditions by altering the parameters of the error correction and the length of the audio samples utilized. In this paper, we experimentally verify the feasibility of the protocol in four different realistic settings and a laboratory experiment. The case studies include an office setting, a scenario where an attacker is capable of reproducing parts of the audio context, a setting near a traffic loaded road, and a crowded canteen environment. We apply statistical tests to show that the entropy of fingerprints based on ambient audio is high. The proposed scheme constitutes a totally unobtrusive but cryptographically strong security mechanism based on contextual information.

Successive Interference Cancellation: Carving Out MAC LayerOpportunities

Abstract


Successive interference cancellation (SIC) is a PHY capability that allows a receiver to decode packets that arrive simultaneously. While the technique is well known in communications literature, emerging software radio platforms are making practical experimentation feasible. This motivates us to study the extent of throughput gains possible with SIC from a MAC layer perspective and scenarios where such gains are worth pursuing. We find that contrary to our initial expectation, the gains are not high when the bits of interfering signals are not known a priori to the receiver. Moreover, we observe that the scope for SIC gets squeezed by the advances in bitrate adaptation. In particular, our analysis shows that interfering one-to-one transmissions benefit less from SIC than scenarios with many-to-one transmissions (such as when clients upload data to a common access point). In view of this, we develop an SIC-aware scheduling algorithm that employs client pairing and power reduction to extract the most gains from SIC. We believe that our findings will be useful guidelines for moving forward with SIC-aware protocol research.

Successive Interference Cancellation: Carving Out MAC LayerOpportunities

Abstract


Successive interference cancellation (SIC) is a PHY capability that allows a receiver to decode packets that arrive simultaneously. While the technique is well known in communications literature, emerging software radio platforms are making practical experimentation feasible. This motivates us to study the extent of throughput gains possible with SIC from a MAC layer perspective and scenarios where such gains are worth pursuing. We find that contrary to our initial expectation, the gains are not high when the bits of interfering signals are not known a priori to the receiver. Moreover, we observe that the scope for SIC gets squeezed by the advances in bitrate adaptation. In particular, our analysis shows that interfering one-to-one transmissions benefit less from SIC than scenarios with many-to-one transmissions (such as when clients upload data to a common access point). In view of this, we develop an SIC-aware scheduling algorithm that employs client pairing and power reduction to extract the most gains from SIC. We believe that our findings will be useful guidelines for moving forward with SIC-aware protocol research.

Noninteractive Localization of Wireless Camera Sensors with MobileBeacon

Abstract


Recent advances in the application field increasingly demand the use of wireless camera sensor networks (WCSNs), for which localization is a crucial task to enable various location-based services. Most of the existing localization approaches for WCSNs are essentially interactive, i.e., require the interaction among the nodes throughout the localization process. As a result, they are costly to realize in practice, vulnerable to sniffer attacks, inefficient in energy consumption and computation. In this paper, we propose LISTEN, a noninteractive localization approach. Using LISTEN, every camera sensor node only needs to silently listen to the beacon signals from a mobile beacon node and capture a few images until determining its own location. We design the movement trajectory of the mobile beacon node, which guarantees to locate all the nodes successfully. We have implemented LISTEN and evaluated it through extensive experiments. Both the analytical and experimental results demonstrate that it is accurate, cost-efficient, and especially suitable for WCSNs that consist of low-end camera sensors.

Vampire Attacks: Draining Life from Wireless Ad Hoc Sensor Networks

Abstract


Ad hoc low-power wireless networks are an exciting research direction in sensing and pervasive computing. Prior security work in this area has focused primarily on denial of communication at the routing or medium access control levels. This paper explores resource depletion attacks at the routing protocol layer, which permanently disable networks by quickly draining nodes' battery power. These "Vampire” attacks are not specific to any specific protocol, but rather rely on the properties of many popular classes of routing protocols. We find that all examined protocols are susceptible to Vampire attacks, which are devastating, difficult to detect, and are easy to carry out using as few as one malicious insider sending only protocol-compliant messages. In the worst case, a single Vampire can increase network-wide energy usage by a factor of O(N), where N in the number of network nodes. We discuss methods to mitigate these types of attacks, including a new proof-of-concept protocol that provably bounds the damage caused by Vampires during the packet forwarding phase.

Group-Based Medium Access Control for IEEE 802.11n Wireless LANs

Abstract


The latest generation of Wireless Local Area Networks (WLANs) is based on IEEE 802.11n-2009 Standard. The standard provides very high data rates at the physical layer and aims to achieve a throughput at the Medium Access Control (MAC) layer that is higher than 100 Mbps. To do that, the standard introduces several mechanisms to improve the MAC efficiency. The most notable ones are the use of frame aggregation and Block-ACK frames. The standard, however, does not introduce a mechanism to reduce the probability of collision. This issue is significant because, with a high data rate, an AP would be able to serve a large number of stations, which would result in a high collision rate. In this paper, we propose a Group-based MAC (GMAC) scheme that reduces the probability of collision and also uses frame aggregation to improve the efficiency. The contending stations are divided into groups. Each group has one station that is the group leader. Only the leader stations contend, hence, reducing the probability of a collision. We evaluate the performance of our scheme with analytic and simulation results. The results show that GMAC achieves a high throughput, high fairness, low delay and maintains a high performance with high data rates.

Discovery and Verification of Neighbor Positions in Mobile Ad HocNetworks

Abstract


A growing number of ad hoc networking protocols and location-aware services require that mobile nodes learn the position of their neighbors. However, such a process can be easily abused or disrupted by adversarial nodes. In absence of a priori trusted nodes, the discovery and verification of neighbor positions presents challenges that have been scarcely investigated in the literature. In this paper, we address this open issue by proposing a fully distributed cooperative solution that is robust against independent and colluding adversaries, and can be impaired only by an overwhelming presence of adversaries. Results show that our protocol can thwart more than 99 percent of the attacks under the best possible conditions for the adversaries, with minimal false positive rates.

Autonomous Sensing Order Selection Strategies Exploiting Channel AccessInformation

Abstract


We design an efficient sensing order selection strategy for a distributed cognitive radio (CR) network, where two or more autonomous CRs sense the channels sequentially (in some sensing order) for spectrum opportunities. We are particularly interested in the case where CRs with false alarms autonomously select the sensing orders in which they visit channels, without coordination from a centralized entity. We propose an adaptive persistent sensing order selection strategy and show that this strategy converges and reduces the likelihood of collisions among the autonomous CRs as compared to a random selection of sensing orders. We also show that, when the number of CRs is less than or equal to the number of channels, the proposed strategy enables the CRs to converge to collision-free channel sensing orders. The proposed adaptive persistent strategy also reduces the expected time of arrival at collision-free sensing orders as compared to the randomize after every collision strategy, in which a CR, upon colliding, randomly selects a new sensing order.

Mobile Relay Configuration in Data-Intensive Wireless Sensor Networks

Abstract


Wireless Sensor Networks (WSNs) are increasingly used in data-intensive applications such as microclimate monitoring, precision agriculture, and audio/video surveillance. A key challenge faced by data-intensive WSNs is to transmit all the data generated within an application's lifetime to the base station despite the fact that sensor nodes have limited power supplies. We propose using low-cost disposable mobile relays to reduce the energy consumption of data-intensive WSNs. Our approach differs from previous work in two main aspects. First, it does not require complex motion planning of mobile nodes, so it can be implemented on a number of low-cost mobile sensor platforms. Second, we integrate the energy consumption due to both mobility and wireless transmissions into a holistic optimization framework. Our framework consists of three main algorithms. The first algorithm computes an optimal routing tree assuming no nodes can move. The second algorithm improves the topology of the routing tree by greedily adding new nodes exploiting mobility of the newly added nodes. The third algorithm improves the routing tree by relocating its nodes without changing its topology. This iterative algorithm converges on the optimal position for each node given the constraint that the routing tree topology does not change. We present efficient distributed implementations for each algorithm that require only limited, localized synchronization. Because we do not necessarily compute an optimal topology, our final routing tree is not necessarily optimal. However, our simulation results show that our algorithms significantly outperform the best existing solutions.

Toward a Statistical Framework for Source Anonymity in Sensor Networks

Abstract


In certain applications, the locations of events reported by a sensor network need to remain anonymous. That is, unauthorized observers must be unable to detect the origin of such events by analyzing the network traffic. Known as the source anonymity problem, this problem has emerged as an important topic in the security of wireless sensor networks, with variety of techniques based on different adversarial assumptions being proposed. In this work, we present a new framework for modeling, analyzing, and evaluating anonymity in sensor networks. The novelty of the proposed framework is twofold: first, it introduces the notion of "interval indistinguishability” and provides a quantitative measure to model anonymity in wireless sensor networks; second, it maps source anonymity to the statistical problem of binary hypothesis testing with nuisance parameters. We then analyze existing solutions for designing anonymous sensor networks using the proposed model. We show how mapping source anonymity to binary hypothesis testing with nuisance parameters leads to converting the problem of exposing private source information into searching for an appropriate data transformation that removes or minimize the effect of the nuisance information. By doing so, we transform the problem from analyzing real-valued sample points to binary codes, which opens the door for coding theory to be incorporated into the study of anonymous sensor networks. Finally, we discuss how existing solutions can be modified to improve their anonymity.

On Centralized and Localized Approximation Algorithms forInterference-Aware Broadcast Scheduling

Abstract


Broadcast scheduling in multihop Wireless Sensor Networks (WSNs) is an effective mechanism to perform interference-aware broadcasting. Existing works provide centralized solutions, which cannot be implemented locally. Additionally, they consider very elementary network and interference models, in which, either all sensor nodes have the same transmission range or their transmission ranges are equal to their interference ranges that are not very practical. Furthermore, they entirely ignore the existence of WSNs in 3D. In this paper, we study the broadcast scheduling in 2D and 3D WSNs. We consider that sensor nodes may have different transmission ranges and their interference ranges are alpha times of their transmission ranges (where alpha >1). We devise efficient coloring methods for coloring a hexagonal tiling in 2D plane and a truncated octahedron tiling in 3D space, based on which we propose O(1)-centralized approximation algorithms and O(1)-localized approximation algorithms for the broadcast scheduling problem in 2D and 3D WSNs, respectively. Our O(1)-centralized approximation algorithms for 3D WSNs and O(1)-localized approximation algorithms for 2D and 3D WSNs are the first approximation algorithms for the corresponding problems. Finally, we present an efficient greedy heuristic to study the effect of various priority metrics for greedily scheduling multiple interfering transmissions. Theoretical analysis and experimental results are provided to evaluate the performance of our algorithms.

Model-Based Analysis of Wireless System Architectures for Real-TimeApplications

Abstract


We propose a model-based description and analysis framework for the design of wireless system architectures. Its aim is to address the shortcomings of existing approaches to system verification and the tracking of anomalies in safety-critical wireless systems. We use Architecture Analysis and Description Language (AADL) to describe an analysis-oriented architecture model with highly modular components. We also develop the cooperative tool chains required to analyze the performance of a wireless system by simulation. We show how this framework can support a detailed and largely automated analysis of a complicated, networked wireless system using examples from wireless healthcare and video broadcasting.

Channel Allocation and Routing in Hybrid Multichannel MultiradioWireless Mesh Networks

Abstract


Many efforts have been devoted to maximizing network throughput in a multichannel multiradio wireless mesh network. Most current solutions are based on either purely static or purely dynamic channel allocation approaches. In this paper, we propose a hybrid multichannel multiradio wireless mesh networking architecture, where each mesh node has both static and dynamic interfaces. We first present an Adaptive Dynamic Channel Allocation protocol (ADCA), which considers optimization for both throughput and delay in the channel assignment. In addition, we also propose an Interference and Congestion Aware Routing protocol (ICAR) in the hybrid network with both static and dynamic links, which balances the channel usage in the network. Our simulation results show that compared to previous works, ADCA reduces the packet delay considerably without degrading the network throughput. The hybrid architecture shows much better adaptivity to changing traffic than purely static architecture without dramatic increase in overhead, and achieves lower delay than existing approaches for hybrid networks

On the Real-Time Hardware Implementation Feasibility of Joint RadioResource Management Policies for Heterogeneous Wireless Networks

Abstract


The study and design of Joint Radio Resource Management (JRRM) techniques is a key and challenging aspect in future heterogeneous wireless systems where different Radio Access Technologies (RAT) will physically coexist. In these systems, the total available radio resources need to be used in a coordinated way to guarantee adequate satisfaction levels to all users, and maximize the system revenues. In addition to carry out an efficient use of the available radio resources, JRRM algorithms need to exhibit good computational performance to guarantee their future implementation viability. In this context, this paper proposes novel JRRM techniques based on linear programming techniques, and investigates their computational cost when implemented in DSP platforms commonly used in mobile-based stations. The obtained results demonstrate the feasibility to implement the proposed JRRM algorithms in future heterogeneous wireless systems.

Simple Hybrid and Incremental Postpruning Techniques for Rule Induction

Abstract


Pruning achieves the dual goal of reducing the complexity of the final hypothesis for improved comprehensibility, and improving its predictive accuracy by minimizing the overfitting due to noisy data. This paper presents a new hybrid pruning technique for rule induction, as well as an incremental postpruning technique based on a misclassification tolerance. Although both have been designed for RULES-7, the latter is also applicable to any rule induction algorithm in general. A thorough empirical evaluation reveals that the proposed techniques enable RULES-7 to outperform other state-of-the-art classification techniques. The improved classifier is also more accurate and up to two orders of magnitude faster than before.

Supporting Search-As-You-Type Using SQL in Databases

Abstract


A search-as-you-type system computes answers on-the-fly as a user types in a keyword query character by character. We study how to support search-as-you-type on data residing in a relational DBMS. We focus on how to support this type of search using the native database language, SQL. A main challenge is how to leverage existing database functionalities to meet the high-performance requirement to achieve an interactive speed. We study how to use auxiliary indexes stored as tables to increase search performance. We present solutions for both single-keyword queries and multikeyword queries, and develop novel techniques for fuzzy search using SQL by allowing mismatches between query keywords and answers. We present techniques to answer first-N queries and discuss how to support updates efficiently. Experiments on large, real data sets show that our techniques enable DBMS systems on a commodity computer to support search-as-you-type on tables with millions of records.

Reinforced Similarity Integration in Image-Rich Information Networks

Abstract


Social multimedia sharing and hosting websites, such as Flickr and Facebook, contain billions of user-submitted images. Popular Internet commerce websites such as Amazon.com are also furnished with tremendous amounts of product-related images. In addition, images in such social networks are also accompanied by annotations, comments, and other information, thus forming heterogeneous image-rich information networks. In this paper, we introduce the concept of (heterogeneous) image-rich information network and the problem of how to perform information retrieval and recommendation in such networks. We propose a fast algorithm heterogeneous minimum order k-SimRank (HMok-SimRank) to compute link-based similarity in weighted heterogeneous information networks. Then, we propose an algorithm Integrated Weighted Similarity Learning (IWSL) to account for both link-based and content-based similarities by considering the network structure and mutually reinforcing link similarity and feature weight learning. Both local and global feature learning methods are designed. Experimental results on Flickr and Amazon data sets show that our approach is significantly better than traditional methods in terms of both relevance and speed. A new product search and recommendation system for e-commerce has been implemented based on our algorithm.

Minimally Supervised Novel Relation Extraction Using a LatentRelational Mapping

Abstract


The World Wide Web includes semantic relations of numerous types that exist among different entities. Extracting the relations that exist between two entities is an important step in various Web-related tasks such as information retrieval (IR), information extraction, and social network extraction. A supervised relation extraction system that is trained to extract a particular relation type (source relation) might not accurately extract a new type of a relation (target relation) for which it has not been trained. However, it is costly to create training data manually for every new relation type that one might want to extract. We propose a method to adapt an existing relation extraction system to extract new relation types with minimum supervision. Our proposed method comprises two stages: learning a lower dimensional projection between different relations, and learning a relational classifier for the target relation type with instance sampling. First, to represent a semantic relation that exists between two entities, we extract lexical and syntactic patterns from contexts in which those two entities co-occur. Then, we construct a bipartite graph between relation-specific (RS) and relation-independent (RI) patterns. Spectral clustering is performed on the bipartite graph to compute a lower dimensional projection. Second, we train a classifier for the target relation type using a small number of labeled instances. To account for the lack of target relation training instances, we present a one-sided under sampling method. We evaluate the proposed method using a data set that contains 2,000 instances for 20 different relation types. Our experimental results show that the proposed method achieves a statistically significant macroaverage F-score of 62.77. Moreover, the proposed method outperforms numerous baselines and a previously proposed weakly supervised relation extraction method.

k-Pattern Set Mining under Constraints

Abstract


We introduce the problem of k-pattern set mining, concerned with finding a set of k related patterns under constraints. This contrasts to regular pattern mining, where one searches for many individual patterns. The k-pattern set mining problem is a very general problem that can be instantiated to a wide variety of well-known mining tasks including concept-learning, rule-learning, redescription mining, conceptual clustering and tiling. To this end, we formulate a large number of constraints for use in k-pattern set mining, both at the local level, that is, on individual patterns, and on the global level, that is, on the overall pattern set. Building general solvers for the pattern set mining problem remains a challenge. Here, we investigate to what extent constraint programming (CP) can be used as a general solution strategy. We present a mapping of pattern set constraints to constraints currently available in CP. This allows us to investigate a large number of settings within a unified framework and to gain insight in the possibilities and limitations of these solvers. This is important as it allows us to create guidelines in how to model new problems successfully and how to model existing problems more efficiently. It also opens up the way for other solver technologies.

Halite: Fast and Scalable Multiresolution Local-Correlation Clustering

Abstract


This paper proposes Halite, a novel, fast, and scalable clustering method that looks for clusters in subspaces of multidimensional data. Existing methods are typically superlinear in space or execution time. Halite's strengths are that it is fast and scalable, while still giving highly accurate results. Specifically the main contributions of Halite are: 1) Scalability: it is linear or quasi linear in time and space regarding the data size and dimensionality, and the dimensionality of the clusters' subspaces; 2) Usability: it is deterministic, robust to noise, doesn't take the number of clusters as an input parameter, and detects clusters in subspaces generated by original axes or by their linear combinations, including space rotation; 3) Effectiveness: it is accurate, providing results with equal or better quality compared to top related works; and 4) Generality: it includes a soft clustering approach. Experiments on synthetic data ranging from five to 30 axes and up to 1 rm million points were performed. Halite was in average at least 12 times faster than seven representative works, and always presented highly accurate results. On real data, Halite was at least 11 times faster than others, increasing their accuracy in up to 35 percent. Finally, we report experiments in a real scenario where soft clustering is desirable.

Finding Rare Classes: Active Learning with Generative andDiscriminative Models

Abstract


Discovering rare categories and classifying new instances of them are important data mining issues in many fields, but fully supervised learning of a rare class classifier is prohibitively costly in labeling effort. There has therefore been increasing interest both in active discovery: to identify new classes quickly, and active learning: to train classifiers with minimal supervision. These goals occur together in practice and are intrinsically related because examples of each class are required to train a classifier. Nevertheless, very few studies have tried to optimise them together, meaning that data mining for rare classes in new domains makes inefficient use of human supervision. Developing active learning algorithms to optimise both rare class discovery and classification simultaneously is challenging because discovery and classification have conflicting requirements in query criteria. In this paper, we address these issues with two contributions: a unified active learning model to jointly discover new categories and learn to classify them by adapting query criteria online; and a classifier combination algorithm that switches generative and discriminative classifiers as learning progresses. Extensive evaluation on a batch of standard UCI and vision data sets demonstrates the superiority of this approach over existing methods.

Fast Activity Detection: Indexing for Temporal StochasticAutomaton-Based Activity Models

Abstract


Today, numerous applications require the ability to monitor a continuous stream of fine-grained data for the occurrence of certain high-level activities. A number of computerized systems—including ATM networks, web servers, and intrusion detection systems—systematically track every atomic action we perform, thus generating massive streams of timestamped observation data, possibly from multiple concurrent activities. In this paper, we address the problem of efficiently detecting occurrences of high-level activities from such interleaved data streams. A solution to this important problem would greatly benefit a broad range of applications, including fraud detection, video surveillance, and cyber security. There has been extensive work in the last few years on modeling activities using probabilistic models. In this paper, we propose a temporal probabilistic graph so that the elapsed time between observations also plays a role in defining whether a sequence of observations constitutes an activity. We first propose a data structure called “temporal multiactivity graph” to store multiple activities that need to be concurrently monitored. We then define an index called Temporal Multiactivity Graph Index Creation (tMAGIC) that, based on this data structure, examines and links observations as they occur. We define algorithms for insertion and bulk insertion into the tMAGIC index and show that this can be efficiently accomplished. We also define algorithms to solve two problems: the “evidence” problem that tries to find all occurrences of an activity (with probability over a threshold) within a given sequence of observations, and the “identification” problem that tries to find the activity that best matches a sequence of observations. We introduce complexity reducing restrictions and pruning strategies to make the problem—which is intrinsically exponential—- inear to the number of observations. Our experiments confirm that tMAGIC has time and space complexity linear to the size of the input, and can efficiently retrieve instances of the monitored activities

AML: Efficient Approximate Membership Localization within a Web-BasedJoin Framework

Abstract


In this paper, we propose a new type of Dictionary-based Entity Recognition Problem, named Approximate Membership Localization (AML). The popular Approximate Membership Extraction (AME) provides a full coverage to the true matched substrings from a given document, but many redundancies cause a low efficiency of the AME process and deteriorate the performance of real-world applications using the extracted substrings. The AML problem targets at locating nonoverlapped substrings which is a better approximation to the true matched substrings without generating overlapped redundancies. In order to perform AML efficiently, we propose the optimized algorithm P-Prune that prunes a large part of overlapped redundant matched substrings before generating them. Our study using several real-word data sets demonstrates the efficiency of P-Prune over a baseline method. We also study the AML in application to a proposed web-based join framework scenario which is a search-based approach joining two tables using dictionary-based entity recognition from web documents. The results not only prove the advantage of AML over AME, but also demonstrate the effectiveness of our search-based approach.

Event Tracking for Real-Time Unaware Sensitivity Analysis (EventTracker)

Abstract


This paper introduces a platform for online Sensitivity Analysis (SA) that is applicable in large scale real-time data acquisition (DAQ) systems. Here, we use the term real-time in the context of a system that has to respond to externally generated input stimuli within a finite and specified period. Complex industrial systems such as manufacturing, healthcare, transport, and finance require high-quality information on which to base timely responses to events occurring in their volatile environments. The motivation for the proposed EventTracker platform is the assumption that modern industrial systems are able to capture data in real-time and have the necessary technological flexibility to adjust to changing system requirements. The flexibility to adapt can only be assured if data is succinctly interpreted and translated into corrective actions in a timely manner. An important factor that facilitates data interpretation and information modeling is an appreciation of the affect system inputs have on each output at the time of occurrence. Many existing sensitivity analysis methods appear to hamper efficient and timely analysis due to a reliance on historical data, or sluggishness in providing a timely solution that would be of use in real-time applications. This inefficiency is further compounded by computational limitations and the complexity of some existing models. In dealing with real-time event driven systems, the underpinning logic of the proposed method is based on the assumption that in the vast majority of cases changes in input variables will trigger events. Every single or combination of events could subsequently result in a change to the system state. The proposed event tracking sensitivity analysis method describes variables and the system state as a collection of events. The higher the numeric occurrence of an input variable at the trigger level during an event monitoring interval, the greater is its impact on the final analysis of the system state. Expe- iments were designed to compare the proposed event tracking sensitivity analysis method with a comparable method (that of Entropy). An improvement of 10 percent in computational efficiency without loss in accuracy was observed. The comparison also showed that the time taken to perform the sensitivity analysis was 0.5 percent of that required when using the comparable Entropy-based method.

Detecting Intrinsic Loops Underlying Data Manifold

Abstract


Detecting intrinsic loop structures of a data manifold is the necessary prestep for the proper employment of the manifold learning techniques and of fundamental importance in the discovery of the essential representational features underlying the data lying on the loopy manifold. An effective strategy is proposed to solve this problem in this study. In line with our intuition, a formal definition of a loop residing on a manifold is first given. Based on this definition, theoretical properties of loopy manifolds are rigorously derived. In particular, a necessary and sufficient condition for detecting essential loops of a manifold is derived. An effective algorithm for loop detection is then constructed. The soundness of the proposed theory and algorithm is validated by a series of experiments performed on synthetic and real-life data sets. In each of the experiments, the essential loops underlying the data manifold can be properly detected, and the intrinsic representational features of the data manifold can be revealed along the loop structure so detected. Particularly, some of these features can hardly be discovered by the conventional manifold learning methods.

Clustering Large Probabilistic Graphs

Abstract


We study the problem of clustering probabilistic graphs. Similar to the problem of clustering standard graphs, probabilistic graph clustering has numerous applications, such as finding complexes in probabilistic protein-protein interaction (PPI) networks and discovering groups of users in affiliation networks. We extend the edit-distance-based definition of graph clustering to probabilistic graphs. We establish a connection between our objective function and correlation clustering to propose practical approximation algorithms for our problem. A benefit of our approach is that our objective function is parameter-free. Therefore, the number of clusters is part of the output. We also develop methods for testing the statistical significance of the output clustering and study the case of noisy clusterings. Using a real protein-protein interaction network and ground-truth data, we show that our methods discover the correct number of clusters and identify established protein relationships. Finally, we show the practicality of our techniques using a large social network of Yahoo! users consisting of one billion edges.

Anonymization of Centralized and Distributed Social Networks bySequential Clustering

Abstract


We study the problem of privacy-preservation in social networks. We consider the distributed setting in which the network data is split between several data holders. The goal is to arrive at an anonymized view of the unified network without revealing to any of the data holders information about links between nodes that are controlled by other data holders. To that end, we start with the centralized setting and offer two variants of an anonymization algorithm which is based on sequential clustering (Sq). Our algorithms significantly outperform the SaNGreeA algorithm due to Campan and Truta which is the leading algorithm for achieving anonymity in networks by means of clustering. We then devise secure distributed versions of our algorithms. To the best of our knowledge, this is the first study of privacy preservation in distributed social networks. We conclude by outlining future research proposals in that direction.

A System to Filter Unwanted Messages from OSN User Walls

Abstract


One fundamental issue in today's Online Social Networks (OSNs) is to give users the ability to control the messages posted on their own private space to avoid that unwanted content is displayed. Up to now, OSNs provide little support to this requirement. To fill the gap, in this paper, we propose a system allowing OSN users to have a direct control on the messages posted on their walls. This is achieved through a flexible rule-based system, that allows users to customize the filtering criteria to be applied to their walls, and a Machine Learning-based soft classifier automatically labeling messages in support of content-based filtering.

A Rough-Set-Based Incremental Approach for Updating Approximationsunder Dynamic Maintenance Environments

Abstract


Approximations of a concept by a variable precision rough-set model (VPRS) usually vary under a dynamic information system environment. It is thus effective to carry out incremental updating approximations by utilizing previous data structures. This paper focuses on a new incremental method for updating approximations of VPRS while objects in the information system dynamically alter. It discusses properties of information granulation and approximations under the dynamic environment while objects in the universe evolve over time. The variation of an attribute's domain is also considered to perform incremental updating for approximations under VPRS. Finally, an extensive experimental evaluation validates the efficiency of the proposed method for dynamic maintenance of VPRS approximations.

A Proxy-Based Approach to Continuous Location-Based Spatial Queries inMobile Environments

Abstract


Caching valid regions of spatial queries at mobile clients is effective in reducing the number of queries submitted by mobile clients and query load on the server. However, mobile clients suffer from longer waiting time for the server to compute valid regions. We propose in this paper a proxy-based approach to continuous nearest-neighbor (NN) and window queries. The proxy creates estimated valid regions (EVRs) for mobile clients by exploiting spatial and temporal locality of spatial queries. For NN queries, we devise two new algorithms to accelerate EVR growth, leading the proxy to build effective EVRs even when the cache size is small. On the other hand, we propose to represent the EVRs of window queries in the form of vectors, called estimated window vectors (EWVs), to achieve larger estimated valid regions. This novel representation and the associated creation algorithm result in more effective EVRs of window queries. In addition, due to the distinct characteristics, we use separate index structures, namely EVR-tree and grid index, for NN queries and window queries, respectively. To further increase efficiency, we develop algorithms to exploit the results of NN queries to aid grid index growth, benefiting EWV creation of window queries. Similarly, the grid index is utilized to support NN query answering and EVR updating. We conduct several experiments for performance evaluation. The experimental results show that the proposed approach significantly outperforms the existing proxy-based approaches

A Generalized Flow-Based Method for Analysis of Implicit Relationshipson Wikipedia

Abstract


We focus on measuring relationships between pairs of objects in Wikipedia whose pages can be regarded as individual objects. Two kinds of relationships between two objects exist: in Wikipedia, an explicit relationship is represented by a single link between the two pages for the objects, and an implicit relationship is represented by a link structure containing the two pages. Some of the previously proposed methods for measuring relationships are cohesion-based methods, which underestimate objects having high degrees, although such objects could be important in constituting relationships in Wikipedia. The other methods are inadequate for measuring implicit relationships because they use only one or two of the following three important factors: distance, connectivity, and cocitation. We propose a new method using a generalized maximum flow which reflects all the three factors and does not underestimate objects having high degree. We confirm through experiments that our method can measure the strength of a relationship more appropriately than these previously proposed methods do. Another remarkable aspect of our method is mining elucidatory objects, that is, objects constituting a relationship. We explain that mining elucidatory objects would open a novel way to deeply understand a relationship.

Wednesday 19 December 2012

T-Drive: Enhancing Driving Directions with Taxi Drivers' Intelligence

Abstract


This paper presents a smart driving direction system leveraging the intelligence of experienced drivers. In this system, GPS-equipped taxis are employed as mobile sensors probing the traffic rhythm of a city and taxi drivers' intelligence in choosing driving directions in the physical world. We propose a time-dependent landmark graph to model the dynamic traffic pattern as well as the intelligence of experienced drivers so as to provide a user with the practically fastest route to a given destination at a given departure time. Then, a Variance-Entropy-Based Clustering approach is devised to estimate the distribution of travel time between two landmarks in different time slots. Based on this graph, we design a two-stage routing algorithm to compute the practically fastest and customized route for end users. We build our system based on a real-world trajectory data set generated by over 33,000 taxis in a period of three months, and evaluate the system by conducting both synthetic experiments and in-the-field evaluations. As a result, 60-70 percent of the routes suggested by our method are faster than the competing methods, and 20 percent of the routes share the same results. On average, 50 percent of our routes are at least 20 percent faster than the competing approaches.

Relationships between Diversity of Classification Ensembles andSingle-Class Performance Measures

Abstract


In class imbalance learning problems, how to better recognize examples from the minority class is the key focus, since it is usually more important and expensive than the majority class. Quite a few ensemble solutions have been proposed in the literature with varying degrees of success. It is generally believed that diversity in an ensemble could help to improve the performance of class imbalance learning. However, no study has actually investigated diversity in depth in terms of its definitions and effects in the context of class imbalance learning. It is unclear whether diversity will have a similar or different impact on the performance of minority and majority classes. In this paper, we aim to gain a deeper understanding of if and when ensemble diversity has a positive impact on the classification of imbalanced data sets. First, we explain when and why diversity measured by Q-statistic can bring improved overall accuracy based on two classification patterns proposed by Kuncheva et al. We define and give insights into good and bad patterns in imbalanced scenarios. Then, the pattern analysis is extended to single-class performance measures, including recall, precision, and F-measure, which are widely used in class imbalance learning. Six different situations of diversity's impact on these measures are obtained through theoretical analysis. Finally, to further understand how diversity affects the single class performance and overall performance in class imbalance problems, we carry out extensive experimental studies on both artificial data sets and real-world benchmarks with highly skewed class distributions. We find strong correlations between diversity and discussed performance measures. Diversity shows a positive impact on the minority class in general. It is also beneficial to the overall performance in terms of AUC and G-mean.

Region-Based Foldings in Process Discovery

Abstract


A central problem in the area of Process Mining is to obtain a formal model that represents the processes that are conducted in a system. If realized, this simple motivation allows for powerful techniques that can be used to formally analyze and optimize a system, without the need to resort to its semiformal and sometimes inaccurate specification. The problem addressed in this paper is known as Process Discovery: to obtain a formal model from a set of system executions. The theory of regions is a valuable tool in process discovery: it aims at learning a formal model (Petri nets) from a set of traces. On its genuine form, the theory is applied on an automaton and therefore one should convert the traces into an acyclic automaton in order to apply these techniques. Given that the complexity of the region-based techniques depends on the size of the input automata, revealing the underlying cycles and folding the initial automaton can incur in a significant complexity alleviation of the region-based techniques. In this paper, we follow this idea by incorporating region information in the cycle detection algorithm, enabling the identification of complex cycles that cannot be obtained efficiently with state-of-the-art techniques. The experimental results obtained by the devised tool suggest that the techniques presented in this paper are a big step into widening the application of the theory of regions in Process Mining for industrial scenarios.

Ranking on Data Manifold with Sink Points

Abstract


Ranking is an important problem in various applications, such as Information Retrieval (IR), natural language processing, computational biology, and social sciences. Many ranking approaches have been proposed to rank objects according to their degrees of relevance or importance. Beyond these two goals, diversity has also been recognized as a crucial criterion in ranking. Top ranked results are expected to convey as little redundant information as possible, and cover as many aspects as possible. However, existing ranking approaches either take no account of diversity, or handle it separately with some heuristics. In this paper, we introduce a novel approach, Manifold Ranking with Sink Points (MRSPs), to address diversity as well as relevance and importance in ranking. Specifically, our approach uses a manifold ranking process over the data manifold, which can naturally find the most relevant and important data objects. Meanwhile, by turning ranked objects into sink points on data manifold, we can effectively prevent redundant objects from receiving a high rank. MRSP not only shows a nice convergence property, but also has an interesting and satisfying optimization explanation. We applied MRSP on two application tasks, update summarization and query recommendation, where diversity is of great concern in ranking. Experimental results on both tasks present a strong empirical performance of MRSP as compared to existing ranking approaches.

Nonadaptive Mastermind Algorithms for String and Vector Databases, withCase Studies

Abstract


In this paper, we study sparsity-exploiting Mastermind algorithms for attacking the privacy of an entire database of character strings or vectors, such as DNA strings, movie ratings, or social network friendship data. Based on reductions to nonadaptive group testing, our methods are able to take advantage of minimal amounts of privacy leakage, such as contained in a single bit that indicates if two people in a medical database have any common genetic mutations, or if two people have any common friends in an online social network. We analyze our Mastermind attack algorithms using theoretical characterizations that provide sublinear bounds on the number of queries needed to clone the database, as well as experimental tests on genomic information, collaborative filtering data, and online social networks. By taking advantage of the generally sparse nature of these real-world databases and modulating a parameter that controls query sparsity, we demonstrate that relatively few nonadaptive queries are needed to recover a large majority of each database.

Ontology Matching: State of the Art and Future Challenges

Abstract


After years of research on ontology matching, it is reasonable to consider several questions: is the field of ontology matching still making progress? Is this progress significant enough to pursue further research? If so, what are the particularly promising directions? To answer these questions, we review the state of the art of ontology matching and analyze the results of recent ontology matching evaluations. These results show a measurable improvement in the field, the speed of which is albeit slowing down. We conjecture that significant improvements can be obtained only by addressing important challenges for ontology matching. We present such challenges with insights on how to approach them, thereby aiming to direct research into the most promising tracks and to facilitate the progress of the field.

On the Recovery of R-Trees

Abstract


We consider the recoverability of traditional R-tree index structures under concurrent updating transactions, an important issue that is neglected or treated inadequately in many proposals of R-tree concurrency control. We present two solutions to ARIES-based recovery of transactions on R-trees. These assume a standard fine-grained single-version update model with physiological write-ahead logging and steal-and-no-force buffering where records with uncommitted updates by a transaction may migrate from their original page to another page due to structure modifications caused by other transactions. Both solutions guarantee that an R-tree will remain in a consistent and balanced state in the presence of any number of concurrent forward-rolling and (totally or partially) backward-rolling multiaction transactions and in the event of process failures and system crashes. One solution maintains the R-tree in a strictly consistent state in which the bounding rectangles of pages are as tight as possible, while in the other solution this requirement is relaxed. In both solutions only a small constant number of simultaneous exclusive latches (write latches) are needed, and in the solution that only maintains relaxed consistency also the number of simultaneous nonexclusive latches is similarly limited. In both solutions, deletions are handled uniformly with insertions, and a logarithmic insertion-path length is maintained under all circumstances.

Maximum Likelihood Estimation from Uncertain Data in the BeliefFunction Framework

Abstract


We consider the problem of parameter estimation in statistical models in the case where data are uncertain and represented as belief functions. The proposed method is based on the maximization of a generalized likelihood criterion, which can be interpreted as a degree of agreement between the statistical model and the uncertain observations. We propose a variant of the EM algorithm that iteratively maximizes this criterion. As an illustration, the method is applied to uncertain data clustering using finite mixture models, in the cases of categorical and continuous attributes.

Large Graph Analysis in the GMine System

Abstract


Current applications have produced graphs on the order of hundreds of thousands of nodes and millions of edges. To take advantage of such graphs, one must be able to find patterns, outliers, and communities. These tasks are better performed in an interactive environment, where human expertise can guide the process. For large graphs, though, there are some challenges: the excessive processing requirements are prohibitive, and drawing hundred-thousand nodes results in cluttered images hard to comprehend. To cope with these problems, we propose an innovative framework suited for any kind of tree-like graph visual design. GMine integrates 1) a representation for graphs organized as hierarchies of partitions-the concepts of SuperGraph and Graph-Tree; and 2) a graph summarization methodology-CEPS. Our graph representation deals with the problem of tracing the connection aspects of a graph hierarchy with sub linear complexity, allowing one to grasp the neighborhood of a single node or of a group of nodes in a single click. As a proof of concept, the visual environment of GMine is instantiated as a system in which large graphs can be investigated globally and locally

Evaluating Data Reliability: An Evidential Answer with Application to aWeb-Enabled Data Warehouse

Abstract


There are many available methods to integrate information source reliability in an uncertainty representation, but there are only a few works focusing on the problem of evaluating this reliability. However, data reliability and confidence are essential components of a data warehousing system, as they influence subsequent retrieval and analysis. In this paper, we propose a generic method to assess data reliability from a set of criteria using the theory of belief functions. Customizable criteria and insightful decisions are provided. The chosen illustrative example comes from real-world data issued from the Sym'Previus predictive microbiology oriented data warehouse.

Distributed Processing of Probabilistic Top-k Queries in WirelessSensor Networks

Abstract


In this paper, we introduce the notion of sufficient set and necessary set for distributed processing of probabilistic top-k queries in cluster-based wireless sensor networks. These two concepts have very nice properties that can facilitate localized data pruning in clusters. Accordingly, we develop a suite of algorithms, namely, sufficient set-based (SSB), necessary set-based (NSB), and boundary-based (BB), for intercluster query processing with bounded rounds of communications. Moreover, in responding to dynamic changes of data distribution in the network, we develop an adaptive algorithm that dynamically switches among the three proposed algorithms to minimize the transmission cost. We show the applicability of sufficient set and necessary set to wireless sensor networks with both two-tier hierarchical and tree-structured network topologies. Experimental results show that the proposed algorithms reduce data transmissions significantly and incur only small constant rounds of data communications. The experimental results also demonstrate the superiority of the adaptive algorithm, which achieves a near-optimal performance under various conditions.

Clustering Sentence-Level Text Using a Novel Fuzzy RelationalClustering Algorithm

Abstract


In comparison with hard clustering methods, in which a pattern belongs to a single cluster, fuzzy clustering algorithms allow patterns to belong to all clusters with differing degrees of membership. This is important in domains such as sentence clustering, since a sentence is likely to be related to more than one theme or topic present within a document or set of documents. However, because most sentence similarity measures do not represent sentences in a common metric space, conventional fuzzy clustering approaches based on prototypes or mixtures of Gaussians are generally not applicable to sentence clustering. This paper presents a novel fuzzy clustering algorithm that operates on relational input data; i.e., data in the form of a square matrix of pairwise similarities between data objects. The algorithm uses a graph representation of the data, and operates in an Expectation-Maximization framework in which the graph centrality of an object in the graph is interpreted as a likelihood. Results of applying the algorithm to sentence clustering tasks demonstrate that the algorithm is capable of identifying overlapping clusters of semantically related sentences, and that it is therefore of potential use in a variety of text mining tasks. We also include results of applying the algorithm to benchmark data sets in several other domains.

Automatic Semantic Content Extraction in Videos Using a Fuzzy Ontologyand Rule-Based Model

Abstract


Recent increase in the use of video-based applications has revealed the need for extracting the content in videos. Raw data and low-level features alone are not sufficient to fulfill the user 's needs; that is, a deeper understanding of the content at the semantic level is required. Currently, manual techniques, which are inefficient, subjective and costly in time and limit the querying capabilities, are being used to bridge the gap between low-level representative features and high-level semantic content. Here, we propose a semantic content extraction system that allows the user to query and retrieve objects, events, and concepts that are extracted automatically. We introduce an ontology-based fuzzy video semantic content model that uses spatial/temporal relations in event and concept definitions. This metaontology definition provides a wide-domain applicable rule construction standard that allows the user to construct an ontology for a given domain. In addition to domain ontologies, we use additional rule definitions (without using ontology) to lower spatial relation computation cost and to be able to define some complex situations more effectively. The proposed framework has been fully implemented and tested on three different domains. We have obtained satisfactory precision and recall rates for object, event and concept extraction.

A Survey of XML Tree Patterns

Abstract


With XML becoming a ubiquitous language for data interoperability purposes in various domains, efficiently querying XML data is a critical issue. This has lead to the design of algebraic frameworks based on tree-shaped patterns akin to the tree-structured data model of XML. Tree patterns are graphic representations of queries over data trees. They are actually matched against an input data tree to answer a query. Since the turn of the 21st century, an astounding research effort has been focusing on tree pattern models and matching optimization (a primordial issue). This paper is a comprehensive survey of these topics, in which we outline and compare the various features of tree patterns. We also review and discuss the two main families of approaches for optimizing tree pattern matching, namely pattern tree minimization and holistic matching. We finally present actual tree pattern-based developments, to provide a global overview of this significant research topic.

A Graph-Based Consensus Maximization Approach for Combining MultipleSupervised and Unsupervised Models

Abstract


Ensemble learning has emerged as a powerful method for combining multiple models. Well-known methods, such as bagging, boosting, and model averaging, have been shown to improve accuracy and robustness over single models. However, due to the high costs of manual labeling, it is hard to obtain sufficient and reliable labeled data for effective training. Meanwhile, lots of unlabeled data exist in these sources, and we can readily obtain multiple unsupervised models. Although unsupervised models do not directly generate a class label prediction for each object, they provide useful constraints on the joint predictions for a set of related objects. Therefore, incorporating these unsupervised models into the ensemble of supervised models can lead to better prediction performance. In this paper, we study ensemble learning with outputs from multiple supervised and unsupervised models, a topic where little work has been done. We propose to consolidate a classification solution by maximizing the consensus among both supervised predictions and unsupervised constraints. We cast this ensemble task as an optimization problem on a bipartite graph, where the objective function favors the smoothness of the predictions over the graph, but penalizes the deviations from the initial labeling provided by the supervised models. We solve this problem through iterative propagation of probability estimates among neighboring nodes and prove the optimality of the solution. The proposed method can be interpreted as conducting a constrained embedding in a transformed space, or a ranking on the graph. Experimental results on different applications with heterogeneous data sources demonstrate the benefits of the proposed method over existing alternatives.

A Fast Clustering-Based Feature Subset Selection Algorithm forHigh-Dimensional Data

Abstract


Feature selection involves identifying a subset of the most useful features that produces compatible results as the original entire set of features. A feature selection algorithm may be evaluated from both the efficiency and effectiveness points of view. While the efficiency concerns the time required to find a subset of features, the effectiveness is related to the quality of the subset of features. Based on these criteria, a fast clustering-based feature selection algorithm (FAST) is proposed and experimentally evaluated in this paper. The FAST algorithm works in two steps. In the first step, features are divided into clusters by using graph-theoretic clustering methods. In the second step, the most representative feature that is strongly related to target classes is selected from each cluster to form a subset of features. Features in different clusters are relatively independent, the clustering-based strategy of FAST has a high probability of producing a subset of useful and independent features. To ensure the efficiency of FAST, we adopt the efficient minimum-spanning tree (MST) clustering method. The efficiency and effectiveness of the FAST algorithm are evaluated through an empirical study. Extensive experiments are carried out to compare FAST and several representative feature selection algorithms, namely, FCBF, ReliefF, CFS, Consist, and FOCUS-SF, with respect to four types of well-known classifiers, namely, the probability-based Naive Bayes, the tree-based C4.5, the instance-based IB1, and the rule-based RIPPER before and after feature selection. The results, on 35 publicly available real-world high-dimensional image, microarray, and text data, demonstrate that the FAST not only produces smaller subsets of features but also improves the performances of the four types of classifiers.

Topology Abstraction Service for IP-VPNs

Abstract


VPN service providers (VSP) and IP-VPN customers have traditionally maintained service demarcation boundaries between their routing and signaling entities. This has resulted in the VPNs viewing the VSP network as an opaque entity and therefore limiting any meaningful interaction between the VSP and the VPNs. The purpose of this research is to address this issue by enabling a VSP to share its core topology information with the VPNs through a novel topology abstraction (TA) service which is both practical and scalable in the context of managed IP-VPNs. TA service provides tunable visibility of state of the VSP's network leading to better VPN performance. A key challenge of the TA service is to generate TA with relevant network resource information for each VPN in an accurate and fair manner. We develop three decentralized schemes for generating TAs with different performance characteristics. These decentralized schemes achieve improved call performance, fair resource sharing for VPNs, and higher network utilization for the VSP. We validate the idea of the VPN TA service and study the performance of the proposed techniques using various simulation scenarios over several topologies.

Supporting HPC Analytics Applications with Access Patterns Using DataRestructuring and Data-Centric Scheduling Techniques in MapReduce

Abstract


Current High Performance Computing (HPC) applications have seen an explosive growth in the size of data in recent years. Many application scientists have initiated efforts to integrate data-intensive computing into computational-intensive HPC facilities, particularly for data analytics. We have observed several scientific applications which must migrate their data from an HPC storage system to a data-intensive one for analytics. There is a gap between the data semantics of HPC storage and data-intensive system, hence, once migrated, the data must be further refined and reorganized. This reorganization must be performed before existing data-intensive tools such as MapReduce can be used to analyze data. This reorganization requires at least two complete scans through the data set and then at least one MapReduce program to prepare the data before analyzing it. Running multiple MapReduce phases causes significant overhead for the application, in the form of excessive I/O operations. That is for every MapReduce phase, a distributed read and write operation on the file system must be performed. Our contribution is to develop a MapReduce-based framework for HPC analytics to eliminate the multiple scans and also reduce the number of data preprocessing MapReduce programs. We also implement a data-centric scheduler to further improve the performance of HPC analytics MapReduce programs by maintaining the data locality. We have added additional expressiveness to the MapReduce language to allow application scientists to specify the logical semantics of their data such that 1) the data can be analyzed without running multiple data preprocessing MapReduce programs, and 2) the data can be simultaneously reorganized as it is migrated to the data-intensive file system. Using our augmented Map-Reduce system, MapReduce with Access Patterns (MRAP), we have demonstrated up to 33 percent throughput improvement in one real application, and up to 70 percent in an I/O kernel of another appl- cation. Our results for scheduling show up to 49 percent improvement for an I/O kernel of a prevalent HPC analysis application.

Thermal and Energy Management of High-Performance Multicores:Distributed and Self-Calibrating Model-Predictive Controller

Abstract


As result of technology scaling, single-chip multicore power density increases and its spatial and temporal workload variation leads to temperature hot-spots, which may cause nonuniform ageing and accelerated chip failure. These critical issues can be tackled by closed-loop thermal and reliability management policies. Model predictive controllers (MPC) outperform classic feedback controllers since they are capable of minimizing performance loss while enforcing safe working temperature. Unfortunately, MPC controllers rely on a priori knowledge of thermal models and their complexity exponentially grows with the number of controlled cores. In this paper, we present a scalable, fully distributed, energy-aware thermal management solution for single-chip multicore platforms. The model-predictive controller complexity is drastically reduced by splitting it in a set of simpler interacting controllers, each one allocated to a core in the system. Locally, each node selects the optimal frequency to meet temperature constraints while minimizing the performance penalty and system energy. Comparable performance with state-of-the-art MPC controllers is achieved by letting controllers exchange a limited amount of information at runtime on a neighborhood basis. In addition, we address model uncertainty by supporting learning of the thermal model with a novel distributed self-calibration approach that matches well the controller architecture.

Strategies for Energy-Efficient Resource Management of HybridProgramming Models

Abstract


Many scientific applications are programmed using hybrid programming models that use both message passing and shared memory, due to the increasing prevalence of large-scale systems with multicore, multisocket nodes. Previous work has shown that energy efficiency can be improved using software-controlled execution schemes that consider both the programming model and the power-aware execution capabilities of the system. However, such approaches have focused on identifying optimal resource utilization for one programming model, either shared memory or message passing, in isolation. The potential solution space, thus the challenge, increases substantially when optimizing hybrid models since the possible resource configurations increase exponentially. Nonetheless, with the accelerating adoption of hybrid programming models, we increasingly need improved energy efficiency in hybrid parallel applications on large-scale systems. In this work, we present new software-controlled execution schemes that consider the effects of dynamic concurrency throttling (DCT) and dynamic voltage and frequency scaling (DVFS) in the context of hybrid programming models. Specifically, we present predictive models and novel algorithms based on statistical analysis that anticipate application power and time requirements under different concurrency and frequency configurations. We apply our models and methods to the NPB MZ benchmarks and selected applications from the ASC Sequoia codes. Overall, we achieve substantial energy savings (8.74 percent on average and up to 13.8 percent) with some performance gain (up to 7.5 percent) or negligible performance loss.