15.0 INFORMATION MANAGEMENT
Managing the large and complex data sets and information that is generated by an MLS project is not a trivial task. Perhaps the best way to address the issues is to recognize that the bulk of data falls into one of two general categories: read-only vs. mutable data. In this context, read-only data will refer to information that does not or should not change, such as raw measurements, and mutable data refers to items that are changeable or developed, such as extracted information. An important observation is that most of the large, unwieldy files are read-only, whereas the mutable files are typically much smaller. The two types should be handled very differently.
Once initial processing is complete (e.g., geo-referencing and classification), the core MLS data will not change much, if at all, and may be considered read-only. Once this data is stored in the appropriate format and location and backed up, it can be separated from the myriad of files controlled under normal IT processes. These files will not change, and therefore do not need to be part of incremental backups or version management. By separating the terabytes of static data from the organization’s other data, management is simplified. Information in this category may include point clouds, classifications, and imagery.
The derived data consists of much smaller and therefore more manageable files. This category may include such information as extracted curb lines and signage, CAD drawings, and metadata. These files may be incorporated into an organization’s existing data management procedures.
Note that the recommendation is not to have read-only MLS data operate outside of IT processes, but rather that IT processes should be broadened to accommodate handling large sets of static data in addition to mutable data.
The next section provides more detail about the types of data that arise during a typical MLS project and how they break down by size and read-only vs. mutable. The last section walks through the various execution steps encountered in a project and puts forth best practices based on the idea of differentiating read-only vs. mutable data.
15.1 Considerations
The types of information acquired through mobile LIDAR can be broken down into several classes, all of which are important, and have different characteristics of note:
15.1.1 Measurements of the scene
Measurement data from the scene will include instrument information such as angles and range to target, intensity of return signal, and possibly color or waveform data as well as the vehicle’s position and orientation. Typically this information is post-processed to register the point cloud to a desired coordinate system. The amount of measurement data collected can be very large, and may overwhelm unprepared IT systems and practices. Exact estimates of data sizes depend on many factors, but as a rule of thumb the total collected LAS (UUsing LAS Point Data Record Format 6) data size may be estimated as:
where:
S = combined size of LAS files (in gigabytes),
H = active collection time (in hours),
n = number of scanners
R = MLS data collection rate (in MHz), and
f = fraction of space collected, e.g., road / structures vs. open sky.
This estimate does not account for any decimation or paring down of the data, and assumes no compression. Values for f vary with scene, but rough estimates include f = 20-30% for open, flat terrain, f = 40-50% in typical low-rise areas, and f = 60-80% in urban canopy (high-rise) locations.
One may estimate the collection time H as H = M/V, where M = miles to be scanned and V = speed (in miles per hour), but be aware that M must include the total miles driven, including multiple passes and extra miles from frontage roads. If the speed V will vary during the collection, the most accurate estimate of H is obtained by summing M/V for each section of constant speed. For example, if a highway consists of a 50-mile section that is to be collected at 40 mph followed by a 40-mile section at 30 mph, the best estimate for H would be (50/40) + (40/30) = 2.6 hours.
Once processed into a point cloud and other files, this data should not change and may be considered read-only.
15.1.2 Ancillary data
Ancillary data refers to supplemental information acquired during field collection. This information could be anything—weather conditions (temperature/pressure/humidity, etc.), photographs, or video. The amount of data can vary significantly depending on the type and speed of information collected. For instance, video logging can require hundreds of GB per day, and can be comparable in size to the Measurement data. Most if not all of this data is read-only.
15.1.3 Extracted data
For our consideration here, extracted data refers to any information derived from either the measurement or ancillary data. For example, locations of features such as road markings or signage can be extracted either automatically or by hand from the point cloud data. Frequently, extracted data and information are stored in a separate file using a format that differs from the point cloud data, e.g., a CAD format. Such extracted data is significantly smaller than measurement or ancillary data, on the order of megabytes or a few gigabytes. Often this information is the highest value, because it contains more intelligence than point cloud data. The tradeoff is that producing these files requires significant time and cost. Therefore, these files should be considered mutable. Note that this is not always the case, however. In particular, an important component of extracted data includes classifications for measured points. Certain file formats such as LAS support the assignment of a classification value such as ground, building, water or low vegetation to each 3D data point, thereby combining the raw measurements and extracted information within a single file. Once the classification is complete and verified, this data may then be considered read-only.
In general, there is a tradeoff between file sizes and the precision or level of detail represented by the data within the file. For a given area, a larger file typically contains more detail than a smaller one, though other considerations such as filtering or compression will influence file size as well. Note that this is applies generically to all types of data (e.g., video or photographs) as well as point cloud data.
15.1.4 Computation / analysis
Downstream usage of the data is accomplished using software tools that allow extraction of higher-level information. Examples include: signage, pavement markings, and structures (bridges, tunnels, etc.). Automated or user-assisted extraction algorithms are an active area of research and development.
The information produced increases in value with each processing step. Also, additional software tools are required to interact with the information. Many different packages exist, and are often tailored to a particular use or application. These combine to create an information hierarchy: increased value comes with increased cost. Also, packages that work with point clouds can be expensive; however, some GIS and CAD packages are now supporting point clouds. Fortunately as the knowledge value increases, the incremental storage required actually drops, leading to smaller marginal costs for storage.
For each step in the chain there are potentially manual, semi-automated and fully automated procedures that are employed to process the data from one step to another. Because manual operations are costly, slow, error-prone, and operator-dependent, much active research is currently focused on developing better and more robust automated and semi-automated tools.
Regardless, potential errors may be introduced at each step since no hardware or software algorithm can be perfect. Therefore, any data management scheme must include the ability to follow the ‘lineage’ of any extracted information back to the original data for verification purposes.
The data lineage is often split between the producers of the data—the entities performing the field collection—and the consumers—typically back-office users. It is important to understand exactly where the various steps are performed, particularly when there may be overlap or rework, or for legal purposes. Regarding the latter, it can be important to be able to trace the history of a measurement or extracted piece of information so as to be comfortable with its use and to be able to defend it against dispute. For example, if a transportation agency issues an overpass clearance height based on MLS information and the published clearance is found (perhaps after an accident) to be incorrect, it will be necessary to trace the origin of the erroneous measurement in order to prevent similar issues in the future.
In almost all cases, this information is mutable and should be included in normal agency IT procedures and policies.
15.1.5 Packaging / delivery
For the purposes of this section, the salient operations after all the data is processed and a project is complete are the storage and archival of all files. This topic is discussed in detail in section 15.2.3.
15.2 Best practices
The practical aspects of the management of large MLS data sets are important due to both the difficulty manipulating the large files and the importance of the information contained therein. Because each organization has unique resources and goals, it is not possible to prescribe a single, generally applicable protocol. However, several best practices have emerged. It is important for each organization to create an individualized plan that best serves their needs, using the following as a guide.
15.2.1 Collection & delivery
A practical way to receive gigabytes or terabytes of data from a service provider in most cases is via external hard drives. The information is usually collected or processed on a hard drive, and attempting to convert to DVD or other medium is often overly time consuming. Given the value of the data, hard drives are inexpensive, with costs well under $1/GB, currently. If the hard drive is compatible with existing network storage devices, it can simply be plugged in and used to host the data, eliminating the time required to copy files. This approach is improved upon if the drive itself supports some form of RAID. Another advantage is that the incremental cost of storage is absorbed into the data collection process.
Recommendation: Avoid large (> few GB) files and tile data prior to delivery.
As with all hard drives, not filling drives to capacity and avoiding disk fragmentation results in better performance. Both of these have been shown to cause problems for large LIDAR data sets.
Solid-State Drives (SSDs) are an alternative to traditional magnetic drives. Access speeds are faster and they are more difficult to corrupt due to lack of magnetic media as well as moving parts. The oft-cited disadvantage with SSDs is the difficulty with repeatedly writing to the device, but this is not a major concern for storage of static data. However, such devices currently are significantly more expensive than magnetic hard drives.
Whichever type of drive is used, we strongly suggest that a duplicate the data be provided on a second drive (or set of drives for large data sets). The contents of the second must be identical to the first, and the drive should be placed in secure storage to serve as a backup in case of failure of the primary set. In essence, make—and test—the backup first, before using any data. Insist on this from your data provider: it is much easier to create and verify a duplicate immediately after field collection than in the office days or weeks later.
Recommendation: Request identical, duplicate copies of the data to be delivered and backup (and verify) the backup prior to any use of the data.
15.2.2 Storage & networks
Designing an optimal network and storage configuration can be difficult. Many options exist, and at a variety of costs and complexities. One can consider the following three setups:
- Local – files reside on a single host workstation. This workstation will be used to perform the bulk of processing and analysis. The primary reason to employ this configuration is to optimize the speed with which the files may be accessed and processed. The downsides of this approach include: difficulty administering multiple workstations, inability to access data if the workstation is powered off, delays accessing the data from machines networked to the host, and lack of centralized control.
- Local area network (LAN) – files reside on a local file server and are connected to several workstations through a fast, local network. This approach is much simpler to administer than a local configuration. File access may be limited by server throughput and network speeds. Therefore, it is recommended that a strong consideration be given to using a high-speed connection (e.g., 1000 base-T) and servers designed to handle large amounts of attached storage. Since the bulk of the MLS data is stored as static files, the server may be optimized for downloads, as opposed to balanced upload/download configurations.
- Wide area network (WAN) – files are not stored locally. Perhaps the best-known version of a WAN is ‘cloud storage,’ whereby a third party warehouse service is used to host the data at on offsite location and access is through the Internet. The concerns are the same as for LANs, namely, the time it takes to transfer a large data set across the network. In general a WAN will be slower than an LAN, but may be adopted for organizational reasons. Important considerations when using a WAN include data security, uptime and bandwidth guarantees, and cost. In particular, MLS data may be very expensive to store ‘in the cloud.’ Software as a Service (SaaS) may become relevant in this application space in the future. Rather than having a transportation agency host its own data and processing applications, a 3rd party could host the data and applications and provide only a thin client application to run on less powerful workstations (e.g., virtual machine). The users then work with the full data set, but only limited visualization and extracted information need be transferred across the network.
Experience has shown that often a combination of strategies may work best. For example, an agency may configure a dedicated workstation for the initial processing of MLS data. The dedicated workstation should be configured to match the optimal requirements of the processing software, and therefore can complete its task more quickly than a general-purpose machine that is accessing data over a network. Once this initial processing is complete, the drives, which now contain both the initial and the processed data, can be removed from the workstation and connected to the LAN so that multiple users can access the information and administration is simplified.
15.2.3 Backup / Archival / Sunset
Archival needs are distinct from backups. Backups refer to immediate copies of data held for either convenience or redundancy in case of failure or loss of the originals, whereas archives refer to data collected and stored for a long period of time after the initial use has ended. Thus, there are two separate considerations: (a) the short-term backup process, and (b) ability to access and use the data at a separate time well after the working period ends. Most, if not all, transportation agencies have standard IT processes for preserving data, though read-only MLS data is typically too large to fit into an organization’s existing IT archival or backup procedures. Therefore, an independent process should be developed and incorporated into the overall IT strategy to handle these files. If the data is physically located on hard drives as suggested above, then a simple backup process is to duplicate the drive(s), disconnect the duplicates, label them, and store in a secure area. Of course, if duplicate drives are requested from the service provider as discussed in section 15.2.1, then backups will already exist and no further procedures will be necessary. If the data is truly read-only, then the offline storage will be equivalent to the online data and can be used for recovery in case of failure of the primary drive. For this reason, editing read-only data should be avoided whenever possible: once a file that is subject to multiple sessions of editing becomes corrupt, it is often difficult and time-consuming to recreate the state immediately preceding the corrupting event. The situation may be exacerbated by the large size of the files.
Once a project is complete, all the data should be archived according to the needs of the agency and the project. The voluminous read-only data may be considered as archived already, since it has not changed since it was created. Other, usually smaller, files can be archived either on the drive holding the read-only files, or in compliance with other agency IT procedures.
As mentioned in Section 4.5.5, sunset provisions should be incorporated into agency IT plans for MLS data. Base the sunset provisions on business needs and use cases, but realize that due to the rapid pace of development in the mobile LIDAR and computer industries, in general, it is impossible to guarantee that future software systems will be compatible with older formats. However, adopting published, open standards for critical data formats can facilitate continued access to the data.
Another important part of the sunset plan must be the routine transfer of data as storage units age or from one generation of storage technology to another. If this is not anticipated, then too great a gap between the prior and current technologies may have developed. If these transfers are done before the legacy storage technology has become completely obsolete, then orphaned data will be avoided.
15.2.4 Monitoring integrity
File integrity is always an important issue and is incorporated into numerous standard checks that are provided by operating systems, hardware, and IT practices routinely. Additional considerations must be made for MLS data because much of this data may operate outside of the usual IT channels. In particular, MLS files considered to be read-only must be guaranteed to remain immutable. Read-only MLS files should be protected from accidental editing, deletion, renaming, or relocation. This can be done by restricting file and network permissions. Importantly, the files should be monitored for corruption. If the files are stored on a RAID array, the operating software should report failures as they occur. If a RAID is not employed, then it is recommended that file checksums be verified periodically, especially after copy operations. While many operating systems provide advanced file checks, corruption can still occur with large files. For example, when copying files from one network folder to another, one generally assumes that the copy is accurate unless otherwise informed by the system. However, there have been cases whereby transfer of numerous large files has subtly corrupted important data. Even if an operating system reports an error to a software application when a file is being read, there is no guarantee that the software will handle the error appropriately. The integrity of offline storage must be checked periodically as well. One feature of the E57 format not currently supported by LAS is that the former incorporates numerous redundancy checks throughout the data so that software applications may verify integrity during use.
In addition to monitoring file integrity, it is recommended that ‘snapshots’ of the data be captured at significant moments during processing to ensure workflow integrity. A snapshot is a document trail showing how that particular moment existed at that time, and/or provides enough information to reconstruct the moment. The backup of initial data usually suffices as a snapshot, as would a backup generated immediately after batch or automated processing. A snapshot of operator-created files can often be supplied from version management tools, if used.
15.2.5 Interoperability & evolution
Interoperability among multiple software systems and data formats is an important requirement and is a current challenge with LIDAR data. Because of the large size of data sets, the time required to move between packages can be substantial, often ranging from several hours to days depending on the software package and computing capabilities. To this end, a distinction exists between a working format and an interchange format. The former is used natively by software systems and does not require any processing prior to use. The latter refers to formats that require substantial processing to convert to a format usable by a software system. Note that often software packages perform this conversion internally, in which case the question becomes whether or not the package reads and writes natively, i.e., without creating intermediate files of a different format.
A practical consideration whenever multiple file formats are used to deal with a single data set is the possibility for losses during conversions. To some degree any conversion from one format to another is likely to introduce artifacts, though they may not be meaningful. For instance, converting a single (X, Y, Z) coordinate triplet from binary to fixed resolution ASCII and back to binary is likely to introduce a tiny numerical discrepancy, say at the sub-millimeter level. While not significant for MLS accuracies, the discrepancy does mean that the original binary file differs from the round-trip file, which could prove to be problematic for file verification.
The recommended best practice is to avoid or minimize the use of multiple formats throughout a workflow. At present these Guidelines recommend the use of binary LAS files: the LAS format is the most mature format for MLS data and therefore most MLS software packages can read and write this format natively. Note that software and file formats for LIDAR data is an active area of research and development, and in the near future acceptable alternatives for LAS may arise. Appendix E briefly discusses this issue in more detail.
Recommendation: Avoid or minimize the use of multiple formats and data transfer throughout a workflow.
Recommendation: Currently we recommend binary LAS files for point cloud delivery; however, E57 will likely be a suitable alternative in the future.
Evolution of file formats and software refers to the ongoing process of upgrading to newer versions that presumably offer better and additional functionality. Evolution is generally beneficial but can create challenges if not managed properly. This is the case in particular for extended projects or those that need to be revisited after a significant hiatus, such as one reopened after completion and archival storage. It is also the case when data use is expanded beyond the project and is incorporated across the organization. As is the case with most software packages, it is generally best to deploy the same version across all users and workstations for a given project, and allot extra time for snapshots, verification, and testing if application upgrades must be made during a project.