Messing around with the Palantir Government suite right now. You can get an account and mess around with it here:
You have the ability to import/export data, filter access, set up collaborative teams and access to the open archives of the US Gov and some non profits. There are two tiers of users, novice users and power users:
Restrictions for Novice Users
Novice users can only import data that is correctly mapped to the deployment ontology. Power users are exempt from this restriction.
The maximum number of rows in structured data sources that a Novice user can imported at one time is restricted by the NOVICE_IMPORT_STRUCTURED_MAX_ROWS system property. The default value for this property is 1000.
The maximum size of unstructured data sources that can be imported by a Novice user at one time is restricted by the NOVICE_IMPORT_UNSTRUCTURED_MAX_SIZE_IN_MB system property. The default value for this property is 5 megabytes.
The maximum number of tags that a Novice user can create using the Find and Tag helper is restricted by the system property NOVICE_FIND_AND_TAG_MAX_TAGS. The default setting for this property is 50.
Novice users cannot access the Tag All Occurrences in Tab option in the Browser’s Tag As dialog.
SearchAround search templates
Novice users cannot import SearchAround Templates from XML files.
Novice users cannot publish SearchAround templates for use by the entire deployment, and cannot edit published templates.
All other SearchAround features remain available.
Resolving Nexus Peering data conflicts
The Pending Changes application is available only in the Palantir Enterprise Platform, and is only accessible to Workspace users who belong to the Nexus Peering Data Managers user group.
Nexus Peering Data Managers use the Pending Changes application to check for, analyze, and resolve data conflicts that are not automatically resolved when a local nexus is synchronized with a peered nexus.
Novice users cannot delete published objects.
Novice users cannot delete objects created or changed by other users.
The maximum number of objects that Novice users can resolve together at one time is restricted by the NOVICE_RESOLVE_MAX_OBJECTS system property. This restriction does not apply to objects resolved by using existing object resolution suites in the Object Resolution Wizard or during data import.
Novice users may use the Object Resolution Wizard only when using existing object resolution suites. Novice users cannot perform Manual Object Resolution, and cannot record new resolution criteria as an Object Resolution Suite.
To learn more, see Resolving and Unresolving Objects in Workspace: Beyond the Basics.
Map application restrictions
All map metadata tools in the Layers helper are restricted.
Novice users cannot access features that allow sorting of layers by metadata, coloring by metadata, or the creation of new metadata. All other Layer helper functions remain available.
In case you didn’t get what I just said, you have access the same tools the FBI and CIA use, except some minor limitations and no access to classified documents. If you have access to Wolfram Alpha/Mathematica and can google for history on your topic of interest then most of the classified files will become redundant.
What about data mining on a budget?
Consider relying on a GPU(s). A CPU is designed to be multitasker that can quickly switch between actions, whereas a Graphical Processing Unit(GPU) is designed to do the same calculations repetitively while giving large increases in performance. The stacks in the listed papers, while giving exponentially higher speeds, did not use modern designs or graphics cards, which hindered them from running even faster.
The GPU (Graphics Prossessing Unit) is changing the face of large scale data mining by significantly speeding up the processing of data mining algorithms. For example, using the K-Means clustering algorithm, the GPU-accelerated version was found to be 200x-400x faster than the popular benchmark program MimeBench running on a single core CPU, and 6x-12x faster than a highly optimised CPU-only version running on an 8 core CPU workstation.
These GPU-accelerated performance results also hold for large data sets. For example in 2009 data set with 1 billion 2-dimensional data points and 1,000 clusters, the GPU-accelerated K-Means algorithm took 26 minutes (using a GTX 280 GPU with 240 cores) whilst the CPU-only version running on a single-core CPU workstation, using MimeBench, took close to 6 days (see research paper “Clustering Billions of Data Points using GPUs” by Ren Wu, and Bin Zhang, HP Laboratories). Substantial additional speed-ups are expected were the tests conducted today on the latest Fermi GPUs with 480 cores and 1 TFLOPS performance.
Over the last two years hundreds of research papers have been published, all confirming the substantial improvement in data mining that the GPU delivers. I will identify a further 7 data mining algorithms where substantial GPU acceleration have been achieved in the hope that it will stimulate your interest to start using GPUs to accelerate your data mining projects:
Hidden Markov Models (HMM) have many data mining applications such as financial economics, computational biology, addressing the challenges of financial time series modelling (non-stationary and non-linearity), analysing network intrusion logs, etc. Using parallel HMM algorithms designed for the GPU, researchers (see cuHMM: a CUDA Implementation of Hidden Markov Model Training and Classification by Chaun Lin, May 2009) were able to achieve performance speedup of up to 800x on a GPU compared with the time taken on a single-core CPU workstation.
Sorting is a very important part of many data mining application. Last month Duane Merrill and Andrew Grinshaw (from University of Virginia) reported achieving a very fast implementation of the radix sorting method and was able to exceed 1G keys/sec average sort rate on an the GTX480 (NVidia Fermi GPU). Seehttp://goo.gl/wpra
Density-based Clustering is an important paradigm in clustering since typically it is noise and outlier robust and very good at searching for clusters of arbitrary shape in metric and vector spaces. Tests have shown that the GPU speed-up ranged from 3.5x for 30k points to almost 15x for 2 million data points. A guaranteed GPU speedup factor of at least 10x was obtained on data sets consisting of more than 250k points. (See “Density-based Clustering using Graphics Processors” by Christian Bohm et al).
Similarity Join is an important building block for similarity search and data mining algorithms. Researchers using a special algorithm called Index-supported similarity join for the GPU to outperform the CPU by a factor of 15.9x on 180 Mbytes of data (See “Index-supported Similarity Join on Graphics Processors” by Christian Bohm et al).
Bayesian Mixture Models has applications in many areas and of particular interest is the Bayesian analysis of structured massive multivariate mixtures with large data sets. Recent research work (see “Understanding the GPU Programming for Statistical Computation: Studies in Massively Massive Mixtures” by Marc Suchard et al.) has demonstrated that an old generation GPU (GeForce GTX285 with 240 cores) was able to achieve a 120x speed-up over a quad-core CPU version.
Support Vector Machines (SVM) has many diverse data mining uses including classification and regression analysis. Training SVM and using them for classification remains computationally intensive. The GPU version of a SVM algorithm was found to be 43x-104x faster than SVM CPU version for building classification models and 112x-212x faster over SVM CPU version for building regression models. See “GPU Accelerated Support Vector Machines for Mining High-Throughput Screening Data” by Quan Liao, Jibo Wang, et al.
Kernel Machines. Algorithms based on kernel methods play a central part in data mining including modern machine learning and non-parametric statistics. Central to these algorithms are a number of linear operations on matrices of kernel functions which take as arguments the training and testing data. Recent work (See “GPUML: Graphical processes for speeding up kernel machines” by Balaji Srinivasan et al. 2009) involves transforming these Kernel Machines into parallel kernel algorithms on a GPU and the following are two example where considerable speed-ups were achieved; (1) To estimate the densities of 10,000 data points on 10,000 samples. The CPU implementation took 16 seconds whilst the GPU implementation took 13ms which is a significant speed-up will in excess of 1,230x; (2) In a Gaussian process regression, for regression 8 dimensional data the GPU took 2 seconds to make predictions whist the CPU version took hours to make the same prediction which again is a significant speed-up over the CPU version.
If you want to use the GPUs but you do not want to get your hands “dirty” writing CUDA C/C++ code (or other languages bindings such as Python, Java, .NET, Fortran, Perl, or Lau) then consider using MATLAB Parallel Computing Toolbox. This is a powerful solution for those who know MATLAB. Alternatively R now has GPU plugins. A subsequent post will cover using MATLAB and R for GPU accelerated data mining.
These are space whales flying through the sun: