I am preparing to participate in a panel discussion on "Remaining Offshore Compliance Options and International Hot Topics" at the 14th Annual University Of San Diego School Of Law Procopio International Tax Institute, here. Yesterday, while reviewing one of the slide presentations, I noted an IRS emphasis of "internal data mining." Most practitioners have known for many years that the IRS had computer algorithms to analyze and match data (either data internal to a tax return or from associated tax filings such as W-2s). But, the data mining concept is beyond that type of "mining." I had only a general sense of what data mining might be. I wanted to know more about data mining generally and in the IRS specifically.
Wikipedia, here, introduces data mining as follows (footnotes omitted):
Data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science with an overall goal to extract information (with intelligent methods) from a data set and transform the information into a comprehensible structure for further use. Data mining is the analysis step of the "knowledge discovery in databases" process, or KDD. Aside from the raw analysis step, it also involves database and data management aspects, data pre-processing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating.
The term "data mining" is in fact a misnomer, because the goal is the extraction of patterns and knowledge from large amounts of data, not the extraction (mining) of data itself. It also is a buzzword and is frequently applied to any form of large-scale data or information processing (collection, extraction, warehousing, analysis, and statistics) as well as any application of computer decision support system, including artificial intelligence (e.g., machine learning) and business intelligence. The book Data mining: Practical machine learning tools and techniques with Java (which covers mostly machine learning material) was originally to be named just Practical machine learning, and the term data mining was only added for marketing reasons. Often the more general terms (large scale) data analysis and analytics – or, when referring to actual methods, artificial intelligence and machine learning – are more appropriate.
The actual data mining task is the semi-automatic or automatic analysis of large quantities of data to extract previously unknown, interesting patterns such as groups of data records (cluster analysis), unusual records (anomaly detection), and dependencies (association rule mining, sequential pattern mining). This usually involves using database techniques such as spatial indices. These patterns can then be seen as a kind of summary of the input data, and may be used in further analysis or, for example, in machine learning and predictive analytics. For example, the data mining step might identify multiple groups in the data, which can then be used to obtain more accurate prediction results by a decision support system. Neither the data collection, data preparation, nor result interpretation and reporting is part of the data mining step, but do belong to the overall KDD process as additional steps.
The related terms data dredging, data fishing, and data snooping refer to the use of data mining methods to sample parts of a larger population data set that are (or may be) too small for reliable statistical inferences to be made about the validity of any patterns discovered. These methods can, however, be used in creating new hypotheses to test against the larger data populations.I can't say I fully understand the concept. I have worked with large databases for litigation and other projects and, earlier in my career, built some of my own databases (e.g., a practice management database and litigation databases) using DBaseIII and Microsoft Access. But, the data mining concept goes way beyond any database with which I am personally familiar.
I inquired and found out that the IRS and IRS CI specifically is using Palantir tools for data mining. (I have some links on Palantir toward the end of this blog.) At a recent institute, the Deputy Chief of CI announced that IRS CI will sometime in the near future IRS will hire data scientists to develop data mining solutions to make its agents more efficient.
So, I did some internet searches. I include below some of the links and excerpts results of those google searches. I am not able to offer anything more definitive than below, but I will be on the lookout for further information that readers might find helpful on data mining.
- IRS Job Research & Analysis Job Descriptions, here:
Artificial Intelligence Analysts
As an Artificial Intelligence Specialist, you will apply artificial intelligence techniques and other advanced computing skills to solve IRS business problems using neural networks, data mining, encryption, agent-based modeling, expert systems, text generation and natural language, and sophisticated Web applications
- David Voreacos and Christian Berthelsen, Data Mining to Find Tax Cheaters (Bloomberg 5/25/17), here (discussing the data collected in the IRS offshore account voluntary disclosure programs and under FATCA).
- Kimberly Houser & Debra Sanders, The Use of Big Data Analytics by the IRS: Efficient Solutions or the End of Privacy as We Know It?, 19 Vand. J. Ent. & Tech. L. 817 (2017), here.
ABSTRACT
This Article examines the privacy issues resulting from the IRS’s big data analytics program as well as the potential violations of federal law. Although historically, the IRS chose tax returns to audit based on internal mathematical mistakes or mismatches with third party reports (such as W-2s), the IRS is now engaging in data mining of public and commercial data pools (including social media) and creating highly detailed profiles of taxpayers upon which to run data analytics. This Article argues that current IRS practices, mostly unknown to the general public are violating fair information practices. This lack of transparency and accountability not only violates federal law regarding the government’s data collection activities and use of predictive algorithms, but may also result in discrimination. While the potential efficiencies that big data analytics provides may appear to be a panacea for the IRS’s budget woes, unchecked, these activities are a significant threat to privacy. Other concerns regarding the IRS’s entrĂ©e into big data are raised including the potential for political targeting, data breaches, and the misuse of such information. This Article intends to bring attention to these privacy concerns and contribute to the academic and policy discussions about the risks presented by the IRS’s data collection, mining and analytics activities.
- Meta S. Brown, Analytics And The IRS: A New Way To Find Cheaters (Forbes 1/26/16), here.
- Palantir scores $98M contract with IRS for Case Analytics Solution, to support Criminal Investigations (G2Xchangeetc 10/2/18), here.
- Do You Know This Company? You Should – They Know You! (Bold Business 5/15/18), here.
Palantir is . . . a well-known American software and services company specializing in large data analysis. Based in Palo Alto, California, the company offers two major projects, namely Palantir Gotham and Palantir Foundry. Making its bold impact in the data-mining industry, the company provides software applications for integrating, visualizing and analyzing data, while connecting these information with humans and environments.
The company’s original clients were United States Intelligence Community (USIC) federal agencies. Ever since then, it has been expanding its clientele to offer services to local and state governments, including private corporations in the healthcare and financial industries. In 2013, the list of the company’s clientele included the FBI, the CIA, the Centre for Disease Control, the NSA, the Air Force, Special Operations Command, the Marine Corps, the IRS, and West Point.
Palantir’s Gotham and Foundry
Within its two products, Palantir Gotham platform consists a suite of capabilities for integrating several varying data sources for safe, secure, and collaborative analysis. This platform provides an enterprise knowledgebase, with the complete record of a company’s collective analysis. It also manages petabyte-scale data within a combination of measurable structure and federated data storage. Gotham was used by counter-terrorism experts at USIC and US Department of Defense, fraud agents at the Recovery Accountability and Transparency Board, and virtual analysts at Information Warfare. On the other hand, Palantir Foundry radically reinvents the method enterprises collaborate with data by extending and amplifying the capability of data integration. Foundry platform includes a suite of high-end data integration, including git-style versioning semantics, data provenance, branching, granular access controls, transformation authoring, and many more.
Palantir works closely with their clients, wherein engineers map and integrate all of the important data, regardless of volume or type, into a single, intelligible model. Upon the flow of bytes and bits data into the Palantir platforms, information are transformed into a clear and defined objects and connections, such as places, things, events, people, people, and the relationships among them.
- Tim Brown, Peter Thiel & Palantir: The CIA-Backed Tech Giant That’s Sifting & Sorting Your Info For Government (Washington Standard 1/4/18), here.
Yes, while a tyrannical government that violates the law and searches and seizes without probable cause and a warrant are realities that we should be concerned about, there is a tremendous threat coming from a company called Palantir, which is backed by the Central Intelligence Agency.
Palantir presents themselves as “We build products that make people better at their most important work — the kind of work you read about on the front page of the newspaper, not just the technology section.”
[Then quotes from a wired article]
Palantir might streamline some criminal investigations—but there’s a possibility that it comes at a high cost, for both the police forces themselves and the communities they serve.
These documents show how Palantir applies Silicon Valley’s playbook to domestic law enforcement. New users are welcomed with discounted hardware and federal grants, sharing their own data in return for access to others’. When enough jurisdictions join Palantir’s interconnected web of police departments, government agencies, and databases, the resulting data trove resembles a pay-to-access social network—a Facebook of crime that’s both invisible and largely unaccountable to the citizens whose behavior it tracks.
No comments:
Post a Comment
Comments are moderated. Jack Townsend will review and approve comments only to make sure the comments are appropriate. Although comments can be made anonymously, please identify yourself (either by real name or pseudonymn) so that, over a few comments, readers will be able to better judge whether to read the comments and respond to the comments.